README

RuleFit (described in Friedman & Popescu) is a clever model combining tree ensembles and linear models. The goal is to produce a model with comparable performance to a tree ensemble, with the interpretability of a linear model.

Specifics to xrf

Comparison to alternatives

Several implementations of RuleFit are available for R: pre, horserule, and rulefit. xrf improves on some aspects of these by: * Usually building more accurate models at fixed number of parameters * Usually building models faster * Building models that predict for new factor-levels * Providing a more concise and limited interface * Tested & actively maintained for fewer bugs

On the last point, as of April 2019, the ‘pre’ and ‘rulefit’ packages fail to build a model on the census income example below due to bugs.

Example

Here we predict whether an individual’s income is greater than $50,000 using census data.

library(RCurl)
library(xrf)

# grabbing data from uci
census_income_text <- getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data')
census_income <- read.csv(textConnection(census_income_text), header=F, stringsAsFactors = F)
colnames(census_income) <- c('age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status',
                            'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
                            'hours_per_week', 'native_country', 'above_50k')
                            
m_xrf <- xrf(above_50k ~ ., census_income, family = 'binomial', 
             xgb_control = list(nrounds = 100, max_depth = 3))

Comparison

Here we employ out-of-the-box RuleFit from pre & xrf, as well as raw xgboost & glmnet models.

library(dplyr)
library(pre)
library(rulefit) # installed via devtools::install_git("https://github.com/Zelazny7/rulefit")
library(horserule)
library(glmnet)
library(xgboost)

auc <- function (prediction, actual) {
  stopifnot(length(unique(actual)) == 2)
  stopifnot(length(prediction) == length(actual))
  mann_whit <- wilcox.test(prediction ~ actual)$statistic
  unname(1 - mann_whit/(sum(actual) * as.double(sum(!actual))))
}

census_income <- census_income %>%
  # pre is picky about data types 
  mutate_if(is.character, as.factor) %>%
  mutate(
    above_50k = as.character(above_50k) == ' >50K'
  )

set.seed(55455)
train_ix <- sample(nrow(census_income), floor(nrow(census_income) * .66))
census_train <- census_income[train_ix, ]
census_test <- census_income[-train_ix, ]
census_mat <- model.matrix(above_50k ~ ., census_income)
census_train_mat <- census_mat[train_ix, ]
census_test_mat <- census_mat[-train_ix, ]

# note, as of 2019-04-17, the pre example fails to work (with an error for a new level in model.frame). as such, the below comparison is not one to one
system.time(m_pre <- pre(above_50k ~ ., na.omit(census_train), 
                         family = 'binomial', ntrees = 100, maxdepth = 3, tree.unbiased = TRUE))
# note, as of 2019-04-25, this example fails by attempting to access names() of a sparse matrix (seems it should be using colnames())
system.time({
  m_gbm <- gbm.fit(census_train_mat, census_train$above_50k, distribution="bernoulli", interaction.depth = 3, shrinkage = 0.1, verbose = FALSE)
  rf_plan <- rulefit(m_gbm, n.trees = 100)
  m_rf <- train(rf_plan, census_train_mat, y = census_train$above_50k, family = "binomial")
})
# note that this involves no hyperparameter tuning
system.time({
  m_hrf <- HorseRuleFit(census_train_mat, census_train$above_50k, ntree=100, L=3)
})
system.time(m_xrf <- xrf(above_50k ~ ., census_train, family = 'binomial', 
             xgb_control = list(nrounds = 100, max_depth = 3)))
m_xgb <- xgboost(census_train_mat, census_train$above_50k, max_depth = 3, nrounds = 100, objective = 'binary:logistic')
m_glm <- cv.glmnet(census_train_mat, census_train$above_50k, alpha = 1)

auc(predict(m_pre, census_test), census_test$above_50k)
auc(predict(m_hrf, census_test_mat), census_test$above_50k)
auc(predict(m_xrf, census_test), census_test$above_50k)
auc(predict(m_glm, newx = census_test_mat, s = 'lambda.min'), census_test$above_50k)
auc(predict(m_xgb, newdata = census_test_mat, s = 'lambda.min'), census_test$above_50k)

xrf has the highest accuracy among RuleFit models, nearly comparable to the tree ensemble. horserule delivered a model much faster, but its accuracy was no better than the LASSO fit. This is likely due to aggressive rule regularization, which may be tuned by altering the out of the box hyperparameters.

De-overlapping rules

Overlapped rules occur when two or more rules belonging to the same subspace are not in mutual exclusion; de-overlapping guarantees that all rules belonging to the same subspace are in mutual exclusion. For example, the rules:

belong to the same subspace but are not in mutual exclusion (e.g. at age=45 & income=45,000). They can be de-overlapped to, for example:

which is one of infinite possible de-overlappings; this de-overlapping is ideal because * it is small * it allows all “effects” from the original 2 rules to be exactly captured (i.e. none of the boundaries are broken)

Model	Time (s)
xrf	73
pre	211*
horserule	29

Model	AUC
xrf	.924
pre	.906*
horserule	.892
xgboost	.926
glmnet	.892

The following example is somewhat contrived (in that it only uses one feature), but demonstrates how de-overlapping can prove useful in interpreting your model. To de-overlap the derived ruleset, simply specify deoverlap=TRUE:

coef_overlap %>%
 filter(coefficient_lambda.1se != 0) %>%
 arrange(rule)

Notice that the rules are not in exclusion. To understand the impact of capital gain on income, we have to add up no less than 7 coefficients for any value of capital gain. Matters are made more confusing by effectively 0 coefficients on many of the ‘>=’ rules, likely a numerical issue in the LASSO as a result of the substantial collinearity.

coef_deoverlap %>%
 filter(coefficient_lambda.1se != 0) %>%
 arrange(term)

Effects are immediately available by doing a lookup in the exclusive rules. This is a great win for interpretability.

As mentioned above, this example is contrived in that it uses depth=1 trees (i.e. conjunctions of size 1). As depth increases, interpretability can suffer regardless de-overlapping if the final ruleset is non-sparse. However, for certain problems, particularly small depth or sparse effects, de-overlapping can be a boon for interpretability.

xrf

Install

About

Specifics to xrf

Comparison to alternatives

Example

Comparison

De-overlapping rules