rfPermute is a package that provides a variety of functions for evaluating the results of Random Forest classification and regression models. The central feature of the package is the estimation of statistical significance (p-values) of variable importance measures as generated by the randomForest package by Andy Liaw and Matthew Wiener.
To install rfPermute from CRAN:
install.packages('rfPermute')
To install the latest version from GitHub:
# make sure you have devtools installed
if (!require('devtools')) install.packages('devtools')
# install from GitHub
devtools::install_github('EricArcher/rfPermute')
If you do not currently have randomForest installed, installing rfPermute will install it for you.
To start, lets create a simple randomForest classification model that we can use throughout this tutorial. The model will use the mtcars
dataset that is shipped with R. We’ll use it to build a model to classify whether or not the car is automatic or manual transmission (column am
) based on other features of the cars. We’ll specify that we want to return importance scores (importance = TRUE
), and run it for a smaller number of trees (ntree = 300
) but will leave the rest of the parameters to their defaults.
library(randomForest)
data(mtcars)
# copy the mtcars data frame so we can modify it
df <- mtcars
# recode the transmission column labels to be more informative
df$am <- factor(c("Automatic", "Manual")[df$am + 1])
# straight Random Forest classification model with random sampling with replacement
rf <- randomForest(am ~ ., df, ntree = 300, importance = TRUE)
rf
Call:
randomForest(formula = am ~ ., data = df, ntree = 300, importance = TRUE)
Type of random forest: classification
Number of trees: 300
No. of variables tried at each split: 3
OOB estimate of error rate: 12.5%
Confusion matrix:
Automatic Manual class.error
Automatic 17 2 0.1053
Manual 2 11 0.1538
We see that the model seems to do a relatively good classification job with an overall OOB error rate of 0.125.
The most important diagnostic for a Random Forest model is the trace plot. This shows how the error rate (OOB for classification models and MSE for regression models) changes as trees were added to the forest. The forest should have enough trees in it such that the error rate is stable. The plotTrace()
function is a rewrite of randomForest::plot.randomForest()
using ggplot2
graphics and providing a legend:
library(rfPermute)
plotTrace(rf)
Here, it looks stable over 300 trees, but we should run it for a few more to make sure. We’ll increase the number of trees to 1000:
rf <- randomForest(am ~ ., df, importance = TRUE, ntree = 1000)
rf
Call:
randomForest(formula = am ~ ., data = df, importance = TRUE, ntree = 1000)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 3
OOB estimate of error rate: 12.5%
Confusion matrix:
Automatic Manual class.error
Automatic 17 2 0.1053
Manual 2 11 0.1538
Notice this may not appreciably change the confusion matrix, but the trace (below) shows fairly good stability at the final model:
plotTrace(rf)
Another feature to examine is the rate at which cases are selected to be used in the model. We want to make sure that enough trees have been built so that the sample usage frequency (the fraction of times they are “inbag”) is close to their expectation. We can see the distribution of this inbag rate with the plotInbag()
function. To demonstrate its use, we will build three normal Random Forest models with an increasing number of trees. In the first, we only have 50 trees:
rf.50 <- randomForest(am ~ ., data = df, ntree = 50)
plotInbag(rf.50)
Since we are sampling at random with replacement, the expectation is that a sample will be in approximately 63% of the trees. The red line shows this expectation, with a fairly wide distribution around this value in this particular model.
Here is a summary of the inbag rates for the model with 50 trees:
summary((rf.50$ntree - rf.50$oob.times) / rf.50$ntree)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.440 0.620 0.660 0.637 0.680 0.720
The difference in this distribution can be seen if we increase the number of trees to 200:
rf.200 <- randomForest(am ~ ., data = df, ntree = 200)
plotInbag(rf.200)
Here we see the x-axis is narrowing, indicating that the variance of the sample inbag rates is more closely approaching the expected value:
summary((rf.200$ntree - rf.200$oob.times) / rf.200$ntree)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.580 0.619 0.645 0.638 0.655 0.695
Finally as an extreme, here is a 10,000 tree model:
rf.10k <- randomForest(am ~ ., data = df, ntree = 10000)
plotInbag(rf.10k)
We see that the variance in inbag rates has narrowed significantly. Ideally, we want this variance to be relatively small to ensure that most samples are getting in a high percentage of trees in the forest:
summary((rf.10k$ntree - rf.10k$oob.times) / rf.10k$ntree)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.628 0.637 0.639 0.638 0.640 0.646
If you have classes with unequal sample sizes, the default settings will produce a model that will tend to perform better on the larger class, regardless of the how well the predictors can classify individual cases. To avoid this bias, it is suggested that a balanced model be created, where the classes are represented by an equal number of samples in each tree and sampling is done without replacement. The balancedSampsize()
function is provided to create a vector like this. By default, it selects half of the smallest class size as the sample size for every class:
sampsize <- balancedSampsize(df$am)
sampsize
Automatic Manual
7 7
Here is the new balanced model:
rf.balanced <- randomForest(am ~ ., data = df, replace = FALSE, sampsize = sampsize, ntree = 1000)
confusionMatrix(rf.balanced)
Automatic Manual pct.correct LCI_0.95 UCI_0.95
Automatic 16 3 84.21 60.42 96.62
Manual 2 11 84.62 54.55 98.08
Overall NA NA 84.38 67.21 94.72
If you have a model with stratified sampling, or you sample without replacement, you will need specify these conditions to the plotInbag()
function, so these expectations can be correctly computed.
plotTrace(rf.balanced)
plotInbag(rf.balanced, replace = FALSE, sampsize = sampsize)
Notice how there are two modes in the distribution here because of the stratified sampling without replacement. We see that the distribution is relatively well centered on the expectations. Along with the trace plot, this indicates that the model is stable and cases are being used in a sufficient number trees such that casewise variability in the predictors is being incorporated.
For many uses, the measurements of variable (predictor) importance from a Random Forest model are of interest. These give a relative ranking of how important each predictor is to the model classification ability. Here is the standard output of importance scores from randomForest:
round(importance(rf), 3)
Automatic Manual MeanDecreaseAccuracy MeanDecreaseGini
mpg 4.177 8.294 8.255 1.225
cyl 3.129 3.611 4.412 0.280
disp 7.629 9.857 11.190 1.645
hp 3.084 5.354 5.354 0.827
drat 8.666 13.386 14.162 1.990
wt 15.982 23.012 24.273 3.770
qsec 6.854 4.898 6.995 1.433
vs 5.110 1.338 4.435 0.193
gear 14.842 18.964 19.470 3.209
carb -0.690 3.723 1.951 0.382
… and a graphical visualization of their distribution:
varImpPlot(rf)
We see that wt
and gear
are the most important predictor variables, while cyl
, vs
, and carb
are the least important predictors. These variable importance scores are derived from randomly permuting each predictor variable one at a time in the out-of-bag (OOB) data for each tree and summarizing how much worse the model prediction is. A predictor that is very important in the overall model will have degraded performance when permuted. Conversely, a variable with no predictive power (not strongly correlated with the response variable anywhere in the parameter space) will not affect the predictive capability of the model much at all as its influence on the model’s predictive ability was low to begin with. The degree to which this happens is the Decrease in Mean Accuracy that is summarized in the importance scores.
However, there is nothing in this output that suggests which of the importance scores are “significant” predictors and which are not. That is, where is the line between predictors that really influence the model’s accuracy and those that aren’t providing any more predictive power than a random predictor. By “random” predictor, we mean a predictor with values distributed at random relative to the response variable (here, transmission type - am
). Answering this requires creating a null distribution of importance scores that we can then compare the observed importance scores to. Because the importance scores come from predictions made by the model taking the covariance of predictors into account, we need to maintain that covariance structure when creating this null distribution.
In rfPermute, we satisfy this requirement by just permuting the response variable and then fitting the same model again. Variable importance scores are computed as before, however in this case they are computed as if the entire matrix of predictors were randomly distributed with respect to the response. Doing this procedure multiple times generates a null distribution of model-specific and, in the case of classification models, class-specific importance scores for each predictor variable. Those observed importance scores for which only a small fraction of the samples from this null are the same or greater (a p-value, typically 5%) can be considered to be significantly more important than a random predictor.
rfPermute computes these null distributions and p-values by providing an interface that is a wrapper for randomForest. The rfPermute model is run with a call to rfPermute()
. The model construction and arguments are the same as a call to randomForest()
. The function that adds an argument specifying the number of permutations to conduct for the null distributions (num.rep
). Because this process runs a full Random Forest model multiple times, it can be computationally expensive. Thus, the runs can be distributed over multiple cores, which can be specified with the num.cores
argument. Here is an example with the same model as above:
rp <- rfPermute(am ~ ., df, ntree = 300, num.rep = 1000, num.cores = 6)
rp
An rfPermute model
Type of random forest: classification
Number of trees: 300
No. of variables tried at each split: 3
No. of permutation replicates: 1000
Start time: 2022-04-18 10:45:22
End time: 2022-04-18 10:45:28
Run time: 5.79 secs
Automatic Manual pct.correct LCI_0.95 UCI_0.95
Automatic 17 2 89.5 66.9 98.7
Manual 2 11 84.6 54.6 98.1
Overall NA NA 87.5 71.0 96.5
This output gives summary information about the rfPermute model that we’ll cover in more detail later. However, notice that the core confusion matrix is the same as the original randomForest model showing that the base classification model is the same.
NOTE: Like the randomForest()
function, rfPermute()
can be run using a formula interface as above, or with the default matrix and vector specification interface. All arguments available in randomForest()
are also available in rfPermute()
, and are specified in the same way. The only exception is that rfPermute()
is always run with the importance = TRUE
argument set and the importance matrix is always available in the output. See ?randomForest
for more information on specifying models.
To extract the variable importance scores from an rfPermute model, we also use the importance()
function, which has a version for the rfPermute
model object. This function returns a matrix with the importance scores for each type followed by their p-values.
imp <- importance(rp)
round(imp, 3)
Automatic Automatic.pval Manual Manual.pval MeanDecreaseAccuracy
wt 10.074 0.002 12.256 0.001 13.017
gear 7.088 0.002 10.090 0.001 10.318
disp 5.527 0.031 6.443 0.013 7.140
drat 5.321 0.052 6.120 0.028 7.123
mpg 3.625 0.129 3.212 0.070 4.610
hp 3.085 0.181 3.972 0.039 4.400
qsec 4.118 0.097 2.312 0.150 4.120
vs 3.267 0.046 1.469 0.124 3.271
cyl 1.546 0.228 -0.051 0.358 1.073
carb 1.002 0.286 0.799 0.255 1.057
MeanDecreaseAccuracy.pval MeanDecreaseGini MeanDecreaseGini.pval
wt 0.001 4.166 0.001
gear 0.001 2.697 0.001
disp 0.017 1.829 0.714
drat 0.022 1.778 0.671
mpg 0.070 0.993 1.000
hp 0.070 1.069 0.998
qsec 0.090 1.698 0.962
vs 0.037 0.189 0.537
cyl 0.212 0.175 0.949
carb 0.246 0.353 1.000
By default, the matrix will be sorted by the MeanDecreaseAccuracy
column, but this can be changed using the sort.by
argument to the function.
The importance scores can also be visualized as a barchart or density chart with plotImportance()
:
plotImportance(rp)
This shows a bar graph of the importance score for each type, with significant scores (p-values <= 0.05) colored red. We can see that not all of the predictors are significant. Additionally, although the distributions of importance scores are the same for both transmission classes, there is a difference in which predictors are significantly important for classifying manual vs. automatic transmissions.
Note that this function as well as many others can be used with both rfPermute and randomForest models:
plotImportance(rf)
Importance scores can also be visualized with heatmaps. Here is a heatmap of the original rfPermute model for just the case-specific importance scores.
plotImportance(rp, plot.type = "heatmap", imp.type = c("Manual", "Automatic"), size = 6)
In this default option, the heatmap shows importance scores scaled by ranked importance. The white diamonds in the heatmap indicate the significant (p < 0.05) scores.
The null distributions of importance values are stored in the rfPermute object. The distribution of these can be viewed with the plotNull()
function which also shows the placement of the observed value relative to the distribution.
plotNull(rp, imp.type = "MeanDecreaseAccuracy", preds = c("disp", "carb"))
For classification models, it can be useful to inspect the distributions of the raw predictor values in each case in the order of their importances. This function allows for that visualization. It requires the original data frame (df
here) and you must specify the column name in that data frame that gives the response class (am
here).
plotImpPreds(rp, df, "am")
This produces violin plots for each predictor variable (the panels), separated by the class column (the x-axis). The predictors are sorted from top left to lower right by their importance scores. Here we see that most of the difference between transmission types is in wt
, while there is more overlap in a variable like cyl
.
The most basic summary is the confusion matrix, which can be computed with the confusionMatrix()
function:
confusionMatrix(rp)
Automatic Manual pct.correct LCI_0.95 UCI_0.95
Automatic 17 2 89.47 66.86 98.70
Manual 2 11 84.62 54.55 98.08
Overall NA NA 87.50 71.01 96.49
In this output, the first two rows are information for the original two classes (0 = manual and 1 = automatic transmissions). The third row gives summary information for the model overall. The first two columns show the actual confusion matrix. The values are the number of samples that were originally from the class in their respective rows that were classified by the model to be in the class in their respective columns. That is, the rows are the original class and the columns are the predicted class. Thus, the diagonals represent the number of samples correctly classified in each class. The remaining columns are:
pct.correct
: the percent of samples in the class (row) that were correctly classified.LCI_0.95
and UCI_0.95
: the lower and upper 95% confidence intervals of percent correctly classified. These values are from a binomial distribution and represent the uncertainty in the estimate of the classification accuracy of the class given the sample size of that class.The confusionMatrix()
function also provides the ability to compute the binomial probability that pct.correct
is greater than a given threshold value taking the sample size of the class into account. This is controlled with the threshold
argument.
# threshold = 0.6
confusionMatrix(rp, threshold = 0.6)
Automatic Manual pct.correct LCI_0.95 UCI_0.95 Pr.gt_0.6
Automatic 17 2 89.47 66.86 98.70 0.267616
Manual 2 11 84.62 54.55 98.08 0.003006
Overall NA NA 87.50 71.01 96.49 0.999864
# threshold = 0.8
confusionMatrix(rp, threshold = 0.8)
Automatic Manual pct.correct LCI_0.95 UCI_0.95 Pr.gt_0.8
Automatic 17 2 89.47 66.86 98.70 5.610e-04
Manual 2 11 84.62 54.55 98.08 2.650e-08
Overall NA NA 87.50 71.01 96.49 9.069e-01
# threshold = 0.95
confusionMatrix(rp, threshold = 0.95)
Automatic Manual pct.correct LCI_0.95 UCI_0.95 Pr.gt_0.95
Automatic 17 2 89.47 66.86 98.70 7.643e-12
Manual 2 11 84.62 54.55 98.08 3.594e-20
Overall NA NA 87.50 71.01 96.49 7.381e-02
Notice that as threshold
is increased the value of the last column (Pr.gt_0.##
) decreases. This is expected as the probability that the true percent correctly classifiable is greater than these thresholds should become smaller as the threshold approaches 1.
The confusion matrix can also be visualized as a heatmap with the plotConfMat()
function:
plotConfMat(rp)
prior
: the expected percent of samples that would be correctly classified if the model was classifying samples at random. This is based on the relative difference in sample sizes of each class. If all classes have the same sample size, the priors will be all be 1 / k where k is the number of classes.class_p.value
: the binomial probability that the number of correctly classified samples in each class would be observed given the size of that class under the assumption that prior
is is the correct rate of correct classification.classPriors(rp, sampsize = NULL)
prior class_p.value
Automatic 59.38 6.992e-04
Manual 40.62 1.642e-04
Overall 51.76 3.150e-06
classPriors(rp, sampsize = c(4, 4))
prior class_p.value
Automatic 50 3.815e-05
Manual 50 1.709e-03
Overall 50 1.278e-06
classPriors(rp, sampsize = c(4, 1))
prior class_p.value
Automatic 80.00 8.287e-02
Manual 20.00 4.342e-08
Overall 55.62 2.019e-05
In classification models, we often want to explore the individual classifications and their probabilities of assignment. We can extract a data frame of this information with the casePredictions()
function:
casePredictions(rp)
id original predicted is.correct Automatic Manual
1 Mazda RX4 Manual Manual TRUE 0.194175 0.80583
2 Mazda RX4 Wag Manual Manual TRUE 0.225225 0.77477
3 Datsun 710 Manual Manual TRUE 0.326531 0.67347
4 Hornet 4 Drive Automatic Automatic TRUE 0.811321 0.18868
5 Hornet Sportabout Automatic Automatic TRUE 0.972222 0.02778
6 Valiant Automatic Automatic TRUE 0.976000 0.02400
7 Duster 360 Automatic Automatic TRUE 0.957627 0.04237
8 Merc 240D Automatic Automatic TRUE 0.593496 0.40650
9 Merc 230 Automatic Manual FALSE 0.444444 0.55556
10 Merc 280 Automatic Automatic TRUE 0.747899 0.25210
11 Merc 280C Automatic Automatic TRUE 0.790476 0.20952
12 Merc 450SE Automatic Automatic TRUE 1.000000 0.00000
13 Merc 450SL Automatic Automatic TRUE 1.000000 0.00000
14 Merc 450SLC Automatic Automatic TRUE 1.000000 0.00000
15 Cadillac Fleetwood Automatic Automatic TRUE 1.000000 0.00000
16 Lincoln Continental Automatic Automatic TRUE 1.000000 0.00000
17 Chrysler Imperial Automatic Automatic TRUE 1.000000 0.00000
18 Fiat 128 Manual Manual TRUE 0.062500 0.93750
19 Honda Civic Manual Manual TRUE 0.033333 0.96667
20 Toyota Corolla Manual Manual TRUE 0.161616 0.83838
21 Toyota Corona Automatic Manual FALSE 0.303571 0.69643
22 Dodge Challenger Automatic Automatic TRUE 0.980000 0.02000
23 AMC Javelin Automatic Automatic TRUE 1.000000 0.00000
24 Camaro Z28 Automatic Automatic TRUE 0.767857 0.23214
25 Pontiac Firebird Automatic Automatic TRUE 0.983333 0.01667
26 Fiat X1-9 Manual Manual TRUE 0.009346 0.99065
27 Porsche 914-2 Manual Manual TRUE 0.057143 0.94286
28 Lotus Europa Manual Manual TRUE 0.119048 0.88095
29 Ford Pantera L Manual Automatic FALSE 0.651376 0.34862
30 Ferrari Dino Manual Manual TRUE 0.271845 0.72816
31 Maserati Bora Manual Automatic FALSE 0.757282 0.24272
32 Volvo 142E Manual Manual TRUE 0.264706 0.73529
This data frame gives the id
of each case, the original
class of that case, the predicted
class of the case, and whether or not the assignment is.correct
. The following columns give the proportion of trees in the forest that voted for that class when the sample was OOB. These can be thought of as the assignment probabilities for each case and will sum to 1 across all classes.
A distribution of these assignment probabilities can be visualized with the plotVotes()
function:
plotVotes(rp)
The individual samples are arrayed along the x-axis, sorted by class and probability of assignment. The y-axis shows the probability of assignment to each class which is denoted by color in the figure. Models showing strong assignments will have a steep break at the class panel with most individuals in each class having high probabilities to their original class.
The cutoff probability for assignment defaults to 1/k where k is the number of classes. For example, in a two-class model, this is 0.5, while in a three-class model it is 0.3333. Whichever class gets that number or greater votes in the forest is the predicted class. Sometimes, it is good to explore how the class-specific classification rates change if this value is made more stringent. This tells us something about how strongly cases were assigned to classes. We can get a sense of this from looking at the percent that would be correctly classified at different cutoffs with the function pctCorrect()
:
pctCorrect(rp, pct = c(0.6, 0.8, 0.95))
class pct.correct_0.6 pct.correct_0.8 pct.correct_0.95
1 Automatic 84.21 68.42 63.16
2 Manual 84.62 53.85 15.38
3 Overall 84.38 62.50 43.75
Notice that as the cutoff increases, the percent correct at each cutoff decreases, but it is clear that a higher proportion of automatic transmission cases are correctly classified with a higher probability. That is, the model is more certain of these classifications than of the manual transmission cases.
A final exploratory feature of assignment probabilities is an exploration of the differnce in distributions of correct classifications and misclassifications by class. In models with strong performance, it is expected that most misclassified individuals will be weakly classified (have low assignment probabilities) to the wrong class, while the correctly classified individuals will be strongly classified to their own class. The plotPredictedProbs()
function lets us explore these distributions:
plotPredictedProbs(rp)
We can see that most of the automatic transmission cases are correctly classified with assignment probabilities greater than 0.7, while the misclassified manual transmission cases have probabilities below tis value. The same is true for the manual transmission classifications.
We can visualize the distribution of individual cases in the Random Forest with a proximity plot. Case proximity is a summary of the relative nearness of cases in trees across the whole forest and is produced by setting proximity = TRUE
in a call to rfPermute()
or randomForest()
. The plotProximity()
function then visualizes this matrix using multi-dimensional scaling:
rp <- rfPermute(am ~ ., df, num.rep = 1000, proximity = TRUE, num.cores = 6)
plotProximity(rp)
Here, the color of the inner dot indicates the original class, while the outer circle indicates the class to which it was classified. In addtion, 95% confidence ellipses are added for each class.
If you wish to combine rfPermute
models (say run on different systems), this can be done with the combineRP()
function which is a wrapper for randomForest::combine()
that also combines the null distributions and recalculates the relevant p-values:
data(iris)
rp1 <- rfPermute(Species ~ ., iris, ntree = 10, norm.votes = FALSE)
rp2 <- rfPermute(Species ~ ., iris, ntree = 10, norm.votes = FALSE)
rp3 <- rfPermute(Species ~ ., iris, ntree = 10, norm.votes = FALSE)
print(rp1)
An rfPermute model
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 2
No. of permutation replicates: 100
Start time: 2022-04-18 10:45:57
End time: 2022-04-18 10:45:58
Run time: 1.02 secs
setosa versicolor virginica pct.correct LCI_0.95 UCI_0.95
setosa 50 0 0 100 92.9 100.0
versicolor 0 48 2 96 86.3 99.5
virginica 0 7 43 86 73.3 94.2
Overall NA NA NA 94 88.9 97.2
print(rp2)
An rfPermute model
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 2
No. of permutation replicates: 100
Start time: 2022-04-18 10:45:58
End time: 2022-04-18 10:45:59
Run time: 1 secs
setosa versicolor virginica pct.correct LCI_0.95 UCI_0.95
setosa 50 0 0 100.0 92.9 100.0
versicolor 0 48 2 96.0 86.3 99.5
virginica 0 7 42 85.7 72.8 94.1
Overall NA NA NA 94.0 88.8 97.2
print(rp3)
An rfPermute model
Type of random forest: classification
Number of trees: 10
No. of variables tried at each split: 2
No. of permutation replicates: 100
Start time: 2022-04-18 10:46:00
End time: 2022-04-18 10:46:01
Run time: 0.996 secs
setosa versicolor virginica pct.correct LCI_0.95 UCI_0.95
setosa 50 0 0 100.0 92.9 100.0
versicolor 0 47 3 94.0 83.5 98.7
virginica 0 7 43 86.0 73.3 94.2
Overall NA NA NA 93.3 88.1 96.8
rp.all <- combineRP(rp1, rp2, rp3)
print(rp.all)
An rfPermute model
Type of random forest: classification
Number of trees: 30
No. of variables tried at each split: 2
No. of permutation replicates: 300
Start time: 2022-04-18 10:45:57
End time: 2022-04-18 10:45:57
Run time: 0 secs
setosa versicolor virginica pct.correct LCI_0.95 UCI_0.95
setosa 50 0 0 100.0 92.9 100.0
versicolor 0 48 2 96.0 86.3 99.5
virginica 0 5 45 90.0 78.2 96.7
Overall NA NA NA 95.3 90.6 98.1
The package also provides a helper function for preparing a data frame for a Random Forest classification model. It removes predictors that are constant and cases with missing data.