This package has been created to create odds plot for the results of a logistic regression.
The package uses caret to train the model and the final model parameter is used to generate the application.
First we load the required packages. The example dataset we are going to use to work with OddsPlotty is the breast cancer data:
library(mlbench)
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(tibble)
library(ggplot2)
library(OddsPlotty)
library(e1071)
library(ggthemes)
Then we are going to load the data we need for the breast cancer data:
data("BreastCancer", package = "mlbench")
#Use complete cases of breast cancer
breast <- BreastCancer[complete.cases(BreastCancer), ] #Create a copy
breast <- breast[, -1]
head(breast, 10)
#> Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei
#> 1 5 1 1 1 2 1
#> 2 5 4 4 5 7 10
#> 3 3 1 1 1 2 2
#> 4 6 8 8 1 3 4
#> 5 4 1 1 3 2 1
#> 6 8 10 10 8 7 10
#> 7 1 1 1 1 2 10
#> 8 2 1 2 1 2 1
#> 9 2 1 1 1 2 1
#> 10 4 2 1 1 2 1
#> Bl.cromatin Normal.nucleoli Mitoses Class
#> 1 3 1 1 benign
#> 2 3 2 1 benign
#> 3 3 1 1 benign
#> 4 3 7 1 benign
#> 5 3 1 1 benign
#> 6 9 7 1 malignant
#> 7 3 1 1 benign
#> 8 3 1 1 benign
#> 9 1 1 5 benign
#> 10 2 1 1 benign
#Convert the class to a factor - Beningn (0) and Malignant (1)
breast$Class <- factor(breast$Class)
str(breast)
#> 'data.frame': 683 obs. of 10 variables:
#> $ Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
#> $ Cell.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
#> $ Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
#> $ Marg.adhesion : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
#> $ Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
#> $ Bare.nuclei : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
#> $ Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
#> $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
#> $ Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
#> $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
This takes care of the class encoding but now we need to code the factors to numeric
for(i in 1:9) {
breast[, i] <- as.numeric(as.character(breast[, i]))
}
#Loops through the first columns - 1 to 9 and changes them from factors to a numerical representation
str(breast)
#> 'data.frame': 683 obs. of 10 variables:
#> $ Cl.thickness : num 5 5 3 6 4 8 1 2 2 4 ...
#> $ Cell.size : num 1 4 1 8 1 10 1 1 1 2 ...
#> $ Cell.shape : num 1 4 1 8 1 10 1 2 1 1 ...
#> $ Marg.adhesion : num 1 5 1 1 3 8 1 1 1 1 ...
#> $ Epith.c.size : num 2 7 2 3 2 7 2 2 2 2 ...
#> $ Bare.nuclei : num 1 10 2 4 1 10 10 1 1 1 ...
#> $ Bl.cromatin : num 3 3 3 3 3 9 3 3 1 2 ...
#> $ Normal.nucleoli: num 1 2 1 7 1 7 1 1 1 1 ...
#> $ Mitoses : num 1 1 1 1 1 1 1 1 5 1 ...
#> $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
This has now changed the data into a numerical value and this can now be used in the GLM model.
I will use Caret to train the Generalised Linear Model (GLM) aka Logistic Regression, as this is the package that best supports the odds plot statistics. Please note: I am training on the full dataset and not undertaking a data partitioning method, as perhaps seen in logistic regression.
library(caret)
glm_model <- caret::train(Class ~ .,
data = breast,
method = "glm",
family = "binomial")
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm_model)
#>
#> Call:
#> NULL
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -3.4841 -0.1153 -0.0619 0.0222 2.4698
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -10.10394 1.17488 -8.600 < 2e-16 ***
#> Cl.thickness 0.53501 0.14202 3.767 0.000165 ***
#> Cell.size -0.00628 0.20908 -0.030 0.976039
#> Cell.shape 0.32271 0.23060 1.399 0.161688
#> Marg.adhesion 0.33064 0.12345 2.678 0.007400 **
#> Epith.c.size 0.09663 0.15659 0.617 0.537159
#> Bare.nuclei 0.38303 0.09384 4.082 4.47e-05 ***
#> Bl.cromatin 0.44719 0.17138 2.609 0.009073 **
#> Normal.nucleoli 0.21303 0.11287 1.887 0.059115 .
#> Mitoses 0.53484 0.32877 1.627 0.103788
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 884.35 on 682 degrees of freedom
#> Residual deviance: 102.89 on 673 degrees of freedom
#> AIC: 122.89
#>
#> Number of Fisher Scoring iterations: 8
Once the model is trained we can inspect the results with OddsPlotty:
The below shows how to visualise and expose the plot from the saved list in OddsPlotty.
#> Waiting for profiling to be done...
Each odds plot has an associated tibble under the hood for querying. To access the tibble use:
#> # A tibble: 9 x 4
#> OR lower upper vars
#> <dbl> <dbl> <dbl> <chr>
#> 1 1.71 1.32 2.31 Cl.thickness
#> 2 0.994 0.674 1.55 Cell.size
#> 3 1.38 0.862 2.16 Cell.shape
#> 4 1.39 1.10 1.80 Marg.adhesion
#> 5 1.10 0.805 1.50 Epith.c.size
#> 6 1.47 1.23 1.78 Bare.nuclei
#> 7 1.56 1.13 2.23 Bl.cromatin
#> 8 1.24 0.998 1.56 Normal.nucleoli
#> 9 1.71 0.993 3.02 Mitoses
Additional parameters for the plot can be fed in:
#> Waiting for profiling to be done...
Another example of how to use a different theme:
library(OddsPlotty)
library(ggthemes)
plotty <- OddsPlotty::odds_plot(glm_model$finalModel,
title = "Odds Plot with ggthemes Tufte Theme",
subtitle = "Showing odds of cancer based on various factors",
point_col = "#00f2ff",
error_bar_colour = "black",
point_size = .5,
error_bar_width = .8,
h_line_color = "red")
#> Waiting for profiling to be done...
plotty$odds_plot + ggthemes::theme_tufte()
As TidyModels utilises the underlying lm and glm packages, which CARET also uses, then the way to train with a TidyModels wrapper would be as follows. I will use the cancer data we have been working with for consistency.
The model requires the raw odds, not the Log Odds, as these get converted in the package. Please make sure that you set the exponentiate parameter to FALSE.
library(tidymodels)
#> -- Attaching packages -------------------------------------- tidymodels 0.1.1 --
#> v broom 0.7.2 v recipes 0.1.15
#> v dials 0.0.9 v rsample 0.0.8
#> v dplyr 1.0.4 v tidyr 1.1.2
#> v infer 0.5.3 v tune 0.1.2
#> v modeldata 0.1.0 v workflows 0.2.1
#> v parsnip 0.1.5 v yardstick 0.0.7
#> v purrr 0.3.4
#> -- Conflicts ----------------------------------------- tidymodels_conflicts() --
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x purrr::lift() masks caret::lift()
#> x yardstick::precision() masks caret::precision()
#> x yardstick::recall() masks caret::recall()
#> x yardstick::sensitivity() masks caret::sensitivity()
#> x yardstick::specificity() masks caret::specificity()
#> x recipes::step() masks stats::step()
#> x tune::tune() masks e1071::tune()
fitted_logistic_model<- logistic_reg() %>%
# Set the engine
set_engine("glm") %>%
# Set the mode - this will always be classification for logstic regression
set_mode("classification") %>%
# Fit the model
fit(Class ~ ., data = breast)
The next step would be to expose fitted_logistic_model$fit
list item. The fit parameter produces the same output as we would expect from the CARET package’s finalModel
we have used in the above caret cases.
In terms of Machine Learning, you could visualise the Odds Plots just on the training set, the only additional steps would be to add a testing split.
To visualise the odds plot with a TidyModels object can be achieved by exposing the fit
list item from the trained TidyModels object (see example below):
# Create odds plot for TidyModels object
tidy_odds_plot <- OddsPlotty::odds_plot(fitted_logistic_model$fit,
title="TidyModels Odds Plot",
point_col = "#6b95ff",
h_line_color = "red")
#> Waiting for profiling to be done...
# Output plot and data table
tidy_odds_plot$odds_plot + ggthemes::theme_gdocs()+ #Use ggthemes
theme(legend.position="none") #Turn off legend
#Generate tibble returning exp(odds) and
tidy_odds_plot$odds_data
#> # A tibble: 9 x 4
#> OR lower upper vars
#> <dbl> <dbl> <dbl> <chr>
#> 1 1.71 1.32 2.31 Cl.thickness
#> 2 0.994 0.674 1.55 Cell.size
#> 3 1.38 0.862 2.16 Cell.shape
#> 4 1.39 1.10 1.80 Marg.adhesion
#> 5 1.10 0.805 1.50 Epith.c.size
#> 6 1.47 1.23 1.78 Bare.nuclei
#> 7 1.56 1.13 2.23 Bl.cromatin
#> 8 1.24 0.998 1.56 Normal.nucleoli
#> 9 1.71 0.993 3.02 Mitoses
This package was created by Gary Hutson https://twitter.com/StatsGary and the package remains part of his work.