Introducing OutlierTree

David Cortes

2024-09-05

This short vignette illustrates basic usage of the OutlierTree library for outlier detection, using the hypothyroid dataset which is bundled with it.

This is a library which flags suspicious values within an observation, contrasting them against the normal values in a human-readable format and potentially adding conditions within the data that make the observation more suspicious; and does so in a similar way as one would do it manually, by checking extreme values in sorted order and filtering observations according to the values of other variables (e.g. if some other variable is TRUE or FALSE).

For a full description of the procedure see Explainable outlier detection through decision tree conditioning.

A look at the dataset

This is a dataset about hospital patients who might potentially have hypo- or hyperthyroidism problems. The observations are about anonymous people whose demographic characteristics, drug intake, and hormone indicators were recorded, along with the judgement about their condition.

It contains many interesting outliers which have something obviously wrong when examined visually, but which would nevertheless be missed by other outlier detection methods.

library(outliertree)
data(hypothyroid)
summary(hypothyroid)
      age           sex       on.thyroxine    query.on.thyroxine
 Min.   :  1.00   F   :1817   Mode :logical   Mode :logical     
 1st Qu.: 36.00   M   : 849   FALSE:2442      FALSE:2733        
 Median : 54.00   NA's: 106   TRUE :330       TRUE :39          
 Mean   : 51.75                                                 
 3rd Qu.: 67.00                                                 
 Max.   :455.00                                                 
 NA's   :1                                                      
 on.antithyroid.medication    sick          pregnant       thyroid.surgery
 Mode :logical             Mode :logical   Mode :logical   Mode :logical  
 FALSE:2738                FALSE:2663      FALSE:1882      FALSE:2734     
 TRUE :34                  TRUE :109       TRUE :41        TRUE :38       
                                           NA's :849                      
                                                                          
                                                                          
                                                                          
 I131.treatment  query.hypothyroid query.hyperthyroid  lithium       
 Mode :logical   Mode :logical     Mode :logical      Mode :logical  
 FALSE:2724      FALSE:2611        FALSE:2600         FALSE:2758     
 TRUE :48        TRUE :161         TRUE :172          TRUE :14       
                                                                     
                                                                     
                                                                     
                                                                     
   tumor           goitre        hypopituitary     psych        
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:2747      FALSE:2701      FALSE:2771      FALSE:2638     
 TRUE :25        TRUE :71        TRUE :1         TRUE :134      
                                                                
                                                                
                                                                
                                                                
      TSH                T3             TT4             T4U        
 Min.   :  0.005   Min.   : 0.05   Min.   :  2.0   Min.   :0.3100  
 1st Qu.:  0.430   1st Qu.: 1.60   1st Qu.: 88.0   1st Qu.:0.8800  
 Median :  1.400   Median : 2.00   Median :104.0   Median :0.9800  
 Mean   :  4.509   Mean   : 2.03   Mean   :109.1   Mean   :0.9986  
 3rd Qu.:  2.600   3rd Qu.: 2.40   3rd Qu.:125.0   3rd Qu.:1.0800  
 Max.   :478.000   Max.   :10.60   Max.   :430.0   Max.   :2.1200  
 NA's   :277       NA's   :576     NA's   :179     NA's   :291     
 referral.source       diagnosis         FTI         
 other:1615      compensated: 154   Min.   :  2.381  
 STMW :  91      negative   :2553   1st Qu.: 92.857  
 SVHC : 275      primary    :  63   Median :107.175  
 SVHD :  31      secondary  :   2   Mean   :110.791  
 SVI  : 760                         3rd Qu.:124.272  
                                    Max.   :394.495  
                                    NA's   :292      

Finding outliers

otree <- outlier.tree(hypothyroid, nthreads=1)
Reporting top 8 outliers [out of 8 found]

row [531] - suspicious column: [hypopituitary] - suspicious value: [TRUE]
    distribution: 99.964% different [norm. obs: 2772]


row [623] - suspicious column: [age] - suspicious value: [455.00]
    distribution: 99.964% <= 94.00 - [mean: 51.60] - [sd: 18.98] - [norm. obs: 2770]


row [2230] - suspicious column: [T3] - suspicious value: [10.60]
    distribution: 99.951% <= 7.10 - [mean: 1.98] - [sd: 0.75] - [norm. obs: 2050]
    given:
        [query.hyperthyroid] = [FALSE]


row [1138] - suspicious column: [age] - suspicious value: [75.00]
    distribution: 95.122% <= 42.00 - [mean: 31.46] - [sd: 5.28] - [norm. obs: 39]
    given:
        [pregnant] = [TRUE]


row [2211] - suspicious column: [age] - suspicious value: [73.00]
    distribution: 95.122% <= 42.00 - [mean: 31.46] - [sd: 5.28] - [norm. obs: 39]
    given:
        [pregnant] = [TRUE]


row [1438] - suspicious column: [FTI] - suspicious value: [394.50]
    distribution: 99.618% <= 232.08 - [mean: 132.68] - [sd: 28.23] - [norm. obs: 261]
    given:
        [TT4] > [123.00] (value: 430.00)
        [referral.source] != [other] (value: STMW)


row [745] - suspicious column: [TT4] - suspicious value: [239.00]
    distribution: 98.571% <= 177.00 - [mean: 135.23] - [sd: 12.57] - [norm. obs: 69]
    given:
        [FTI] between (97.96, 128.12] (value: 112.74)
        [T4U] > [1.12] (value: 2.12)
        [age] > [55.00] (value: 87.00)


row [1412] - suspicious column: [TT4] - suspicious value: [430.00]
    distribution: 99.762% <= 230.00 - [mean: 111.88] - [sd: 31.88] - [norm. obs: 420]
    given:
        [T3] is NA

(i.e. it’s saying that it’s abnormal to be pregnant at the age of 75, or to not be classified as hyperthyroidal when having very high thyroid hormone levels)

A closer look at some of those outliers

A look at the distributions within the clusters in which some outliers were flagged:

pregnant <- hypothyroid[hypothyroid$pregnant,]
hist(pregnant$age, breaks=50, col="navy",
     main="Age distribution among pregnant patients",
     xlab="Age")

non.hyperthyr <- hypothyroid[!hypothyroid$query.hyperthyroid,]
hist(non.hyperthyr$T3, breaks=50, col="darkred",
     main="T3 hormone levels\n(Non-hyperthyroidal patients)",
     xlab="T3 blood concentration")

Handling results

The identified outliers, along with all the relevant information, are returned as a list of lists, which can be inspected manually and the exact conditions extracted from them (see documentation for more details).

They are nevertheless returned as a class of its own in order to provide pretty-printing and slicing:

outliers <- predict(otree, hypothyroid, outliers_print=FALSE)
outliers[1:700]
Reporting top 2 outliers [out of 2 found]

row [531] - suspicious column: [hypopituitary] - suspicious value: [TRUE]
    distribution: 99.964% different [norm. obs: 2772]


row [623] - suspicious column: [age] - suspicious value: [455.00]
    distribution: 99.964% <= 94.00 - [mean: 51.60] - [sd: 18.98] - [norm. obs: 2770]
outliers[1138]
Reporting top 1 outliers [out of 1 found]

row [1138] - suspicious column: [age] - suspicious value: [75.00]
    distribution: 95.122% <= 42.00 - [mean: 31.46] - [sd: 5.28] - [norm. obs: 39]
    given:
        [pregnant] = [TRUE]
outliers[[1138]]
$suspicous_value
$suspicous_value$column
[1] "age"

$suspicous_value$value
[1] 75

$suspicous_value$decimals
[1] 0


$group_statistics
$group_statistics$upper_thr
[1] 42

$group_statistics$pct_below
[1] 0.9512195

$group_statistics$mean
[1] 31.46154

$group_statistics$sd
[1] 5.28078

$group_statistics$n_obs
[1] 39


$conditions
$conditions[[1]]
$conditions[[1]]$column
[1] "pregnant"

$conditions[[1]]$value_this
[1] TRUE

$conditions[[1]]$comparison
[1] "="

$conditions[[1]]$value_comp
[1] TRUE



$tree_depth
[1] 1

$uses_NA_branch
[1] FALSE

$outlier_score
[1] 0.01297346