The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code! We will use < 10 lines of code and just 6 function names to explore penguins:
function | package | description |
---|---|---|
library() |
{base} | load a package |
filter() |
{dplyr} | subset rows using column values |
describe() |
{explore} | describe variables of the table |
explore() |
{explore} | explore graphically a variable |
explore_all() |
{explore} | explore all variables of the table |
explain_tree() |
{explore} | explain a target using a decision tree |
The penguins
dataset comes with the palmerpenguins
package. It has 344 observations and 8 variables. (https://github.com/allisonhorst/palmerpenguins)
Furthermore, we use the packages {dplyr} for filter()
and %>%
and {explore} for data exploration.
library(dplyr)
library(explore)
penguins <- use_data_penguins()
# equivalent to
# penguins <- palmerpenguins::penguins
penguins %>% describe()
#> # A tibble: 8 x 8
#> variable type na na_pct unique min mean max
#> <chr> <chr> <int> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 species fct 0 0 3 NA NA NA
#> 2 island fct 0 0 3 NA NA NA
#> 3 bill_length_mm dbl 2 0.6 165 32.1 43.9 59.6
#> 4 bill_depth_mm dbl 2 0.6 81 13.1 17.2 21.5
#> 5 flipper_length_mm int 2 0.6 56 172 201. 231
#> 6 body_mass_g int 2 0.6 95 2700 4202. 6300
#> 7 sex fct 11 3.2 3 NA NA NA
#> 8 year int 0 0 3 2007 2008. 2009
There are some NA
-values (unknown values) in the data.
The variable containing the most NAs is sex. flipper_length_mm and
others contain only 2 observations with NAs.
We use only penguins with known flipper length for the data exploration!
We reduced the penguins from 344 to 342.
What is the relationship between all the variables and species?
We already see some strong patterns in the data.
flipper_length_mm
separates species Gentoo,
bill_length_mm
separates species Adelie from Chinstrap. And
we see that Chinstrap and Gentoo are located on separate islands.
Now we explain species using a decision tree:
We found an easy explanation how to find out the species by just using flipper_length_mm and bill_length_mm.
flipper_legnth_mm >= 207
, it is a Gentoo penguin
(95% right)flipper_length_mm < 207
and
bill_length_mm < 43
, it is a Adelie penguin (97%
right)flipper_length_mm < 207
and
bill_length_mm >= 43
, it is a Chinstrap penguin (92%
right)Now let’s take a closer look to these variables:
data %>%
explore(
flipper_length_mm, bill_length_mm,
target = species,
color = c("darkorange", "purple", "lightseagreen")
)
The plot shows a not perfect but good separation between the 3 species!