messy

When teaching examples using R, instructors often using nice datasets - but these aren’t very realistic, and aren’t what students will later encounter in the real world. Real datasets have typos, missing values encoded in strange ways, and weird spaces. The {messy} R package takes a clean dataset, and randomly adds these things in - giving students the opportunity to practice their data cleaning and wrangling skills without having to change all of your examples.

Installation

Install from CRAN using:

install.packages("messy")

Install development version from GitHub using:

remotes::install_github("nrennie/messy")

Usage

For more in-depth usage instructions, see the package documentation at nrennie.rbind.io/messy which has examples of each function.

The simplest way to use the {messy} package is applying the messy() function:

set.seed(1234)
messy(ToothGrowth[1:10,])

    len supp dose
1   4.2   VC  0.5
2  11.5 <NA> <NA>
3  7.3    VC  0.5
4   5.8  (VC  0.5
5   6.4   VC <NA>
6    10   VC  0.5
7  11.2 <NA>  0.5
8  11.2   VC  0.5
9  5.2    VC  0.5
10    7   VC 0.5

You can vary the amount of messiness for each function, and chain together different functions to create customised messy data:

set.seed(1234)
ToothGrowth[1:10,] |> 
  make_missing(cols = "supp", missing = " ") |> 
  make_missing(cols = c("len", "dose"), missing = c(NA, 999)) |> 
  add_whitespace(cols = "supp", messiness = 0.5) |> 
  add_special_chars(cols = "supp")

    len supp dose
1   4.2   VC  0.5
2  11.5  VC    NA
3   7.3   VC  0.5
4   5.8 *VC   0.5
5   6.4  VC   0.5
6  10.0   VC  0.5
7  11.2       0.5
8  11.2  V#C   NA
9   5.2  !VC  0.5
10  7.0 VC*   0.5