Lecture II: Data Munging

Skills

  1. Generalization: Arrange, count, summarise
  2. Extraction: Filter, select, mutate

Abilities You’ll Achieve

  1. Be the “family doctor” to your dataset
  2. Store the valuable data easily! And quickly!

Your “Patient” Today

Demographic statistics popularized by Hans Rosling’s TED talks.

library(gapminder)
gapminder

Generalization

Knowing about the Data

head(gapminder, n = 6)

## Systemic view

str(gapminder)

Knowing professionally

Hey, I’m a professional~ I wanna see the data in a systematic way, such as finding out

  • Size of the observations
  • Number and names of the variables included
  • Or the structure of the entire data frame
gapminder
nrow(gapminder)
ncol(gapminder)
names(gapminder)
str(gapminder)

Knowing about the Variable

Q: Tell me something about the population variable in the dataset, like, how many countries’ population we have, what the average, who has the largest and smallest population, and many other things! Btw, what type the pop is stored?

head(gapminder$year, n = 10)
mean(gapminder$year, na.rm = TRUE)
median(gapminder$year)
min(gapminder$year)
max(gapminder$year)
length(gapminder$year)
summary(gapminder$year)

class(gapminder$gdpPercap)
typeof(gapminder$gdpPercap)

Fancier Moves

Welcome to the Tidyverse

Prevalent toolkit for data manipulation

  • A Hadley package
  • A growing set of packages, actually!

Installation:

## install.packages("tidyverse")
library("tidyverse")

We focus on dplyr today.

Five Guns of dplyr

They do one thing, but they do it well.

Composability: Make Everything Different

Making codes more readable.

Shortcut for %>%:

  • Ctrl + Shift + M (Win)
  • Cmd + Shift + M (Mac)

Data Overview

Beyond the base

You still remember str(), right?

str(gapminder)
glimpse(gapminder)

View in Order

Q: Which countries have the largest populations? And the smallest?

gapminder
gapminder %>% 
  arrange(pop)

arrange(gapminder, desc(pop))

“Give me some numbers”!

Q: How many observations do we have in each continent? Do we have same number of observations in each countries in the same continent?

gapminder %>% 
  count(continent)

# gapminder %>% 
#   add_count(continent)
gapminder %>% 
  count(continent, country)

What does count() give?

Stats to variables

Q: What was the average GDP per capita and median life expectancy?

gapminder %>% 
  summarise(mean_gdp = mean(gdpPercap), median_life = median(lifeExp))

Summarise in groups

Q: What was the average GDP per capita and median life expectancy in each continent?

gapminder %>% 
  group_by(continent) %>% 
  summarise(mean_gdp = mean(gdpPercap), median_life = median(lifeExp))

Extraction

Rows

Q: Which countries had the largest population in 2007?

gapminder %>% 
  arrange(desc(pop))
gapminder %>% 
  filter(year == 2007) %>% 
  arrange(desc(pop))

How about which country had the largest population in the decade ending with 2007? (Tip: using %in% as a condition)

Columns

Q: If I want

  1. Only country, year, and population
  2. Everything but not continent
  3. Variables starting with “co”

gapminder %>% 
  select(country, year, pop)
gapminder %>% 
  select(-continent)

gapminder %>% 
  select(starts_with("co"))

Combo Attack

Q: What’s the life expectancy of the country that had the largest population in 2007—showing the country name, population, and life expectancy together, please?

gapminder
gapminder %>% 
  filter(year == 2007) %>% 
  arrange(desc(pop)) %>% 
  select(country, pop, lifeExp)

Modification

Q: What’s the total GDP of each country?

gapminder %>% 
  mutate(gdp = pop * gdpPercap) %>% 
    select(country, pop, gdpPercap, gdp)

Batch Modification

Q: How do we only keep the integers for all the numeric variables?

gapminder %>% 
  mutate_if(is.double, round, digits = 0)

Important note

When doing gapminder %>% ..., you are NOT adding or changing anything of the gapminder. If you want to save the changes, send the result to an object.

gapminderNew <- gapminder %>% ...

Take-Home Points

  1. Be clear about the logic before the moves;
  2. Use the dplyr functions wisely and in combo;
    • Generalization: arrange, count, summarise
    • Extraction: filter, select, mutate
  3. Don’t forget group_by and mutate_if

Bonus: Combining variables

Q: I want to fill the missing in the x, and combine y and z to one variable?

df_toy %>% 
  mutate(x = coalesce(x, 0L),
         yz = coalesce(y, z))

Thank you!

  yuehu@tsinghua.edu.cn

  https://sammo3182.github.io/

  sammo3182