Before diving into the specifics, let’s clarify some vocabulary:
tidyverse
in R,
dplyr
is a package for data manipulation. It provides a set
of functions that can be used to filter, select, arrange, mutate,
summarize, and join data.tidyverse
in R, dbplyr
is a database backend for dplyr
.
It allows you to write R code that is then translated into SQL
queries.A connection to a database is an object that allows you to interact
with it from R. The allofus
package relies on the
bigrquery
R package to create a connection to the Google
BigQuery database when you run aou_connect()
.
The object con
is used to refer to the connection to the
All of Us database. When you run the aou_connect()
function, it also gets stored as the default connection for a session,
so you don’t need to include it in other functions from the
allofus
package. For instance, you can run
aou_tables()
to see a list of tables in the database:
#> # A tibble: 68 × 2
#> table_name columns
#> <chr> <chr>
#> 1 concept_ancestor ancestor_concept_id, descendant_concept_id, min_lev…
#> 2 cb_criteria_ancestor ancestor_id, descendant_id
#> 3 attribute_definition attribute_definition_id, attribute_name, attribute_…
#> 4 cdm_source cdm_source_name, cdm_source_abbreviation, cdm_holde…
#> 5 concept_class concept_class_id, concept_class_name, concept_class…
#> 6 concept concept_id, concept_name, domain_id, vocabulary_id,…
#> 7 concept_synonym concept_id, concept_synonym_name, language_concept_…
#> 8 cb_criteria_relationship concept_id_1, concept_id_2
#> 9 concept_relationship concept_id_1, concept_id_2, relationship_id, valid_…
#> 10 condition_occurrence condition_occurrence_id, person_id, condition_conce…
#> # ℹ 58 more rows
To access a table in the database, use
tbl(con, "tablename")
. The resulting object is reference to
a table in a database that allows you to interact with it from R. In
order to connect to a table in the database, you must first create the
database connection (con
). For example, to create a
reference to the person table:
Although the person_tbl
object behaves similarly to a
data frame, it is not actually a data frame. Instead, it is actually a
SQL query that only gets run when you need to access the data. This is a
feature of the dbplyr
package that allows you to manipulate
data without actually retrieving it. The SQL query behind the
person_tbl
object is:
When you print person_tbl
, you’ll get something like
this:
#> # Source: table<person> [?? x 23]
#> # Database: BigQueryConnection
#> person_id gender_concept_id year_of_birth month_of_birth day_of_birth
#> <int64> <int64> <int64> <int64> <int64>
#> 1 xxxxxxx 903096 1955 NA NA
#> 2 xxxxxxx 903096 1978 NA NA
#> 3 xxxxxxx 903096 2000 NA NA
#> 4 xxxxxxx 903096 1988 NA NA
#> 5 xxxxxxx 903096 1993 NA NA
#> 6 xxxxxxx 903096 1959 NA NA
#> 7 xxxxxxx 903096 1976 NA NA
#> 8 xxxxxxx 903096 1961 NA NA
#> 9 xxxxxxx 903096 1952 NA NA
#> 10 xxxxxxx 903096 1980 NA NA
#> # ℹ more rows
#> # ℹ 18 more variables: birth_datetime <dttm>, race_concept_id <int64>,
#> # ethnicity_concept_id <int64>, location_id <int64>, provider_id <int64>,
#> # care_site_id <int64>, person_source_value <chr>, gender_source_value <chr>,
#> # gender_source_concept_id <int64>, race_source_value <chr>,
#> # race_source_concept_id <int64>, ethnicity_source_value <chr>,
#> # ethnicity_source_concept_id <int64>, …
You’ll only see the first 10 rows of the person table (person ids were omitted from the output). This allows you to see what the data looks like without loading the entire table into R, which can be slow or even crash your session.
Instead, you want to perform as much data manipulation as possible on the database. This is more efficient because the operations are translated into SQL and executed on the server, which is faster and requires less memory than processing in R.
Before bringing data into R, you can manipulate it on the database
using the dplyr
functions. This allows you to perform
operations on the database without bringing the data into R’s
memory.
For example, you can subset the person table to women born after 1980:
Before we print out young_women
, the SQL query has not
actually been run. In fact, person_tbl
is not run either.
When we do print it, it will run the following SQL:
SELECT `person`.*
FROM `person`
WHERE (`gender_concept_id` = 45878463.0) AND (`year_of_birth` > 1980.0)
and print the first 10 rows:
#> # Source: SQL [?? x 23]
#> # Database: BigQueryConnection
#> person_id gender_concept_id year_of_birth month_of_birth day_of_birth
#> <int64> <int64> <int64> <int64> <int64>
#> 1 xxxxxxx 45878463 1992 NA NA
#> 2 xxxxxxx 45878463 1989 NA NA
#> 3 xxxxxxx 45878463 1981 NA NA
#> 4 xxxxxxx 45878463 1990 NA NA
#> 5 xxxxxxx 45878463 1990 NA NA
#> 6 xxxxxxx 45878463 1985 NA NA
#> 7 xxxxxxx 45878463 1987 NA NA
#> 8 xxxxxxx 45878463 1986 NA NA
#> 9 xxxxxxx 45878463 1983 NA NA
#> 10 xxxxxxx 45878463 1998 NA NA
# ℹ more rows
# ℹ 18 more variables: birth_datetime <dttm>, race_concept_id <int64>,
# ethnicity_concept_id <int64>, location_id <int64>, provider_id <int64>,
# care_site_id <int64>, person_source_value <chr>, gender_source_value <chr>,
# gender_source_concept_id <int64>, race_source_value <chr>,
# race_source_concept_id <int64>, ethnicity_source_value <chr>,
# ethnicity_source_concept_id <int64>, …
Note that we don’t know how many observations match these conditions
yet (the dimensions are [?? x 23]
), because it hasn’t been
fully executed – only the first 10 rows. To get the total number of
observations, we can use tally()
:
#> # Source: SQL [1 x 1]
#> # Database: BigQueryConnection
#> n
#> <int64>
#> 1 76135
This is actually a SQL query that only results in 1 row, so we do get to see the entire thing. It’s much faster to run than to bring the entire table into R and then count the number of rows, because the code is executed on the database:
SELECT count(*) AS `n`
FROM (
SELECT `person`.*
FROM `person`
WHERE (`gender_concept_id` = 45878463.0) AND (`year_of_birth` > 1980.0)
)
collect()
FunctionWe can bring the result of a query into the local R session using
collect()
:
This brings the table into your local R workspace as a tibble, or
dataframe. This is useful for performing operations that cannot be
performed directly on the database, such as certain statistical analyses
or plotting. For example, if you’re planning to run a regression
analysis on the filtered data, you would first use
collect()
to bring the data into R.
We can bring in the result of tally()
as well:
#> A tibble: 1 × 1
#> n
#> <int64>
#> 76135
Or even the entire person table, although that’s not recommended because it’s so large!
#> # A tibble: 413457 × 23
#> person_id gender_concept_id year_of_birth month_of_birth day_of_birth
#> <int64> <int64> <int64> <int64> <int64>
#> xxxxxxx 903096 1955 NA NA
#> xxxxxxx 903096 1978 NA NA
#> xxxxxxx 903096 2000 NA NA
#> xxxxxxx 903096 1988 NA NA
#> xxxxxxx 903096 1993 NA NA
#> xxxxxxx 903096 1959 NA NA
#> xxxxxxx 903096 1976 NA NA
#> xxxxxxx 903096 1961 NA NA
#> xxxxxxx 903096 1952 NA NA
#> xxxxxxx 903096 1980 NA NA
The All of Us data is spread across multiple tables, for the most
part corresponding to the OMOP Common Data Model. This allows for
efficient storage and retrieval of data, but it can be a bit tricky to
work with at first. Fortunately, dbplyr
makes it easy to
join tables together.
For example, how did we know that
gender_concept_id == 45878463
referred to women? We can
look up the names of concept ids in the concept
table:
#> # Source: SQL [?? x 2]
#> # Database: BigQueryConnection
#> concept_id concept_name
#> <int64> <chr>
#> 1 38003166 Durable Medical Equipment - General Classification
#> 2 35805830 DexaBEAM
#> 3 38003221 Blood - Plasma
#> 4 1147839 survey_conduct.survey_start_date
#> 5 8623 log reduction
#> 6 38004063 Rehabilitation Practitioner
#> 7 38003186 Radiology - Diagnostic - General Classification
#> 8 35805115 VNCOP-B
#> 9 35805457 VAdCA
#> 10 8581 heartbeat
#> # ℹ more rows
We just want to extract the names of the gender concept ids. To do
this, we can join the person
table with the
concept
table. So that we can see the full range of gender
ids, first we will count them:
genders_in_aou <- person_tbl %>%
count(gender_concept_id) %>%
left_join(concept_tbl, by = join_by(gender_concept_id == concept_id))
genders_in_aou
#> # Source: SQL [9 x 3]
#> # Database: BigQueryConnection
#> gender_concept_id n concept_name
#> <int64> <int64> <chr>
#> 1 1177221 602 I prefer not to answer
#> 2 0 97 No matching concept
#> 3 45878463 247453 Female
#> 4 1585843 407 Gender Identity: Additional Options
#> 5 903096 7356 PMI: Skip
#> 6 45880669 154241 Male
#> 7 1585842 562 Gender Identity: Transgender
#> 8 1585841 1213 Gender Identity: Non Binary
#> 9 2000000002 1526 Not man only, not woman only, prefer not to answer,…
The result of this SQL query is just 9 rows, so we get to see all of them. Both the counting and the joining were done directly on the database, so this was very efficient.
aou_join()
The allofus
package includes a function called
aou_join()
that makes it easy to join tables together. It
includes some additional checks to help avoid mistakes in joining. For
example, if we wanted to join the person
table with the
observation
table, dropping people with no observations, we
could do it like this:
Warning message:
“There are shared column names not specified in the `by` argument.
→ These column names now end in '_x' and '_y'.
ℹ You can change these suffixes using the `suffix` argument but it cannot
contain periods (`.`).
→ Consider specifing all shared columns in the `by` argument.
→ Or if these additional shared columns are `NA`, remove them prior to joining.”
The warning message tells us that the person
and
observation
tables share some column names that we didn’t
specify as part of the join argument. That is because both tables have a
column called provider_id
. We can see this by looking at
the column names of the obs
table that have the default
added suffix, “_x” and “_y”:
#> [1] "provider_id_x" "provider_id_y"
Because this is often a mistake occurring because we are not working
with the tables directly, aou_join()
warns us about this.
We can avoid this warning by specifying all of the columns that we want
to join on and removing the columns that we don’t want to join on. For
example, we could remove the provider_id
column from the
person
table before joining:
Unfortunately, we can’t join a table on the database with a dataframe in R. If you end up with one of each, you have a couple of options:
allofus
functions have a collect = FALSE
argument, but sometimes it’s unavoidable.collect()
and then join it with the dataframe in R. This
can be inefficient if part of the reason for joining is to subset the
table down to only data you care about.aou_atlas_cohort()
, and you want to bring in activity data
for those participants, you could run:show_query()
Understanding the SQL code that dbplyr
generates can be
insightful, especially if you’re debugging or simply curious about the
translation from R to SQL. To view the SQL query that corresponds to
your dbplyr
operations, use the show_query()
function:
SELECT
`edjnngldox`.`person_id` AS `person_id`,
`gender_concept_id`,
`year_of_birth`,
`month_of_birth`,
`day_of_birth`,
`birth_datetime`,
`race_concept_id`,
`ethnicity_concept_id`,
`location_id`,
`care_site_id`,
`person_source_value`,
`gender_source_value`,
`gender_source_concept_id`,
`race_source_value`,
`race_source_concept_id`,
`ethnicity_source_value`,
`ethnicity_source_concept_id`,
`state_of_residence_concept_id`,
`state_of_residence_source_value`,
`sex_at_birth_concept_id`,
`sex_at_birth_source_concept_id`,
`sex_at_birth_source_value`,
`observation_id`,
`observation_concept_id`,
`observation_date`,
`observation_datetime`,
`observation_type_concept_id`,
`value_as_number`,
`value_as_string`,
`value_as_concept_id`,
`qualifier_concept_id`,
`unit_concept_id`,
`zwcwezaowf`.`provider_id` AS `provider_id`,
`visit_occurrence_id`,
`visit_detail_id`,
`observation_source_value`,
`observation_source_concept_id`,
`unit_source_value`,
`qualifier_source_value`,
`value_source_concept_id`,
`value_source_value`,
`questionnaire_response_id`
FROM `person` `edjnngldox`
INNER JOIN `observation` `zwcwezaowf`
ON (`edjnngldox`.`person_id` = `zwcwezaowf`.`person_id`)
This function prints the SQL query that would be sent to the
database. It’s a great way to learn SQL and understand how
dbplyr
optimizes data manipulation. (Why the gibberish
table names? Bugs in previous versions of dbplyr
resulted
in table names that would break the query, and giving them unique names
is a workaround.)
Another approach to working with the data is to write SQL code
directly. This is especially useful for complex queries that are
difficult to express in dplyr
syntax. The
allofus
package includes a function called
aou_sql()
that makes it easy to run SQL code directly on
the database. For example, we could count the number of people in the
person
table like this:
There are a few important things to note about this code. First, the
CDR
variable is a special variable referring to what All of
Us calls the “curated data repository”. When writing SQL code directly,
we don’t need the database connection object con
, instead
we need to direct the code to the correct tables by preceding the table
names with “{CDR}”. This means we can’t run the code we get from
show_query()
without modification. For example, we could
count the number of young women in the dataset, as we did above with the
dbplyr
approach, like this:
aou_sql("
SELECT count(*) AS `n`
FROM (
SELECT `person`.*
FROM {CDR}.`person`
WHERE (`gender_concept_id` = 45878463.0) AND (`year_of_birth` > 1980.0)
)
")
Second, the aou_sql()
function returns a dataframe – the
entire result of the SQL query is brought into memory. This means that
we want to run an entire query at once, instead of breaking it into
multiple steps like we did with dbplyr
.