This vignette deals with situations where you want to transform
tables of your dm
and then update an existing table or add
a new table to the dm
. There are two straightforward
approaches:
dm
,dm
object by zooming to a table
and manipulating it while maintaining the key relations whenever
possible.Both approaches aim at maintaining the key relations whenever
possible. We will explore the second approach here. For the first
approach, see vignette("tech-dm-zoom")
.
dm
“Zooming” to a table of a dm
means:
dm
is kept,
including the originally zoomed tabledm_zoomed
is produced, presenting a
view of the table for transformationsselect()
, mutate()
and other table
manipulation functions{dm} provides methods for many of the {dplyr}-verbs for a
dm_zoomed
which behave the way you are used to, affecting
only the zoomed table and leaving the rest of the dm
untouched. When you are finished with transforming the table, there are
three options to proceed:
dm_update_zoomed()
if you want to replace the
originally zoomed table with the new tabledm_insert_zoomed()
if you are creating a new table
for your dm
dm_discard_zoomed()
if you do not need the result
and want to discard itWhen employing one of the first two options, the resulting table in
the dm
will have all the primary and foreign keys available
that could be tracked from the originally zoomed table.
So much for the theory, but how does it look and feel? To explore this, we once more make use of our trusted {nycflights13} data.
Imagine you want to have a column in flights
, specifying
if a flight left before noon or after. Just like with {dplyr}, we can
tackle this with mutate()
. Let us do this step by step:
library(dm)
library(dplyr)
flights_dm <- dm_nycflights13()
flights_dm
flights_zoomed <-
flights_dm %>%
dm_zoom_to(flights)
# The print output for a `dm_zoomed` looks very much like that from a normal `tibble`.
flights_zoomed
flights_zoomed_mutate <-
flights_zoomed %>%
mutate(am_pm_dep = if_else(dep_time < 1200, "am", "pm")) %>%
# in order to see our changes in the output we use `select()` for reordering the columns
select(year:dep_time, am_pm_dep, everything())
flights_zoomed_mutate
# To update the original `dm` with a new `flights` table we use `dm_update_zoomed()`:
updated_flights_dm <-
flights_zoomed_mutate %>%
dm_update_zoomed()
# The only difference in the `dm` print output is the increased number of columns
updated_flights_dm
# The schematic view of the data model remains unchanged
dm_draw(updated_flights_dm)
The same course of action could, for example, be employed to create a
surrogate key for a table, a synthetic simple key that replaces a
compound key. We can do this for the weather
table.
library(tidyr)
weather_zoomed <-
flights_dm %>%
dm_zoom_to(weather)
weather_zoomed
# Maybe there is some hidden candidate for a primary key that we overlooked
enum_pk_candidates(weather_zoomed)
# Seems we have to construct a column with unique values
# This can be done by combining column `origin` with `time_hour`, if the latter
# is converted to a single time zone first; all within the `dm`:
weather_zoomed_mutate <-
weather_zoomed %>%
# first convert all times to the same time zone:
mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>%
# paste together as character the airport code and the time
unite("origin_slot_id", origin, time_hour_fmt) %>%
select(origin_slot_id, everything())
# check if we the result is as expected:
enum_pk_candidates(weather_zoomed_mutate) %>% filter(candidate)
flights_upd_weather_dm <-
weather_zoomed_mutate %>%
dm_update_zoomed() %>%
dm_add_pk(weather, origin_slot_id)
flights_upd_weather_dm
# creating the coveted FK relation between `flights` and `weather`
extended_flights_dm <-
flights_upd_weather_dm %>%
dm_zoom_to(flights) %>%
mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>%
# need to keep `origin` as FK to airports, so `remove = FALSE`
unite("origin_slot_id", origin, time_hour_fmt, remove = FALSE) %>%
dm_update_zoomed() %>%
dm_add_fk(flights, origin_slot_id, weather)
extended_flights_dm %>% dm_draw()
dm
If you look at the dm
created by
dm_nycflights13(cycle = TRUE)
, you see that two columns of
flights
relate to one and the same table,
airports
. One column stands for the departure airport and
the other for the arrival airport.
This generates a cycle which leads to failures with many operations
that only work on cycle-free data models, such as
dm_flatten_to_tbl()
, dm_filter()
or
dm_wrap_to_tbl()
. In such cases, it can be beneficial to
“disentangle” the dm
by duplicating the referred table. One
way to do this in the {dm}-framework is as follows:
disentangled_flights_dm <-
dm_nycflights13(cycle = TRUE) %>%
# zooming and immediately inserting essentially creates a copy of the original table
dm_zoom_to(airports) %>%
# reinserting the `airports` table under the name `destination`
dm_insert_zoomed("destination") %>%
# renaming the originally zoomed table
dm_rename_tbl(origin = airports) %>%
# Key relations are also duplicated, so the wrong ones need to be removed
dm_rm_fk(flights, dest, origin) %>%
dm_rm_fk(flights, origin, destination)
dm_draw(disentangled_flights_dm)
In a future update, we will provide a more convenient way to
“disentangle” dm
objects, so that the individual steps will
be done automatically.
dm
Here is an example for adding a summary of a table as a new table to
a dm
(FK-relations are taken care of automatically):
If you would like to join some or all of the columns of one table to
another, you can make use of one of the join
-methods for a
dm_zoomed
. In addition to the usual arguments for the
{dplyr}-joins, by supplying the select
argument you can
specify which columns of the RHS-table you want to be included in the
join. For the syntax please see the example below. The LHS-table of a
join is always the zoomed table.
joined_flights_dm <-
flights_dm %>%
dm_zoom_to(flights) %>%
# let's first reduce the number of columns of flights
select(-dep_delay:-arr_delay, -air_time:-time_hour) %>%
# in the {dm}-method for the joins you can specify which columns you want to add to the zoomed table
left_join(planes, select = c(tailnum, plane_type = type)) %>%
dm_insert_zoomed("flights_plane_type")
# this is how the table looks now
joined_flights_dm$flights_plane_type
# also here, the FK-relations are transferred to the new table
dm_draw(joined_flights_dm)
At each point, you can retrieve the zoomed table by calling
pull_tbl()
on a dm_zoomed
. To use our last
example once more:
Currently, not all {dplyr}-verbs have their own method for a
dm_zoomed
object, so be aware that in some cases it will
still be necessary to resort to extracting one or more tables from a
dm
and reinserting a transformed version back into the
dm
object. The supported functions are:
group_by()
, ungroup()
,
summarise()
, mutate()
,
transmute()
, filter()
, select()
,
relocate()
, rename()
, distinct()
,
arrange()
, slice()
, left_join()
,
inner_join()
, full_join()
,
right_join()
, semi_join()
, and
anti_join()
.
The same is true for {tidyr}-functions. Methods are provided for:
unite()
and separate()
.
There might be situations when you would like the key relations
to remain intact, but they are dropped nevertheless. This is because a
rigid logic is implemented, that does drop a key when its associated
column is acted upon with e.g. a mutate()
call. In these
cases, the key relations will need to be re-established after finishing
with the manipulations.
For each implemented {dplyr}-verb, there is a logic for tracking
key relations between the tables. Up to {dm} version 0.2.4 we tried to
track the columns in a very detailed manner. This has become
increasingly difficult, especially with dplyr::across()
. As
of {dm} 0.2.5, we give more responsibility to the {dm} user: Now those
columns are tracked whose names remain in the resulting
table. Affected by these changes are the methods for:
mutate()
, transmute()
,
distinct()
. When using one of these functions, be aware
that if you want to replace a key column with a column with a different
content but of the same name, this column will automatically become a
key column.