library(autodb)
#>
#> Attaching package: 'autodb'
#> The following object is masked from 'package:stats':
#>
#> decompose
To avoid confusion, this vignette is consistent with terminology: a database consists of relations, which have records with the same attributes. This is in line with terms used in the relational model; in Statistics, we would talk about tables, which have rows with values for the same variables, with a column for each variable.
For the original data set, which is a flat table, we refer to a data frame instead, which also consists of records with attributes.
When we talk about the type of values an attribute can take, we talk about a class, as in the relational model and R, rather than a value type, as in many other programming languages.
autodb
is described in the DESCRIPTION file in the
following way:
Automatic normalisation of a data frame to third normal form, with the intention of easing the process of data cleaning. (Usage to design your actual database for you is not advised.)”
Let us begin there.
Database normalisation works on the principle that our data should only express a piece of information in a single place. This is done to reduce the chance that a modification will accidentally change or remove other pieces of information.
Data cleaning works on the principle that we remove duplicate observations and structural errors, and have a clear understanding of any associations between different observations, or parts of observations. This is done to ensure a high quality for the data used in analysis, but I have found that it is equally important for making sure that I understand the domain I am hoping to model.
Normalisation is helpful for both of these areas. I usually use the original, flat, “tidy” data for the analysis itself. However, for checking the data’s consistency, and for using the data to learn about the domain, normalised data is often helpful, whether it’s used for the analysis or not.
I usually do data checking under the following conditions:
It’s in this context that I would usually have to do the normalisation by hand, and I would like to have a tool that does it semi-automatically. This package is intended to be that tool, using a minimal amount of additional libraries.
Good database design is a hard and involved process, which requires a strong grasp on the domain that the database will hold information on. Please don’t try to use this package to do it on autopilot. In particular, the conditions described above mean we do things differently to good practice for normalised relational databases:
merge
function.autodb
gets a given data frame to third normal form:
every attribute depends on the whole key(s), and non-key attributes
depend on nothing but the key(s). This was chosen because there is an
existing algorithm, Bernstein’s synthesis, for normalising to third
normal form, and because it’s the highest normal form that is attainable
with arbitrary original data frames. There is an additional enhancement
available as an option: see the section on avoidable attributes for more
details.
For most of these simple examples, we use the
ChickWeight
dataset, included with base R.
summary(ChickWeight)
#> weight Time Chick Diet
#> Min. : 35.0 Min. : 0.00 13 : 12 1:220
#> 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
#> Median :103.0 Median :10.00 20 : 12 3:120
#> Mean :121.8 Mean :10.72 10 : 12 4:118
#> 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
#> Max. :373.0 Max. :21.00 19 : 12
#> (Other):506
autodb
The simplest way to do normalisation is with the autodb
function:
There is no plotting method within the autodb
package
itself. Instead, there are functions to write normalised objects as
inputs to the Graphviz visualisation software, using the DOT
language.
The generic function to do this is gv
, which calls the
gv.database
method on our database, db
:
db_text <- gv(db)
cat(db_text)
#> digraph {
#> rankdir = "LR"
#> node [shape=plaintext];
#>
#> "Chick" [label = <
#> <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#> <TR><TD COLSPAN="3">Chick (50 records)</TD></TR>
#> <TR><TD PORT="TO_chick">Chick</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_chick">ordered</TD></TR>
#> <TR><TD PORT="TO_diet">Diet</TD><TD></TD><TD PORT="FROM_diet">factor</TD></TR>
#> </TABLE>>];
#> "Time_Chick" [label = <
#> <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#> <TR><TD COLSPAN="3">Time_Chick (578 records)</TD></TR>
#> <TR><TD PORT="TO_time">Time</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_time">numeric</TD></TR>
#> <TR><TD PORT="TO_chick">Chick</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_chick">ordered</TD></TR>
#> <TR><TD PORT="TO_weight">weight</TD><TD></TD><TD PORT="FROM_weight">numeric</TD></TR>
#> </TABLE>>];
#>
#> "Time_Chick":FROM_chick -> "Chick":TO_chick;
#> }
This can be saved to a file for passing to Graphviz elsewhere, or can
be passed into an R function that renders Graphviz code, such as
grViz
in the DiagrammeR
package:
if (requireNamespace("DiagrammeR", quietly = TRUE)) {
show <- function(x) DiagrammeR::grViz(gv(x))
maybe_plot <- function(x) DiagrammeR::grViz(gv(x))
}else{
show <- print
maybe_plot <- function(x) invisible(NULL)
}
Each relation is represented as a box, with the top row giving the relation name and the number of records, and the other rows detailing the attributes. In the middle is a grid of cells: each column of cells represents a (candidate) key for the relation, where attributes in that key have their cell filled in black.
We can see that ChickWeight
was split into two
relations, with a weight for each unique combination of Chick number and
time, and the Chick number by itself determining the diet. We can also
see the number of records in each relation, and the classes of the
attributes in those relations.
gv
also has methods for data.frame
, and for
database_schema
, which is introduced later.
Having looked at the final result first, we now look at the
individual steps. The first step is to find the (non-trivial and
minimal) dependencies present in the given data frame. There are various
ways to do this; by default, the package uses DFD, a depth-first search
algorithm. We run this using the discover
function, setting
progress
to TRUE
to see the steps taken:
deps <- discover(ChickWeight, accuracy = 1, progress = TRUE)
#> formatting numerical/complex variables with 7 significant digits
#> simplifying data types
#> constructing powerset
#> dependant weight
#> determinants available, starting search
#> dependant Time
#> determinants available, starting search
#> dependant Chick
#> determinants available, starting search
#> dependant Diet
#> determinants available, starting search
#> DFD complete
The accuracy
argument expects a number between zero and
one, determining what proportion of a data frame’s records need to
satisfy a given dependency for the algorithm to consider it valid.
Setting accuracy
to one, as done here, limits us to exact
dependencies. We’ll look at other values later, in the section on
approximate dependencies.
deps
#> 2 functional dependencies
#> 4 attributes: weight, Time, Chick, Diet
#> Time, Chick -> weight
#> Chick -> Diet
After simplifying the data to something quicker to iterate over – specifically, by converting all attributes to have integer values – the algorithm takes each attribute in turn as a possible dependant, and finds sets of the other attributes that determine it.
The result is a list of functional dependencies, in the format
determinant set -> dependant
, with an attribute named
attrs_order
that gives all the attribute names in their
original order. Each of these three parts has its own (generic)
extraction function:
detset(deps)
#> [[1]]
#> [1] "Time" "Chick"
#>
#> [[2]]
#> [1] "Chick"
dependant(deps)
#> [1] "weight" "Diet"
attrs_order(deps)
#> [1] "weight" "Time" "Chick" "Diet"
The former two, especially, are useful for filtering, as we’ll see later.
Now that we have a list of discovered dependencies, we can construct a database schema, where the relation schemas are normalised to third normal form. This is done using a version of Bernstein’s synthesis.
schema <- synthesise(deps)
schema
#> 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#> key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#> key 1: Time, Chick
Like the database before, we can also plot this database schema:
This is similar to the database plot given before, but there is some information not present, that requires the data frame itself. We do have class information about the attributes, extracted from the data frame during dependency discovery, but we have no record counts for the individual relation schemas. However, we do have automatically-generated names for the individual relations.
Additionally, at this point we have no connections between the relation schemas, since Bernstein’s synthesis doesn’t supply information about foreign key references. We could use this database schema to build a database, but we’d rather add the foreign key references first.
Let’s look at a different dataset for a moment, to look at some cases where we don’t want to use the dependencies as given. We’ll use the Titanic data set, also provided with base R. This data is in array form, so we first convert it to data frame form:
Class | Sex | Age | Survived | Freq |
---|---|---|---|---|
1st | Male | Child | No | 0 |
2nd | Male | Child | No | 0 |
3rd | Male | Child | No | 35 |
Crew | Male | Child | No | 0 |
1st | Female | Child | No | 0 |
2nd | Female | Child | No | 0 |
3rd | Female | Child | No | 17 |
Crew | Female | Child | No | 0 |
1st | Male | Adult | No | 118 |
2nd | Male | Adult | No | 154 |
3rd | Male | Adult | No | 387 |
Crew | Male | Adult | No | 670 |
1st | Female | Adult | No | 4 |
2nd | Female | Adult | No | 13 |
3rd | Female | Adult | No | 89 |
Crew | Female | Adult | No | 3 |
1st | Male | Child | Yes | 5 |
2nd | Male | Child | Yes | 11 |
3rd | Male | Child | Yes | 13 |
Crew | Male | Child | Yes | 0 |
1st | Female | Child | Yes | 1 |
2nd | Female | Child | Yes | 13 |
3rd | Female | Child | Yes | 14 |
Crew | Female | Child | Yes | 0 |
1st | Male | Adult | Yes | 57 |
2nd | Male | Adult | Yes | 14 |
3rd | Male | Adult | Yes | 75 |
Crew | Male | Adult | Yes | 192 |
1st | Female | Adult | Yes | 140 |
2nd | Female | Adult | Yes | 80 |
3rd | Female | Adult | Yes | 76 |
Crew | Female | Adult | Yes | 20 |
This is a simple set of data, with a single count observation,
Freq
, for each combination of the four determining
attributes. In other words, the relation is already normalised, so we
only expect one relation in the normalised database.
If we use autodb
again, we get the following
database:
Oops! The DFD algorithm found some functional dependencies where the count could be used to determine another attribute. These are clearly spurious: frequency count can’t causally determine age, for example. However, the algorithm still finds these dependencies, because the counts are unique often enough to make these dependencies hold in the given data.
There are two approaches we can take to eliminate these spurious
dependencies: not letting them be detected in the first place, and
removing them before using synthesise
.
To stop them being detected, we can put constraints on what is
discovered by discover
: we can ask for certain attributes
to not be considered as determinants, or we can exclude attributes that
inherit from certain classes. In this example, we could exclude
Freq
from being considered:
titanic_deps_freqonly <- discover(as.data.frame(Titanic), 1, exclude = "Freq")
titanic_deps_freqonly
#> 1 functional dependency
#> 5 attributes: Class, Sex, Age, Survived, Freq
#> Class, Sex, Age, Survived -> Freq
Alternatively, we could exclude all attributes that inherit from “numeric”:
identical(titanic_deps_freqonly, discover(as.data.frame(Titanic), 1, exclude_class = "numeric"))
#> [1] TRUE
These can both be used as arguments to autodb
too:
Generally, excluding numeric attributes as determinants is often useful, because we expect non-integer numbers to be a measurement, not part of a primary key.
Alternatively, we could remove unwanted dependencies before using
decompose
. Here are the found dependencies, if we don’t
exclude anything:
titanic_deps <- discover(as.data.frame(Titanic), 1)
titanic_deps
#> 3 functional dependencies
#> 5 attributes: Class, Sex, Age, Survived, Freq
#> Sex, Survived, Freq -> Age
#> Class, Survived, Freq -> Age
#> Class, Sex, Age, Survived -> Freq
We can remove the unwanted dependencies, where Age
is
the dependant:
Getting back to our ChickWeight
example, we now have a
database schema, consisting of a list of relation
schemas.
However, we have no information about how these relation schemas are
linked to each other. In particular, we have no information about
foreign keys. We can add this information using
autoref
:
linked_schema <- autoref(schema)
linked_schema
#> database schema with 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#> key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#> key 1: Time, Chick
#> references:
#> Time_Chick.{Chick} -> Chick.{Chick}
We could also have used normalise
, instead of
synthesise
and autoref
separately:
normalise(deps)
#> database schema with 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#> key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#> key 1: Time, Chick
#> references:
#> Time_Chick.{Chick} -> Chick.{Chick}
Plotting this updated database schema shows the same relation schemas as before, linked together by foreign key references:
Finally, once we have our normalised database schema, we can apply it
to our original data frame, or a new one with the same structure. This
results in a normalised database, as we got from using
autodb
:
We now have the record counts added.
We can reverse the process of turning a data frame into a database
with the rejoin
function. This may not be identical to
ChickWeight
, since the rows may have been rearranged.
However, we can use the convenience function df_equiv
to
check for equivalence under row reordering:
rejoined <- rejoin(db)
summary(rejoined)
#> weight Time Chick Diet
#> Min. : 35.0 Min. : 0.00 13 : 12 1:220
#> 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
#> Median :103.0 Median :10.00 20 : 12 3:120
#> Mean :121.8 Mean :10.72 10 : 12 4:118
#> 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
#> Max. :373.0 Max. :21.00 19 : 12
#> (Other):506
identical(rejoined, ChickWeight)
#> [1] FALSE
df_equiv(rejoined, ChickWeight)
#> [1] TRUE
When rejoined, the relation attributes will be in the original order. However, the record order might have changed.
Included in the package is a 447-by-25 data frame called
nudge
:
if (requireNamespace("DiagrammeR", quietly = TRUE)) {
DiagrammeR::grViz(gv(nudge, name = "nudge"))
}else{
summary(nudge)
}
This is the data set for a meta-analysis, looking at the effectiveness of “nudge” interventions. Measurements are taken within a three-layer hierarchy: publications contain studies, which contain effect size measurements. We expect this hierarchy to be apparent in the normalisation.
Getting full dependency information for this relation can take a long time, so here we search for a reduced set, not considering numeric or sample size attributes as determinants:
nudge_deps <- discover(
nudge,
accuracy = 1,
exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
exclude_class = "numeric"
)
nudge_schema <- normalise(nudge_deps, remove_avoidable = TRUE)
show(nudge_schema)
We can see a relation, with many attributes, determined by the effect
size ID number, es_id
. This contains all of the numeric
measurements, as expected in the relation for the lowest level in the
hierarchy. As also expected, this has a foreign key reference to a
relation for study-level information, determined by the study ID number,
study_id
.
However, the publication-level relation this refers to is not
determined by the publication ID, publication_id
, as we
would expect; neither is it determined by the publication reference.
Instead, it is solely determined by the publication’s title: to use the
ID, we would need to supplement it with other information, like the
publication year, or other choices that come from the middle column of
relations, which look likely to be spurious.
This suggests that some publication ID numbers have been erroneously assigned to several publications, which we can easily test:
nudge_database <- decompose(nudge, nudge_schema)
nudge_title_relation <- records(nudge_database)$title
nudge_pid_duplicates <- unique(nudge_title_relation$publication_id[
duplicated(nudge_title_relation$publication_id)
])
knitr::kable(subset(nudge_title_relation, publication_id %in% nudge_pid_duplicates))
title | publication_id | reference | year | |
---|---|---|---|---|
44 | Enhanced active choice: A new method to motivate behavior change | 95 | Keller et al. (2011) | 2011 |
130 | Nudging product choices: The effect of position change on snack bar choice | 95 | Keller et al. (2015) | 2015 |
We’d also expect reference
to determine the publication,
but this is also not the case: it needs additional information, such as
binary_outcome
.
This means that there are publications that share a reference:
nudge_reference_duplicates <- unique(nudge_title_relation$reference[
duplicated(nudge_title_relation$reference)
])
knitr::kable(subset(nudge_title_relation, reference %in% nudge_reference_duplicates))
title | publication_id | reference | year | |
---|---|---|---|---|
214 | Nudge vs superbugs: A behavioural economics trial to reduce the overprescribing of antibiotics | 18 | BETA (2018) | 2018 |
399 | Energy labels that make cents | 19 | BETA (2018) | 2018 |
BETA is the Behavioural Economics Team of the Australian Government, so it’s not surprising that they’d have multiple publications/reports per year. Duplicate references is not necessarily an error, but would be awkward if the references were to be used.
There are other, clearly spurious, relations involving
reference
. One of these combines it with
type_experiment
to determine location
. Since
reference
is a publication-level attribute, and
location
is a study-level attribute, this can’t be
right.
(As far as I’m aware, the publication ID and reference errors mentioned above would not have affected the meta-analysis results.)
The publication ID and the reference clearly have issues, so we need to decide what to do about them. Here, I’ll show what is usually the correct approach, which is to remove the spurious functional dependencies. We’ll look at the other options later.
The better option is to remove all functional dependencies that we’d consider to be spurious: those where publication ID or reference are part of a multiple-attribute determinant set, and the determinant set isn’t just those two:
nudge_deps_filtered <- nudge_deps[
lengths(detset(nudge_deps)) == 1 |
vapply(
detset(nudge_deps),
\(ds) length(setdiff(ds, c("publication_id", "reference"))) != 1,
logical(1)
)
]
nudge_schema_filtered <- normalise(nudge_deps_filtered, remove_avoidable = TRUE)
show(nudge_schema_filtered)
We now, finally, have the study location in a non-spurious relation. If we want to, we could also add the dependencies we fail to find, due to excluding the sample size attributes from determinant sets:
nudge_deps_size <- discover(nudge[, startsWith(names(nudge), "n_")], 1)
nudge_deps_size
#> 3 functional dependencies
#> 4 attributes: n_study, n_comparison, n_control, n_intervention
#> n_control, n_intervention -> n_comparison
#> n_comparison, n_intervention -> n_control
#> n_comparison, n_control -> n_intervention
nudge_deps_final <- c(nudge_deps_filtered, nudge_deps_size)
nudge_schema_final <- normalise(nudge_deps_final, remove_avoidable = TRUE)
nudge_database_final <- decompose(nudge, nudge_schema_final)
show(nudge_schema_final)
The new schema confirms what we might expect, that the total sample size, and the sizes of control and treatment arms, only have two degrees of freedom: knowing two of them determines the other.
We’d expect this to be because the total sample size is the sum of the other two. However, the functional dependency can’t determine this. Indeed, if we check for this more specific condition, we find that it doesn’t hold:
knitr::kable(unique(subset(
nudge,
n_comparison != n_control + n_intervention,
c(
es_id,
reference,
title,
n_study,
n_comparison,
n_control,
n_intervention
)
)))
es_id | reference | title | n_study | n_comparison | n_control | n_intervention | |
---|---|---|---|---|---|---|---|
237 | 180 | Hedlin & Sunstein (2016) | Does active choosing promote green energy use? Experimental evidence | 1037 | 1037 | 345 | 346 |
302 | 181 | Hedlin & Sunstein (2016) | Does active choosing promote green energy use? Experimental evidence | 1037 | 1037 | 345 | 346 |
Looking through the paper, the arm sample sizes aren’t given explicitly, but the numbers here are consistent with the parts of the study that have two treatments.
Automatic search for functional dependencies cannot check domain-specific constraints like the summation constraint above, but it can give schemas that suggest constraints to check for, such as the two-degrees relation above.
Going back to the final database schema, removing dependencies has
revealed extra information about study locations, via the new
title_type_experiment
relation: studies of the same
experiment type in a publication always have the same location. Looking
at the resulting database shows that this removes many entries of what
would be redundant location information if kept in the study
relation:
While this is not a dependency we could expect to hold if more data was collected, it’s a reasonable dependency for the given data set, which won’t be added to.
The additional relation mentioned at the end of the previous section is also notable for what we’ll now discuss: alternative ways to get rid of spurious structure. How can we do this?
publication_id
above, since it doesn’t do its intended
job of unique publication identification – and re-do the entire process
(re-running discover
can be expensive; we don’t always want
to throw out everything to do with an attribute,
e.g. reference -> year
, or
title -> reference
);exclude
, to disallow them
in determinant sets, and re-do the entire process (same issues as above,
except that we keep title -> reference
, since an
excluded reference
can still be a dependant);What happens if we remove the offending schemas?
nudge_schema_relfiltered <- nudge_schema[
!grepl("publication_id_", names(nudge_schema), fixed = TRUE) &
!grepl("_publication_id", names(nudge_schema), fixed = TRUE) &
!grepl("reference_", names(nudge_schema), fixed = TRUE) &
!grepl("_reference", names(nudge_schema), fixed = TRUE)
]
Subsetting a database schema also removes any foreign key references involving removed schemas, so the resulting schema is still valid. However, any foreign key reference chains with removed schemas in the middle are broken. Amongst others, in this case we lose the reference between the three relations for the three hierarchical data levels:
Re-running autoref
has no effect, because the
es_id
section doesn’t contain any key for the
title
section:
Because the database subschema including the es_id
relation only has publication_id
for distinguishing
publications, effect sizes for the two publications with the same ID
cannot be uniquely associated with their publication: removing schemas
irreversibly separates two parts of the database schema.
Furthermore, compared to nudge_schema_filtered
, there is
a relation schema missing: directly removing schemas loses the structure
information that results in the title_type_experiment
relation. How does this happen?
The relation isn’t present in the original schema,
nudge_schema_filtered
, because the full set of functional
dependencies make its implied dependency transitive. Specifically,
nudge_fds
contains the following functional
dependencies:
example_fds <- functional_dependency(
list(
list("title", "reference"),
list(c("reference", "type_experiment"), "location"),
list(c("title", "type_experiment"), "location")
),
c("title", "reference", "type_experiment", "location")
)
example_fds
#> 3 functional dependencies
#> 4 attributes: title, reference, type_experiment, location
#> title -> reference
#> reference, type_experiment -> location
#> title, type_experiment -> location
The first two dependencies make the third transitive, so it’s ignored
when constructing the relation schemas: it’s implied, and enforced, by
the title
and reference_type_experiment
relation schemas.
However, when reference, type_experiment -> location
is removed as spurious,
title, type_experiment -> location
is no longer
transitive, so it now appears as its own relation schema.
When we instead remove the reference_type_experiment
schema, this implicitly also throws away the
title, type_experiment -> location
dependency that it
co-implies. We throw away non-spurious information that we don’t intend
to.
Between the risk of irreversibly breaking the database schema’s structure, and the risk of unintentionally removing additional structure information, I hope it’s clear why directly removing relations should be avoided.
Larger datasets can often have entry errors, without an easy way to remove or deal with them. For this reason, we might be interested in “approximate” functional dependencies, which hold after removing a bounded amount of violating records.
Suppose that we normalise nudge
again, without any
manual dependency removal, but allowing approximate dependencies. We
could cheat, knowing that the questionable data example found above
showed there to be two questionable records: one for a duplicated
publication ID, and one for a duplicated reference. Since
nudge
has 447 records, we can get rid of the resulting
questionable dependencies by setting the accuracy to allow two
discrepant records.
nudge_approx_cheat_database_schema <- discover(
nudge,
accuracy = 1 - 2/nrow(nudge),
exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
exclude_class = "numeric"
) |>
normalise()
show(nudge_approx_cheat_database_schema)
Currently decompose
doesn’t account for approximate
dependencies, resulting in invalid databases, so here we just work with
the database schema.
Compare this to the database schema we arrived at manually:
approximation
to be approximately determined by some publication- and study-level
attributes, rather than functionally determined at the effect-size
level.Lowering the accuracy results in more dependencies being found, and
so in more relations. The number of relations can get very large. For
example, suppose we instead set the accuracy to 0.99
,
returning any approximate dependencies that hold in at least 443
records.
nudge_approx_database_schema <- discover(
nudge,
accuracy = 0.99,
exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
exclude_class = "numeric"
) |>
normalise()
show(nudge_approx_database_schema)
This is a little overwhelming, so we’ll use a utility function called
reduce
. This returns only the named “main” relation, and
any relations for which it is a descendant via foreign key references.
Reducing the approximate schema, with the effect size table as the main
table, gives us this set of relations:
Questionable intermediate relations aside, we can see that there is now a publication-level relation with the publication ID as a key, since it determines the publication attributes once we remove one of the duplicate records we discovered before.
reduce
can also be used on databases, where it considers
each relation with the largest number of records as a main table. It
needs to be used with caution: while the relations it removes are often
spurious, it’s possible to find databases where relations not linked to
the main relation by foreign key references are required to rejoin to
the original data frame, so reduction can remove necessary relations.
Its intent is mostly to make glancing at database plots more
manageable.
The next normal form after third normal form (3NF) is Boyes-Codd normal form (BCNF). Ensuring BCNF is enforced by the database is trickier, as in some cases it can’t be enforced with just relations and foreign key constraints.
However, the package includes an option to convert to enhanced third normal form, also known as LTK form, which can be so enforced. This enhancement is tangential to BCNF, and could also be used to enhance schemas in BCNF.
In brief, the standard normal forms only put constraints on the attributes present in the relations one relation at a time. The enhancement is a constraint on the attributes present in a relation, while considering their presence in other relations. If a attribute in a relation can be removed, and still be determined from that relation by joining it to others, then the attribute is “avoidable”, and can be removed. If the attribute is in any of the relation’s keys, they’ll be replaced by keys that use the attributes not being removed. This removes attributes from relations without removing any information from the database as a whole.
For example, we can take this simple example from Chapter 6 of The Theory of Relational Databases, by David Maier:
avoid_deps <- functional_dependency(
list(
list("A", "B"),
list("B", "A"),
list(c("A", "C"), "D"),
list(c("A", "C"), "E"),
list(c("B", "D"), "C")
),
attrs_order = c("A", "B", "C", "D", "E")
)
avoid_deps
#> 5 functional dependencies
#> 5 attributes: A, B, C, D, E
#> A -> B
#> B -> A
#> A, C -> D
#> A, C -> E
#> B, D -> C
avoid_schema <- normalise(avoid_deps)
show(avoid_schema)
Attributes A
and B
are equivalent, since
relation A
has them both as a key. In other words, relation
A
is a simple lookup relation. Because of this, we could
remove B
from relation A_C
, and replace the
key B, D
with A, D
, which is equivalent when
accounting for relation A
.
We can have this removal of avoidable attributes done automatically,
using the remove_avoidable
flag for
normalise
:
This schema is now in LTK form, with no remaining avoidable
attributes. We could have also removed A
from relation
A_C
instead of B
, so this process may not have
a unique result. The package’s implementation prefers to remove
attributes that appear later in the original relation.
Sometimes, the structure for some attributes depends on whether other attributes are missing. Here are two possible reasons:
For example, I commonly get data sets where certain values can be given as either single values, or endpoints of an interval, and this “union type” data is spread across several attributes:
df_options <- data.frame(
id = 1:20,
value = c(2.3, 2.3, 5.7, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
lower_bound = c(NA_real_, NA_real_, NA_real_, 2.4, 0, 1, 0, 5.6, 2.4, 5.3, 5.3, 2.4, 2.4, 2.4, 2.4, 2.4, 2.4, 2.4, 5.6, 2.4),
upper_bound = c(NA_real_, NA_real_, NA_real_, 7.1, 10, 10, 13.1, 25.8, 10, 13.1, 10, 25.8, 25.8, 25.8, 25.8,13.1, 13.1, 25.8, 25.8, 25.8),
interval_distribution = c(NA, NA, NA, "uniform", "uniform", "uniform", "uniform", "uniform", "Beta", "Beta", "Beta", "Beta", "Kumaraswamy", "Kumaraswamy", "Kumaraswamy", "Kumaraswamy", "PERT", "PERT", "PERT", "PERT"),
param1 = c(NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1, 2, 2, 2.1, 2, 2, 2, 1, 2, 2),
param2 = c(NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 2, 2, 2, 1, 1, 1, NA, NA, NA, NA)
)
df_options$interval_distribution <- factor(df_options$interval_distribution)
knitr::kable(df_options)
id | value | lower_bound | upper_bound | interval_distribution | param1 | param2 |
---|---|---|---|---|---|---|
1 | 2.3 | NA | NA | NA | NA | NA |
2 | 2.3 | NA | NA | NA | NA | NA |
3 | 5.7 | NA | NA | NA | NA | NA |
4 | NA | 2.4 | 7.1 | uniform | NA | NA |
5 | NA | 0.0 | 10.0 | uniform | NA | NA |
6 | NA | 1.0 | 10.0 | uniform | NA | NA |
7 | NA | 0.0 | 13.1 | uniform | NA | NA |
8 | NA | 5.6 | 25.8 | uniform | NA | NA |
9 | NA | 2.4 | 10.0 | Beta | 1.0 | 1 |
10 | NA | 5.3 | 13.1 | Beta | 1.0 | 2 |
11 | NA | 5.3 | 10.0 | Beta | 1.0 | 2 |
12 | NA | 2.4 | 25.8 | Beta | 2.0 | 2 |
13 | NA | 2.4 | 25.8 | Kumaraswamy | 2.0 | 2 |
14 | NA | 2.4 | 25.8 | Kumaraswamy | 2.1 | 1 |
15 | NA | 2.4 | 25.8 | Kumaraswamy | 2.0 | 1 |
16 | NA | 2.4 | 13.1 | Kumaraswamy | 2.0 | 1 |
17 | NA | 2.4 | 13.1 | PERT | 2.0 | NA |
18 | NA | 2.4 | 25.8 | PERT | 1.0 | NA |
19 | NA | 5.6 | 25.8 | PERT | 2.0 | NA |
20 | NA | 2.4 | 25.8 | PERT | 2.0 | NA |
autodb
has no way to know about these relationships,
because it doesn’t treat missing values as a special case (see the
“Missing values” section in “Planned extensions” below). The resulting
schema ignores them:
However, from looking at the data ourselves, we can see a clearly-implied structure:
As a general hack for finding these sorts of relationships, I like taking each attribute with missing values, and adding a companion attribute that states whether the value is present or missing:
df_options_with_presence <- data.frame(
id = df_options$id,
value = df_options$value,
value_present = !is.na(df_options$value),
lower_bound = df_options$lower_bound,
lower_bound_present = !is.na(df_options$lower_bound),
upper_bound = df_options$upper_bound,
upper_bound_present = !is.na(df_options$upper_bound),
interval_distribution = df_options$interval_distribution,
interval_distribution_present = !is.na(df_options$interval_distribution),
param1 = df_options$param1,
param1_present = !is.na(df_options$param1),
param2 = df_options$param2,
param2_present = !is.na(df_options$param2)
)
This is not always practical, because adding columns rapidly increases the running time for the dependency search. However, when it’s practical, it helps to make the structure more apparent in the database schema:
This shows that the values, bounds, and interval distributions inform the presence of each other: they drive the main union type.
value_present | lower_bound_present | upper_bound_present | interval_distribution_present | |
---|---|---|---|---|
1 | TRUE | FALSE | FALSE | FALSE |
4 | FALSE | TRUE | TRUE | TRUE |
We can also see how the interval distribution determines how many parameters are required:
interval_distribution | value_present | param1_present | param2_present | |
---|---|---|---|---|
1 | NA | TRUE | FALSE | FALSE |
4 | uniform | FALSE | FALSE | FALSE |
9 | Beta | FALSE | TRUE | TRUE |
13 | Kumaraswamy | FALSE | TRUE | TRUE |
17 | PERT | FALSE | TRUE | FALSE |
There’s clearly additional structure here, since there’s an existence
constraint: if param2
is present, then param1
must also be present. If we notice a situation like this, we can attempt
further investigation by splitting the data, and passing each part into
autodb
separately:
db_options_with_presence_p1absent <- autodb(subset(
df_options_with_presence,
!param1_present
))
show(db_options_with_presence_p1absent)
In this case, the constraint is reflected by the first subset, where
param1
is absent, always having param2
absent,
as indicated by both param2_present
and param2
being constant:
param1 | param1_present | param2 | param2_present |
---|---|---|---|
NA | FALSE | NA | FALSE |
Since this data splitting is currently a manual process, there are practical limits to how far we can take this.
It’s expected that an R user might pass a data frame into the package that has duplicate records. At the moment, these are kept when searching for dependencies. This can affect results for approximate dependencies, due to affecting the record counts. However, they will be removed when the data frame is decomposed into a database. At the moment, I’m not certain on whether they are best handled by removing them, or by simply returning an error if there are duplicate records.
Strictly speaking, autodb
does not search for functional
dependencies, because these can’t account for missing values. Instead,
it searches for a weaker variant, which has been called a
literal functional dependency (LFD) in the literature.
The functional dependency X -> Y
holds if, for any
two possible records whose values in X
are equal,
their values in Y
are also equal. This requires
attribute values to take the equality operator, ==
in R. Additionally, they must take it under binary logic: whether two
values are equal can only be true or false, not missing. This disallows
the presence of missing values (NA
and NaN
),
since testing for equality with a missing value returns a missing
result.
The literal functional dependency X -> Y
holds, or
the functional dependency X -> Y
holds literally, if,
for any two possible records whose values in X
are
identical, their values in Y
are also
identical, where values are identical if both are non-missing
and equal, or both are missing. In other words, missing values are
treated as not equal to non-missing values, but as equal to each other.
This requires attribute classes to take the identity operator,
identical()
with default optional parameter values in R.
Again, they must take it under binary logic, but the identity operator
cannot return non-binary results anyway: In R, for example,
identical()
can only return TRUE
or
FALSE
, never NA
. There is, therefore, no
constraint on missing values in attribute classes, or on attribute
classes in general.
LFDs are more generic than standard functional dependencies: since practically every class takes the identity operator, they practically make no assumptions about the attribute classes present in a data set.
There are other FD variants that handle missing values, but, unlike most of them, LFDs still satisfy Armstrong’s axioms, allowing them to be used in normalisation in the same way as normal FDs. For example, they still respect transitivity: if X -> Y and Y -> Z literally, then X -> Z literally.
However, ignoring the special status of missing values in this way ignores important structural information. In the relational model, and ideally in relational databases, we want to have no missing values at all, partly to avoid awkward questions about what trinary logic means with respect to filtering or joining tables. For example, take the following data frame:
df_nas <- data.frame(
patient = c(1L, 2L, 3L, 4L),
trial_entry_date = as.Date(c("2022/05/02", "2022/06/06", "2022/04/01", "2022/03/19")),
death_date = as.Date(c(NA, NA, "2022/10/07", NA))
)
knitr::kable(df_nas)
patient | trial_entry_date | death_date |
---|---|---|
1 | 2022-05-02 | NA |
2 | 2022-06-06 | NA |
3 | 2022-04-01 | 2022-10-07 |
4 | 2022-03-19 | NA |
autodb
currently treats NA
as just another
value, which results in the initial data frame not being split in the
database schema:
In this case, a missing death date represents no death, and we would prefer to move death information to a separate relation, containing only patients with a death date, in this case patient 3:
ideal_db <- decompose(
df_nas,
database_schema(
relation_schema(
list(
patient = list(c("patient", "trial_entry_date"), list("patient")),
patient_death = list(c("patient", "death_date"), list("patient"))
),
names(df_nas)
),
list(list("patient_death", "patient", "patient", "patient"))
)
)
records(ideal_db)$patient_death <- subset(records(ideal_db)$patient_death, !is.na(death_date))
show(ideal_db)
This makes more of the data’s structure apparent, and enforceable, in the database structure, since it makes it clear that the death date is optional.
Removing missing values in this way requires additional steps in
dependency detection, as well as normalisation, than autodb
currently has: it requires keeping additional information about when
attributes are missing, compared to the others.
Rejoining databases, and checking relations satisfy foreign key
constraints, is done using merge.data.frame
. This means
that data classes that don’t work properly in merge
aren’t
guaranteed to work properly in autodb
.
Any such issues come from how certain data classes are handled during
merges in the base language, so they are issues with R, rather than with
autodb
, and I have no plans to fix them. If
autodb
seems to have odd failures, check that used data
classes behave correctly in merges.
For example, in older versions of R, the built-in POSIXct date/time
class didn’t have values merged properly, because the merge didn’t
account for differences in time zone / daylight saving time. This would
result in, for example, the nycflights13::weather
data set
violating the foreign key constraints of its own discovered schema,
since one foreign key used a POSIXct attribute.
A more complex example, that still applies and probably always will,
is a merge where two attributes being merged on have different classes.
In general, this is allowed: since autodb
is written for R,
a dynamically-typed language, it follows SQLite in not constraining the
user much when it comes to data classes in schemas. For primitive
classes, R’s class coercion usually makes things work as you’d
expect.
However, merging a factor attribute with a non-factor, non-character attribute should be handled with caution if the latter contains values not in the factor’s level set.
For example, we can define these data frames:
df_badmerge_int <- cbind(
expand.grid(
a = c(NA, 0L, 1L),
b = c(NA, FALSE, TRUE)
),
row = 1:9
)
df_badmerge_factor <- df_badmerge_int
df_badmerge_factor$a <- as.factor(df_badmerge_factor$a)
knitr::kable(df_badmerge_int)
a | b | row |
---|---|---|
NA | NA | 1 |
0 | NA | 2 |
1 | NA | 3 |
NA | FALSE | 4 |
0 | FALSE | 5 |
1 | FALSE | 6 |
NA | TRUE | 7 |
0 | TRUE | 8 |
1 | TRUE | 9 |
df_badmerge_logical <- df_badmerge_int
df_badmerge_logical$a <- as.logical(df_badmerge_logical$a)
names(df_badmerge_logical)[[3]] <- "row2"
knitr::kable(df_badmerge_logical)
a | b | row2 |
---|---|---|
NA | NA | 1 |
FALSE | NA | 2 |
TRUE | NA | 3 |
NA | FALSE | 4 |
FALSE | FALSE | 5 |
TRUE | FALSE | 6 |
NA | TRUE | 7 |
FALSE | TRUE | 8 |
TRUE | TRUE | 9 |
We can then merge the data frame with logical a
with the
other two, keeping the row
attributes to track which
records were merged.
Whichever other data frame we merge with, the two sets of
a
values have different classes, so R does coercion. When
merging with just a
, this gives the result we’d expect, for
both other data frames and regardless of merge order. For the integer
version, the logical values are coerced to integers:
a | row | row2 |
---|---|---|
0 | 2 | 2 |
0 | 2 | 5 |
0 | 2 | 8 |
0 | 5 | 2 |
0 | 5 | 5 |
0 | 5 | 8 |
0 | 8 | 2 |
0 | 8 | 5 |
0 | 8 | 8 |
1 | 6 | 6 |
1 | 6 | 3 |
1 | 6 | 9 |
1 | 3 | 6 |
1 | 3 | 3 |
1 | 3 | 9 |
1 | 9 | 6 |
1 | 9 | 3 |
1 | 9 | 9 |
NA | 1 | 1 |
NA | 1 | 7 |
NA | 1 | 4 |
NA | 7 | 1 |
NA | 7 | 7 |
NA | 7 | 4 |
NA | 4 | 1 |
NA | 4 | 7 |
NA | 4 | 4 |
a | row2 | row |
---|---|---|
FALSE | 2 | 2 |
FALSE | 2 | 5 |
FALSE | 2 | 8 |
FALSE | 5 | 2 |
FALSE | 5 | 5 |
FALSE | 5 | 8 |
FALSE | 8 | 2 |
FALSE | 8 | 5 |
FALSE | 8 | 8 |
TRUE | 6 | 6 |
TRUE | 6 | 3 |
TRUE | 6 | 9 |
TRUE | 3 | 6 |
TRUE | 3 | 3 |
TRUE | 3 | 9 |
TRUE | 9 | 6 |
TRUE | 9 | 3 |
TRUE | 9 | 9 |
NA | 1 | 1 |
NA | 1 | 7 |
NA | 1 | 4 |
NA | 7 | 1 |
NA | 7 | 7 |
NA | 7 | 4 |
NA | 4 | 1 |
NA | 4 | 7 |
NA | 4 | 4 |
For the factor version, the logical values are coerced to factors,
but they don’t match any of the given levels, so they all become
NA
:
a | row | row2 |
---|---|---|
NA | 7 | 7 |
NA | 7 | 4 |
NA | 7 | 1 |
NA | 4 | 7 |
NA | 4 | 4 |
NA | 4 | 1 |
NA | 1 | 7 |
NA | 1 | 4 |
NA | 1 | 1 |
a | row2 | row |
---|---|---|
NA | 7 | 7 |
NA | 7 | 4 |
NA | 7 | 1 |
NA | 4 | 7 |
NA | 4 | 4 |
NA | 4 | 1 |
NA | 1 | 7 |
NA | 1 | 4 |
NA | 1 | 1 |
However, we see unexpected behaviour with the factor version, when
also merging on another attribute, b
: the merge result now
depends on the input order. With the factor version first, the result is
similar to before:
knitr::kable(merge(
df_badmerge_factor,
df_badmerge_logical
))
#> Warning in `[<-.factor`(`*tmp*`, ri, value = c(NA, FALSE, TRUE, NA, FALSE, :
#> invalid factor level, NA generated
a | b | row | row2 |
---|---|---|---|
NA | FALSE | 4 | 4 |
NA | FALSE | 4 | 5 |
NA | FALSE | 4 | 6 |
NA | NA | 1 | 1 |
NA | NA | 1 | 2 |
NA | NA | 1 | 3 |
NA | TRUE | 7 | 7 |
NA | TRUE | 7 | 8 |
NA | TRUE | 7 | 9 |
With the logical version first, however, only the logical
a
values that are NA
before coercion are kept,
rather than all of them:
a | b | row2 | row |
---|---|---|---|
NA | FALSE | 4 | 4 |
NA | NA | 1 | 1 |
NA | TRUE | 7 | 7 |
In autodb
, this could be a problem if two relations in a
foreign key constraint contain the same attribute, but one of them has
it store factors in records, due to separate insertions. Letting an
attribute’s class vary across relations should, therefore, be done with
caution.
Bernstein’s synthesis is guaranteed to minimise the number of relations created for a given set of functional dependencies, and removing avoidable attributes can reduce the number of attributes in those relations. However, there can still be redundant keys. For example, we can take the following set of functional dependencies:
fds_redkey <- functional_dependency(
list(
list("a", "b"),
list("d", "c"),
list(c("b", "d"), "a"),
list("a", "c"),
list(c("b", "c"), "d")
),
letters[1:4]
)
fds_redkey
#> 5 functional dependencies
#> 4 attributes: a, b, c, d
#> a -> b
#> d -> c
#> b, d -> a
#> a -> c
#> b, c -> d
Normalising gives the following relations:
These relations have some redundancy: relation a
implies
{b, d} -> c
, but relation d
implies that
{d} -> c
. This isn’t resolved by removing avoidable
attributes, because d
still needs to be in relation
a
: we just need to remove {b, d}
as a key.
However, this is resolved if we instead use this set of functional
dependencies, which is equivalent to the previous set:
fds_redkey_fix <- functional_dependency(
list(
list("a", "b"),
list("d", "c"),
list(c("b", "c"), "a"),
list("a", "d")
),
letters[1:4]
)
fds_redkey_fix
#> 4 functional dependencies
#> 4 attributes: a, b, c, d
#> a -> b
#> d -> c
#> b, c -> a
#> a -> d
schema_redkey_fix <- normalise(fds_redkey_fix, remove_avoidable = TRUE)
show(schema_redkey_fix)
Unfortunately, there’s no way in the package to find better sets like this.
As mentioned in the section on avoidable attributes, normal forms only refer to one relation at a time: referring to a database as following a normal form just means its relations follow it. Even if we also add the foreign key references that normal forms don’t require, this leaves room for database schemas that satisfy normal forms, but that we wouldn’t want to use.
For a simple example, take this database schema, whose relation schemas are in third normal form:
If we now create copies of these relations, with the intention that copies always contain the same data, then all relations are still in third normal form, and so we’d say the database is too:
However, no one would claim that this is a good database design, since there is clearly a large amount of data redundancy. Higher normal forms would not change this.