Using autodb

library(autodb)
#> 
#> Attaching package: 'autodb'
#> The following object is masked from 'package:stats':
#> 
#>     decompose

Terminology

To avoid confusion, this vignette is consistent with terminology: a database consists of relations, which have records with the same attributes. This is in line with terms used in the relational model; in Statistics, we would talk about tables, which have rows with values for the same variables, with a column for each variable.

For the original data set, which is a flat table, we refer to a data frame instead, which also consists of records with attributes.

When we talk about the type of values an attribute can take, we talk about a class, as in the relational model and R, rather than a value type, as in many other programming languages.

Motivation

autodb is described in the DESCRIPTION file in the following way:

Automatic normalisation of a data frame to third normal form, with the intention of easing the process of data cleaning. (Usage to design your actual database for you is not advised.)”

Let us begin there.

For data cleaning

Database normalisation works on the principle that our data should only express a piece of information in a single place. This is done to reduce the chance that a modification will accidentally change or remove other pieces of information.

Data cleaning works on the principle that we remove duplicate observations and structural errors, and have a clear understanding of any associations between different observations, or parts of observations. This is done to ensure a high quality for the data used in analysis, but I have found that it is equally important for making sure that I understand the domain I am hoping to model.

Normalisation is helpful for both of these areas. I usually use the original, flat, “tidy” data for the analysis itself. However, for checking the data’s consistency, and for using the data to learn about the domain, normalised data is often helpful, whether it’s used for the analysis or not.

I usually do data checking under the following conditions:

It’s in this context that I would usually have to do the normalisation by hand, and I would like to have a tool that does it semi-automatically. This package is intended to be that tool, using a minimal amount of additional libraries.

Not for database design

Good database design is a hard and involved process, which requires a strong grasp on the domain that the database will hold information on. Please don’t try to use this package to do it on autopilot. In particular, the conditions described above mean we do things differently to good practice for normalised relational databases:

To third normal form

autodb gets a given data frame to third normal form: every attribute depends on the whole key(s), and non-key attributes depend on nothing but the key(s). This was chosen because there is an existing algorithm, Bernstein’s synthesis, for normalising to third normal form, and because it’s the highest normal form that is attainable with arbitrary original data frames. There is an additional enhancement available as an option: see the section on avoidable attributes for more details.

Simple examples

For most of these simple examples, we use the ChickWeight dataset, included with base R.

summary(ChickWeight)
#>      weight           Time           Chick     Diet   
#>  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
#>  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
#>  Median :103.0   Median :10.00   20     : 12   3:120  
#>  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
#>  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
#>  Max.   :373.0   Max.   :21.00   19     : 12          
#>                                  (Other):506

Normalisation with autodb

The simplest way to do normalisation is with the autodb function:

db <- autodb(ChickWeight)
db
#> database with 2 relations
#> 4 attributes: weight, Time, Chick, Diet
#> relation Chick: Chick, Diet; 50 records
#>   key 1: Chick
#> relation Time_Chick: Time, Chick, weight; 578 records
#>   key 1: Time, Chick
#> references:
#> Time_Chick.{Chick} -> Chick.{Chick}

Plots

There is no plotting method within the autodb package itself. Instead, there are functions to write normalised objects as inputs to the Graphviz visualisation software, using the DOT language.

The generic function to do this is gv, which calls the gv.database method on our database, db:

db_text <- gv(db)
cat(db_text)
#> digraph {
#>   rankdir = "LR"
#>   node [shape=plaintext];
#> 
#>   "Chick" [label = <
#>     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#>     <TR><TD COLSPAN="3">Chick (50 records)</TD></TR>
#>     <TR><TD PORT="TO_chick">Chick</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_chick">ordered</TD></TR>
#>     <TR><TD PORT="TO_diet">Diet</TD><TD></TD><TD PORT="FROM_diet">factor</TD></TR>
#>     </TABLE>>];
#>   "Time_Chick" [label = <
#>     <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4">
#>     <TR><TD COLSPAN="3">Time_Chick (578 records)</TD></TR>
#>     <TR><TD PORT="TO_time">Time</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_time">numeric</TD></TR>
#>     <TR><TD PORT="TO_chick">Chick</TD><TD BGCOLOR="black"></TD><TD PORT="FROM_chick">ordered</TD></TR>
#>     <TR><TD PORT="TO_weight">weight</TD><TD></TD><TD PORT="FROM_weight">numeric</TD></TR>
#>     </TABLE>>];
#> 
#>   "Time_Chick":FROM_chick -> "Chick":TO_chick;
#> }

This can be saved to a file for passing to Graphviz elsewhere, or can be passed into an R function that renders Graphviz code, such as grViz in the DiagrammeR package:

if (requireNamespace("DiagrammeR", quietly = TRUE)) {
  show <- function(x) DiagrammeR::grViz(gv(x))
  maybe_plot <- function(x) DiagrammeR::grViz(gv(x))
}else{
  show <- print
  maybe_plot <- function(x) invisible(NULL)
}
maybe_plot(db)

Each relation is represented as a box, with the top row giving the relation name and the number of records, and the other rows detailing the attributes. In the middle is a grid of cells: each column of cells represents a (candidate) key for the relation, where attributes in that key have their cell filled in black.

We can see that ChickWeight was split into two relations, with a weight for each unique combination of Chick number and time, and the Chick number by itself determining the diet. We can also see the number of records in each relation, and the classes of the attributes in those relations.

gv also has methods for data.frame, and for database_schema, which is introduced later.

Finding functional dependencies

Having looked at the final result first, we now look at the individual steps. The first step is to find the (non-trivial and minimal) dependencies present in the given data frame. There are various ways to do this; by default, the package uses DFD, a depth-first search algorithm. We run this using the discover function, setting progress to TRUE to see the steps taken:

deps <- discover(ChickWeight, accuracy = 1, progress = TRUE)
#> formatting numerical/complex variables with 7 significant digits
#> simplifying data types
#> constructing powerset
#> dependant weight
#> determinants available, starting search
#> dependant Time
#> determinants available, starting search
#> dependant Chick
#> determinants available, starting search
#> dependant Diet
#> determinants available, starting search
#> DFD complete

The accuracy argument expects a number between zero and one, determining what proportion of a data frame’s records need to satisfy a given dependency for the algorithm to consider it valid. Setting accuracy to one, as done here, limits us to exact dependencies. We’ll look at other values later, in the section on approximate dependencies.

deps
#> 2 functional dependencies
#> 4 attributes: weight, Time, Chick, Diet
#> Time, Chick -> weight
#>       Chick -> Diet

After simplifying the data to something quicker to iterate over – specifically, by converting all attributes to have integer values – the algorithm takes each attribute in turn as a possible dependant, and finds sets of the other attributes that determine it.

The result is a list of functional dependencies, in the format determinant set -> dependant, with an attribute named attrs_order that gives all the attribute names in their original order. Each of these three parts has its own (generic) extraction function:

detset(deps)
#> [[1]]
#> [1] "Time"  "Chick"
#> 
#> [[2]]
#> [1] "Chick"
dependant(deps)
#> [1] "weight" "Diet"
attrs_order(deps)
#> [1] "weight" "Time"   "Chick"  "Diet"

The former two, especially, are useful for filtering, as we’ll see later.

Normalisation

Now that we have a list of discovered dependencies, we can construct a database schema, where the relation schemas are normalised to third normal form. This is done using a version of Bernstein’s synthesis.

schema <- synthesise(deps)
schema
#> 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#>   key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#>   key 1: Time, Chick

Like the database before, we can also plot this database schema:

maybe_plot(schema)

This is similar to the database plot given before, but there is some information not present, that requires the data frame itself. We do have class information about the attributes, extracted from the data frame during dependency discovery, but we have no record counts for the individual relation schemas. However, we do have automatically-generated names for the individual relations.

Additionally, at this point we have no connections between the relation schemas, since Bernstein’s synthesis doesn’t supply information about foreign key references. We could use this database schema to build a database, but we’d rather add the foreign key references first.

Tuning detection and normalisation

Let’s look at a different dataset for a moment, to look at some cases where we don’t want to use the dependencies as given. We’ll use the Titanic data set, also provided with base R. This data is in array form, so we first convert it to data frame form:

knitr::kable(as.data.frame(Titanic))
Class Sex Age Survived Freq
1st Male Child No 0
2nd Male Child No 0
3rd Male Child No 35
Crew Male Child No 0
1st Female Child No 0
2nd Female Child No 0
3rd Female Child No 17
Crew Female Child No 0
1st Male Adult No 118
2nd Male Adult No 154
3rd Male Adult No 387
Crew Male Adult No 670
1st Female Adult No 4
2nd Female Adult No 13
3rd Female Adult No 89
Crew Female Adult No 3
1st Male Child Yes 5
2nd Male Child Yes 11
3rd Male Child Yes 13
Crew Male Child Yes 0
1st Female Child Yes 1
2nd Female Child Yes 13
3rd Female Child Yes 14
Crew Female Child Yes 0
1st Male Adult Yes 57
2nd Male Adult Yes 14
3rd Male Adult Yes 75
Crew Male Adult Yes 192
1st Female Adult Yes 140
2nd Female Adult Yes 80
3rd Female Adult Yes 76
Crew Female Adult Yes 20

This is a simple set of data, with a single count observation, Freq, for each combination of the four determining attributes. In other words, the relation is already normalised, so we only expect one relation in the normalised database.

If we use autodb again, we get the following database:

show(autodb(as.data.frame(Titanic)))

Oops! The DFD algorithm found some functional dependencies where the count could be used to determine another attribute. These are clearly spurious: frequency count can’t causally determine age, for example. However, the algorithm still finds these dependencies, because the counts are unique often enough to make these dependencies hold in the given data.

There are two approaches we can take to eliminate these spurious dependencies: not letting them be detected in the first place, and removing them before using synthesise.

To stop them being detected, we can put constraints on what is discovered by discover: we can ask for certain attributes to not be considered as determinants, or we can exclude attributes that inherit from certain classes. In this example, we could exclude Freq from being considered:

titanic_deps_freqonly <- discover(as.data.frame(Titanic), 1, exclude = "Freq")
titanic_deps_freqonly
#> 1 functional dependency
#> 5 attributes: Class, Sex, Age, Survived, Freq
#> Class, Sex, Age, Survived -> Freq

Alternatively, we could exclude all attributes that inherit from “numeric”:

identical(titanic_deps_freqonly, discover(as.data.frame(Titanic), 1, exclude_class = "numeric"))
#> [1] TRUE

These can both be used as arguments to autodb too:

show(autodb(as.data.frame(Titanic), exclude = "Freq"))

Generally, excluding numeric attributes as determinants is often useful, because we expect non-integer numbers to be a measurement, not part of a primary key.

Alternatively, we could remove unwanted dependencies before using decompose. Here are the found dependencies, if we don’t exclude anything:

titanic_deps <- discover(as.data.frame(Titanic), 1)
titanic_deps
#> 3 functional dependencies
#> 5 attributes: Class, Sex, Age, Survived, Freq
#>       Sex, Survived, Freq -> Age
#>     Class, Survived, Freq -> Age
#> Class, Sex, Age, Survived -> Freq

We can remove the unwanted dependencies, where Age is the dependant:

titanic_deps[dependant(titanic_deps) == "Freq"]
#> 1 functional dependency
#> 5 attributes: Class, Sex, Age, Survived, Freq
#> Class, Sex, Age, Survived -> Freq

Adding foreign key references

Getting back to our ChickWeight example, we now have a database schema, consisting of a list of relation schemas. However, we have no information about how these relation schemas are linked to each other. In particular, we have no information about foreign keys. We can add this information using autoref:

linked_schema <- autoref(schema)
linked_schema
#> database schema with 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#>   key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#>   key 1: Time, Chick
#> references:
#> Time_Chick.{Chick} -> Chick.{Chick}

We could also have used normalise, instead of synthesise and autoref separately:

normalise(deps)
#> database schema with 2 relation schemas
#> 4 attributes: weight, Time, Chick, Diet
#> schema Chick: Chick, Diet
#>   key 1: Chick
#> schema Time_Chick: Time, Chick, weight
#>   key 1: Time, Chick
#> references:
#> Time_Chick.{Chick} -> Chick.{Chick}

Plotting this updated database schema shows the same relation schemas as before, linked together by foreign key references:

maybe_plot(linked_schema)

Decomposing the original relation

Finally, once we have our normalised database schema, we can apply it to our original data frame, or a new one with the same structure. This results in a normalised database, as we got from using autodb:

db2 <- decompose(ChickWeight, linked_schema)
show(db2)

We now have the record counts added.

Rejoining a database back into a data frame

We can reverse the process of turning a data frame into a database with the rejoin function. This may not be identical to ChickWeight, since the rows may have been rearranged. However, we can use the convenience function df_equiv to check for equivalence under row reordering:

rejoined <- rejoin(db)
summary(rejoined)
#>      weight           Time           Chick     Diet   
#>  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
#>  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
#>  Median :103.0   Median :10.00   20     : 12   3:120  
#>  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
#>  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
#>  Max.   :373.0   Max.   :21.00   19     : 12          
#>                                  (Other):506
identical(rejoined, ChickWeight)
#> [1] FALSE
df_equiv(rejoined, ChickWeight)
#> [1] TRUE

When rejoined, the relation attributes will be in the original order. However, the record order might have changed.

A larger example

Included in the package is a 447-by-25 data frame called nudge:

if (requireNamespace("DiagrammeR", quietly = TRUE)) {
  DiagrammeR::grViz(gv(nudge, name = "nudge"))
}else{
  summary(nudge)
}

This is the data set for a meta-analysis, looking at the effectiveness of “nudge” interventions. Measurements are taken within a three-layer hierarchy: publications contain studies, which contain effect size measurements. We expect this hierarchy to be apparent in the normalisation.

Initial decomposition

Getting full dependency information for this relation can take a long time, so here we search for a reduced set, not considering numeric or sample size attributes as determinants:

nudge_deps <- discover(
  nudge,
  accuracy = 1,
  exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
  exclude_class = "numeric"
)
nudge_schema <- normalise(nudge_deps, remove_avoidable = TRUE)
show(nudge_schema)

We can see a relation, with many attributes, determined by the effect size ID number, es_id. This contains all of the numeric measurements, as expected in the relation for the lowest level in the hierarchy. As also expected, this has a foreign key reference to a relation for study-level information, determined by the study ID number, study_id.

However, the publication-level relation this refers to is not determined by the publication ID, publication_id, as we would expect; neither is it determined by the publication reference. Instead, it is solely determined by the publication’s title: to use the ID, we would need to supplement it with other information, like the publication year, or other choices that come from the middle column of relations, which look likely to be spurious.

This suggests that some publication ID numbers have been erroneously assigned to several publications, which we can easily test:

nudge_database <- decompose(nudge, nudge_schema)
nudge_title_relation <- records(nudge_database)$title
nudge_pid_duplicates <- unique(nudge_title_relation$publication_id[
  duplicated(nudge_title_relation$publication_id)
])
knitr::kable(subset(nudge_title_relation, publication_id %in% nudge_pid_duplicates))
title publication_id reference year
44 Enhanced active choice: A new method to motivate behavior change 95 Keller et al. (2011) 2011
130 Nudging product choices: The effect of position change on snack bar choice 95 Keller et al. (2015) 2015

We’d also expect reference to determine the publication, but this is also not the case: it needs additional information, such as binary_outcome.

This means that there are publications that share a reference:

nudge_reference_duplicates <- unique(nudge_title_relation$reference[
  duplicated(nudge_title_relation$reference)
])
knitr::kable(subset(nudge_title_relation, reference %in% nudge_reference_duplicates))
title publication_id reference year
214 Nudge vs superbugs: A behavioural economics trial to reduce the overprescribing of antibiotics 18 BETA (2018) 2018
399 Energy labels that make cents 19 BETA (2018) 2018

BETA is the Behavioural Economics Team of the Australian Government, so it’s not surprising that they’d have multiple publications/reports per year. Duplicate references is not necessarily an error, but would be awkward if the references were to be used.

There are other, clearly spurious, relations involving reference. One of these combines it with type_experiment to determine location. Since reference is a publication-level attribute, and location is a study-level attribute, this can’t be right.

(As far as I’m aware, the publication ID and reference errors mentioned above would not have affected the meta-analysis results.)

Removing spurious structure

The publication ID and the reference clearly have issues, so we need to decide what to do about them. Here, I’ll show what is usually the correct approach, which is to remove the spurious functional dependencies. We’ll look at the other options later.

The better option is to remove all functional dependencies that we’d consider to be spurious: those where publication ID or reference are part of a multiple-attribute determinant set, and the determinant set isn’t just those two:

nudge_deps_filtered <- nudge_deps[
  lengths(detset(nudge_deps)) == 1 |
    vapply(
      detset(nudge_deps),
      \(ds) length(setdiff(ds, c("publication_id", "reference"))) != 1,
      logical(1)
    )
]
nudge_schema_filtered <- normalise(nudge_deps_filtered, remove_avoidable = TRUE)
show(nudge_schema_filtered)

We now, finally, have the study location in a non-spurious relation. If we want to, we could also add the dependencies we fail to find, due to excluding the sample size attributes from determinant sets:

nudge_deps_size <- discover(nudge[, startsWith(names(nudge), "n_")], 1)
nudge_deps_size
#> 3 functional dependencies
#> 4 attributes: n_study, n_comparison, n_control, n_intervention
#>    n_control, n_intervention -> n_comparison
#> n_comparison, n_intervention -> n_control
#>      n_comparison, n_control -> n_intervention
nudge_deps_final <- c(nudge_deps_filtered, nudge_deps_size)
nudge_schema_final <- normalise(nudge_deps_final, remove_avoidable = TRUE)
nudge_database_final <- decompose(nudge, nudge_schema_final)
show(nudge_schema_final)

The new schema confirms what we might expect, that the total sample size, and the sizes of control and treatment arms, only have two degrees of freedom: knowing two of them determines the other.

We’d expect this to be because the total sample size is the sum of the other two. However, the functional dependency can’t determine this. Indeed, if we check for this more specific condition, we find that it doesn’t hold:

knitr::kable(unique(subset(
  nudge,
  n_comparison != n_control + n_intervention,
  c(
    es_id,
    reference,
    title,
    n_study,
    n_comparison,
    n_control,
    n_intervention
  )
)))
es_id reference title n_study n_comparison n_control n_intervention
237 180 Hedlin & Sunstein (2016) Does active choosing promote green energy use? Experimental evidence 1037 1037 345 346
302 181 Hedlin & Sunstein (2016) Does active choosing promote green energy use? Experimental evidence 1037 1037 345 346

Looking through the paper, the arm sample sizes aren’t given explicitly, but the numbers here are consistent with the parts of the study that have two treatments.

Automatic search for functional dependencies cannot check domain-specific constraints like the summation constraint above, but it can give schemas that suggest constraints to check for, such as the two-degrees relation above.

Going back to the final database schema, removing dependencies has revealed extra information about study locations, via the new title_type_experiment relation: studies of the same experiment type in a publication always have the same location. Looking at the resulting database shows that this removes many entries of what would be redundant location information if kept in the study relation:

show(nudge_database_final)

While this is not a dependency we could expect to hold if more data was collected, it’s a reasonable dependency for the given data set, which won’t be added to.

How not to remove spurious structure

The additional relation mentioned at the end of the previous section is also notable for what we’ll now discuss: alternative ways to get rid of spurious structure. How can we do this?

What happens if we remove the offending schemas?

nudge_schema_relfiltered <- nudge_schema[
  !grepl("publication_id_", names(nudge_schema), fixed = TRUE) &
    !grepl("_publication_id", names(nudge_schema), fixed = TRUE) &
    !grepl("reference_", names(nudge_schema), fixed = TRUE) &
    !grepl("_reference", names(nudge_schema), fixed = TRUE)
]

Subsetting a database schema also removes any foreign key references involving removed schemas, so the resulting schema is still valid. However, any foreign key reference chains with removed schemas in the middle are broken. Amongst others, in this case we lose the reference between the three relations for the three hierarchical data levels:

show(nudge_schema_relfiltered)

Re-running autoref has no effect, because the es_id section doesn’t contain any key for the title section:

identical(autoref(nudge_schema_relfiltered), nudge_schema_relfiltered)
#> [1] TRUE

Because the database subschema including the es_id relation only has publication_id for distinguishing publications, effect sizes for the two publications with the same ID cannot be uniquely associated with their publication: removing schemas irreversibly separates two parts of the database schema.

Furthermore, compared to nudge_schema_filtered, there is a relation schema missing: directly removing schemas loses the structure information that results in the title_type_experiment relation. How does this happen?

The relation isn’t present in the original schema, nudge_schema_filtered, because the full set of functional dependencies make its implied dependency transitive. Specifically, nudge_fds contains the following functional dependencies:

example_fds <- functional_dependency(
  list(
    list("title", "reference"),
    list(c("reference", "type_experiment"), "location"),
    list(c("title", "type_experiment"), "location")
  ),
  c("title", "reference", "type_experiment", "location")
)
example_fds
#> 3 functional dependencies
#> 4 attributes: title, reference, type_experiment, location
#>                      title -> reference
#> reference, type_experiment -> location
#>     title, type_experiment -> location

The first two dependencies make the third transitive, so it’s ignored when constructing the relation schemas: it’s implied, and enforced, by the title and reference_type_experiment relation schemas.

show(normalise(example_fds, ensure_lossless = FALSE))

However, when reference, type_experiment -> location is removed as spurious, title, type_experiment -> location is no longer transitive, so it now appears as its own relation schema.

show(normalise(example_fds[-2], ensure_lossless = FALSE))

When we instead remove the reference_type_experiment schema, this implicitly also throws away the title, type_experiment -> location dependency that it co-implies. We throw away non-spurious information that we don’t intend to.

Between the risk of irreversibly breaking the database schema’s structure, and the risk of unintentionally removing additional structure information, I hope it’s clear why directly removing relations should be avoided.

Other features

Approximate dependencies and database reduction

Larger datasets can often have entry errors, without an easy way to remove or deal with them. For this reason, we might be interested in “approximate” functional dependencies, which hold after removing a bounded amount of violating records.

Suppose that we normalise nudge again, without any manual dependency removal, but allowing approximate dependencies. We could cheat, knowing that the questionable data example found above showed there to be two questionable records: one for a duplicated publication ID, and one for a duplicated reference. Since nudge has 447 records, we can get rid of the resulting questionable dependencies by setting the accuracy to allow two discrepant records.

nudge_approx_cheat_database_schema <- discover(
  nudge,
  accuracy = 1 - 2/nrow(nudge),
  exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
  exclude_class = "numeric"
) |>
  normalise()
show(nudge_approx_cheat_database_schema)

Currently decompose doesn’t account for approximate dependencies, resulting in invalid databases, so here we just work with the database schema.

Compare this to the database schema we arrived at manually:

Lowering the accuracy results in more dependencies being found, and so in more relations. The number of relations can get very large. For example, suppose we instead set the accuracy to 0.99, returning any approximate dependencies that hold in at least 443 records.

nudge_approx_database_schema <- discover(
  nudge,
  accuracy = 0.99,
  exclude = c("n_study", "n_comparison", "n_control", "n_intervention"),
  exclude_class = "numeric"
) |>
  normalise()
show(nudge_approx_database_schema)

This is a little overwhelming, so we’ll use a utility function called reduce. This returns only the named “main” relation, and any relations for which it is a descendant via foreign key references. Reducing the approximate schema, with the effect size table as the main table, gives us this set of relations:

show(reduce(nudge_approx_database_schema, "es_id"))

Questionable intermediate relations aside, we can see that there is now a publication-level relation with the publication ID as a key, since it determines the publication attributes once we remove one of the duplicate records we discovered before.

reduce can also be used on databases, where it considers each relation with the largest number of records as a main table. It needs to be used with caution: while the relations it removes are often spurious, it’s possible to find databases where relations not linked to the main relation by foreign key references are required to rejoin to the original data frame, so reduction can remove necessary relations. Its intent is mostly to make glancing at database plots more manageable.

Avoidable attributes

The next normal form after third normal form (3NF) is Boyes-Codd normal form (BCNF). Ensuring BCNF is enforced by the database is trickier, as in some cases it can’t be enforced with just relations and foreign key constraints.

However, the package includes an option to convert to enhanced third normal form, also known as LTK form, which can be so enforced. This enhancement is tangential to BCNF, and could also be used to enhance schemas in BCNF.

In brief, the standard normal forms only put constraints on the attributes present in the relations one relation at a time. The enhancement is a constraint on the attributes present in a relation, while considering their presence in other relations. If a attribute in a relation can be removed, and still be determined from that relation by joining it to others, then the attribute is “avoidable”, and can be removed. If the attribute is in any of the relation’s keys, they’ll be replaced by keys that use the attributes not being removed. This removes attributes from relations without removing any information from the database as a whole.

For example, we can take this simple example from Chapter 6 of The Theory of Relational Databases, by David Maier:

avoid_deps <- functional_dependency(
  list(
    list("A", "B"),
    list("B", "A"),
    list(c("A", "C"), "D"),
    list(c("A", "C"), "E"),
    list(c("B", "D"), "C")
  ),
  attrs_order = c("A", "B", "C", "D", "E")
)
avoid_deps
#> 5 functional dependencies
#> 5 attributes: A, B, C, D, E
#>    A -> B
#>    B -> A
#> A, C -> D
#> A, C -> E
#> B, D -> C
avoid_schema <- normalise(avoid_deps)
show(avoid_schema)

Attributes A and B are equivalent, since relation A has them both as a key. In other words, relation A is a simple lookup relation. Because of this, we could remove B from relation A_C, and replace the key B, D with A, D, which is equivalent when accounting for relation A.

We can have this removal of avoidable attributes done automatically, using the remove_avoidable flag for normalise:

avoid_schema_removed <- normalise(
  avoid_deps,
  remove_avoidable = TRUE
)
show(avoid_schema_removed)

This schema is now in LTK form, with no remaining avoidable attributes. We could have also removed A from relation A_C instead of B, so this process may not have a unique result. The package’s implementation prefers to remove attributes that appear later in the original relation.

Handling some special cases

Structure conditional on value missingness

Sometimes, the structure for some attributes depends on whether other attributes are missing. Here are two possible reasons:

For example, I commonly get data sets where certain values can be given as either single values, or endpoints of an interval, and this “union type” data is spread across several attributes:

df_options <- data.frame(
  id = 1:20,
  value = c(2.3, 2.3, 5.7, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
  lower_bound = c(NA_real_, NA_real_, NA_real_, 2.4, 0, 1, 0, 5.6, 2.4, 5.3, 5.3, 2.4, 2.4, 2.4, 2.4, 2.4, 2.4, 2.4, 5.6, 2.4),
  upper_bound = c(NA_real_, NA_real_, NA_real_, 7.1, 10, 10, 13.1, 25.8, 10, 13.1, 10, 25.8, 25.8, 25.8, 25.8,13.1, 13.1, 25.8, 25.8, 25.8),
  interval_distribution = c(NA, NA, NA, "uniform", "uniform", "uniform", "uniform", "uniform", "Beta", "Beta", "Beta", "Beta", "Kumaraswamy", "Kumaraswamy", "Kumaraswamy", "Kumaraswamy", "PERT", "PERT", "PERT", "PERT"),
  param1 = c(NA, NA, NA, NA, NA, NA, NA, NA, 1, 1, 1, 2, 2, 2.1, 2, 2, 2, 1, 2, 2),
  param2 = c(NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 2, 2, 2, 1, 1, 1, NA, NA, NA, NA)
)
df_options$interval_distribution <- factor(df_options$interval_distribution)
knitr::kable(df_options)
id value lower_bound upper_bound interval_distribution param1 param2
1 2.3 NA NA NA NA NA
2 2.3 NA NA NA NA NA
3 5.7 NA NA NA NA NA
4 NA 2.4 7.1 uniform NA NA
5 NA 0.0 10.0 uniform NA NA
6 NA 1.0 10.0 uniform NA NA
7 NA 0.0 13.1 uniform NA NA
8 NA 5.6 25.8 uniform NA NA
9 NA 2.4 10.0 Beta 1.0 1
10 NA 5.3 13.1 Beta 1.0 2
11 NA 5.3 10.0 Beta 1.0 2
12 NA 2.4 25.8 Beta 2.0 2
13 NA 2.4 25.8 Kumaraswamy 2.0 2
14 NA 2.4 25.8 Kumaraswamy 2.1 1
15 NA 2.4 25.8 Kumaraswamy 2.0 1
16 NA 2.4 13.1 Kumaraswamy 2.0 1
17 NA 2.4 13.1 PERT 2.0 NA
18 NA 2.4 25.8 PERT 1.0 NA
19 NA 5.6 25.8 PERT 2.0 NA
20 NA 2.4 25.8 PERT 2.0 NA

autodb has no way to know about these relationships, because it doesn’t treat missing values as a special case (see the “Missing values” section in “Planned extensions” below). The resulting schema ignores them:

db_options <- autodb(df_options)
show(db_options)

However, from looking at the data ourselves, we can see a clearly-implied structure:

As a general hack for finding these sorts of relationships, I like taking each attribute with missing values, and adding a companion attribute that states whether the value is present or missing:

df_options_with_presence <- data.frame(
  id = df_options$id,
  value = df_options$value,
  value_present = !is.na(df_options$value),
  lower_bound = df_options$lower_bound,
  lower_bound_present = !is.na(df_options$lower_bound),
  upper_bound = df_options$upper_bound,
  upper_bound_present = !is.na(df_options$upper_bound),
  interval_distribution = df_options$interval_distribution,
  interval_distribution_present = !is.na(df_options$interval_distribution),
  param1 = df_options$param1,
  param1_present = !is.na(df_options$param1),
  param2 = df_options$param2,
  param2_present = !is.na(df_options$param2)
)

This is not always practical, because adding columns rapidly increases the running time for the dependency search. However, when it’s practical, it helps to make the structure more apparent in the database schema:

db_options_with_presence <- autodb(df_options_with_presence)
show(db_options_with_presence)

This shows that the values, bounds, and interval distributions inform the presence of each other: they drive the main union type.

knitr::kable(records(db_options_with_presence)$value_present)
value_present lower_bound_present upper_bound_present interval_distribution_present
1 TRUE FALSE FALSE FALSE
4 FALSE TRUE TRUE TRUE

We can also see how the interval distribution determines how many parameters are required:

knitr::kable(records(db_options_with_presence)$interval_distribution)
interval_distribution value_present param1_present param2_present
1 NA TRUE FALSE FALSE
4 uniform FALSE FALSE FALSE
9 Beta FALSE TRUE TRUE
13 Kumaraswamy FALSE TRUE TRUE
17 PERT FALSE TRUE FALSE

There’s clearly additional structure here, since there’s an existence constraint: if param2 is present, then param1 must also be present. If we notice a situation like this, we can attempt further investigation by splitting the data, and passing each part into autodb separately:

db_options_with_presence_p1absent <- autodb(subset(
  df_options_with_presence,
  !param1_present
))
show(db_options_with_presence_p1absent)
show(autodb(subset(
  df_options_with_presence,
  param1_present
)))

In this case, the constraint is reflected by the first subset, where param1 is absent, always having param2 absent, as indicated by both param2_present and param2 being constant:

knitr::kable(records(db_options_with_presence_p1absent)$constants)
param1 param1_present param2 param2_present
NA FALSE NA FALSE

Since this data splitting is currently a manual process, there are practical limits to how far we can take this.

Planned extensions

Handling of duplicate records

It’s expected that an R user might pass a data frame into the package that has duplicate records. At the moment, these are kept when searching for dependencies. This can affect results for approximate dependencies, due to affecting the record counts. However, they will be removed when the data frame is decomposed into a database. At the moment, I’m not certain on whether they are best handled by removing them, or by simply returning an error if there are duplicate records.

Missing values

Strictly speaking, autodb does not search for functional dependencies, because these can’t account for missing values. Instead, it searches for a weaker variant, which has been called a literal functional dependency (LFD) in the literature.

The functional dependency X -> Y holds if, for any two possible records whose values in X are equal, their values in Y are also equal. This requires attribute values to take the equality operator, == in R. Additionally, they must take it under binary logic: whether two values are equal can only be true or false, not missing. This disallows the presence of missing values (NA and NaN), since testing for equality with a missing value returns a missing result.

The literal functional dependency X -> Y holds, or the functional dependency X -> Y holds literally, if, for any two possible records whose values in X are identical, their values in Y are also identical, where values are identical if both are non-missing and equal, or both are missing. In other words, missing values are treated as not equal to non-missing values, but as equal to each other. This requires attribute classes to take the identity operator, identical() with default optional parameter values in R. Again, they must take it under binary logic, but the identity operator cannot return non-binary results anyway: In R, for example, identical() can only return TRUE or FALSE, never NA. There is, therefore, no constraint on missing values in attribute classes, or on attribute classes in general.

LFDs are more generic than standard functional dependencies: since practically every class takes the identity operator, they practically make no assumptions about the attribute classes present in a data set.

There are other FD variants that handle missing values, but, unlike most of them, LFDs still satisfy Armstrong’s axioms, allowing them to be used in normalisation in the same way as normal FDs. For example, they still respect transitivity: if X -> Y and Y -> Z literally, then X -> Z literally.

However, ignoring the special status of missing values in this way ignores important structural information. In the relational model, and ideally in relational databases, we want to have no missing values at all, partly to avoid awkward questions about what trinary logic means with respect to filtering or joining tables. For example, take the following data frame:

df_nas <- data.frame(
  patient = c(1L, 2L, 3L, 4L),
  trial_entry_date = as.Date(c("2022/05/02", "2022/06/06", "2022/04/01", "2022/03/19")),
  death_date = as.Date(c(NA, NA, "2022/10/07", NA))
)
knitr::kable(df_nas)
patient trial_entry_date death_date
1 2022-05-02 NA
2 2022-06-06 NA
3 2022-04-01 2022-10-07
4 2022-03-19 NA

autodb currently treats NA as just another value, which results in the initial data frame not being split in the database schema:

show(autodb(df_nas))

In this case, a missing death date represents no death, and we would prefer to move death information to a separate relation, containing only patients with a death date, in this case patient 3:

ideal_db <- decompose(
  df_nas,
  database_schema(
    relation_schema(
      list(
        patient = list(c("patient", "trial_entry_date"), list("patient")),
        patient_death = list(c("patient", "death_date"), list("patient"))
      ),
      names(df_nas)
    ),
    list(list("patient_death", "patient", "patient", "patient"))
  )
)
records(ideal_db)$patient_death <- subset(records(ideal_db)$patient_death, !is.na(death_date))
show(ideal_db)

This makes more of the data’s structure apparent, and enforceable, in the database structure, since it makes it clear that the death date is optional.

Removing missing values in this way requires additional steps in dependency detection, as well as normalisation, than autodb currently has: it requires keeping additional information about when attributes are missing, compared to the others.

Limitations

Table merges don’t fix issues with merge.data.frame

Rejoining databases, and checking relations satisfy foreign key constraints, is done using merge.data.frame. This means that data classes that don’t work properly in merge aren’t guaranteed to work properly in autodb.

Any such issues come from how certain data classes are handled during merges in the base language, so they are issues with R, rather than with autodb, and I have no plans to fix them. If autodb seems to have odd failures, check that used data classes behave correctly in merges.

For example, in older versions of R, the built-in POSIXct date/time class didn’t have values merged properly, because the merge didn’t account for differences in time zone / daylight saving time. This would result in, for example, the nycflights13::weather data set violating the foreign key constraints of its own discovered schema, since one foreign key used a POSIXct attribute.

A more complex example, that still applies and probably always will, is a merge where two attributes being merged on have different classes. In general, this is allowed: since autodb is written for R, a dynamically-typed language, it follows SQLite in not constraining the user much when it comes to data classes in schemas. For primitive classes, R’s class coercion usually makes things work as you’d expect.

However, merging a factor attribute with a non-factor, non-character attribute should be handled with caution if the latter contains values not in the factor’s level set.

For example, we can define these data frames:

df_badmerge_int <- cbind(
  expand.grid(
    a = c(NA, 0L, 1L),
    b = c(NA, FALSE, TRUE)
  ),
  row = 1:9
)
df_badmerge_factor <- df_badmerge_int
df_badmerge_factor$a <- as.factor(df_badmerge_factor$a)
knitr::kable(df_badmerge_int)
a b row
NA NA 1
0 NA 2
1 NA 3
NA FALSE 4
0 FALSE 5
1 FALSE 6
NA TRUE 7
0 TRUE 8
1 TRUE 9
df_badmerge_logical <- df_badmerge_int
df_badmerge_logical$a <- as.logical(df_badmerge_logical$a)
names(df_badmerge_logical)[[3]] <- "row2"
knitr::kable(df_badmerge_logical)
a b row2
NA NA 1
FALSE NA 2
TRUE NA 3
NA FALSE 4
FALSE FALSE 5
TRUE FALSE 6
NA TRUE 7
FALSE TRUE 8
TRUE TRUE 9

We can then merge the data frame with logical a with the other two, keeping the row attributes to track which records were merged.

Whichever other data frame we merge with, the two sets of a values have different classes, so R does coercion. When merging with just a, this gives the result we’d expect, for both other data frames and regardless of merge order. For the integer version, the logical values are coerced to integers:

knitr::kable(merge(
  df_badmerge_int[, c("a", "row")],
  df_badmerge_logical[, c("a", "row2")]
))
a row row2
0 2 2
0 2 5
0 2 8
0 5 2
0 5 5
0 5 8
0 8 2
0 8 5
0 8 8
1 6 6
1 6 3
1 6 9
1 3 6
1 3 3
1 3 9
1 9 6
1 9 3
1 9 9
NA 1 1
NA 1 7
NA 1 4
NA 7 1
NA 7 7
NA 7 4
NA 4 1
NA 4 7
NA 4 4
knitr::kable(merge(
  df_badmerge_logical[, c("a", "row2")],
  df_badmerge_int[, c("a", "row")]
))
a row2 row
FALSE 2 2
FALSE 2 5
FALSE 2 8
FALSE 5 2
FALSE 5 5
FALSE 5 8
FALSE 8 2
FALSE 8 5
FALSE 8 8
TRUE 6 6
TRUE 6 3
TRUE 6 9
TRUE 3 6
TRUE 3 3
TRUE 3 9
TRUE 9 6
TRUE 9 3
TRUE 9 9
NA 1 1
NA 1 7
NA 1 4
NA 7 1
NA 7 7
NA 7 4
NA 4 1
NA 4 7
NA 4 4

For the factor version, the logical values are coerced to factors, but they don’t match any of the given levels, so they all become NA:

knitr::kable(merge(
  df_badmerge_factor[, c("a", "row")],
  df_badmerge_logical[, c("a", "row2")]
))
a row row2
NA 7 7
NA 7 4
NA 7 1
NA 4 7
NA 4 4
NA 4 1
NA 1 7
NA 1 4
NA 1 1
knitr::kable(merge(
  df_badmerge_logical[, c("a", "row2")],
  df_badmerge_factor[, c("a", "row")]
))
a row2 row
NA 7 7
NA 7 4
NA 7 1
NA 4 7
NA 4 4
NA 4 1
NA 1 7
NA 1 4
NA 1 1

However, we see unexpected behaviour with the factor version, when also merging on another attribute, b: the merge result now depends on the input order. With the factor version first, the result is similar to before:

knitr::kable(merge(
  df_badmerge_factor,
  df_badmerge_logical
))
#> Warning in `[<-.factor`(`*tmp*`, ri, value = c(NA, FALSE, TRUE, NA, FALSE, :
#> invalid factor level, NA generated
a b row row2
NA FALSE 4 4
NA FALSE 4 5
NA FALSE 4 6
NA NA 1 1
NA NA 1 2
NA NA 1 3
NA TRUE 7 7
NA TRUE 7 8
NA TRUE 7 9

With the logical version first, however, only the logical a values that are NA before coercion are kept, rather than all of them:

knitr::kable(merge(
  df_badmerge_logical,
  df_badmerge_factor
))
a b row2 row
NA FALSE 4 4
NA NA 1 1
NA TRUE 7 7

In autodb, this could be a problem if two relations in a foreign key constraint contain the same attribute, but one of them has it store factors in records, due to separate insertions. Letting an attribute’s class vary across relations should, therefore, be done with caution.

Synthesis doesn’t minimise relation key count

Bernstein’s synthesis is guaranteed to minimise the number of relations created for a given set of functional dependencies, and removing avoidable attributes can reduce the number of attributes in those relations. However, there can still be redundant keys. For example, we can take the following set of functional dependencies:

fds_redkey <- functional_dependency(
  list(
    list("a", "b"),
    list("d", "c"),
    list(c("b", "d"), "a"),
    list("a", "c"),
    list(c("b", "c"), "d")
  ),
  letters[1:4]
)
fds_redkey
#> 5 functional dependencies
#> 4 attributes: a, b, c, d
#>    a -> b
#>    d -> c
#> b, d -> a
#>    a -> c
#> b, c -> d

Normalising gives the following relations:

schema_redkey <- normalise(fds_redkey, remove_avoidable = TRUE)
show(schema_redkey)

These relations have some redundancy: relation a implies {b, d} -> c, but relation d implies that {d} -> c. This isn’t resolved by removing avoidable attributes, because d still needs to be in relation a: we just need to remove {b, d} as a key. However, this is resolved if we instead use this set of functional dependencies, which is equivalent to the previous set:

fds_redkey_fix <- functional_dependency(
  list(
    list("a", "b"),
    list("d", "c"),
    list(c("b", "c"), "a"),
    list("a", "d")
  ),
  letters[1:4]
)
fds_redkey_fix
#> 4 functional dependencies
#> 4 attributes: a, b, c, d
#>    a -> b
#>    d -> c
#> b, c -> a
#>    a -> d
schema_redkey_fix <- normalise(fds_redkey_fix, remove_avoidable = TRUE)
show(schema_redkey_fix)

Unfortunately, there’s no way in the package to find better sets like this.

Normalisation isn’t all there is to database design

As mentioned in the section on avoidable attributes, normal forms only refer to one relation at a time: referring to a database as following a normal form just means its relations follow it. Even if we also add the foreign key references that normal forms don’t require, this leaves room for database schemas that satisfy normal forms, but that we wouldn’t want to use.

For a simple example, take this database schema, whose relation schemas are in third normal form:

dup_db <- autodb(ChickWeight)
show(dup_db)

If we now create copies of these relations, with the intention that copies always contain the same data, then all relations are still in third normal form, and so we’d say the database is too:

show(dup_db[c(1, 1, 2, 2, 2)])

However, no one would claim that this is a good database design, since there is clearly a large amount of data redundancy. Higher normal forms would not change this.