Some minor changes to tests, as part of re-submission to CRAN.
Some performance improvements, as part of re-submission to CRAN.
df_records
, for
converting a data frame into a list of row values. These are sometimes
more useful than data frames, e.g. for checking which rows of a data
frame are present in another one.database
for larger data sets,
specifically the validity checks for the data satisfying the foreign key
references.create
for
relation_schema
and database_schema
, by
removing validity checks. If the input is valid, these are
redundant.autodb
, by skipping removal of
extraneous attributes. This is done on the results of
discover
, so there won’t be any.Continuing efforts to prepare for submission to CRAN.
autodb
, discover
, and df_equiv
,
to round to a number of significant digits. Due to the nature of
floating-point, and the definition of a functional dependency,
floating-point values can’t be compared using equality
(==
), or by all.equal
for the purposes of
functional dependency discovery / validation, and have the result be
consistent between different machines. Because of this, floating-point
variables are now rounded to a small level of precision by default
before processing. If the data frame is being loaded from a file, we
recommend reading any numerical/complex variables as character values
(strings), if it’s appropriate, to avoid loss of precision.df_equiv
now checks rows for exact matches, outside of
the rounding mentioned above. Previously, it compared rows using
match
, which gave no control over float precision.relation_schema
, relation
,
database_schema
, and database
now only return
a name-based subset successfully if all of the given names exist in the
object.Some minor changes to documentation and tests, to allow for package updates and submission to CRAN.
format
and as.data.frame
methods for
functional_dependency
, relation_schema
,
database_schema
, relation
, and
database
. This allows them to be columns in a data frame at
initial construction. I’m not sure why you’d want to put them a
a data frame column, but it’s consistent with the idea that the objects
from these classes should mostly be treatable as vectors. Be warned:
they don’t currently work in tibbles.as.character
method for
functional_dependency
. The optional
align_arrows
argument can add padding to one side, in order
to make the arrows align when they’re printed to different lines. These
options are used to align arrows in its print
method, and
its format
method for when printed as a data frame
column.==
and !=
implementations for
functional_dependency
. These ignore differences in
attrs_order
: differently-ordered determinant sets are
considered equal.rename_attrs
method for
functional_dependency
.dependants
argument to discover
,
which limit the functional dependency search to those with a dependant
in the given set of column names, defaulting to all of them. This should
significantly speed up searches where only some dependants are of
interest.detset_limit
argument for
discover
/autodb
, which limits the FD search to
only look for dependencies with the determinant set size under a given
limit. For DFD, this usually doesn’t significantly reduce the search
time, but it won’t make it worse. It will be useful once other search
algorithms are implemented.all
argument to insert
,
FALSE
by default. If TRUE
, then
insert
returns an error if the data to insert doesn’t
include all attribute for the elements being inserted into, rather than
skipping those elements. This helps to prevent accidental no-ops.progress = TRUE
now
keeps the output display up to date when using a console-based version
of R.gv
to account for Graphviz HTML-like labels
requiring certain characters, namely the set “<>&, to be
escaped in Graphviz HTML-like labels, and removed completely in
attribute values.df_equiv
to properly handle data frames with zero
columns or duplicate rows.database_schema
and database
, and
reference re-assignments, to allow references to be given with the
referee’s key not in attribute order.The general theme for this version is classes for intermediate results: functional dependencies, schemas, and databases now have fleshed-out classes, with methods to keep them self-consistent. They all have their own constructors, for users to create their own, instead of having to generate them from a given data frame.
dfd
to discover
, to reflect the
generalisation to allow the use of other methods. At the moment, this
just includes DFD.flatten
from exported functions, in favour of
flattening the functional dependencies in
dfd
/discover
instead. Since
flatten
was usually called anyway, and its output is more
readable since adding a print
method for it, there was
little reason to keep the old dfd
/discover
output format, where functional dependencies were grouped by
dependant.cross_reference
to autoref
, to
better reflect its purpose as generating foreign key references.normalise
to synthesise
, to
reflect its only creating relation schemas, not foreign key references.
The new function named normalise
now calls a wrapper for
both synthesise
and autoref
, since in most
cases we don’t need to do these steps separately. Additionally,
ensure_lossless
is now an argument for
synthesise
rather than autoref
: this is a more
nature place to put it, since synthesise
creates relations,
and autoref
adds foreign key references.[[
method, so code that used [[
to extract determinant sets or
dependants from functional dependencies will no longer work. These
should be extracted with the new detset
and
dependant
functions instead.database
class has its own subsetting
methods, so components must be extracted with records
,
keys
, and so on.database
class no longer assigns a
parents
attribute to each relation, since this duplicates
the foreign key reference information given in
references
.database
class no longer has a name
attribute. This was only used to name the graph when using the
gv
function, so is now an argument for the
database
method of gv
instead, bringing its
arguments into line with those of the other methods.relationships
in database_schema
and
database
objects are now called references
, to
better reflect their being foreign key constraints, and they are stored
in a format that better reflects this: instead of an element for each
pair of attributes in a foreign key, there is one element for the whole
foreign key, containing all of the involved attributes. Similarly, they
are now printed in the format “child.{c1, c2, …} -> parent.{p1, p2,
…}” instead of “child.c1 -> parent.p1; child.c2 -> parent.p2;
…”.cross_reference
/autoref
now defaults to
generating more than one foreign key reference per parent-child relation
pair, rather than keeping only the one with the first child key by
priority order. This can result in some confusion on plots, since
references are still plotted one attribute pair at a time.functional_dependency
class for flattened
functional dependency sets. The attributes vector is now stored as an
attribute, so that the dependencies can be accessed as a simple list
without list subsetting operators. There are also detset
,
dependant
, and attrs_order
generic functions
for extracting the relevant parts. detset
and
dependant
, in particular, should be useful for the purposes
of filtering predicates.relation_schema
class for relational schema
sets, as returned by synthesise
. The attributes and keys
are now stored together in a named list, with the
attrs_order
vector attribute order stored as an attribute.
As with the functional_dependency
, this lets the schemas be
accessed like a vector. There is also merge_empty_keys
for
combining schemas with an empty key, and attrs
,
keys
, and attrs_order
generic functions for
extracting the relevant parts.database_schema
class for database schemas, as
returned by normalise
. This inherits from
relation_schema
, and has foreign key references as an
additional references
attribute. There is a
merge_empty_keys
method that conserves validity of the
foreign key references. Additionally, when the names of the contained
relation schemas are changed using names<-
, the
references are changed to use the new names.relation
class for vectors of relations
containing data. Since a database_schema
is just a
relation_schema
vector with foreign key references added,
the relation
class was added as the equivalent underlying
vector for the database
class. A user of the package
probably won’t need to use it.database
is now a wrapper class around
relation
, that adds foreign key references, and handles
them separately in its methods.[
,
[[
, and – except for functional_dependency
–
$
subsetting operators, along with their replacement
equivalents, [<-
etc., to allow treating them as vectors
of relation schemas or relations. Subsetting also removes any foreign
key references in database_schema
and database
objects that are no longer relevant. These methods prevent the
subsetting operators from being used to access the object’s internal
components, so many of the generic functions mentioned above were
written to allow access in a more principled manner, not requiring
knowledge of how the structure is implemented.c
method for vector-like
concatenation. There are two non-trivial aspects to this. Firstly, when
concatenating objects with different attrs_order
attributes, c
merges the orders to keep them consistent, if
possible. Secondly, for database_schema
and
database
, foreign key references are changed to reflect any
changes made to relation names to keep them unique.unique
method for vector-like
removal of duplicate schemas / relations. This conserves validity of
foreign key references for database_schema
and
database
objects. For relation
and
database
objects, duplication doesn’t require records to be
kept in the same order.names<-
method for
consistently changing relation (schema) names. In particular, for
databases and database schemas, this ensures the names are also changed
in references.functional_dependency
, have a
rename_attrs
method for renaming the attributes across the
whole object. This renames them in all schemas, relations, references,
and so on.create
generic function, for creating
relation
and database
objects from
relation_schema
and database_schema
objects,
respectively. The created objects contain no data. This function is
roughly the equivalent to CREATE TABLE
in SQL, but the
vectorised nature of the relation classes means that several tables are
created at once.insert
generic function for
relation
and database
objects, which takes a
data frame of new data, and inserts it into any relation in the object
whose attributes are all present in the new data. This is roughly
equivalent to SQL’s INSERT
, but works over multiple
relations at once, and means there’s now a way to put data into a
database
outside of decompose
. Indeed,
decompose
is now equivalent to calling create
,
then calling insert
with all the relations.normalise
to prefer to remove dependencies
with dependants and determinant sets later in table order, and with
larger dependant sets. This brings it more in line with similar
decisions made in other package functions.dfd
/discover
to improve computation time.skip_bijections
option to
dfd
/discover
, to speed up functional
dependency searches where there are pairwise-equivalent attributes
present.autodb
documentation link to page with
database format information.df_equiv
to work with data.frame
columns that are lists.dfd
/discover
treating similar
numeric values as equal, resulting in data frames not being insertable
into their own schema.database
checks not handling doubles correctly.
Specifically, foreign key reference checks involve merging tables
together, and merge operates on doubles with a tolerance that’s set
within an internal method, so merges can create duplicates that need to
be removed afterwards.rejoin
in the case where merges are
based on doubles, sometimes resulting in duplicates.normalise
’s return output to be invariant to the
given order of the functional_dependency
input.normalise
returning relations with attributes in
the wrong order in certain cases where
remove_avoidable = TRUE
.gv
giving Graphviz code that could result in
incorrect diagrams: relation and attribute names were converted to lower
case, and not checked for uniqueness afterwards. This could result in
incorrect foreign key references being drawn. The fix also accounts for
a current bug in Graphviz, where edges between HTML-style node ports
ignore case for the port labels.NEWS.md
file to track changes to the
package.autodb
, dfd
,
gv
, and rejoin
.decompose
to return an error if the data.frame
doesn’t satisfy the functional dependencies implied by the schema. This
will return an error when using decompose
with a schema
derived from the same data.frame if any approximate dependencies were
included. Previously, using decompose
or dfd
with approximate dependencies would result in constructing a database
with duplicate key values, since there’s currently no handling of
approximate dependencies during database construction, and records
ignored in approximate dependencies were being kept. This is incorrect
behaviour; decompose
will be added back for approximate
dependencies once the package can properly handle them.reduce
generic, and added a method for database
schemas. Currently this method requires explicitly naming the main
relations, rather than inferring them.nudge
data documentation, improved commentary
on publication references in vignette.autodb
, due to
approximate dependencies now returning an error in
decompose
.print.database
to refer to records instead of
rows.normalise
that resulted in relations
having duplicate keys.normalise
, that resulted in schemas that
didn’t reproduce the given functional dependencies.dfd
’s data simplification step for POSIXct
datetimes, in case where two times only differ by
standard/daylight-savings time (e.g. 1:00:00 EST vs. 1:00:00 EDT on the
same day).dfd
with cache = TRUE, where data frame
column names being argument names for paste
can result in
an error.gv
methods included
Graphviz syntax errors when given relations with zero-length names.
gv.data.frame
now requires name
to be
non-empty; gv.database_schema
and gv.database
replace zero-length names.