This is a developer vignette to explain how caching works and what we learned on the way.
The main caching features were implemented in the following two pull requests:
#538: Implemented simple caching and utilities for managing
caches. Input text is styled as a whole and added to the cache
afterwards. This makes most sense given that the very same expression
will probably never be passed to styler, unless it is already compliant
with the style guide. Apart from the (negligible) inode, caching text
has a memory cost of 0. Speed boosts only result if the whole text
passed to styler is compliant to the style guide in use. Changing one
line in a file with hundreds of lines means each line will be styled
again. This is a major drawback and makes the cache only useful for a
use with a pre-commit framework (the initial motivation) or when
functions like style_pkg()
are run often and most files
were not changed.
#578: Adds a second layer of caching by caching top-level
expressions individually. This will bring speed boosts to the situation
where very little is changed but there are many top-level expressions.
Hence, changing one line in a big file will invalidate the cache for the
expression the line is part of, i.e. when changing
x <- 2
to x = 2
below, styler will have to
restyle the function definition, but not another(call)
and
all other expressions that were not changed.
While #538 also required a lot of thought, this is not necessarily visible in the diff. The main challenge was to figure out how the caching should work conceptually and where we best insert the functionality as well as how to make caching work for edge cases like trailing blank lines etc. For details on the conceptual side and requirements, see #538.
In comparison, the diff in #578 is much larger. We can walk through the main changes introduced here:
Each nest gained a column is_cached to indicate if an
expression is cached. It’s only ever set for the top-level nest, but set
to NA
for all other nests. Also, comments are not cached
because they are essentially top level terminals which are very cheap to
style (also because hardly any rule concerns them) and because each
comment is a top-level expression, simply styling them is cheaper than
checking for each of them if it is in the cache.
Each nest also gained a column block to denote the block
to which it belongs for styling. Running each top-level expression
through parse_transform_serialize_r()
separately is
relatively expensive. We prefer to put multiple top-level expressions
into a block and process the block. This is done with
parse_transform_serialize_r_block()
. Note that before we
implemented this PR, all top-level expressions were sent through
parse_transform_serialize_r()
as one block. Leaving out
some exceptions in this explanation, we always put uncached top-level
expressions in a block and cached top-level expressions into a block and
then style the uncached ones.
Apart from the actual styling, a very costly part of formatting
code with styler is to compute the nested parse data with
compute_parse_data_nested()
. When caching top-level
expressions, it is evident that building up the nested structure for
cached code is unnecessary because we don’t actually style it, but
simply return text
. For this reason, we introduce the
concept of a shallow nest. It can only occur at the top level. For the
top-level expressions we know that they are cached, we remove all
children before building up the nested parse table and let them act as
terminals
and will later simply return their
text
. Hence, in the nested parse table, no cached
expressions have children.
Because we now style blocks of expressions and we want to
preserve the line breaks between them, we need to keep track of all
blank lines between expressions, which was not necessary previously
because all expressions were in a block and the blank lines separating
them were stored in newlines
and lag_newlines
except for all blank lines before the first expression.
Because we wanted to cache by expression, but process by block of
expression, we needed to decompose the block into individual expressions
and add them to the cache once we obtained the final text. We could
probably also have added expressions to the cache before we put the text
together, but the problem is that at some point we turn the nested
structure into a flat structure and as this must happen with a
post_visit()
approach, we’d have to implement a complicated
routine to check if we are now about to put together all top-level
expressions and then if yes write them to the cache. A simple (but maybe
not so elegant) parsing of the output as implemented in
cache_by_expression()
seemed reasonable in terms of
limiting complexity and keeping efficiency.
For more detailed explanation and documentation, please consult the help files of the internals.