Climate projection data sets are produced in a variety of formats but all conform to the CF Metadata Conventions. NetCDF data files, in particular, are highly structured and relatively easy to process. That said, it is very important to maintain a proper processing workflow such that the small changes in the climate projections are maintained and revealed through analysis. In this document, the basic workflow with varying calendars is described.
Individual files containing climate projections contain global, regional or local data, typically on a rectangular latitude-longitude grid, for a single parameter such as “near-surface temperature”, and for a number of time steps. An analysis workflow then consists of a number of steps:
apply(data, 1:2, tapply, f, fun)
(following the CF Metadata Conventions, dimensions 1 and 2 are
“longitude” and “latitude”, respectively; the third dimension is
“time”). Repeat for the data suite for each ensemble member.Apart from the first step of obtaining the data, the steps lend
themselves well to automation. The catch, however, is in the factor
f
to use with tapply()
. The different models
(in your ensemble) use different calendars, meaning that different
factors are required. The CFtime package can help out.
The CFfactor()
function produces a factor that respects
the calendar of the data files. The function comes in two operating
modes:
# Setting up
fn <- list.files(path = system.file("extdata", package = "CFtime"), full.names = TRUE)[1]
nc <- nc_open(fn)
cf <- CFtime(nc$dim$time$units,
nc$dim$time$calendar,
nc$dim$time$vals)
# Create monthly factors for a baseline epoch and early, mid and late 21st century epochs
baseline <- CFfactor(cf, epoch = 1991:2020)
future <- CFfactor(cf, epoch = list(early = 2021:2040, mid = 2041:2060, late = 2061:2080))
str(baseline)
#> Factor w/ 12 levels "01","02","03",..: 1 1 1 1 1 1 1 1 1 1 ...
#> - attr(*, "epoch")= int 30
#> - attr(*, "period")= chr "month"
str(future)
#> List of 3
#> $ early: Factor w/ 12 levels "01","02","03",..: NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "epoch")= int 20
#> ..- attr(*, "period")= chr "month"
#> $ mid : Factor w/ 12 levels "01","02","03",..: NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "epoch")= int 20
#> ..- attr(*, "period")= chr "month"
#> $ late : Factor w/ 12 levels "01","02","03",..: NA NA NA NA NA NA NA NA NA NA ...
#> ..- attr(*, "epoch")= int 20
#> ..- attr(*, "period")= chr "month"
Building on the examples above of opening a file, creating a
CFtime
instance and a suitable factor for one data suite,
here daily rainfall, the actual processing of the data into
precipitation anomalies for 3 periods relative to a baseline period
could look like this:
# Read the data from the NetCDF file.
# Keep degenerate dimensions so that we have a predictable data structure: 3-dimensional array.
# Converts units of kg m-2 s-1 to mm/day.
pr <- ncvar_get(nc, "pr", collapse_degen = FALSE) * 86400
# Assign dimnames(), optional.
dimnames(pr) <- list(nc$dim$lon$vals, nc$dim$lat$vals, as_timestamp(cf))
# Get a global attribute from the file
experiment <- ncatt_get(nc, "")$experiment_id
nc_close(nc)
# Calculate the daily average precipitation per month for the baseline period
# and the three future epochs.
pr_base <- apply(pr, 1:2, tapply, baseline, mean) # an array
pr_future <- lapply(future, function(f) apply(pr, 1:2, tapply, f, mean)) # a list of arrays
# Calculate the precipitation anomalies for the future epochs against the baseline.
# Working with daily averages per month so we can simply subtract and then multiply by days
# per month for each of the factor levels using the CF calendar.
ano <- mapply(function(pr, f) {(pr - pr_base) * CFfactor_units(cf, f)}, pr_future, future, SIMPLIFY = FALSE)
# Plot the results
plot(1:12, ano$early[,1,1], type = "o", col = "blue", ylim = c(-50, 40), xlim = c(1, 12),
main = paste0("Hamilton, New Zealand\nExperiment: ", experiment),
xlab = "month", ylab = "Precipitation anomaly (mm)")
lines(1:12, ano$mid[,1,1], type = "o", col = "green")
lines(1:12, ano$late[,1,1], type = "o", col = "red")
Looks like Hadley will be needing rubber boots in spring and autumn back home!
The interesting feature, working from opening the NetCDF file down to
plotting, is that the specifics of the CF calendar that the data suite
uses do not have to be considered anywhere in the processing workflow:
the CFtime
package provides the functionality. Data suites
using another CF calendar are processed exactly the same.
Different climate projection data sets can use different calendars. It is absolutely essential to respect the calendar of the different data sets because the underlying solar and atmospheric physics are based on those calendars as well.
In a typical situation, a researcher would construct a multi-model ensemble to remove or reduce the bias in any given model. The data sets composing the ensemble might well use different calendars. The correct way of constructing an ensemble is to perform the desired analysis on every ensemble member individually and to combine them only in the final step and to then perform any ensemble operations such as computing confidence intervals. The design of the CFtime package makes it easy to do this, through its heavy use of lists. Building on the previous example, let’s make a multi-model ensemble of 2 models (not much of an ensemble but such are the limitations of including data with packages - the example easily extends to a larger set of ensemble members).
# Get the list of files that make up the ensemble members, here:
# GFDL ESM4 and MRI ESM2 models for experiment SSP2-4.5, precipitation, CMIP6 2015-01-01 to 2099-12-31
lf <- list.files(path = system.file("extdata", package = "CFtime"), full.names = TRUE)
# Loop over the files individually
# ano is here a list with each element holding the results for a single model
ano <- lapply(lf, function(fn) {
nc <- nc_open(fn)
cf <- CFtime(nc$dim$time$units, nc$dim$time$calendar, nc$dim$time$vals)
pr <- ncvar_get(nc, "pr", collapse_degen = FALSE) * 86400
nc_close(nc)
baseline <- CFfactor(cf, epoch = 1991:2020)
pr_base <- apply(pr, 1:2, tapply, baseline, mean)
future <- CFfactor(cf, epoch = list(early = 2021:2040, mid = 2041:2060, late = 2061:2080))
pr_future <- lapply(future, function(f) apply(pr, 1:2, tapply, f, mean))
mapply(function(pr, f) {(pr - pr_base) * CFfactor_units(cf, f)}, pr_future, future, SIMPLIFY = FALSE)
})
# Epoch names
epochs <- c("early", "mid", "late")
dim(epochs) <- 3
# Build the ensemble for each epoch
# For each epoch, grab the data for each of the ensemble members, simplify to an array
# and take the mean per row (months, in this case)
ensemble <- apply(epochs, 1, function(e) {
rowMeans(sapply(ano, function(a) a[[e]], simplify = T))})
colnames(ensemble) <- epochs
rownames(ensemble) <- rownames(ano[[1]][[1]])
ensemble
#> early mid late
#> 01 -22.901333 -15.2675652 -1.6380748
#> 02 -25.430060 -21.4016013 -15.3427744
#> 03 -21.895792 -14.7434749 -22.4187823
#> 04 6.763958 6.9498244 15.2008249
#> 05 -2.635598 -15.3054439 -17.0106058
#> 06 -43.152012 -47.3442148 -32.1797467
#> 07 1.072942 10.4940924 3.9473240
#> 08 4.124084 -6.0917940 -12.9178847
#> 09 9.231852 -0.2038321 2.7198366
#> 10 5.148302 10.3044202 12.0060866
#> 11 16.186108 25.9156463 8.2168498
#> 12 -3.211510 -0.2036319 0.7604947
Here we simply compute the average of the monthly precipitation anomaly over the ensemble members. In a more typical scenario, you would use the values from the individual models and to apply a more suitable analysis, such as calculating the confidence interval or model agreement.
One significant advantage of this processing workflow is that it is
easily parallelized: the bulk of the work goes into computing the
anomalies, ano
, and this is embarrassingly
parallel because they read their own data and produce independent
outputs. Use package
future or something similar to easily make the code run on all
available CPU cores.
Due to the large size of typical climate projection data files, it is common to have a data suite that is contained in multiple files. A case in point is the CORDEX data set which breaks up the experiment period of 2006 - 2100 into 19 files of 5 years each, with each file covering a single parameter (temperature, precipitation, etc) over an entire domain (such as Europe, South Asia, Central America and the Caribbean, etc). The CFtime package can streamline processing of such multi-file data suites as well.
Assuming that you have your CORDEX files in a directory on disk,
organized by domain and other properties such as the variable, GCM/RCM
combination, experiment, etc, the process of preparing the files for
processing could be encoded in a function as below. The argument
fn
is a list of file names to process, and var
is the variable contained in the files. (There are no checks on argument
sanity here, which should really be included. This function only makes
sense for a single [domain, GCM/RCM, experiment, variable] combination.
Also be aware of data size, CORDEX files are huge and stitching all
domain data together will easily exhaust available memory and it may
thus lead to very large swap files and very poor performance - use the
CFsubset()
function to read temporal chunks of data to
avoid such problems.)
library(ncdf4)
library(abind)
prepare_CORDEX <- function(fn, var) {
offsets <- vector("list", length(fn))
data <- vector("list", length(fn))
for (i in 1:length(fn)) {
nc <- nc_open(fn[i])
if (i == 1)
# Create an "empty" CFtime object, without elements
cf <- CFtime(nc$dim$time$units, nc$dim$time$calendar)
# Make lists of all datum offsets and data arrays
offsets[[i]] <- as.vector(nc$dim$time$vals)
data[[i]] <- ncvar_get(nc, var,
start = c(10, 10, 1), count = c(100, 100, -1), # spatial subsetting
collapse_degen = FALSE)
nc_close(nc)
}
# Create a list for output with the CFtime instance assigned the offsets and
# the data bound in a single 3-dimensional array
list(CFtime = cf + unlist(offsets), data = abind(data, along = 3))
}
Calling this function like
prepare_CORDEX(list.files(path = "~/CC/CORDEX/CAM", pattern = "^pr.*\\.nc$", full.names = TRUE), "pr")
will yield a list of NetCDF files with precipitation data, with the
resulting CFtime
instance describing the full temporal
extent covered by the data files, as well as the data bound on the
temporal dimension, ready for further processing.
When working like this it is imperative that the offsets and the data
arrays are added to their final structures in exactly the same
order. It is not necessary that the offsets (and the data)
themselves are in order, but the correspondence between offsets and data
needs to be maintained. (list.files()
produces a list in
alphabetical order by default, which for most climate projection files
produces offsets in chronological order.)
The results presented contain modified data from Copernicus Climate Change Service information, 2023-2024. Neither the European Commission nor ECMWF is responsible for any use that may be made of the Copernicus information or data it contains.
We acknowledge the World Climate Research Programme, which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP6. We thank the climate modeling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access, and the multiple funding agencies who support CMIP6 and ESGF.
The two datasets used as examples in this vignette carry the following license statements: