In this vignette, we will discuss the use of both area and
cluster-level models using the SUMMER package. We will use the simulated
surveys in the DemoData
dataset in the package. The
DemoData
object is a list that contains full birth history
data from simulated surveys with stratified cluster sampling design,
similar to most of the DHS surveys. It has been pre-processed into the
person-month format, where for each list entry, each row represents one
person-month record. Each record contains columns for the cluster ID
(clustid
), household ID (id
), strata
membership (strata
) and survey weights
(weights
). The region and time period associated with each
person-month record has also been computed and saved in the dataset. Raw
DHS data can be processed into this format using the
getBirths
function.
We first load the package and data. We will use the dplyr package in some data processing steps, and ggplot2 and patchwork for visualization.
DemoData
contains model survey data provided by DHS.
Note that this data is simulated and does not represent any real
country’s data. DemoData
is obtained by processing the raw
DHS birth data (in .dta format) in R. The raw file of birth recodes can
be downloaded from the DHS website https://dhsprogram.com/data/Download-Model-Datasets.cfm.
For this demo dataset, no registration is needed. For real DHS survey
datasets, permission to access needs to be registered with DHS directly.
DemoData
contains a small sample of the observations in
this dataset randomly assigned to \(5\)
example DHS surveys.
For more details, the following code snippet demonstrates how to
split the raw demo data into person-month format from similar to that in
the DemoData
object. Notice that to read the file from
early version of stata, the package readstata13
is
required. The following script is based on the example dataset
ZZBR62FL.DTA
available from the DHS website. We use the
interaction of v024 and v025 as the strata indicator for the purpose of
demonstration. We can see from range(data$v008)
that the
CMC code for the date of interview corresponds to the year
1900 + floor(1386/12)= 2015
. In practice, however, the
survey year is usually known. The survey year variable allows the
mis-recorded data. Dates after the surveyyear
will be
removed. Thus for survey taking place over multiple years, the later
year is suggested to be used as surveyyear
. If set to NA
then no checking will be performed.
library(readstata13)
my_fp <- "data/ZZBR62DT/ZZBR62FL.DTA"
dat <- getBirths(filepath = my_fp, surveyyear = 2015, strata = c("v024", "v025"))
dat <- dat[, c("v001", "v002", "v024", "per5", "ageGrpD", "v005", "strata", "died")]
colnames(dat) <- c("clustid", "id", "region", "time", "age", "weights", "strata",
"died")
Back to the pre-processed dataset, DemoData
is a list of
\(5\) data frames where each row
represent one person-month record and contains the \(8\) variables as shown below. Notice that
time
variable is turned into 5-year bins from
80-84
to 10-14
.
## Length Class Mode
## 1999 8 data.frame list
## 2003 8 data.frame list
## 2007 8 data.frame list
## 2011 8 data.frame list
## 2015 8 data.frame list
## clustid id region time age weights strata died
## 1 1 1 eastern 00-04 0 1.1 eastern.rural 0
## 2 1 1 eastern 00-04 1-11 1.1 eastern.rural 0
## 3 1 1 eastern 00-04 1-11 1.1 eastern.rural 0
## 4 1 1 eastern 00-04 1-11 1.1 eastern.rural 0
## 5 1 1 eastern 00-04 1-11 1.1 eastern.rural 0
## 6 1 1 eastern 00-04 1-11 1.1 eastern.rural 0
For demonstration purpose, we associate the regions in this simulated
dataset with an example map stored in the DemoMap
object in
the SUMMER package. DemoMap
contains geographic data from
the 1995 Uganda Admin 1 regions defined by DHS. It contains a
SpatialPolygonsDataFrame object geo
and the corresponding
spatial adjacency matrix mat
.
The spatial adjacency matrix can also be created directly from the SpatialPolygonsDataFrame by
Finally, more details on the pipeline of downloading and processing raw DHS data and shapefiles can be found in the vignette A Case Study of Estimating Subnational U5MR using DHS data.
First, we obtain Horvitz-Thompson estimators using
getDirectList
. We specify the survey design as the
two-stage stratified cluster sampling where strata are specified in the
strata
column, and clusters are specified by both the
cluster ID (clusterid
) and household ID
(id
).
years <- levels(DemoData[[1]]$time)
data_multi <- getDirectList(births = DemoData, years = years, regionVar = "region",
timeVar = "time", clusterVar = "~clustid+id", ageVar = "age", weightsVar = "weights",
geo.recode = NULL)
Before fitting the model, we also aggregate estimates based on different surveys into a single set of estimates, using the inverse design-based variances as the weights.
## [1] 150 10
## [1] 30 10
With the combined direct estimates, we are ready to fit the smoothing
models described in Li et al. (2019).
First, we ignore the subnational estimates, and fit a model with
temporal random effects only. In this part, we use the subset of data
region variable being “All”. In fitting this model, we first define the
list of time periods we wish to project the estimates on. First we can
fit a Random Walk 2 only model defined on the 5-year period. The
argument m = 1
specifies that the random walk is in the
same temporal resolution as the input data. See the next example for the
case where the random walk model is specified at a higher
resolution.
years.all <- c(years, "15-19")
fit1 <- smoothDirect(data = data, Amat = NULL, year.label = years.all, year.range = c(1985,
2019), time.model = "rw2", m = 1)
## ----------------------------------
## Smoothed Direct Model
## Main temporal model: rw2
## Number of time periods: 35
## Temporal resolution: period model (m = 1)
## ----------------------------------
## This is INLA_24.12.11 built 2024-12-11 20:07:43 UTC.
## - See www.r-inla.org/contact-us for how to get help.
## - List available models/likelihoods/etc with inla.list.models()
## - Use inla.doc(<NAME>) to access documentation
## ----------------------------------
## Smoothed Direct Model
## Main temporal model: rw2
## Number of time periods: 35
## Temporal resolution: period model (m = 1)
## ----------------------------------
## Fixed Effects
## mean sd 0.025quant 0.5quant 0.97quant mode kld
## (Intercept) -1.4 0.081 -1.6 -1.4 -1.3 -1.4 0
## ----------------------------------
## Random Effects
## Name Model
## 1 time.struct RW2 model
## 2 time.unstruct IID model
## ----------------------------------
## Model hyperparameters
## mean sd 0.025quant 0.5quant 0.97quant mode
## Precision for time.struct 1212 9512 5.4 140 6883 9.8
## Precision for time.unstruct 1191 6732 11.2 207 6959 23.3
## [,1]
## log marginal-likelihood (integration) -2.1
## log marginal-likelihood (Gaussian) -1.9
When data is sparse, direct estimates at yearly level may be unstable. This is why we used 5-year periods as the model’s temporal resolution in this example. When performing temporal smoothing, however, we can define the temporal smoother on the yearly scale instead. Notice that the direct estimates are still calculated in 5-year intervals, but the smoothed estimator now produce estimates at both yearly and period resolutions.
fit2 <- smoothDirect(data = data, Amat = NULL, year.label = years.all, year.range = c(1985,
2019), time.model = "rw2", m = 5, type.st = 4)
## ----------------------------------
## Smoothed Direct Model
## Main temporal model: rw2
## Number of time periods: 35
## Temporal resolution: yearly model (m = 5)
## ----------------------------------
## ----------------------------------
## Smoothed Direct Model
## Main temporal model: rw2
## Number of time periods: 35
## Temporal resolution: yearly model (m = 5)
## ----------------------------------
## Fixed Effects
## mean sd 0.025quant 0.5quant 0.97quant mode kld
## (Intercept) -1.4 0.074 -1.6 -1.4 -1.3 -1.4 0
## ----------------------------------
## Random Effects
## Name Model
## 1 time.struct RGeneric2
## 2 time.unstruct RGeneric2
## ----------------------------------
## Model hyperparameters
## mean sd 0.025quant 0.5quant 0.97quant mode
## Theta1 for time.struct 5.0 1.9 1.6 5.0 8.8 4.6
## Theta1 for time.unstruct 4.3 1.8 1.1 4.2 7.8 3.9
## [,1]
## log marginal-likelihood (integration) -75
## log marginal-likelihood (Gaussian) -75
The marginal posteriors are already stored in the fitted object. We use the following function to extract and re-arrange them.
We can compare the results visually using the function below.
Now we fit the full model on all subnational regions. First, we use the Random Walk 2 model defined on the 5-year period.
fit3 <- smoothDirect(data = data, Amat = mat, year.label = years.all, year.range = c(1985,
2019), time.model = "rw2", m = 1, type.st = 4)
## ----------------------------------
## Smoothed Direct Model
## Main temporal model: rw2
## Number of time periods: 35
## Temporal resolution: period model (m = 1)
## Spatial effect: bym2
## Number of regions: 4
## Interaction temporal model: rw2
## Interaction type: 4
## ----------------------------------
Similarly we can also estimate the Random Walk 2 random effects on the yearly scale.
fit4 <- smoothDirect(data = data, Amat = mat, year.label = years.all, year.range = c(1985,
2019), time.model = "rw2", m = 5, type.st = 4)
## ----------------------------------
## Smoothed Direct Model
## Main temporal model: rw2
## Number of time periods: 35
## Temporal resolution: yearly model (m = 5)
## Spatial effect: bym2
## Number of regions: 4
## Interaction temporal model: rw2
## Interaction type: 4
## ----------------------------------
The figures below shows the comparison of the subnational model with different temporal scales.
g1 <- plot(out3, is.yearly = FALSE) + ggtitle("Subnational period model")
g2 <- plot(out4, is.yearly = TRUE) + ggtitle("Subnational yearly model")
g1 + g2
We can also add back the direct estimates for comparison when plotting the smoothed estimates.
plot(out4, is.yearly = TRUE, data.add = data_multi, option.add = list(point = "mean",
by = "surveyYears")) + facet_wrap(~region, scales = "free")
Finally, we show the estimates over time on maps.
We now describe the fitting of the cluster-level model for U5MR described in Martin et al. (2020) and Fuglstad, Li, and Wakefield (2021). For this simulated dataset, the strata variable is coded as region crossed by urban/rural status. For our analysis with urban/rural stratified model, we first construct a new strata variable that contains only the urban/rural status, i.e., the additional stratification within each region.
for (i in 1:length(DemoData)) {
strata <- DemoData[[i]]$strata
DemoData[[i]]$strata[grep("urban", strata)] <- "urban"
DemoData[[i]]$strata[grep("rural", strata)] <- "rural"
}
To fit the cluster-level model, we calculate the number of
person-months and number of deaths for each cluster, time period, and
age group. We first create the data frame using the
getCounts
function for each survey and combine them into a
single data frame.
The model fitting function smoothCluster
expects columns
with specific names. We rename the cluster ID and time period columns to
be ‘cluster’ and ‘years’. The response variable is ‘Y’ and the binomial
total is ‘total’.
counts.all <- NULL
for (i in 1:length(DemoData)) {
vars <- c("clustid", "region", "strata", "time", "age")
counts <- getCounts(DemoData[[i]][, c(vars, "died")], variables = "died", by = vars,
drop = TRUE)
counts <- counts %>%
mutate(cluster = clustid, years = time, Y = died)
counts$survey <- names(DemoData)[i]
counts.all <- rbind(counts.all, counts)
}
With the created data frame, we fit the cluster-level model using the
smoothCluster
function. Notice that here we need to specify
the age groups (age.group
), the length of each age group
(age.n
) in months, and how the age groups are mapped to the
temporal random effects (age.time.group
). In the default
case, age.time.group = c(1, 2, 3, 3, 3, 3)
means the first
two age groups each has its own temporal trend, the the following four
age groups share the same temporal trend. We start with the default
temporal model of random walk or order 2 on the 5-year periods in this
dataset (with real data, we can use a finer temporal resolution). For
the space-time interaction, we use an AR(1) prior (specified by
st.time.model
) interacting with a spatial ICAR prior, with
random linear trends in each area (specified by
pc.st.slope.u
and pc.st.slope.alpha
). We add
survey iid effects to the model as well using
survey.effect = TRUE
argument. The temporal main effects
are defined for each stratum separately (specified by
strata.time.effect = TRUE
, so in total six random walks are
used to model the main temporal effect.
periods <- c("85-89", "90-94", "95-99", "00-04", "05-09", "10-14")
fit.bb <- smoothCluster(data = counts.all, Amat = DemoMap$Amat,
family = "betabinomial",
year.label = c(periods, "15-19"),
age.group = c("0", "1-11", "12-23", "24-35", "36-47", "48-59"),
age.n = c(1, 11, 12, 12, 12, 12),
age.time.group = c(1, 2, 3, 3, 3, 3),
time.model = "rw2",
st.time.model = "ar1",
pc.st.slope.u = 1, pc.st.slope.alpha = 0.01,
survey.effect = TRUE,
strata.time.effect = TRUE)
## ----------------------------------
## Cluster-level model
## Main temporal model: rw2
## Number of time periods: 7
## Spatial effect: bym2
## Number of regions: 4
## Interaction temporal model: ar1
## Interaction type: 4
## Interaction random slopes: yes
## Number of age groups: 6
## Stratification: yes
## Number of age-specific fixed effect intercept per stratum: 6
## Number of age-specific trends per stratum: 3
## Strata-specific temporal trends: yes
## Survey effect: yes
## ----------------------------------
## ----------------------------------
## Cluster-level model
## Main temporal model: rw2
## Number of time periods: 7
## Spatial effect: bym2
## Number of regions: 4
## Interaction temporal model: ar1
## Interaction type: 4
## Interaction random slopes: yes
## Number of age groups: 6
## Stratification: yes
## Number of age group fixed effect intercept per stratum: 6
## Number of age-specific trends per stratum: 3
## Strata-specific temporal trends: yes
## Survey effect: yes
## ----------------------------------
## Fixed Effects
## mean sd 0.025quant 0.5quant 0.97quant mode kld
## age.intercept0:rural -3.0 0.14 -3.3 -3.0 -2.7 -3.0 0
## age.intercept1-11:rural -4.7 0.11 -4.9 -4.7 -4.5 -4.7 0
## age.intercept12-23:rural -5.8 0.15 -6.1 -5.8 -5.5 -5.8 0
## age.intercept24-35:rural -6.6 0.20 -6.9 -6.6 -6.2 -6.6 0
## age.intercept36-47:rural -6.9 0.23 -7.3 -6.9 -6.4 -6.9 0
## age.intercept48-59:rural -7.3 0.29 -7.9 -7.3 -6.8 -7.3 0
## age.intercept0:urban -2.7 0.13 -3.0 -2.7 -2.5 -2.7 0
## age.intercept1-11:urban -5.0 0.13 -5.2 -5.0 -4.7 -5.0 0
## age.intercept12-23:urban -5.7 0.17 -6.1 -5.7 -5.4 -5.7 0
## age.intercept24-35:urban -7.1 0.29 -7.6 -7.1 -6.5 -7.1 0
## age.intercept36-47:urban -7.6 0.37 -8.3 -7.6 -6.9 -7.6 0
## age.intercept48-59:urban -8.0 0.46 -8.9 -8.0 -7.1 -8.0 0
##
## Slope fixed effect index:
## time.slope.group1: 0:rural
## time.slope.group2: 1-11:rural
## time.slope.group3: 12-23:rural, 24-35:rural, 36-47:rural, 48-59:rural
## time.slope.group4: 0:urban
## time.slope.group5: 1-11:urban
## time.slope.group6: 12-23:urban, 24-35:urban, 36-47:urban, 48-59:urban
## ----------------------------------
## Random Effects
## Name Model
## 1 time.struct RW2 model
## 2 time.unstruct IID model
## 3 region.struct BYM2 model
## 4 region.int Besags ICAR model
## 5 st.slope.id IID model
## 6 survey.id IID model
## ----------------------------------
## Model hyperparameters
## mean sd 0.025quant
## overdispersion for the betabinomial observations 0.002 0.001 0.001
## Precision for time.struct 90.937 110.544 11.888
## Precision for time.unstruct 1083.231 4741.238 16.135
## Precision for region.struct 265.294 721.638 5.409
## Phi for region.struct 0.343 0.237 0.028
## Precision for region.int 588.715 2349.763 6.307
## Group PACF1 for region.int 0.899 0.186 0.321
## Precision for st.slope.id 89.504 296.081 1.182
## 0.5quant 0.97quant mode
## overdispersion for the betabinomial observations 0.002 0.005 0.002
## Precision for time.struct 58.262 347.017 28.141
## Precision for time.unstruct 249.503 6291.515 35.575
## Precision for region.struct 91.291 1441.425 11.633
## Phi for region.struct 0.295 0.842 0.081
## Precision for region.int 142.946 3426.994 11.757
## Group PACF1 for region.int 0.970 0.999 1.000
## Precision for st.slope.id 25.615 508.394 2.245
## [,1]
## log marginal-likelihood (integration) -3454
## log marginal-likelihood (Gaussian) -3449
In this example, we do not have any bias adjustment in this simple
model. If the ratio adjustments to U5MR are known, they could be entered
with the bias.adj
and bias.adj.by
arguments
when fitting the model.
Posterior samples from the model are taken and summarized using the
getSmoothed
function. For models with a large number of
areas and time points, this step may take some time to compute. The
save.draws
argument makes it possible to save the raw
posterior draws, so that we can use them again in other functions or
recompute different posterior credible intervals.
## ---------------------------------------------
## Stratified estimates stored in ...$stratified
## Aggregated estimates stored in ...$overall
## ---------------------------------------------
## Estimates computed for 7 time period(s) and 4 area(s)
## No strata weights has been supplied. Overall estimates are not calculated.
## Posterior draws are saved in the output. You can use 'getSmoothed(..., draws = ...$draws)' next time to speed up the call.
## 1000 posterior draws taken.
For example, to recompute the posterior CI directly using the existing draws:
The est.bb
object above computes the U5MR estimates by
urban/rural strata. In order to obtain the overall region-specific U5MR,
we need additional information on the population fractions of each
stratum within regions. As an demonstration, we simulate the population
totals over the years with
pop.base <- expand.grid(region = c("central", "eastern", "northern", "western"),
strata = c("urban", "rural"))
pop.base$population <- round(runif(dim(pop.base)[1], 1000, 20000))
periods.all <- c(periods, "15-19")
pop <- NULL
for (i in 1:length(periods.all)) {
tmp <- pop.base
tmp$population <- pop.base$population + round(rnorm(dim(pop.base)[1], mean = 0,
sd = 200))
tmp$years <- periods.all[i]
pop <- rbind(pop, tmp)
}
head(pop)
## region strata population years
## 1 central urban 16641 85-89
## 2 eastern urban 2662 85-89
## 3 northern urban 4307 85-89
## 4 western urban 13222 85-89
## 5 central rural 16821 85-89
## 6 eastern rural 17979 85-89
In order to compute the aggregated estimates, we need the proportion of urban/rural populations within each region in each time period.
weight.strata <- expand.grid(region = c("central", "eastern", "northern", "western"),
years = periods.all)
weight.strata$urban <- weight.strata$rural <- NA
for (i in 1:dim(weight.strata)[1]) {
which.u <- which(pop$region == weight.strata$region[i] & pop$years == weight.strata$years[i] &
pop$strata == "urban")
which.r <- which(pop$region == weight.strata$region[i] & pop$years == weight.strata$years[i] &
pop$strata == "rural")
weight.strata[i, "urban"] <- pop$population[which.u]/(pop$population[which.u] +
pop$population[which.r])
weight.strata[i, "rural"] <- 1 - weight.strata[i, "urban"]
}
head(weight.strata)
## region years rural urban
## 1 central 85-89 0.50 0.50
## 2 eastern 85-89 0.87 0.13
## 3 northern 85-89 0.37 0.63
## 4 western 85-89 0.36 0.64
## 5 central 90-94 0.51 0.49
## 6 eastern 90-94 0.87 0.13
Now we can recompute the smoothed estimates with the population fractions.
est.bb <- getSmoothed(fit.bb, nsim = 1000, CI = 0.95, save.draws = TRUE, weight.strata = weight.strata)
head(est.bb$overall)
## region years time area variance median mean upper lower rural urban
## 5 central 85-89 1 1 0.00112 0.23 0.23 0.30 0.17 0.50 0.50
## 6 central 90-94 2 1 0.00051 0.21 0.21 0.26 0.17 0.51 0.49
## 7 central 95-99 3 1 0.00034 0.19 0.19 0.23 0.16 0.50 0.50
## 1 central 00-04 4 1 0.00027 0.18 0.18 0.22 0.15 0.51 0.49
## 2 central 05-09 5 1 0.00025 0.17 0.17 0.20 0.14 0.50 0.50
## 3 central 10-14 6 1 0.00037 0.15 0.15 0.18 0.12 0.50 0.50
## is.yearly years.num
## 5 FALSE NA
## 6 FALSE NA
## 7 FALSE NA
## 1 FALSE NA
## 2 FALSE NA
## 3 FALSE NA
We can compare the stratum-specific and aggregated U5MR estimates now.
There are extensive visualization tools in the package. We can add
hatching to the map visualizations to indicate uncertainty of the
estimates using hatchPlot
with similar syntax.
hatchPlot(est.bb$overall,
geo = DemoMap$geo, by.data = "region", by.geo = "REGNAME",
is.long = TRUE, variables = "years", values = "median",
lower = "lower", upper = "upper", hatch = "red",
ncol = 4, direction = -1, per1000 = TRUE, legend.label = "U5MR")
Next we show the posterior densities of the estimates, where each
each panel correspond to one time period, and the regions are ordered by
the posterior median of estimates in the last year (specified by the
order = -1
argument).
ridgePlot(draws = est.bb, year.plot = periods.all,
ncol = 4, per1000 = TRUE, order = -1, direction = -1)
It can also be plotted for each region over time as well.
ridgePlot(draws = est.bb, year.plot = periods.all,
ncol = 4, per1000 = TRUE, by.year = FALSE, direction = -1)
When external estimates are available on the national level, we may
want the subnational estimates to aggregate to the national estimates
for internal validation purposes. Methods to benchmark to the national
mean estimates were proposed in the literature. The vignette A Case
Study of Estimating Subnational U5MR using DHS data demonstrates
the approach described in Li et al. (2019)
for the area-level model. The bias.adj
argument in the
smoothCluster()
function allows one to adjust the estimates
if the benchmarking ratio is known, using the approximation methods
described in Wakefield et al. (2019). We
refer readers to the package document for more details. Here we
demonstrate a post-processing method described by Okonek and Wakefield (2022). It has the
additional advantage that both the mean and uncertainty of the national
estimates are accounted for in the benchmarking process.
To demonstrate the function, we first simulate national estimates. For a realistic simulation, we reuse the national direct estimates and add a constant bias term. We assume these estimates are from an external source. Thus we expect the estimates after benchmarking will be larger in general.
national <- data.frame(years = periods.all, est = out1$median + 0.01, sd = runif(7,
0.01, 0.03))
head(national)
## years est sd
## 1 85-89 0.24 0.020
## 2 90-94 0.23 0.012
## 3 95-99 0.21 0.025
## 4 00-04 0.22 0.023
## 5 05-09 0.20 0.018
## 6 10-14 0.17 0.025
In order to perform the benchmarking process, we need the proportion of population by region over the time periods. We can compute these from the simulated population before.
weight.region <- expand.grid(region = c("central", "eastern", "northern", "western"),
years = periods.all)
weight.region$proportion <- NA
for (i in 1:dim(weight.region)[1]) {
which <- which(pop$region == weight.strata$region[i] & pop$years == weight.strata$years[i])
which.year <- which(pop$years == weight.strata$years[i])
weight.region[i, "proportion"] <- sum(pop$population[which])/sum(pop$population[which.year])
}
head(weight.region)
## region years proportion
## 1 central 85-89 0.410
## 2 eastern 85-89 0.253
## 3 northern 85-89 0.084
## 4 western 85-89 0.253
## 5 central 90-94 0.413
## 6 eastern 90-94 0.256
We perform the benchmarking procedure using the
Benchmark
function
est.bb.bench <- Benchmark(est.bb, national, weight.region = weight.region, estVar = "est",
sdVar = "sd", timeVar = "years")
Since the benchmarking procedure is based on a rejection sampler, the number of posterior draws accepted after benchmarking can be small, as in this example. We recommend increasing the number of draws first, i.e.,
est.bb <- getSmoothed(fit.bb, nsim = 20000, CI = 0.95, save.draws = TRUE, weight.strata = weight.strata)
est.bb.bench <- Benchmark(est.bb, national, weight.region = weight.region, estVar = "est",
sdVar = "sd", timeVar = "years")
We can compare the posterior median and variance before and after benchmarking.
par(mfrow = c(1, 2))
plot(est.bb$overall$median, est.bb.bench$overall$median, xlab = "Before benchmarking",
ylab = "After benchmarking", main = "Posterior median")
abline(c(0, 1))
plot(est.bb$overall$variance, est.bb.bench$overall$variance, xlab = "Before benchmarking",
ylab = "After benchmarking", main = "Posterior variance")
abline(c(0, 1))
As an sanity check, we can compute the simple weighted average of the posterior median of all regions to form a set of rough national estimates. It can be seen that the benchmarked estimates are indeed higher and closer to the simulated national estimates benchmarked against.
compare <- national
compare$before <- NA
compare$after <- NA
for (i in 1:dim(compare)[1]) {
weights <- subset(weight.region, years == national$years[i])
sub <- subset(est.bb$overall, years == national$years[i])
sub <- merge(sub, weights)
sub.bench <- subset(est.bb.bench$overall, years == national$years[i])
sub.bench <- merge(sub.bench, weights)
compare$before[i] <- sum(sub$proportion * sub$median)
compare$after[i] <- sum(sub.bench$proportion * sub.bench$median)
}
plot(compare$est, compare$after, col = 2, pch = 10, xlim = range(c(compare$est, compare$before,
compare$after)), ylim = range(c(compare$est, compare$before, compare$after)),
xlab = "External national estimates", ylab = "Weighted posterior median after benchmarking",
main = "Sanity check: compare weighted average of area medians to benchmark")
points(compare$est, compare$before)
abline(c(0, 1))
legend("topleft", c("Before benchmarking", "After benchmarking"), pch = c(1, 10),
col = c(1, 2))