The sampler R package is designed to enable data scientists to design, draw, and analyze simple or complex samples using data frames. It enables you to load machine-readable files (e.g. .csv, .tsv, etc.) in R containing a sampling frame or collected data, store them as objects, and perform sampling techniques and analysis using clear and concise methods.
Specifically, a data scientist can use the sampler R package to:
The sampler R package builds a bridge for survey administrators between the free and open-source R environment and no-to-low cost Open Data Kit (ODK)-based toolkits such as Ona and ELMO. The sampler package is available via CRAN or GitHub for use in R and R Studio.
install.packages("sampler")
library(sampler)
install.packages("devtools"); library(devtools)
::install_github("mbaldassaro/sampler"); library(sampler) devtools
The sampler R package includes the following datasets:
albania
: dataset containing 2017 Albania election
results by polling station published by the Central Election Commission
and opened by the Coalition of Domestic Observers & Democracy
Internationalopening
: dataset containing 2017 Albania election
observation findings on polling station opening process by the Coalition
of Domestic Observers (CDO) CDO conducted a statistically-based
observation (SBO) exercise, deploying observers to a random sample of
polling stations for the 25 June 2017 Albanian elections. This is a
subset of observation data collected by CDO observers that includes data
that was used to perform statistical analysisFull documentation of datasets and functions can be found on RDocumentation
rsampcalc(N, e, ci=95,p=0.5, over=0)
Where: * N
is population universe (e.g. 10000, nrow(df))
* e
is tolerable margin of error (integer or float, e.g. 5,
2.5) * ci
(optional) is confidence level for establishing a
confidence interval using z-score (defaults to 95; restricted to 80, 85,
90, 95 or 99 as input) * p
(optional) is anticipated
response distribution (defaults to 0.5; takes value between 0 and 1 as
input) * over
(optional) is desired oversampling proportion
(defaults to 0; takes value between 0 and 1 as input)
Returns appropriate sample size (rounded up to nearest integer)
Example:
rsampcalc(N=5361, e=3, ci=95, p=0.5, over=0.1)
Source: Sampling Design & Analysis, S. Lohr, 1999, equation 2.17
rsamp(df, n, over=0, rep=FALSE)
Where: * df
is object containing full sampling data
frame * n
is sample size (integer or object containing
sample size) * over
(optional) is desired oversampling
proportion (defaults to 0; takes value between 0 and 1 as input) *
rep
(optional) is boolean for a sample with repalcement
(TRUE) or without replacement (defaults to FALSE)
Returns a simple random sample of size n
Example:
rsamp(albania, n=360, over=0.1, rep=FALSE)
or
<- rsampcalc(nrow(albania), 3, 95, 0.5)
size rsamp(albania, size)
ssampcalc(df, n, strata, over=0)
Where: * df
is object containing sampling data frame *
n
is sample size (integer) or object containing sample size
* strata
is variable in sampling data frame by which to
stratify * over
(optional) is desired oversampling
proportion (defaults to 0; takes value between 0 and 1 as input)
Returns proportional sample size per strata (rounded up to nearest integer)
Example:
ssampcalc(df=albania, n=544, strata=qarku, over=0.05)
or
<- rsampcalc(nrow(albania), 3, 95, 0.5)
size ssampcalc(albania, size, qarku)
Source: Sampling Design & Analysis, S. Lohr, 1999, 4.4
ssamp(df, n, strata, over=1)
Where: * df
is object containing full sampling data
frame * n
is sample size (integer, or object containing
sample size) * strata
is variable in sampling data frame by
which to stratify (e.g. region) * over
(optional) is
desired oversampling proportion (defaults to 0; takes value between 0
and 1 as input)
Returns stratified sample using proportional allocation without replacement
Example:
ssamp(df=albania, n=360, strata=qarku, over=0.1)
or
<- rsampcalc(nrow(albania), 3, 95, 0.5)
size ssamp(albania, size, qarku)
psampcalc(df, n, strata, unit, over=0)
Where: * df
is object containing full sampling data
frame * n
is sample size (integer) or object containing
sample size * strata
is variable in sampling data frame by
which to stratify * unit
is variable in sampling data frame
containing sub-units (e.g. population) * over
(optional) is
desired oversampling proportion (defaults to 0; takes value between 0
and 1 as input)
Returns sample size per strata based on sub-units (rounded up to nearest integer)
Example
psampcalc(df=albania, n=544, strata=qarku, unit=zgjedhes, over=0.1)
Source: Sampling Design & Analysis, S. Lohr, 1999, 4.4
rpro(df, col_name, ci=95, na="", N=0)
Where: * df
is object containing data frame on which to
perform analysis (e.g. data) * col_name
is variable in data
frame for which you want to calculate proportion and margin of error *
ci
(optional) is confidence level for establishing a
confidence interval using z-score (defaults to 95; restricted to 80, 85,
90, 95 or 99 as input) * na
(optional) is value that you
want to filter and exclude (defaults to include everything) *
N
(optional) is population universe (e.g. 10000, nrow(df));
if N value is passed as an argument, margin of error will be calculated
using fpc
Returns table of responses (n), proportions (midpoint), margins of error, lower and upper bounds by factor for a given variable
Example:
rpro(df=opening, col_name=openTime, ci=95, na="n/a", N=5361)
Source: Sampling Design & Analysis, S. Lohr, 1999, Equation 2.15
spro(fulldf, sampdf, strata, col_name, ci=95, na="")
Where: * fulldf
is object containing original data frame
used to draw sample * sampdf
is object containing data
frame on which to perform analysis * strata
is variable in
both data frames by which to stratify * col_name
is
variable in data frame for which you want to calculate proportion and
margin of error * ci
(optional) is confidence level for
establishing a confidence interval using z-score (defaults to 95;
restricted to 80, 85, 90, 95 or 99 as input) * na
(optional) is value that you want to filter and exclude (defaults to
include everything)
Returns table of responses (n), proportions (midpoint), margins of error, lower and upper bounds by factor for a given variable in a stratified sample
Example:
spro(fulldf=albania, sampdf=opening, strata=qarku, col_name=openTime, ci=95, na="n/a")
Source: Sampling Design & Analysis, S. Lohr, 1999, 4.6 & 4.7
cpro(df, numerator, denominator, ci=95, na="", N=0)
Where: * df
is object containing data frame on which to
perform analysis * numerator
is variable in data frame for
which you want to calculate proportion and margin of error *
denominator
is variable in data frame containing population
of unequal cluster sizes * ci
(optional) is confidence
level for establishing a confidence interval using z-score (defaults to
95; restricted to 80, 85, 90, 95 or 99 as input) * na
(optional) is value that you want to filter and exclude (defaults to
include everything) * N
(optional) is population universe
(e.g. 10000, nrow(df)); if N value is passed as an argument, margin of
error will be calculated using fpc
Returns table of responses (n), proportions (midpoint), margins of error, lower and upper bounds by factor for a given variable in an unequal-sized cluster sample
Example:
<- ssamp(albania, 890, qarku)
alresults cpro(df=alresults, numerator=totalVoters, denominator=zgjedhes, ci=95)
cpro(df=alresults, numerator=pd, denominator=validVotes, ci=95, N=5361)
Source 1: Survey Sampling, L. Kish, 1965, Equation 6.3.4
Source 2: Sampling Techniques, W.G. Cochran, 1977, Equation 3.34
rmissing(sampdf, colldf, col_name)
Where: * sampdf
is object containing data frame of
sample points * colldf
is object containing data frame of
collected data * col_name
is common variable (i.e. key) in
data frames by which to check for missing points
Returns table of sample points missing from collected data
Example:
<- rsamp(df=albania, 544)
alsample <- rsamp(df=alsample, 390)
alreceived rmissing(sampdf=alsample, colldf=alreceived, col_name=qvKod)
smissing(sampdf, colldf, strata, col_name)
Where: * sampdf
is object containing data frame of
sample points * colldf
is object containing data frame of
collected data * strata
is variable in both data frames by
which to stratify * col_name
is common variable (i.e. key)
in data frames by which to check for missing points
Returns table of number of sample points by strata missing from collected data
Example:
<- rsamp(df=albania, 544)
alsample <- rsamp(df=alsample, 390)
alreceived smissing(sampdf=alsample, colldf=alreceived, strata=qarku, col_name=qvKod)
dupe(df, col_name)
Where: * df
is object containing data frame of collected
data * col_name
is variable within data frame by which to
filter for duplicate values
Returns table of duplicate values within collected data
Example:
<- rsamp(df=albania, n=390, rep=TRUE)
aldupe dupe(df=aldupe, col_name=qvKod)
dedupe(df, col_name)
Where: * df
is object containing data frame of collected
data * col_name
is variable within data frame by which to
filter for duplicate values
Returns table of observations based on unique values within collected data
Example:
<- rsamp(df=albania, n=390, rep=TRUE)
aldupe dedupe(df=aldupe, col_name=qvKod)