simtrait
: Simulate Complex Traits from Genotypes\(\DeclareMathOperator{\E}{E}\) \(\DeclareMathOperator{\Cov}{Cov}\)
This vignette has three main parts:
Practical examples that show how to use the functions and demonstrate that the random traits generated by this package have the desired covariance structure.
The mathematical trait model that motivated this package.
The algorithm implementation, which follows straightforwardly from the model when ancestral allele frequencies are known. However, more painful details are necessary when ancestral allele frequencies must be estimated from the genotypes, which induces biases that fortunately can be corrected.
In this section we first simulated an admixed population using bnpsd
, then we simulate traits using simtrait
. In particular, we simulate a large number of traits to demonstrate that their sample covariance matrix is as expected.
library(popkin) # to create plots of our covariance matrices
library(bnpsd) # to simulate an admixed population
library(simtrait) # this package
# dimensions of data/model
# number of loci
<- 10000
m_loci # number of individuals, smaller than usual for easier visualizations
<- 30
n_ind # number of intermediate subpops (source populations for admixed individuals)
<- 3
k_subpops
# define population structure
# FST values for 3 subpopulations (proportional/unnormalized)
<- 1 : k_subpops
inbr_subpops <- 0.5 # bias coeff of standard Fst estimator
bias_coeff <- 0.3 # desired final Fst
Fst <- admix_prop_1d_linear(
obj n_ind = n_ind,
k_subpops = k_subpops,
bias_coeff = bias_coeff,
coanc_subpops = inbr_subpops,
fst = Fst
)<- obj$admix_proportions
admix_proportions # rescaled Fst vector for intermediate subpops
<- obj$coanc_subpops
inbr_subpops
# get pop structure parameters of the admixed individuals
<- coanc_admix(admix_proportions, inbr_subpops)
coancestry <- coanc_to_kinship(coancestry)
kinship
# draw allele freqs and genotypes
<- draw_all_admix(admix_proportions, inbr_subpops, m_loci)
out <- out$X # genotypes
X <- out$p_anc # ancestral AFs p_anc
First we simulate one trait. Note that we pick non-default values for the mean mu
and variance factor sigma_sq
to validate that these can be anything we want.
# parameters of simulation
<- 100
m_causal <- 0.8
herit # default 0, let's try a non-trivial case
<- 1
mu # default 1, also let's see that this more complicated case works well
<- 1.5
sigma_sq
# create simulated trait
# case of exact p_anc
<- sim_trait(
obj X = X,
m_causal = m_causal,
herit = herit,
p_anc = p_anc,
mu = mu,
sigma_sq = sigma_sq
)# trait vector
length(obj$trait)
#> [1] 30
n_ind#> [1] 30
$trait
obj#> [1] 1.00069332 2.53888790 1.42660255 -0.88322626 3.08248240 1.61879629
#> [7] 1.10440061 3.00185264 1.91014736 0.83263612 2.92201651 1.79979790
#> [13] 1.91331148 2.02561827 0.84362082 1.47471941 1.77407608 0.81792990
#> [19] 0.22485294 1.39820207 1.28270478 -0.30555166 1.95875281 0.95595471
#> [25] 0.74833719 1.34793993 -0.13496653 1.29827423 0.05408678 1.08624785
# randomly-picked causal locus indexes
length( obj$causal_indexes )
#> [1] 100
m_causal#> [1] 100
head( obj$causal_indexes ) # show partially...
#> [1] 359 4241 4242 712 7874 3393
# regression coefficients vector
length( obj$causal_coeffs )
#> [1] 100
m_causal#> [1] 100
head( obj$causal_coeffs ) # show partially...
#> [1] 0.394236347 0.003328288 -0.039226445 -0.071400398 -0.062547204
#> [6] 0.180616273
The interesting validation is simulation a large number of random traits, from which we can estimate a sample covariance matrix to compare to the desired theoretical one. We shall compare this to other versions of the simulation. We distinguish this version as having random coefficients (RC) and employing true allele frequencies p_anc
.
# the theoretical covariance matrix of the trait is calculated by cov_trait
<- cov_trait(kinship = kinship, herit = herit, sigma_sq = sigma_sq)
V
# simulate these many traits
<- 1000
n_traits # store in this matrix, initialize with zeroes
<- matrix(data = 0, nrow = n_traits, ncol = n_ind)
Y_rc_freq # start loop
for (i in 1 : n_traits) {
<- sim_trait(
obj X = X,
m_causal = m_causal,
herit = herit,
p_anc = p_anc,
mu = mu,
sigma_sq = sigma_sq
)<- obj$trait # store in i^th row
Y_rc_freq[i,]
}# estimate sample covariance
<- cov(Y_rc_freq) V_rc_freq
First let’s verify that the mean is as expected. Below the red line marks the desired mean.
<- par(mgp = c(2, 0.5, 0))
par_orig # reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
# visualize distribution
boxplot(
list(
'RC freq' = rowMeans(Y_rc_freq)
),xlab = "Trait Type",
ylab = 'Sample Mean'
)# red line marks expected mean
abline(h = mu, col = 'red')
par( par_orig ) # reset `par`
Now let’s visualize the covariance matrices using plot_popkin
from the popkin
package. Since both matrices have large diagonals, we shrink them somewhat using inbr_diag
also from the popkin
package.
plot_popkin(
list(V, V_rc_freq),
titles = c('Theoretical', 'RC freq'),
leg_title = 'Covariance',
panel_letters_adj = 0,
# set margin for title (top is non-zero)
mar = c(0, 2)
)
This plot verifies that the empirical covariance matches the theoretical expectation!
For real data, true ancestral allele frequencies are unknown. A reasonable trait can still be simulated in these cases, but this solution no longer has theoretical guarantees to yield the desired mean value in particular. This solution relies on a known mean kinship to compensate for the biases of estimated ancestral allele frequencies. A good kinship matrix estimate can be obtained using the popkin
package.
For simplicity, here we use the true kinship matrix rather than an estimate:
# store this in new matrix
<- matrix(data = 0, nrow = n_traits, ncol = n_ind)
Y_rc_kin # start loop
for (i in 1 : n_traits) {
<- sim_trait(
obj X = X,
m_causal = m_causal,
herit = herit,
# whole kinship matrix can be passed instead of just mean
kinship = kinship,
mu = mu,
sigma_sq = sigma_sq
)<- obj$trait # store in i^th row
Y_rc_kin[i,]
}# estimate sample covariance
<- cov(Y_rc_kin) V_rc_kin
First let’s verify the means again. Recall the red line marks the desired mean. Below the original sample (simulated using the true p_anc
) is shown first as “RC freq,” while the new sample based on the kinship matrix is “RC kinship”:
<- par(mgp = c(2, 0.5, 0))
par_orig # reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
# visualize distribution
boxplot(
list(
"RC freq" = rowMeans(Y_rc_freq),
"RC kinship" = rowMeans(Y_rc_kin)
),xlab = "Trait Type",
ylab = 'Sample Mean'
)# red line marks expected mean
abline(h = mu, col = 'red')
par( par_orig ) # reset `par`
Now we compare all three matrices:
plot_popkin(
list(V, V_rc_freq, V_rc_kin),
titles = c('Theoretical', 'RC freq', 'RC kinship'),
leg_title = 'Covariance',
panel_letters_adj = 0,
mar = c(0, 2)
)
This plot shows again good agreement between the sample covariance matrix of traits simulated without true ancestral allele frequencies (“RC kinship”) and the desired “theoretical” covariance matrix.
An alternative approach for simulating traits is by drawing them from a Multivariate Normal (MVN) model with the desired mean and covariance structure. This is often called the infinitesimal model, since it follows from the central limit theorem under the assumption that there are infinite causal loci, each with an infinitesimal effect size. A trait simulated this way has no use in GWAS tests, as there are no causal loci (in other words, the null hypothesis holds across the genome). However, these traits have a heritability that can be estimated, and in fact this infinitesimal model is assumed by approaches that estimate heritability by fitting variance components, such as GCTA (Yang et al. 2011).
We draw the MVN traits this way:
# This function simulates trait replicates in one call,
# generating a matrix comparable to the previous ones.
<- sim_trait_mvn(
Y_mvn rep = n_traits,
kinship = kinship,
herit = herit,
mu = mu,
sigma_sq = sigma_sq
)# estimate sample covariance
<- cov(Y_mvn) V_mvn
First let’s verify the means again. Recall the red line marks the desired mean. The new sample is denoted as “MVN,” and the other two are as in the previous sections (traits simulated from genotypes, using the true p_anc
or with bias corrections from the kinship matrix):
<- par(mgp = c(2, 0.5, 0))
par_orig # reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
# visualize distribution
boxplot(
list(
"RC freq" = rowMeans(Y_rc_freq),
"RC kinship" = rowMeans(Y_rc_kin),
"MVN" = rowMeans(Y_mvn)
),xlab = "Trait Type",
ylab = 'Sample Mean'
)# red line marks expected mean
abline(h = mu, col = 'red')
par( par_orig ) # reset `par`
Now we compare all four covariance matrices:
plot_popkin(
list(V, V_rc_freq, V_rc_kin, V_mvn),
titles = c('Theoretical', 'RC freq', 'RC kinship', 'MVN'),
leg_title = 'Covariance',
panel_letters_adj = 0,
mar = c(0, 2),
leg_width = 0.4
)
This plot shows again good agreement between the sample covariance matrix of traits simulated under the infinitesimal model (“MVN”) and the desired “theoretical” covariance matrix and first two simulations.
This package can also simulate traits from a model where coefficients are larger for rarer variants, which may be more realistic for disease traits where selection prevents common variants from having large coefficients, while allowing rare variants to have larger coefficients. The effect size of locus \(i\) is its variance contribution, equal to \(2 \beta^2_i p_i(1-p_i)\) for outbred individuals, where \(\beta_i\) is the regression coefficient and \(p_i\) is the ancestral allele frequency. The limit of strong negative and other modes of selection forces effect sizes to be equal for all loci, so the coefficient at locus \(i\) is proportional to \(1 / \sqrt{p_i(1-p_i)}\). As in the other models, the coefficients are rescaled to yield the desired heritability and variance factor. This is related to previous models proposed in the literature, for example (Speed et al. 2012).
This time we simulate both the true allele frequency version, “freq,” and the “kinship” version that unbiases sample allele frequencies.
# store this in new matrix
<- matrix(data = 0, nrow = n_traits, ncol = n_ind)
Y_fes_freq <- matrix(data = 0, nrow = n_traits, ncol = n_ind)
Y_fes_kin # start loop
for (i in 1 : n_traits) {
<- sim_trait(
obj X = X,
m_causal = m_causal,
herit = herit,
p_anc = p_anc,
mu = mu,
sigma_sq = sigma_sq,
fes = TRUE # only diff from orig run
)<- obj$trait # store in i^th row
Y_fes_freq[i,]
<- sim_trait(
obj X = X,
m_causal = m_causal,
herit = herit,
kinship = kinship,
mu = mu,
sigma_sq = sigma_sq,
fes = TRUE # only diff from orig run
)<- obj$trait # store in i^th row
Y_fes_kin[i,]
}# estimate sample covariance
<- cov(Y_fes_freq)
V_fes_freq <- cov(Y_fes_kin) V_fes_kin
First let’s verify the means again. Recall the red line marks the desired mean. The new samples are labeled as “FES freq” and “FES kinship,” and the first three are as in the previous sections (traits simulated from genotypes and the random coefficients (RC) model, using the true p_anc
or with bias corrections from the kinship matrix, and the MVN traits):
<- par(mgp = c(2, 0.5, 0))
par_orig # reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
# visualize distribution
boxplot(
list(
"RC freq" = rowMeans(Y_rc_freq),
"RC kinship" = rowMeans(Y_rc_kin),
"MVN" = rowMeans(Y_mvn),
"FES freq" = rowMeans(Y_fes_freq),
"FES kinship" = rowMeans(Y_fes_kin)
),xlab = "Trait Type",
ylab = 'Sample Mean'
)# red line marks expected mean
abline(h = mu, col = 'red')
par( par_orig ) # reset `par`
Now we compare all covariance matrices:
plot_popkin(
list( V, V_rc_freq, V_rc_kin, V_mvn, V_fes_freq, V_fes_kin ),
titles = c('Theoretical', 'RC freq', 'RC kinship', 'MVN', 'FES freq', 'FES kinship'),
leg_title = 'Covariance',
panel_letters_adj = 0,
mar = c(0, 2),
leg_width = 0.4,
layout_rows = 2
)
These plots show again good agreement between the sample covariance matrix of traits simulated under this “fixed effect sizes” model (“FES freq” and “FES kinship”) and the desired “theoretical” covariance matrix and first three simulations.
In closing these simulation results, let’s make more direct and detailed comparisons between the empirical covariances matrices and the desired theoretical one:
<- par(mgp = c(2, 0.5, 0))
par_orig # create multipanel figure
par( mfrow = c(2, 3) )
# reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
plot( V, V_rc_freq, xlab = 'Theoretical Cov', ylab = 'RC freq Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_rc_kin, xlab = 'Theoretical Cov', ylab = 'RC kinship Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_mvn, xlab = 'Theoretical Cov', ylab = 'MVN Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_fes_freq, xlab = 'Theoretical Cov', ylab = 'FES freq Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_fes_kin, xlab = 'Theoretical Cov', ylab = 'FES kinship Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
barplot(
c(
rmsd( V, V_rc_freq ),
rmsd( V, V_rc_kin ),
rmsd( V, V_mvn ),
rmsd( V, V_fes_freq ),
rmsd( V, V_fes_kin )
),names.arg = c('RC freq', 'RC kin', 'MVN', 'FES freq', 'FES kin'),
ylab = 'RMSD from Theoretical',
las = 3
)
par( par_orig ) # reset `par`
Empirical covariance matrices estimate the theoretical one without bias when ancestral allele frequencies are used (“freq” versions) and with the MVN model, all of which have a comparable small mean squared error (RMSD in figure above). However, use of estimated allele frequencies, despite the necessary kinship bias correction (“kin” versions), still resulted in empirical covariances that are upwardly biased estimators of the desired theoretical values, which is slight for “RC kin” but severe for “FES kin.”
We can also simulate random group effects not determined by genetics (i.e., functions of the kinship matrix), but rather a result of the environment (any categorical covariates, which may represent geography, culture, etc). The code can simulate several such levels of groups, each level drawing random Normal effects with mean zero and a given fixed variance parameter. For the illustration we shall simulate two levels, each determined by the index or coordinate of the individual, but these two levels are not nested.
# first level, first half is all one group called "a"
# second half is another group called "b"
<- round( n_ind / 2 )
n_half <- c(
labs1 rep.int( 'a', n_half ),
rep.int( 'b', n_ind - n_half )
)# second level will be in thirds instead
# each level is independent, so group names can repeat across levels
<- round( n_ind / 3 )
n_third <- c(
labs2 rep.int( 'a', n_third ),
rep.int( 'b', n_third ),
rep.int( 'c', n_ind - 2 * n_third )
)# combine!
<- cbind( labs1, labs2 )
labs # set heritability and variance effects for these levels
# reduce heritability to allow for large group variances
<- 0.3
herit <- c( 0.3, 0.2 ) labs_sigma_sq
Let’s redraw all traits to have these block effects, and validate them by comparing their empirical covariances to the desired, theoretical covariance. It is done as before except the parameters labs
and labs_sigma_sq
are now passed to the functions cov_trait
, sim_trait
, and sim_trait_mvn
:
<- cov_trait( kinship = kinship, herit = herit, sigma_sq = sigma_sq,
V labs = labs, labs_sigma_sq = labs_sigma_sq )
# store in this matrix, initialize with zeroes
<- matrix(data = 0, nrow = n_traits, ncol = n_ind)
Y_rc_freq <- matrix(data = 0, nrow = n_traits, ncol = n_ind)
Y_rc_kin <- matrix(data = 0, nrow = n_traits, ncol = n_ind)
Y_fes_freq <- matrix(data = 0, nrow = n_traits, ncol = n_ind)
Y_fes_kin # start loop
for (i in 1 : n_traits) {
<- sim_trait( X=X, m_causal=m_causal, herit=herit, mu=mu, sigma_sq=sigma_sq,
Y_rc_freq[i,] labs=labs, labs_sigma_sq=labs_sigma_sq, p_anc=p_anc )$trait
<- sim_trait( X=X, m_causal=m_causal, herit=herit, mu=mu, sigma_sq=sigma_sq,
Y_rc_kin[i,] labs=labs, labs_sigma_sq=labs_sigma_sq, kinship=kinship )$trait
<- sim_trait( X=X, m_causal=m_causal, herit=herit, mu=mu, sigma_sq=sigma_sq,
Y_fes_freq[i,] labs=labs, labs_sigma_sq=labs_sigma_sq, p_anc=p_anc, fes=TRUE )$trait
<- sim_trait( X=X, m_causal=m_causal, herit=herit, mu=mu, sigma_sq=sigma_sq,
Y_fes_kin[i,] labs=labs, labs_sigma_sq=labs_sigma_sq, kinship=kinship, fes=TRUE )$trait
}<- sim_trait_mvn( rep = n_traits, kinship = kinship, herit = herit,
Y_mvn mu = mu, sigma_sq = sigma_sq, labs = labs, labs_sigma_sq = labs_sigma_sq )
# estimate sample covariance
<- cov(Y_rc_freq)
V_rc_freq <- cov(Y_rc_kin)
V_rc_kin <- cov(Y_fes_freq)
V_fes_freq <- cov(Y_fes_kin)
V_fes_kin <- cov(Y_mvn) V_mvn
We verify the means as before. Recalling that the red line marks the desired mean, we see that all means are as requested:
<- par(mgp = c(2, 0.5, 0))
par_orig # reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
# visualize distribution
boxplot(
list(
"RC freq" = rowMeans(Y_rc_freq),
"RC kinship" = rowMeans(Y_rc_kin),
"MVN" = rowMeans(Y_mvn),
"FES freq" = rowMeans(Y_fes_freq),
"FES kinship" = rowMeans(Y_fes_kin)
),xlab = "Trait Type",
ylab = 'Sample Mean'
)# red line marks expected mean
abline(h = mu, col = 'red')
par( par_orig ) # reset `par`
Now we compare all covariance matrices. We clearly see the group effects (covariance diagonal blocks), and in this case the kinship effect is much reduced (we reduced the heritability), so all simulations (which only differ in their genetic effect) agree even more than in the earlier high heritability example:
plot_popkin(
list( V, V_rc_freq, V_rc_kin, V_mvn, V_fes_freq, V_fes_kin ),
titles = c('Theoretical', 'RC freq', 'RC kinship', 'MVN', 'FES freq', 'FES kinship'),
leg_title = 'Covariance',
panel_letters_adj = 0,
mar = c(0, 2),
leg_width = 0.4,
layout_rows = 2
)
Lastly, here is the direct comparisons between the empirical covariances matrices and the desired theoretical one:
<- par(mgp = c(2, 0.5, 0))
par_orig # create multipanel figure
par( mfrow = c(2, 3) )
# reduce margins from default
par(mar = c(3.5, 3, 0, 0) + 0.2)
plot( V, V_rc_freq, xlab = 'Theoretical Cov', ylab = 'RC freq Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_rc_kin, xlab = 'Theoretical Cov', ylab = 'RC kinship Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_mvn, xlab = 'Theoretical Cov', ylab = 'MVN Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_fes_freq, xlab = 'Theoretical Cov', ylab = 'FES freq Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
plot( V, V_fes_kin, xlab = 'Theoretical Cov', ylab = 'FES kinship Cov' )
abline( 0, 1, lty = 2, col = 'gray' )
barplot(
c(
rmsd( V, V_rc_freq ),
rmsd( V, V_rc_kin ),
rmsd( V, V_mvn ),
rmsd( V, V_fes_freq ),
rmsd( V, V_fes_kin )
),names.arg = c('RC freq', 'RC kin', 'MVN', 'FES freq', 'FES kin'),
ylab = 'RMSD from Theoretical',
las = 3
)
par( par_orig ) # reset `par`
All errors are low except “FES kin” still has a noticeable upward bias and the largest error, although this is still a small error compared to the previous high heritability simulation.
Here is a brief summary of the trait model, which explains what this package does internally.
Suppose there are \(n\) individuals and \(m\) (causal) loci. For simplicity we shall assume that every locus has a regression coefficient, although in practice many of these coefficients will be zero. The following variables are part of the model:
Variable | Dimensions | Description |
---|---|---|
\(\mathbf{X}\) | \(n \times m\) | Genotypes |
\(\mathbf{x}_i\) | \(n \times 1\) | Genotype vector at locus \(i\) |
\(\mathbf{y}\) | \(n \times 1\) | Trait |
\(\mathbf{\beta}\) | \(m \times 1\) | Regression coefficients |
\(\mathbf{\epsilon}\) | \(n \times 1\) | Non-genetic effects |
\(\mathbf{p}\) | \(m \times 1\) | Ancestral allele frequencies |
\(\mathbf{\Phi}\) | \(n \times n\) | Kinship matrix |
\(\alpha\) | \(1 \times 1\) | Intercept coefficient |
\(\mu\) | \(1 \times 1\) | Trait mean |
\(h^2\) | \(1 \times 1\) | Heritability |
\(\sigma^2\) | \(1 \times 1\) | Trait variance factor |
\(\mathbf{1}\) | \(n \times 1\) | Vector of ones |
\(\mathbf{0}\) | \(n \times 1\) | Vector of zeroes |
\(\mathbf{I}\) | \(n \times n\) | Identity matrix |
We assume the linear polygenic model for a quantitative trait: \[ \mathbf{y} = \alpha \mathbf{1} + \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon}. \] To analyze the covariance structure of the trait, we shall assume that \(\alpha\) and \(\mathbf{\beta}\) are fixed parameters, while \(\mathbf{X} = (\mathbf{x}_i)\) and \(\mathbf{\epsilon}\) are random with expectations and covariances of \[\begin{align*} \E[\mathbf{X}] &= 2 \mathbf{1} \mathbf{p}^\intercal , \\ \Cov(\mathbf{x}_i) &= 4 p_i (1-p_i) \mathbf{\Phi} , \\ \E[\mathbf{\epsilon}] &= \mathbf{0} , \\ \Cov(\mathbf{\epsilon}) &= (1-h^2) \sigma^2 \mathbf{I} , \end{align*}\] where \(\mathbf{p} = (p_i)\). The expectation of the trait is therefore \[\begin{align*} \E[\mathbf{y}] &= \alpha \mathbf{1} + \E[\mathbf{X}] \mathbf{\beta} + \E[\mathbf{\epsilon}] \\ &= \alpha \mathbf{1} + 2 \mathbf{1} \mathbf{p}^\intercal \mathbf{\beta} , \end{align*}\] which can be written as \[\begin{align*} \E[\mathbf{y}] = \mu \mathbf{1} , \quad \text{where} \quad \mu = \alpha + 2 \mathbf{p}^\intercal \mathbf{\beta} . \end{align*}\] The covariance matrix of the trait is \[\begin{align*} \Cov(\mathbf{y}) &= \sum_{i=1}^m \Cov(\mathbf{x}_i) \beta_i^2 + \Cov(\mathbf{\epsilon}) \\ &= \mathbf{\Phi} \sum_{i=1}^m 4 p_i (1-p_i) \beta_i^2 + (1-h^2) \sigma^2 \mathbf{I} , \end{align*}\] where \(\mathbf{\beta} = (\beta_i)\). Therefore, we can write the covariance in terms of the heritability and the overall variance scale: \[\begin{align*} \Cov(\mathbf{y}) &= \sigma^2 \left( 2 h^2 \mathbf{\Phi} + (1-h^2) \mathbf{I} \right) , \quad \text{where} \\ \sigma^2 h^2 &= \sum_{i=1}^m 2 p_i (1-p_i) \beta_i^2 . \end{align*}\] The factor of two in front of \(\mathbf{\Phi}\) is traditionally there since for an unstructured population \(2 \mathbf{\Phi} = \mathbf{I}\), in which case the trait covariance simplifies to \(\Cov(\mathbf{y}) = \sigma^2 \mathbf{I}\) for any value of \(h^2\). More broadly, the variance of the trait for any outbred individual is \(\sigma^2\) under this parametrization.
In all cases the user sets the heritability and other parameters but not the coefficients \(\mathbf{\beta}\) directly. To choose \(\mathbf{\beta}\), the algorithm initially draws random coefficients or sets them using a formula (depending on the model) and scales them to yield the desired covariance structure.
The user provides a genotype matrix and sets the number of causal loci. The algorithm selects random loci to be the causal ones. From this moment on \(\mathbf{X}\) will contain only those causal loci.
Under the random coefficients (RC) model, the initial coefficients are drawn independently from a standard normal distribution: \[ \beta_i \sim \text{N}(0,1). \]
Under the fixed effect sizes (FES) model, the initial coefficients are \[ \beta_i = 1 / \sqrt{p_i(1-p_i)}. \] (When \(p_i\) are unknown, their sample estimates are used for this step.) Lastly, the sign of \(\beta_i\) is drawn randomly (it is negative with probability 0.5).
Again, whichever form these coefficients take, they are rescaled to result in the desired heritability and variance factor using the procedures described next.
Below we divide the algorithm into two steps: (1) scaling the coefficients, and (2) centering the trait. Each step forks into two cases: whether the true ancestral allele frequencies are known or not (the latter requires a known mean kinship).
Here we assume that \(\mathbf{p} = (p_i)\) is provided by the user. The user has also provided the desired values of both \(h^2\) and \(\sigma^2\). The initial genetic variance factor is \[ \sigma^2_0 = \sum_{i=1}^m 2 p_i (1-p_i) \beta_i^2. \] We obtain the desired variance by dividing each \(\beta_i\) by \(\sigma_0\) (which results in a variance of 1) and then multiply by \(h \sigma\) (which finally results in the desired variance of \(h^2 \sigma^2\)). Combining both steps, the update is \[ \mathbf{\beta} \leftarrow \mathbf{\beta} \frac{ h \sigma }{\sigma_0}. \]
When \(\mathbf{p}\) isn’t known, sample estimates \(\mathbf{\hat{p}}\) are constructed from the genotype data. Let \[ \hat{p}_i = \frac{1}{2n} \mathbf{1}^\intercal \mathbf{x}_i . \] Although this estimator is unbiased (\(\E[\mathbf{\hat{p}}] = \mathbf{p}\)), the resulting variance estimates of interest are downwardly biased (Ochoa and Storey 2016): \[ \E \left[ \hat{p}_i \left( 1-\hat{p}_i \right) \right] = p_i(1-p_i) (1 - \bar{\varphi}), \] where \(\bar{\varphi} = \frac{1}{n^2} \mathbf{1}^\intercal \mathbf{\Phi} \mathbf{1}\) is the mean kinship coefficient in the data. Therefore the initial genetic variance factor, estimated as \[ \hat{\sigma}^2_0 = \sum_{i=1}^m 2 \hat{p}_i (1-\hat{p}_i) \beta_i^2, \] has an expectation of \[ \E \left[ \hat{\sigma}^2_0 \right] = \sigma^2_0 (1 - \bar{\varphi}) \] Since this additional factor \((1 - \bar{\varphi})\) is known in this setting, the adjusted update \[ \mathbf{\beta} \leftarrow \mathbf{\beta} \frac{ h \sigma \sqrt{1-\bar{\varphi}} }{\hat{\sigma}_0} \] also results in the desired variance.
This is the preferred approach as it is the only case that guarantees success. Given our model, we obtain the desired overall trait mean \(\mu\) by choosing the intercept to be \[ \alpha = \mu - 2 \mathbf{p}^\intercal \mathbf{\beta} \]
The solution that this version of the algorithm takes is to choose the intercept \[\begin{align*} \alpha &= \mu - 2 \hat{\bar{p}} \mathbf{1}_m^\intercal \mathbf{\beta} , \quad \text{where} \\ \hat{\bar{p}} &= \frac{1}{m} \mathbf{1}_m^\intercal \mathbf{\hat{p}} = \frac{1}{2 m n} \mathbf{1}_m^\intercal \mathbf{X}^\intercal \mathbf{1}_n = \frac{1}{2} \bar{X} , \end{align*}\] where \(\mathbf{1}_m\) above are length-\(m\) vectors of ones. This works very well in practice since \(\mathbf{\beta}\) is drawn randomly, so it is uncorrelated with the true \(\mathbf{p}\) (this is true in FES too since the sign of each coefficient is random). In this setting it suffices to consider each coefficient \(\beta_i\) as acting on the average locus, which is treated as having a random ancestral allele frequency \(p_i\), and all that matters is the global mean of \(p_i\) values.
Now let’s discuss why the obvious way of centering the trait without known ancestral allele frequencies doesn’t work. Why not use the sample allele frequencies as \[ \alpha = \mu - 2 \mathbf{\hat{p}}^\intercal \mathbf{\beta} \quad ? \] Centering the trait this way is equivalent to centering genotypes at each locus: \[ \mathbf{y} = \alpha \mathbf{1} + \sum_{i=1}^m (\mathbf{x}_i - 2 \hat{p}_i \mathbf{1}) \beta_i + \mathbf{\epsilon}. \] However, this operation introduces a distortion in the covariance of the genotypes (Ochoa and Storey 2016): \[ \Cov \left( \mathbf{x}_i - 2 \hat{p}_i \mathbf{1} \right) = p_i (1-p_i) \left( \mathbf{\Phi} + \bar{\varphi} \mathbf{1}\mathbf{1}^\intercal - \mathbf{\varphi} \mathbf{1}^\intercal - \mathbf{1} \mathbf{\varphi}^\intercal \right), \] where \(\mathbf{\varphi} = \frac{1}{n} \mathbf{\Phi} \mathbf{1}\). These undesirable distortions propagate to the trait, which we confirmed in simulations (not shown). It is not clear how these distortions can be corrected for after centering the trait as shown above.
Note that the intercept version we chose instead does not induce this genotype centering, which prevents the undesirable distortions in the trait covariance.