Generate phenotypes and covariates

Firstly, combine data from COVID-19 testing results, hospital inpatient data, and death records to generate multiple COVID-19 related phenotypes, using the makePhenotypes function. The risk.factor phenotype, extracts and formats several of the most well-known risk factors for COVID-19.

## Obtain phenotypes

pheno <- makePhenotypes(ukb.data = "ukb0000.tab", 
                        res.eng = "20210426_covid19_result_england.txt", 
                        res.wal = "20210426_covid19_result_wales.txt", 
                        res.sco = "20210426_covid19_result_scotland.txt", 
                        death.file = "20210408_death.txt",  
                        death.cause.file = "20210408_death_cause.txt", 
                        hesin.file = "20210502_hesin.txt", 
                        hesin_diag.file = "20210502_hesin_diag.txt", 
                        hesin_oper.file = "20210502_hesin_oper.txt", 
                        hesin_critical.file "20210222_hesin_critical.txt", 
                        code.file "coding19.tsv")

## Obtain covariates

covariates <- risk.factor(ukb.data = "ukb0000.tab", 
                        ABO.data = "covid19_misc.txt.gz", 
                        hesin.file =  "20210502_hesin.txt", 
                        res.eng = "20210426_covid19_result_england.txt", 
                        res.wal = "20210426_covid19_result_wales.txt", 
                        res.sco = "20210426_covid19_result_scotland.txt")

Running a GWAS

Quality control and files for GWAS

Now carry out sample and variant level QC, using data supplied by UK Biobank.
For sampleQC, ancestry must be specified as “WhiteBritish” or “all”.
For variantQC, MAF and INFO (imputation quality) filters can be specified, or defaults to MAF = 0.001 and INFO = 0.5.

sampleQC(ukb.data = "ukb0000.tab", 
         withdrawnFile = "w0000_20200820.csv", 
         ancestry="all", 
         software="SAIGE") 

variantQC(snpQC = "ukb_snp_qc.txt", 
          mfiDir = "qcDir", 
          mafFilt=0.001, 
          infoFilt=0.5) 

Sample QC outputs

Lists of IDs to include/exclude are outputted, with appropriate format.

If selected software is “plink”:

  • sampleExclude_plink.txt
  • sampleInclude_plink.txt

If selected software is “SAIGE”:

  • sampleExclude_plink.txt
  • sampleInclude_plink.txt
  • sampleExclude_SAIGE.txt
  • sampleInclude_SAIGE.txt

Variant QC outputs

  • snpIncludeSNPIDs_minMa0.001_minInfo0.8.txt - SNPs to include in GWAS (SNPIDs)
  • snpIncludeRSIDs_minMa0.001_minInfo0.8.txt - SNPs to include in GWAS (rsIDs)
  • snpsIncludeRSIDs_GRM.txt - Subset of SNPs to be used for generated Genetic Relatedness Matrix

Create phenotype files for input

The below commands generates a phenotype file that can be used to run a GWAS using SAIGE, of COVID +ve individuals vs COVID -ve.
By default, this function appends 20PCs of ancestry, and genotyping array to the phenotype file. User can also (optionally) specify additional covariates (eg as generated by the “risk.factor” function, above).

makeGWASFiles(ukb.data = "ukb0000.tab", 
              pheno = pheno, 
              covariates = covariates, 
              phe.name = "pos.neg", 
              cov.name = c("sex", "age"), 
              includeSampsFile = sampleInclude_SAIGE.txt,
              software = "SAIGE", 
              prefix = "phenotypes_SAIGE")

This outputs the file phenotypes_SAIGE.txt

Data filtering

Filtering of the genetic data requires:

For running SAIGE, we need to create:
1. a Plink format file, with SNPs to be used for calculating the genetic relatedness matrix.

Directly genotyped data comes in Plink format, and split by chromosome.


## list of Plink files
for CHR in {1..22};
do
  echo ukb_cal_chr${CHR}_v2.bed ukb_snp_chr${CHR}_v2.bim ukb0000.fam >> plinkList.txt
done


## merge chromosomes together
plink \
  --merge-list plinkList.txt \
  --make-bed \
  --out allChr_temp

## extract variants and European samples for GRM calculations
plink \
  --bfile cleaningTemp/allChr_temp \
  --keep sampleInclude_plink.txt \
  --extract snpsIncludeRSIDs_GRM.txt \
  --make-bed \
  --out grmSNPs

rm allChr_temp*

This creates binary Plink files: grmSNPs.bed/fam/bim.

  1. filtered bgen files for running the GWAS

Imputed data comes in bgen format, and split by chromosome. These files should be filtered prior to running to the GWAS, but be kept split by chromosome, so the GWAS can be run in parallel.


## Filter chromosome 1

CHR=1

qctool \
  -g ukb_imp_chr${CHR}_v3.bgen  \
  -s ukb0000_imp_chr${CHR}_v3_s487296.sample \
  -incl-snpids snpIncludeRSIDs_minMa0.001_minInfo0.8.txt \
  -incl-samples sampleInclude_SAIGE.txt \
  -og filtered_chr${CHR}.bgen \
  -os filtered_chr${CHR}.sample

Running a GWAS with SAIGE

We like to run SAIGE using a container image, with Singularity (https://sylabs.io/singularity/), and this is what is shown below.

SAIGE can be installed and run in many ways. See https://github.com/weizhouUMICH/SAIGE for details (including the download for the container (Docker) image).

SAIGE runs in two steps:

  1. Fitting the null model

Here we are fitting the null model, with COVID +ve individuals vs COVID -ve as the phenotype (“pos.neg”), and sex, age, 10PCs of ancestry, and genotyping array as covariates.


singularity run ./saige_0.36.3.2.sif step1_fitNULLGLMM.R     \
        --plinkFile=grmSNPs \
        --phenoFile=phenotypes_SAIGE.txt \
        --phenoCol=pos.neg \
        --covarColList=sex,age,array,pc1,pc2,pc3,pc4,pc5,pc6,pc7,pc8,pc9,pc10 \
        --sampleIDColinphenoFile=ID \
        --traitType=binary       \
        --outputPrefix=nullModel \
        --nThreads=4 \
        --LOCO=FALSE

This outputs the files nullModel.rda and nullModel.varianceRatio.txt to be used in step 2.

  1. association analysis

Using the output from step 1, run the association analysis.

Note - a little reformatting of the data is needed first: the sample file header needs to be removed, and the bgen files needs indexing with bgenix (bgenix is included in the SAIGE download).


# Remove header from sample file
sed -e '1,2d' filtered_chr${CHR}.sample > filtered_chr${CHR}_noHead.sample

# Index bgen file
bgenix -g filtered_chr${CHR}.bgen -index -clobber

# Run GWAS
singularity run ./saige_0.36.3.2.sif step2_SPAtests.R   \
        --bgenFile=filtered_chr${CHR}.bgen \
        --bgenFileIndex=filtered_chr${CHR}.bgen.bgi \
        --minMAF=0.001 \
        --sampleFile=filtered_chr${CHR}_noHead.sample \
        --GMMATmodelFile=nullModel.rda \
        --varianceRatioFile=nullModel.varianceRatio.txt \
        --SAIGEOutputFile=chr${CHR}_results.txt \
        --numLinesOutput=2 \
        --IsOutputAFinCaseCtrl=TRUE