Tutorial for GWAS with SLOPE

How to do GWAS?

This tutorial will guide you on how to perform GWAS with SLOPE. Analysis consists of three simple steps.

Reading data

You need to provide paths to three files:

.fam file with information about observations including phenotype. By default this file is assumed to have six column, with last one containing phenotype. For details see documentation of function readPhenotype
.map file with mapping information about snps. This file is not required for subsequent analysis, but is highly recommended. Note that lack of mapping information will result in less informative plots and results summary.
.raw file with snps. We assume that snps were previously exported from PLINK with command
plink --file input --recodeAD --out output, where input is you name of .ped file.

library(geneSLOPE)
famFile <- system.file("extdata", "plinkPhenotypeExample.fam", package = "geneSLOPE")
mapFile <- system.file("extdata", "plinkMapExample.map", package = "geneSLOPE")
snpsFile <- system.file("extdata", "plinkDataExample.raw", package = "geneSLOPE")

phenotype <- read_phenotype(filename = famFile)

When you have phenotype you can move to reading snp data. Depending on data size reading SNPs may long time. As data is very large, snps are filtered with their marginal test p-value. All snps which p-values are larger than threshold \(pValMax\) will be truncated. For details on how to choose \(pValMax\) see How changing parameters affects my analysis?

screening.result <- screen_snps(snpsFile, mapFile, phenotype, pValMax = 0.05, 
                      chunkSize = 1e2, verbose=FALSE)

Parameter verbose=FALSE suppresses progress bar. Default value is TRUE.

User look into result of reading and screening dataset

summary(screening.result)

## Object of class screeningResult
## $X: data matrix
##   90  observations
##   52  snps
## 1000  SNPs were screened
## 52  snps had p-value smaller than  0.05  in marginal test

When data is successfully read, one can move to the second step of analysis.

Clumping highly correlated genes

Next step is clumping. Highly correlated snps will be clustered. For details see How this procedure works exactly? \(rho\) controls number and size of clumps. For details see How changing parameters affects my analysis?

clumping.result <- clump_snps(screening.result, rho = 0.3, verbose = FALSE)

What is the result of clumping procedure?

summary(clumping.result)

## Object of class clumpingResult
## 52  SNPs grouped in  41  clumps
## Mean clump size  1.268293 
## Min clump size  1 
## Max clump size  4

We can also plot our results

plot(clumping.result)

If we are interested in specific chromosome we can “zoom it”

plot(clumping.result, chromosomeNumber = 1)

It is possible to identify interactively clump number that contains SNP of interest. The procedure is the following. First plot the whole genome, then run function and click on SNP of interest.

plot(clumping.result)
identify_clump(clumping.result)

Knowing clump number one can zoom into it.

plot(clumping.result, clumpNumber = 1)

Running SLOPE on result of clumping procedure

Last step of analysis is using SLOPE

slope.result <- select_snps(clumping.result, fdr=0.1)

## Warning in select_snps(clumping.result, fdr = 0.1): All lambdas are equal. SLOPE does not guarantee
##             False Discovery Rate control

As before one can plot and summarize results

summary(slope.result)

## Object of class selectionResult
## 2 snps selected out of 41 clump representatives
## Effect size for selected snps (absolute values)
##  Min:  3.640963 
##  Mean:  3.768299 
##  Max:  3.895635 
## R square of the final model:  0.9756304 
## Kink value:  1

plot(slope.result)

Like with result of clumping, it is possible to identify interactively clump number which contains specific SNP selected by SLOPE. The procedure is the following. First plot the whole genome, then run function and click on SNP of interest.

plot(slope.result)
identify_clump(slope.result)

When clump is identified one can zoom into it

plot(slope.result, clumpNumber = 1)

It is easy to get information about selected SNPs. To get indices of columns in original SNP matrix they refer to use

slope.result$selectedSnpsNumbers

##  rs2719295_T rs17546815_T 
##          222          573

If .map file was given, then one can get more information about SNPs

slope.result$X_info[slope.result$selectedSnpsNumbers,]

##     chromosome         rs genetic_distance_(morgans)
## 222          8  rs2719295                    34.2919
## 573         11 rs17546815                   113.7420
##     base_pair_position_(bp_units)
## 222                      34291873
## 573                     113741770

For information about SNPs that are part of specific clump use

summary(slope.result, clumpNumber = 1)

## Summary of 1 selected clump
##     chromosome         rs genetic_distance_(morgans)
## 222          8  rs2719295                    34.2919
## 377          2 rs11124642                    38.6580
## 598          2  rs4672803                   217.2340
## 906          6   rs325120                   147.8600
##     base_pair_position_(bp_units)
## 222                      34291873
## 377                      38657967
## 598                     217233745
## 906                     147860161

How changing parameters affects my analysis?

There are three numerical parameters that influence result

\(pValMax\) is the threshold p-value for marginal test. When data is loaded to R, initial screening of snps is performed. For every snp, test for slope coefficient in simple linear regression model \(lm(phenotype \sim snp)\) is performed. All snps with p-value larger than pValMax are discarded. Setting this parameter to too large will significantly increase number of snps on which clumping procedure will be performed. This may cause two technical threats
- Computer might run out of RAM memory
- Clumping procedure might take a lot of time
\(rho\) is threshold for correlation between snps in clumping procedure. For given snp, every that is correlated with it at least at level \(rho\) will be clumped together. Setting this parameter too high (say 0.7) will cause SLOPE to work on highly correlated snps which might affect FDR control
\(fdr\) false discover rate (FDR) for SLOPE procedure. The higher \(fdr\) is, the more variables will be accepted to the model. Contrary, small \(fdr\) yields more conservative models

How this procedure works exactly?

Clumping Procedure for SLOPE (CPS)

Input: \(rho \in (0, 1)\);

for each SNPs calculate p-value for simple linear regression test, i.e. after assuming linear regression model with a single explanatory variable and testing if slope parameter is nonzero. Vector created from p-values gives hierarchy which is used in next steps;
select index, Idx, corresponding to the smallest p-value, create group of all SNPs correlated with SNPs Idx at least on level Rho (Pearson correlation);
define this group as clump and Idx as representative of clump. Exclude entire clump from remained SNPs;
repeat two previous steps until each SNP is assigned to some clump.