How To Create a Mask File

library(smer)

The Sparse Marginal Epistasis test (SME) requires a HDF5 file containing binary masking data. The effectiveness of the test relies on how informative the mask is for a specific trait. For instance, to test for marginal epistasis related to transcriptional activity, DNase I hypersensitivity sites data can be used to condition the detection on accessible chromatin regions.

Sparse Marginal Epistasis test (SME) schematic DHS data Figure 1. The illustration shows open chromatin data. DNase I hypersensitive sites indicate transcriptional activity. The DNase-seq signal is converted to a binary mask, where genetic variants in inaccessible chromatin regions are excluded from the covariance matrix for marginal epistasis.

Mask File Format

The sme() function expects the mask data to be in an HDF5 file. The HDF5 format includes two primary object types:

Mask Format Requirements

The mask data should be organized into the following groups and datasets:

Groups:

The required group names can be configured as input parameters of sme(). The defaults are ld and gxg.

Datasets:

Important: All indices in the mask file must be zero-based and should correspond to the zero-based row indices of the PLINK .bim file. This includes the dataset index (<j> in gxg/<j>) and the data itself. This zero-based indexing is necessary because the mask data is read by a C++ subroutine in sme(), which uses zero-based indexing, unlike R’s one-based indexing for SNP indices in the function call.

Creating and Using Mask Files

The package provides utility functions to create, write, and read valid mask files for sme().

hdf5_file <- tempfile()

# Group names
gxg_h5_group <- "gxg"
ld_h5_group <- "ld"

# Data (still in 1-based R indexing)
include_gxg_snps <- 1:10
exclude_ld_snps <- 5:6

# Focal SNP (still in 1-based R indexing)
focal_snp <- 4

# Dataset names
dataset_name_pattern <- "%s/%s"
# 0-based index!
gxg_dataset <- sprintf(dataset_name_pattern, gxg_h5_group, focal_snp - 1)
ld_dataset <- sprintf(dataset_name_pattern, ld_h5_group, focal_snp - 1)

# Create an empty HDF5 file
create_hdf5_file(hdf5_file)

# Write LD data
write_hdf5_dataset(hdf5_file, ld_dataset, exclude_ld_snps - 1) # 0-based index!

# Write GXG data
write_hdf5_dataset(hdf5_file, gxg_dataset, include_gxg_snps - 1)

We can check that the data was written correctly.

ld_read <- read_hdf5_dataset(hdf5_file, ld_dataset)
gxg_read <- read_hdf5_dataset(hdf5_file, gxg_dataset)

print(sprintf("Zero-based indices of SNPs to exclude: %s", str(ld_read)))
#>  int [1:2] 4 5
#> character(0)
print(sprintf("Zero-based indices of SNPs to include: %s", str(gxg_read)))
#>  int [1:10] 0 1 2 3 4 5 6 7 8 9
#> character(0)

SessionInfo

sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sequoia 15.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Berlin
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] GenomicRanges_1.56.2 GenomeInfoDb_1.40.1  IRanges_2.38.1      
#> [4] S4Vectors_0.42.1     BiocGenerics_0.50.0  ggplot2_3.5.1       
#> [7] dplyr_1.1.4          smer_0.0.1          
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9              generics_0.1.3          tidyr_1.3.1            
#>  [4] digest_0.6.37           magrittr_2.0.3          evaluate_1.0.3         
#>  [7] grid_4.4.2              iterators_1.0.14        CompQuadForm_1.4.3     
#> [10] mvtnorm_1.3-3           fastmap_1.2.0           foreach_1.5.2          
#> [13] genio_1.1.2             jsonlite_1.8.9          backports_1.5.0        
#> [16] httr_1.4.7              purrr_1.0.2             UCSC.utils_1.0.0       
#> [19] scales_1.3.0            codetools_0.2-20        jquerylib_0.1.4        
#> [22] cli_3.6.3               rlang_1.1.4             XVector_0.44.0         
#> [25] munsell_0.5.1           withr_3.0.2             cachem_1.1.0           
#> [28] yaml_2.3.10             FMStable_0.1-4          tools_4.4.2            
#> [31] parallel_4.4.2          checkmate_2.3.2         colorspace_2.1-1       
#> [34] GenomeInfoDbData_1.2.12 vctrs_0.6.5             R6_2.5.1               
#> [37] lifecycle_1.0.4         zlibbioc_1.50.0         mvMAPIT_2.0.3          
#> [40] pkgconfig_2.0.3         pillar_1.10.1           bslib_0.8.0            
#> [43] gtable_0.3.6            glue_1.8.0              Rcpp_1.0.14            
#> [46] harmonicmeanp_3.0.1     xfun_0.50               tibble_3.2.1           
#> [49] tidyselect_1.2.1        knitr_1.49              farver_2.1.2           
#> [52] htmltools_0.5.8.1       rmarkdown_2.29          labeling_0.4.3         
#> [55] compiler_4.4.2