# install.packages("simBKMRdata")
Simulating Multivariate Environmental Exposure Data and Estimating Feature Selection Thresholds
2025-04-07
In environmental health studies, the relationship between multiple environmental exposures and health outcomes, such as cognitive development, is often investigated using complex datasets that exhibit non-normality. This paper introduces an R package designed to simulate multivariate environmental exposure data and estimate feature selection thresholds to aid in research to apply and extend the Bayesian Kernel Machine Regression (BKMR) methodology. The package enables researchers to generate data from multivariate normal and multivariate skewed Gamma distributions, common in environmental exposure research, and calculate multivariate statistical features such as mean, variance, skewness, shape, and rate vectors, and correlation matrices. This package also facilitates the calculation of a Posterior Inclusion Probability (PIP) threshold for feature selection in BKMR, offering an approach that accounts for non-normal data (based on the forthcoming work of Hasan, Odom, Bursac, Lucchini, and Ibrahimou). The effectiveness of the package is demonstrated through a real-world application using data from an adolescent environmental exposure study, where metal exposures are related to IQ scores.
R package, Bayesian Kernel Machine Regression, multivariate simulation, feature selection, environmental exposure, statistical moments, multivariate Gamma distribution, skewed data
Environmental exposure to metals such as Cadmium, Mercury, Arsenic, and Lead is linked to various adverse health outcomes, especially among children (Horton et al. 2013). In particular, these exposures may influence cognitive functions, with effects that often vary based on the concentration of metals in the environment (Sanders, Claus Henn, and Wright 2015; Trasande and Liu 2011; Tyler and Allan 2014; Tolins, Ruchirawat, and Landrigan 2014). Research into these relationships frequently involves complex, multivariate datasets where environmental factors are correlated, and data distributions often exhibit skewness rather than normality (Bobb et al. 2014, 2018; Hasan et al. 2024).
The Bayesian Kernel Machine Regression (BKMR) model is a powerful method for identifying relevant environmental factors and interactions (Bobb et al. 2014, 2018; Wilson et al. 2022; Devick et al. 2022; Gerwinn 2010; Ibrahimou et al. 2024). However, its effectiveness is typically constrained by its assumption of normality in the data, which is frequently violated in environmental health studies (Bobb et al. 2014, 2018; Hasan et al. 2024; Meulen EA 2021). To allow BKMR to flexibly handle multivariate skewed data, ongoing methodological innovation is needed, and synthetic data sets are a critical component of this research. To fill this need for realistic synthetic data sets, we have developed an R package that facilitates parametric bootstrap data simulation for multivariate environmental exposures. Further, this package includes computational results to estimate a Posterior Inclusion Probability (PIP) threshold which enables feature selection for BKMR via a hypothesis testing framework with controlled test size, which is particularly useful when data deviate from multivariate normality (Hasan et al. 2024). In this chapter, we show how to estimate key statistical moments from a real data set which includes effects of metals exposures on children’s IQ scores. Then, to facilitate future methodological innovation, we show how to use these moments as parameter estimates to simulate multiple synthetic environmental exposure data sets from multivariate normal and multivariate (skewed) Gamma distributions. Also, for the real data analysis, we show how to apply our threshold equation to use BKMR results for feature selection.
The package aims to provide a robust framework for simulating non-normal environmental exposure data and estimating feature selection thresholds in the context of Bayesian Kernel Machine Regression (BKMR). The core functions of the package include:
simulate_group_data()
: Generates data using multivariate normal or Gamma distributions.Statistical Parameters Estimation: The function calculate_stats_gaussian()
calculates key statistical moments, including mean, variance, skewness, and correlation matrix, which are essential for understanding the underlying structure of the normally distributed data. The function calculate_stats_gamma()
computes the sample size, gamma distribution parameters (shape and rate), and Spearman correlation matrix for each group, based on the grouping column.
Feature Selection Threshold Calculation: The PIP threshold for feature selection is computed using the calculate_pip_threshold()
function. This function estimates the threshold based on a Four-Parameter Logistic Regression model (Richard’s curve), which adjusts for variations in data and sample size(RICHARDS 1959). The formula used to calculate the PIP threshold is:
Where:
• y: the 95th percentile of the PIP values PIPq95
• A: Left asymptote, fixed at 0.
• K: Right asymptote, bounded above by 1.
• C: Constant.
• β1, β2: Midpoint shift parameters for CV and sample size.
• x1: Log-transformed CV log2(CV).
• x2: Log-transformed sample size log10(Sample Size).
A dynamic threshold function was derived from the logistic model to adjust thresholds based on CV and sample size.
The package includes data transformation functions to prepare data for analysis:
• Scaling by Standard Deviation or MAD: The trans_ratio()
function scales data by either the standard deviation (SD) or the median absolute deviation (MAD).
• Logarithmic and Root Transformations: The trans_log()
function applies a logarithmic transformation to the data, and the trans_root()
function allows for fractional root transformations, helping to manage skewed data.
The simBKMRdata software is implemented as an R package. It depends on the following packages: bkmr(Bobb 2022), fields( et al. 2021), base(2024) and MASS(Venables and Ripley 2002). The R software and these required packages can be obtained from the CRAN website. Furthermore, daily builds of package simBKMRdata are provided on the CRAN website […..]. It has been published under GPL version 3. The source code is available on GitHub at [https://github.com/khasa006/simBKMRdata].
To install the package from CRAN, use:
# install.packages("simBKMRdata")
For GitHub installation, use:
# devtools::install_github("simBKMRdata")
library(simBKMRdata)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the dataset
data("metalExposChildren_df")
<- metalExposChildren_df %>%
analysisData_df select(QI, Cadmium:Manganese, Sex)
head(analysisData_df)
# A tibble: 6 × 7
QI Cadmium Mercury Arsenic Lead Manganese Sex
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 123 0.363 0.025 4.41 6.33 440. Female
2 123 1.01 4.12 6.21 10.1 120. Female
3 104 0.278 0.62 17.0 13.5 176. Female
4 135 0.273 0.529 2.69 2.94 110. Male
5 94 0.363 1.03 24.4 6.09 630. Male
6 119 0.443 0.025 5.68 8.22 170. Female
# Create a histogram with density plot for each metal
<- analysisData_df %>%
variablePlot select(Cadmium:Manganese) %>%
pivot_longer(cols = everything(), names_to = "Metal", values_to = "Value") %>%
ggplot(aes(x = Value, fill = Metal)) +
geom_histogram(aes(y = ..density..), bins = 30, alpha = 0.5, position = "identity") +
geom_density(aes(color = Metal), linewidth = 1) +
facet_wrap(~Metal, scales = "free") +
theme_minimal() +
labs(title = "Histogram with Density Plot for Metals",
x = "Concentration",
y = "Density") +
theme(legend.position = "none")
<- calculate_stats_gamma(
param_list data_df = analysisData_df,
group_col= "Sex",
using = "MoM"
)
param_list
$Female
$Female$sampSize
[1] 242
$Female$mean_vec
QI Cadmium Mercury Arsenic Lead Manganese
101.8925620 0.5542361 0.7013119 24.5664563 10.4392500 545.5697791
$Female$sampCorr_mat
QI Cadmium Mercury Arsenic Lead
QI 1.00000000 -0.09419925 -0.06251526 -0.1139683 -0.01904167
Cadmium -0.09419925 1.00000000 0.58836338 0.3536415 0.47882840
Mercury -0.06251526 0.58836338 1.00000000 0.5350280 0.42408005
Arsenic -0.11396827 0.35364145 0.53502800 1.0000000 0.44714397
Lead -0.01904167 0.47882840 0.42408005 0.4471440 1.00000000
Manganese -0.07923072 -0.22917974 0.02052380 0.2872210 -0.21157349
Manganese
QI -0.07923072
Cadmium -0.22917974
Mercury 0.02052380
Arsenic 0.28722102
Lead -0.21157349
Manganese 1.00000000
$Female$shape_num
QI Cadmium Mercury Arsenic Lead Manganese
44.7062637 0.3428091 0.2415481 0.8969605 1.0328424 2.3044289
$Female$rate_num
QI Cadmium Mercury Arsenic Lead Manganese
0.438758854 0.618525338 0.344423247 0.036511597 0.098938372 0.004223894
$Male
$Male$sampSize
[1] 186
$Male$mean_vec
QI Cadmium Mercury Arsenic Lead Manganese
101.1075269 0.5059338 0.7228842 25.6696247 11.0417906 511.1826055
$Male$sampCorr_mat
QI Cadmium Mercury Arsenic Lead
QI 1.000000000 -0.08158903 -0.001218806 -0.003652175 0.1694279
Cadmium -0.081589032 1.00000000 0.639037283 0.407157209 0.3503445
Mercury -0.001218806 0.63903728 1.000000000 0.570412246 0.4754083
Arsenic -0.003652175 0.40715721 0.570412246 1.000000000 0.4758518
Lead 0.169427922 0.35034449 0.475408267 0.475851786 1.0000000
Manganese -0.132040534 -0.12340908 0.080655296 0.330912517 -0.1507620
Manganese
QI -0.1320405
Cadmium -0.1234091
Mercury 0.0806553
Arsenic 0.3309125
Lead -0.1507620
Manganese 1.0000000
$Male$shape_num
QI Cadmium Mercury Arsenic Lead Manganese
46.1001452 0.7570466 0.2816617 0.1743466 0.5440889 1.6162489
$Male$rate_num
QI Cadmium Mercury Arsenic Lead Manganese
0.455951665 1.496335196 0.389635949 0.006791941 0.049275420 0.003161784
<- simulate_group_data(
gamma_samples param_list = param_list,
data_gen_fn = generate_mvGamma_data,
group_col_name = "Sex"
)
head(gamma_samples)
V1 V2 V3 V4 V5 V6 Sex
1 90.66508 0.01162794 0.029554317 16.726579 6.667533 1724.8307 Female
2 83.18012 0.35959451 2.023619801 69.408463 15.037361 581.3868 Female
3 85.86602 0.94853231 0.075715896 39.637177 15.328454 266.7006 Female
4 122.18359 0.57090429 0.021902046 60.259552 29.351564 714.1228 Female
5 122.54090 1.89519271 6.206970673 30.056895 4.418394 917.8859 Female
6 98.98082 0.16055286 0.005175967 9.759447 6.520575 218.6911 Female
library(bkmr)
For guided examples, go to 'https://jenfb.github.io/bkmr/overview.html'
set.seed(2025)
<- analysisData_df %>%
bkmrAnalysisData_df na.omit() %>%
mutate_at(
vars(Cadmium:Manganese), ~ log10(. + 1) %>% as.vector
)
# Extract the response variable (IQ) and the exposure matrix (metals)
<- bkmrAnalysisData_df %>%
iq pull(QI) %>%
na.omit()
<- bkmrAnalysisData_df %>%
expos select(Cadmium:Manganese) %>%
data.matrix()
# Generate knot points using a cover design for Bayesian modeling
<- fields::cover.design(expos, nd = 50)$design
knots50
# Fit the BKMR model using MCMC
<- kmbayes(
modelFit y = iq, # Response variable
Z = expos, # Exposure matrix (metal concentrations)
X = NULL, # No additional covariates
iter = 2000, # Number of MCMC iterations
family = "gaussian", # Gaussian response
verbose = TRUE, # Output progress for each iteration
varsel = TRUE, # Perform variable selection
knots = knots50 # Knot points generated earlier
)
Iteration: 200 (10% completed; 1.8815 secs elapsed)
Acceptance rates for Metropolis-Hastings algorithm:
param rate
1 lambda 0.5678392
2 r/delta (overall) 0.4221106
3 r/delta (move 1) 0.3960396
4 r/delta (move 2) 0.4489796
Iteration: 400 (20% completed; 3.95538 secs elapsed)
Acceptance rates for Metropolis-Hastings algorithm:
param rate
1 lambda 0.5789474
2 r/delta (overall) 0.3809524
3 r/delta (move 1) 0.3793103
4 r/delta (move 2) 0.3826531
Iteration: 600 (30% completed; 5.83481 secs elapsed)
Acceptance rates for Metropolis-Hastings algorithm:
param rate
1 lambda 0.5959933
2 r/delta (overall) 0.3839733
3 r/delta (move 1) 0.4095238
4 r/delta (move 2) 0.3556338
Iteration: 800 (40% completed; 7.62318 secs elapsed)
Acceptance rates for Metropolis-Hastings algorithm:
param rate
1 lambda 0.5757196
2 r/delta (overall) 0.3904881
3 r/delta (move 1) 0.4272517
4 r/delta (move 2) 0.3469945
Iteration: 1000 (50% completed; 9.46968 secs elapsed)
Acceptance rates for Metropolis-Hastings algorithm:
param rate
1 lambda 0.5675676
2 r/delta (overall) 0.3713714
3 r/delta (move 1) 0.3914657
4 r/delta (move 2) 0.3478261
Iteration: 1200 (60% completed; 11.27485 secs elapsed)
Acceptance rates for Metropolis-Hastings algorithm:
param rate
1 lambda 0.5562969
2 r/delta (overall) 0.3661384
3 r/delta (move 1) 0.3930818
4 r/delta (move 2) 0.3357016
Iteration: 1400 (70% completed; 13.07163 secs elapsed)
Acceptance rates for Metropolis-Hastings algorithm:
param rate
1 lambda 0.5632595
2 r/delta (overall) 0.3624017
3 r/delta (move 1) 0.3835979
4 r/delta (move 2) 0.3374806
Iteration: 1600 (80% completed; 14.89255 secs elapsed)
Acceptance rates for Metropolis-Hastings algorithm:
param rate
1 lambda 0.5672295
2 r/delta (overall) 0.3589744
3 r/delta (move 1) 0.3779343
4 r/delta (move 2) 0.3373494
Iteration: 1800 (90% completed; 16.71744 secs elapsed)
Acceptance rates for Metropolis-Hastings algorithm:
param rate
1 lambda 0.5669817
2 r/delta (overall) 0.3613118
3 r/delta (move 1) 0.3793103
4 r/delta (move 2) 0.3408551
Iteration: 2000 (100% completed; 18.54461 secs elapsed)
Acceptance rates for Metropolis-Hastings algorithm:
param rate
1 lambda 0.5662831
2 r/delta (overall) 0.3521761
3 r/delta (move 1) 0.3783019
4 r/delta (move 2) 0.3226837
# Extract posterior inclusion probabilities (PIPs) and sort them
<- ExtractPIPs(modelFit) %>% arrange(desc(PIP)) pipFit
<- calculate_pip_threshold(y = bkmrAnalysisData_df$QI) pipThresh_num
We began by examining the real-world environmental exposure dataset derived from the cohort study led by Dr. Lucchini and colleagues. The dataset comprises measurements of five key metal concentrations — Cadmium, Mercury, Arsenic, Lead, and Manganese—in children, along with corresponding intelligence quotient (IQ) scores. Visualization of the marginal distributions of these metal concentrations using histograms and kernel density plots (Figure 1) revealed marked right-skewness across all analytes, suggesting the need for distribution-sensitive modeling approaches.
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
Warning: Removed 115 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 115 rows containing non-finite outside the scale range
(`stat_density()`).
To capture the empirical distributional features of the observed data, we implemented the calculate_stats_gamma()
function to estimate parameters for Gamma distributions within male and female subgroups. This method-of-moments approach yielded group-specific descriptors, including sample sizes, means, Gamma shape and rate parameters, and Spearman correlation matrices among the metals. These parameters served as foundational elements for generating realistic synthetic data that preserve sex-specific distribution and correlation structures of the environmental exposures.
With the estimated parameters, we used the simulate_group_data()
function to generate synthetic multivariate Gamma-distributed exposure data. This function internally called generate_mvGamma_data()
and produced group-labeled datasets that preserved the distributional properties of the original data. The simulated data maintained inter-variable correlations and group-specific skewness patterns, effectively replicating real-world exposure profiles.
To assess the relationship between metal exposures and cognitive outcomes, we employed Bayesian Kernel Machine Regression (BKMR) using the kmbayes() function from the bkmr package. Prior to analysis, we omit the missing values and the metal concentration variables were log-transformed using the trans_log()
function to mitigate skewness. The exposure matrix thus comprised log-transformed values of the five metals, with IQ as the continuous outcome variable.
The BKMR model was fit with 2,000 MCMC iterations, allowing for variable selection via computation of Posterior Inclusion Probabilities (PIPs). Extraction and ranking of PIPs, conducted using the ExtractPIPs()
function, revealed that Cadmium and Lead were among the most influential predictors of IQ variability, as indicated by elevated PIP values (Table 1).
Table 1. Posterior Inclusion Probabilities (PIPs) for Metal Exposures
Posterior Inclusion Probabilities (PIPs) | |
---|---|
variable | PIP |
Cadmium | 0.944 |
Lead | 0.586 |
Manganese | 0.248 |
Mercury | 0.231 |
Arsenic | 0.093 |
Notably, Cadmium and Lead emerged as likely key contributors to cognitive development outcomes. However, determining which PIP values were statistically meaningful required the use of a calibrated, data-driven threshold.
To address the challenge of determining meaningful PIP thresholds under skewed data conditions, we applied the calculate_pip_threshold()
function. This function dynamically estimates a threshold based on the dataset’s coefficient of variation (CV) and sample size using a Four-Parameter Logistic Regression model.
For the current dataset, the computed threshold value was 0.120809. This threshold adapts to sample characteristics and adjusts the selection criterion accordingly, making it especially suitable for non-normal data distributions.
By comparing each metal’s PIP against the calculated threshold, we identified a refined subset of significant exposures, which included not only cadmium and lead but also manganese and mercury. This approach allows for a statistically principled method of feature selection within the BKMR framework, ensuring robust and interpretable results even when the data deviate from normality.
The R package we developed is a valuable tool for simulating multivariate environmental exposure data and estimating feature selection thresholds for Bayesian Kernel Machine Regression (BKMR). By supporting multivariate normal and skewed Gamma distributions, the package is particularly well-suited for handling non-normal environmental exposure data, which is common in health studies.
While BKMR is an excellent tool for feature selection, its reliance on normality can limit its application to real-world datasets. Our package fills this gap by enabling researchers to model and analyze non-normally distributed data, ensuring that feature selection remains robust even in complex scenarios.
This R package provides essential tools for simulating multivariate environmental exposure data, estimating statistical moments, and calculating PIP thresholds for feature selection in Bayesian Kernel Machine Regression (BKMR). The ability to simulate data from multivariate normal and skewed Gamma distributions makes the package particularly valuable for environmental health studies where data often deviate from normality.