Helper Functions for Bayesian Kernel Machine Regression

In environmental health studies, the relationship between multiple environmental exposures and health outcomes, such as cognitive development, is often investigated using complex datasets that exhibit non-normality. This paper introduces an R package designed to simulate multivariate environmental exposure data and estimate feature selection thresholds to aid in research to apply and extend the Bayesian Kernel Machine Regression (BKMR) methodology. The package enables researchers to generate data from multivariate normal and multivariate skewed Gamma distributions, common in environmental exposure research, and calculate multivariate statistical features such as mean, variance, skewness, shape, and rate vectors, and correlation matrices. This package also facilitates the calculation of a Posterior Inclusion Probability (PIP) threshold for feature selection in BKMR, offering an approach that accounts for non-normal data (based on the forthcoming work of Hasan, Odom, Bursac, Lucchini, and Ibrahimou). The effectiveness of the package is demonstrated through a real-world application using data from an adolescent environmental exposure study, where metal exposures are related to IQ scores.

Keywords

R package, Bayesian Kernel Machine Regression, multivariate simulation, feature selection, environmental exposure, statistical moments, multivariate Gamma distribution, skewed data

Introduction

Environmental exposure to metals such as Cadmium, Mercury, Arsenic, and Lead is linked to various adverse health outcomes, especially among children (Horton et al. 2013). In particular, these exposures may influence cognitive functions, with effects that often vary based on the concentration of metals in the environment (Sanders, Claus Henn, and Wright 2015; Trasande and Liu 2011; Tyler and Allan 2014; Tolins, Ruchirawat, and Landrigan 2014). Research into these relationships frequently involves complex, multivariate datasets where environmental factors are correlated, and data distributions often exhibit skewness rather than normality (Bobb et al. 2014, 2018; Hasan et al. 2024).

The Bayesian Kernel Machine Regression (BKMR) model is a powerful method for identifying relevant environmental factors and interactions (Bobb et al. 2014, 2018; Wilson et al. 2022; Devick et al. 2022; Gerwinn 2010; Ibrahimou et al. 2024). However, its effectiveness is typically constrained by its assumption of normality in the data, which is frequently violated in environmental health studies (Bobb et al. 2014, 2018; Hasan et al. 2024; Meulen EA 2021). To allow BKMR to flexibly handle multivariate skewed data, ongoing methodological innovation is needed, and synthetic data sets are a critical component of this research. To fill this need for realistic synthetic data sets, we have developed an R package that facilitates parametric bootstrap data simulation for multivariate environmental exposures. Further, this package includes computational results to estimate a Posterior Inclusion Probability (PIP) threshold which enables feature selection for BKMR via a hypothesis testing framework with controlled test size, which is particularly useful when data deviate from multivariate normality (Hasan et al. 2024). In this chapter, we show how to estimate key statistical moments from a real data set which includes effects of metals exposures on children’s IQ scores. Then, to facilitate future methodological innovation, we show how to use these moments as parameter estimates to simulate multiple synthetic environmental exposure data sets from multivariate normal and multivariate (skewed) Gamma distributions. Also, for the real data analysis, we show how to apply our threshold equation to use BKMR results for feature selection.

Materials and Methods

Overview of the R Package

The package aims to provide a robust framework for simulating non-normal environmental exposure data and estimating feature selection thresholds in the context of Bayesian Kernel Machine Regression (BKMR). The core functions of the package include:

A dynamic threshold function was derived from the logistic model to adjust thresholds based on CV and sample size.

The package includes data transformation functions to prepare data for analysis:

• Scaling by Standard Deviation or MAD: The trans_ratio() function scales data by either the standard deviation (SD) or the median absolute deviation (MAD).

• Logarithmic and Root Transformations: The trans_log() function applies a logarithmic transformation to the data, and the trans_root() function allows for fractional root transformations, helping to manage skewed data.

Software Implementation

The simBKMRdata software is implemented as an R package. It depends on the following packages: bkmr(Bobb 2022), fields( et al. 2021), base(2024) and MASS(Venables and Ripley 2002). The R software and these required packages can be obtained from the CRAN website. Furthermore, daily builds of package simBKMRdata are provided on the CRAN website […..]. It has been published under GPL version 3. The source code is available on GitHub at [https://github.com/khasa006/simBKMRdata].

Installation

# install.packages("simBKMRdata")

# devtools::install_github("simBKMRdata")

Example Usage

Data Exploration

library(simBKMRdata)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load the dataset
data("metalExposChildren_df")


analysisData_df <- metalExposChildren_df %>%
  select(QI, Cadmium:Manganese, Sex)

head(analysisData_df)

# A tibble: 6 × 7
     QI Cadmium Mercury Arsenic  Lead Manganese Sex   
  <dbl>   <dbl>   <dbl>   <dbl> <dbl>     <dbl> <fct> 
1   123   0.363   0.025    4.41  6.33      440. Female
2   123   1.01    4.12     6.21 10.1       120. Female
3   104   0.278   0.62    17.0  13.5       176. Female
4   135   0.273   0.529    2.69  2.94      110. Male  
5    94   0.363   1.03    24.4   6.09      630. Male  
6   119   0.443   0.025    5.68  8.22      170. Female

# Create a histogram with density plot for each metal
variablePlot <- analysisData_df %>%
  select(Cadmium:Manganese) %>%
  pivot_longer(cols = everything(), names_to = "Metal", values_to = "Value") %>%
  ggplot(aes(x = Value, fill = Metal)) + 
  geom_histogram(aes(y = ..density..), bins = 30, alpha = 0.5, position = "identity") + 
  geom_density(aes(color = Metal), linewidth = 1) + 
  facet_wrap(~Metal, scales = "free") + 
  theme_minimal() + 
  labs(title = "Histogram with Density Plot for Metals", 
       x = "Concentration", 
       y = "Density") + 
  theme(legend.position = "none")

Parameter Estimation

param_list <- calculate_stats_gamma(
  data_df = analysisData_df, 
  group_col= "Sex", 
  using = "MoM"
)

param_list

$Female
$Female$sampSize
[1] 242

$Female$mean_vec
         QI     Cadmium     Mercury     Arsenic        Lead   Manganese 
101.8925620   0.5542361   0.7013119  24.5664563  10.4392500 545.5697791 

$Female$sampCorr_mat
                   QI     Cadmium     Mercury    Arsenic        Lead
QI         1.00000000 -0.09419925 -0.06251526 -0.1139683 -0.01904167
Cadmium   -0.09419925  1.00000000  0.58836338  0.3536415  0.47882840
Mercury   -0.06251526  0.58836338  1.00000000  0.5350280  0.42408005
Arsenic   -0.11396827  0.35364145  0.53502800  1.0000000  0.44714397
Lead      -0.01904167  0.47882840  0.42408005  0.4471440  1.00000000
Manganese -0.07923072 -0.22917974  0.02052380  0.2872210 -0.21157349
            Manganese
QI        -0.07923072
Cadmium   -0.22917974
Mercury    0.02052380
Arsenic    0.28722102
Lead      -0.21157349
Manganese  1.00000000

$Female$shape_num
        QI    Cadmium    Mercury    Arsenic       Lead  Manganese 
44.7062637  0.3428091  0.2415481  0.8969605  1.0328424  2.3044289 

$Female$rate_num
         QI     Cadmium     Mercury     Arsenic        Lead   Manganese 
0.438758854 0.618525338 0.344423247 0.036511597 0.098938372 0.004223894 


$Male
$Male$sampSize
[1] 186

$Male$mean_vec
         QI     Cadmium     Mercury     Arsenic        Lead   Manganese 
101.1075269   0.5059338   0.7228842  25.6696247  11.0417906 511.1826055 

$Male$sampCorr_mat
                    QI     Cadmium      Mercury      Arsenic       Lead
QI         1.000000000 -0.08158903 -0.001218806 -0.003652175  0.1694279
Cadmium   -0.081589032  1.00000000  0.639037283  0.407157209  0.3503445
Mercury   -0.001218806  0.63903728  1.000000000  0.570412246  0.4754083
Arsenic   -0.003652175  0.40715721  0.570412246  1.000000000  0.4758518
Lead       0.169427922  0.35034449  0.475408267  0.475851786  1.0000000
Manganese -0.132040534 -0.12340908  0.080655296  0.330912517 -0.1507620
           Manganese
QI        -0.1320405
Cadmium   -0.1234091
Mercury    0.0806553
Arsenic    0.3309125
Lead      -0.1507620
Manganese  1.0000000

$Male$shape_num
        QI    Cadmium    Mercury    Arsenic       Lead  Manganese 
46.1001452  0.7570466  0.2816617  0.1743466  0.5440889  1.6162489 

$Male$rate_num
         QI     Cadmium     Mercury     Arsenic        Lead   Manganese 
0.455951665 1.496335196 0.389635949 0.006791941 0.049275420 0.003161784

Data Simulation

gamma_samples <- simulate_group_data(
  param_list = param_list, 
  data_gen_fn = generate_mvGamma_data, 
  group_col_name = "Sex"
)

head(gamma_samples)

         V1         V2          V3        V4        V5        V6    Sex
1  90.66508 0.01162794 0.029554317 16.726579  6.667533 1724.8307 Female
2  83.18012 0.35959451 2.023619801 69.408463 15.037361  581.3868 Female
3  85.86602 0.94853231 0.075715896 39.637177 15.328454  266.7006 Female
4 122.18359 0.57090429 0.021902046 60.259552 29.351564  714.1228 Female
5 122.54090 1.89519271 6.206970673 30.056895  4.418394  917.8859 Female
6  98.98082 0.16055286 0.005175967  9.759447  6.520575  218.6911 Female

BKMR Analysis

library(bkmr)

For guided examples, go to 'https://jenfb.github.io/bkmr/overview.html'

set.seed(2025)

bkmrAnalysisData_df <- analysisData_df %>%
  na.omit() %>%
  mutate_at(
    vars(Cadmium:Manganese), ~ log10(. + 1) %>% as.vector
  )


# Extract the response variable (IQ) and the exposure matrix (metals)
iq <- bkmrAnalysisData_df %>% 
  pull(QI) %>% 
  na.omit()

expos <- bkmrAnalysisData_df %>%
  select(Cadmium:Manganese) %>% 
  data.matrix()

# Generate knot points using a cover design for Bayesian modeling
knots50 <- fields::cover.design(expos, nd = 50)$design

# Fit the BKMR model using MCMC
modelFit <- kmbayes(
  y = iq,         # Response variable
  Z = expos,      # Exposure matrix (metal concentrations)
  X = NULL,       # No additional covariates
  iter = 2000,    # Number of MCMC iterations
  family = "gaussian",  # Gaussian response
  verbose = TRUE,       # Output progress for each iteration
  varsel = TRUE,       # Perform variable selection
  knots = knots50      # Knot points generated earlier
)

Iteration: 200 (10% completed; 1.8815 secs elapsed)

Acceptance rates for Metropolis-Hastings algorithm:
              param      rate
1            lambda 0.5678392
2 r/delta (overall) 0.4221106
3 r/delta  (move 1) 0.3960396
4 r/delta  (move 2) 0.4489796

Iteration: 400 (20% completed; 3.95538 secs elapsed)

Acceptance rates for Metropolis-Hastings algorithm:
              param      rate
1            lambda 0.5789474
2 r/delta (overall) 0.3809524
3 r/delta  (move 1) 0.3793103
4 r/delta  (move 2) 0.3826531

Iteration: 600 (30% completed; 5.83481 secs elapsed)

Acceptance rates for Metropolis-Hastings algorithm:
              param      rate
1            lambda 0.5959933
2 r/delta (overall) 0.3839733
3 r/delta  (move 1) 0.4095238
4 r/delta  (move 2) 0.3556338

Iteration: 800 (40% completed; 7.62318 secs elapsed)

Acceptance rates for Metropolis-Hastings algorithm:
              param      rate
1            lambda 0.5757196
2 r/delta (overall) 0.3904881
3 r/delta  (move 1) 0.4272517
4 r/delta  (move 2) 0.3469945

Iteration: 1000 (50% completed; 9.46968 secs elapsed)

Acceptance rates for Metropolis-Hastings algorithm:
              param      rate
1            lambda 0.5675676
2 r/delta (overall) 0.3713714
3 r/delta  (move 1) 0.3914657
4 r/delta  (move 2) 0.3478261

Iteration: 1200 (60% completed; 11.27485 secs elapsed)

Acceptance rates for Metropolis-Hastings algorithm:
              param      rate
1            lambda 0.5562969
2 r/delta (overall) 0.3661384
3 r/delta  (move 1) 0.3930818
4 r/delta  (move 2) 0.3357016

Iteration: 1400 (70% completed; 13.07163 secs elapsed)

Acceptance rates for Metropolis-Hastings algorithm:
              param      rate
1            lambda 0.5632595
2 r/delta (overall) 0.3624017
3 r/delta  (move 1) 0.3835979
4 r/delta  (move 2) 0.3374806

Iteration: 1600 (80% completed; 14.89255 secs elapsed)

Acceptance rates for Metropolis-Hastings algorithm:
              param      rate
1            lambda 0.5672295
2 r/delta (overall) 0.3589744
3 r/delta  (move 1) 0.3779343
4 r/delta  (move 2) 0.3373494

Iteration: 1800 (90% completed; 16.71744 secs elapsed)

Acceptance rates for Metropolis-Hastings algorithm:
              param      rate
1            lambda 0.5669817
2 r/delta (overall) 0.3613118
3 r/delta  (move 1) 0.3793103
4 r/delta  (move 2) 0.3408551

Iteration: 2000 (100% completed; 18.54461 secs elapsed)

Acceptance rates for Metropolis-Hastings algorithm:
              param      rate
1            lambda 0.5662831
2 r/delta (overall) 0.3521761
3 r/delta  (move 1) 0.3783019
4 r/delta  (move 2) 0.3226837

# Extract posterior inclusion probabilities (PIPs) and sort them
pipFit <- ExtractPIPs(modelFit) %>% arrange(desc(PIP))

Dynamic Threshold Calculation

pipThresh_num <- calculate_pip_threshold(y = bkmrAnalysisData_df$QI)

Result

Exploratory Data Analysis

We began by examining the real-world environmental exposure dataset derived from the cohort study led by Dr. Lucchini and colleagues. The dataset comprises measurements of five key metal concentrations — Cadmium, Mercury, Arsenic, Lead, and Manganese—in children, along with corresponding intelligence quotient (IQ) scores. Visualization of the marginal distributions of these metal concentrations using histograms and kernel density plots (Figure 1) revealed marked right-skewness across all analytes, suggesting the need for distribution-sensitive modeling approaches.

Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

Warning: Removed 115 rows containing non-finite outside the scale range
(`stat_bin()`).

Warning: Removed 115 rows containing non-finite outside the scale range
(`stat_density()`).

Parameter Estimation from Real Data

To capture the empirical distributional features of the observed data, we implemented the calculate_stats_gamma() function to estimate parameters for Gamma distributions within male and female subgroups. This method-of-moments approach yielded group-specific descriptors, including sample sizes, means, Gamma shape and rate parameters, and Spearman correlation matrices among the metals. These parameters served as foundational elements for generating realistic synthetic data that preserve sex-specific distribution and correlation structures of the environmental exposures.

Simulation of Multivariate Gamma Data

With the estimated parameters, we used the simulate_group_data() function to generate synthetic multivariate Gamma-distributed exposure data. This function internally called generate_mvGamma_data() and produced group-labeled datasets that preserved the distributional properties of the original data. The simulated data maintained inter-variable correlations and group-specific skewness patterns, effectively replicating real-world exposure profiles.

Bayesian Kernel Machine Regression (BKMR) Analysis

To assess the relationship between metal exposures and cognitive outcomes, we employed Bayesian Kernel Machine Regression (BKMR) using the kmbayes() function from the bkmr package. Prior to analysis, we omit the missing values and the metal concentration variables were log-transformed using the trans_log() function to mitigate skewness. The exposure matrix thus comprised log-transformed values of the five metals, with IQ as the continuous outcome variable.

The BKMR model was fit with 2,000 MCMC iterations, allowing for variable selection via computation of Posterior Inclusion Probabilities (PIPs). Extraction and ranking of PIPs, conducted using the ExtractPIPs() function, revealed that Cadmium and Lead were among the most influential predictors of IQ variability, as indicated by elevated PIP values (Table 1).

Posterior Inclusion Probabilities (PIPs)
variable	PIP
Cadmium	0.944
Lead	0.586
Manganese	0.248
Mercury	0.231
Arsenic	0.093

Notably, Cadmium and Lead emerged as likely key contributors to cognitive development outcomes. However, determining which PIP values were statistically meaningful required the use of a calibrated, data-driven threshold.

To address the challenge of determining meaningful PIP thresholds under skewed data conditions, we applied the calculate_pip_threshold() function. This function dynamically estimates a threshold based on the dataset’s coefficient of variation (CV) and sample size using a Four-Parameter Logistic Regression model.

For the current dataset, the computed threshold value was 0.120809. This threshold adapts to sample characteristics and adjusts the selection criterion accordingly, making it especially suitable for non-normal data distributions.

By comparing each metal’s PIP against the calculated threshold, we identified a refined subset of significant exposures, which included not only cadmium and lead but also manganese and mercury. This approach allows for a statistically principled method of feature selection within the BKMR framework, ensuring robust and interpretable results even when the data deviate from normality.

Discussion

The R package we developed is a valuable tool for simulating multivariate environmental exposure data and estimating feature selection thresholds for Bayesian Kernel Machine Regression (BKMR). By supporting multivariate normal and skewed Gamma distributions, the package is particularly well-suited for handling non-normal environmental exposure data, which is common in health studies.

While BKMR is an excellent tool for feature selection, its reliance on normality can limit its application to real-world datasets. Our package fills this gap by enabling researchers to model and analyze non-normally distributed data, ensuring that feature selection remains robust even in complex scenarios.

Conclusion

This R package provides essential tools for simulating multivariate environmental exposure data, estimating statistical moments, and calculating PIP thresholds for feature selection in Bayesian Kernel Machine Regression (BKMR). The ability to simulate data from multivariate normal and skewed Gamma distributions makes the package particularly valuable for environmental health studies where data often deviate from normality.

Bobb, Jennifer F. 2022. “Bkmr: Bayesian Kernel Machine Regression.” https://CRAN.R-project.org/package=bkmr.

Bobb, Jennifer F., Birgit Claus Henn, Linda Valeri, and Brent A. Coull. 2018. “Statistical Software for Analyzing the Health Effects of Multiple Concurrent Exposures via Bayesian Kernel Machine Regression.” Environmental Health 17 (1). https://doi.org/10.1186/s12940-018-0413-y.

Bobb, Jennifer F., Linda Valeri, Birgit Claus Henn, David C. Christiani, Robert O. Wright, Maitreyi Mazumdar, John J. Godleski, and Brent A. Coull. 2014. “Bayesian Kernel Machine Regression for Estimating the Health Effects of Multi-Pollutant Mixtures.” Biostatistics 16 (3): 493–508. https://doi.org/10.1093/biostatistics/kxu058.

Devick, Katrina L., Jennifer F. Bobb, Maitreyi Mazumdar, Birgit Claus Henn, David C. Bellinger, David C. Christiani, Robert O. Wright, Paige L. Williams, Brent A. Coull, and Linda Valeri. 2022. “Bayesian Kernel Machine Regression-Causal Mediation Analysis.” Statistics in Medicine 41 (5): 860–76. https://doi.org/10.1002/sim.9255.

Douglas Nychka, Reinhard Furrer, John Paige, and Stephan Sain. 2021. “Fields: Tools for Spatial Data.” https://github.com/dnychka/fieldsRPackage.

Gerwinn, Sebastian. 2010. “Bayesian Inference for Generalized Linear Models for Spiking Neurons.” Frontiers in Computational Neuroscience 4. https://doi.org/10.3389/fncom.2010.00012.

Hasan, Kazi Tanvir, Gabriel Odom, Zoran Bursac, and Boubakari Ibrahimou. 2024. “The Sensitivity of Bayesian Kernel Machine Regression (BKMR) to Data Distribution: A Comprehensive Simulation Analysis.” https://doi.org/10.48550/ARXIV.2411.00286.

Horton, Lindsey M., Mary E. Mortensen, Yulia Iossifova, Marlena M. Wald, and Paula Burgess. 2013. “What Do We Know of Childhood Exposures to Metals (Arsenic, Cadmium, Lead, and Mercury) in Emerging Market Countries?” International Journal of Pediatrics 2013: 1–13. https://doi.org/10.1155/2013/872596.

Ibrahimou, Boubakari, Kazi Tanvir Hasan, Shelbie Burchfield, Hamisu Salihu, Yiliang Zhu, Getachew Dagne, Mario De La Rosa, Assefa Melesse, Roberto Lucchini, and Zoran Bursac. 2024. “Assessing the Risk of Heart Attack: A Bayesian Kernel Machine Regression Analysis of Heavy Metal Mixtures.” http://dx.doi.org/10.21203/rs.3.rs-4456611/v1.

Meulen EA, van der Meulen PJ ( van der, Raymond K. 2021. “Consistent Confidence Limits, P Values, and Power of the Non-Conservative, Size – α Modified Fisher Exact Test.” Journal of Biostatistics and Biometric Applications 6 (1): 102. https://www.annexpublishers.com/articles/JBIA/6102-Consistent-Confidence-Limits-P-Values.pdf.

R Core Team. 2024. “R: A Language and Environment for Statistical Computing.” https://www.R-project.org/.

RICHARDS, F. J. 1959. “A Flexible Growth Function for Empirical Use.” Journal of Experimental Botany 10 (2): 290–301. https://doi.org/10.1093/jxb/10.2.290.

Sanders, Alison P., Birgit Claus Henn, and Robert O. Wright. 2015. “Perinatal and Childhood Exposure to Cadmium, Manganese, and Metal Mixtures and Effects on Cognition and Behavior: A Review of Recent Literature.” Current Environmental Health Reports 2 (3): 284–94. https://doi.org/10.1007/s40572-015-0058-8.

Tolins, Molly, Mathuros Ruchirawat, and Philip Landrigan. 2014. “The Developmental Neurotoxicity of Arsenic: Cognitive and Behavioral Consequences of Early Life Exposure.” Annals of Global Health 80 (4): 303. https://doi.org/10.1016/j.aogh.2014.09.005.

Trasande, Leonardo, and Yinghua Liu. 2011. “Reducing The Staggering Costs Of Environmental Disease In Children, Estimated At $76.6 Billion In 2008.” Health Affairs 30 (5): 863–70. https://doi.org/10.1377/hlthaff.2010.1239.

Tyler, Christina R., and Andrea M. Allan. 2014. “The Effects of Arsenic Exposure on Neurological and Cognitive Dysfunction in Human and Rodent Studies: A Review.” Current Environmental Health Reports 1 (2): 132–47. https://doi.org/10.1007/s40572-014-0012-1.

Venables, W. N., and B. D. Ripley. 2002. “Modern Applied Statistics with s.” https://www.stats.ox.ac.uk/pub/MASS4/.

Wilson, Ander, Hsiao-Hsien Leon Hsu, Yueh-Hsiu Mathilda Chiu, Robert O. Wright, Rosalind J. Wright, and Brent A. Coull. 2022. “Kernel Machine and Distributed Lag Models for Assessing Windows of Susceptibility to Environmental Mixtures in Children’s Health Studies.” The Annals of Applied Statistics 16 (2). https://doi.org/10.1214/21-aoas1533.

Abstract