A Brief Introduction to the spm Package

1 Introduction

The package ‘spm’ is short for ‘spatial predictive modelling, spatial predictive model(s), or spatial predictive method(s)’. It aims to introduce some novel, accurate, hybrid geostatistical and machine learning methods for spatial predictive modelling. It contains functions for a few geostatistical and machine learning methods as well as their hybrid methods. Further methods will be included in the future according to their predictive accuracy and new developments in spatial predictive modelling field.

For each method, two functions are provided. Two further functions are provided to derive averaged variable importance and relative variable influence respectievly. They all use data.frame as input data. Moreover, two functions are provided for accuracy assessment, and a further function is for converting error measures (e.g., rmse) to accuracy measure (i.e., vecv). These functions attempt to simplify and streamline the model evaluation and model application processes, which may assist users to apply these methods to their data to improve modelling efficiency as well as predictive accuracy.

2 Spatial predictive methods

The methods currently included in the package are:

two commonly used geostatistical methods: inverse distance weighting (idw) and ordinary kriging (ok);
two highly accurate machine learning methods: random forest (rf) in library(randomForest) and general boosting model (gbm) in library(gbm);
four hybrid methods: rfidw, rfok, gbmidw and gbmok;
two averaging methods: the average of rfidw and rfok (rfokrfidw), and the average of gbmidw and gbmok (gbmokgbmidw); and
For random forest method, all relevant methods above are also implemented using ranger (rg) in library(ranger), which lead to four methods: rg, rgidw, rgok and rgokrgidw. These methods are fast version of their corersponding rf methods.

3 Functions

3.1 Averaged variable importance and relative variable influence

A function, avi, is developed to derive an averaged variable importance (avi) based on random forest. And a function, rvi, to a derive relative variable influence (rvi) based on generalized boosted regression modeling.

3.2 Accuracy assessment

3.2.1 Cross validation

Of the two functions for each method, one is for assessing the predictive errors and accuracy of the method based on cross-validation. These functions are:

idwcv,
okcv,
rfcv,
rgcv
gbmcv,
rfidwcv,
rfokcv,
rgidwcv,
rgokcv,
gbmidwcv,
gbmokcv,
rfokrfidwcv,
rgokrgidwcv, and
gbmokgbmidwcv.

The procedure for these functions can be summarised in Table 1.

Table 1: A procedure for predictive accuracy assessment

Step	Description
1.1	Resample the training data into k sub-datasets for k-fold validation
1.2	for Each training and corresponding validating sub-dataset, do
1.3	Develop a predictive model based on the training sub-dataset
1.4	Validate the model using the validating sub-dataset and record the resultant predictions
1.5	Repeat above procedure for each sub-dataset
1.6	end
1.7	Calculate the predictive accuracy based on the training data and the resultant predictions

The accuracy produced from the above procedure may change with each run due to randomness associated with data resampling for cross validation. To stabilise the accuracy measure, the above steps need to be repeated n times; and the accuracy assessment function for each method provides an useful tool to realise this as demonstrated in section 4.3.

3.2.2 Accuracy and error measures

A unit, scale and data-variance independent accuracy measure, the vecv (i.e., variance explained by predictive models based on cross-validation) is implemented in function vecv for numerical data.

A few other error and accuracy measures are also available in function pred.acc for numerical data, including mean error (me), mean absolute error (mae), mean squared error (mse), relative me (rme), relative mae (rmae), root mse (rmse), relative rmse (rrmse), as well as the accuracy measure, vecv and Legates and McCabe’s E1 (e1).

pred.acc can also be sued to calculate correct classification rate (ccr), kappa (kappa), sensitivity (sens), specificity (spec) and true skill statistic (tss) for categorical data with the observed (obs) data specified as factor.

They are based on the differences between the predicted values for and the observed values of validation samples for cross-validation. For 0 and 1 data, the observed values need to be specified as factor in order to use accuracy measures for categorical data. Moreover, sens, spec, tss and rmse are for categorical data with two levels (e.g. presence and absence data).

3.2.3 Convering error measures to vecv

A function, tovecv, is developed to convert existing predictive error measures to vecv. For the definition of vecv, please see function vecv in library (spm). The error measures considered are mean square error (mse), root mse (rmse), relative rmse (rrmse), standardised rmse (srmse) and mean square reduced error (msre).

3.3 Generate spatial predictions

The other function for each method is for generating the spatial predictions using the method, including:

idwpred,
okpred,
rfpred,
rgpred,
gbmpred,
rfidwpred,
rfokpred,
rgidwpred,
rgokpred,
gbmidwpred,
gbmokpred,
rfokrfidwpred,
rgokrgidwpred and
gbmokgbmidwpred.

With certain parameters specified for a method, the cross-validation function (e.g., rfokcv) produces information on the predictive accuracy of the method, and the prediction function (e.g., rfokpred) produces spatial predictions. By using the same random seed, the predictive accuracy values produced for these methods are directly comparable.

4 Applications of these functions

Install spm using:

install.packages("spm", dependencies = c("Imports", "Suggests"))

4.1 Examples for idwcv and rfokcv

4.1.1 idwcv

The function idwcv is applied to seabed gravel content samples in the Petrel region to test the predictive accuracy of inverse distance squared, a commonly used spatial interpolation method.

library(spm)
data(petrel)
set.seed(1234)
idwcv1 <- idwcv(petrel[, c(1,2)], petrel[, 5], nmax = 12, idp = 2, predacc = "VEcv")
idwcv1
[1] 23.11333

4.1.2 rfokcv

The function rfokcv is applied to seabed gravel content samples in the Petrel region to test the predictive accuracy of RFOK.

library(spm)
data(petrel)
set.seed(1234) 
rfokcv1 <- rfokcv(petrel[, c(1,2)], petrel[, c(1,2, 6:9)], petrel[, 5], predacc = "VEcv")
rfokcv1
[1] 39.88995

This accuracy is comparable with that of idw above as their random seeds were the same, i.e., the same datasets were generated and used in the cross validation.

4.2 Select optimal parameters and predictors

4.2.1 Select optimal parameters and predictors

The cross-validation functions above enable users to find optimal parameters and/or predictors for these methods (e.g., nmax and idp for idw). Taking idw as an example as below.

data(petrel)
idp <- c((1:10)*0.2)
nmax <- c(10:20)
idwopt <- array(0,dim=c(length(idp),length(nmax)))
for (i in 1:length(idp)) {
for (j in 1:length(nmax))  {
set.seed(1234)
idwcv2.3 <- idwcv(petrel[, c(1,2)], petrel[, 5], nmax = nmax[j], idp = idp[i], predacc = "VEcv" )
idwopt[i, j] <- idwcv2.3
  }
} 
which (idwopt == max(idwopt), arr.ind = T )
>      row col
[1,]   3   3
idp[3]
> [1] 0.6
nmax[3]
> [1] 12

4.2.2 Predictive accuracy of optimised model

Predictive accuracy of optimised idw

library(spm)
data(petrel)
set.seed(1234)
idwcv1 <- idwcv(petrel[, c(1,2)], petrel[, 5], nmax = 12, idp = 0.6, predacc = "VEcv")
idwcv1
[1] 35.93557

As usual, such parameter and variable selection is time consuming if the range of relevant parameter and/or number of predictors are large, especially for gbm and gbm related methods. For the machine learning methods, it is recommended to use the existing variable selection methods to do preliminary selection and then use above procedure to do further selection if needed.

These selections may produce the most accurate predictive models, but the resultant predictions may not look real (i.e. containing artefacts) that may not satisfy the need of some clients. So the artefacts may need to be minimised by sacrificing a small amount of the accuracy to meet client’s expectation.

4.3 Stabilise the predictive accuracy produced by cross validation functions

Again taking idwcv as an example, the accuracy produced above changes with each run of idwcv, i.e., with random seed. To stabilise the accuracy, we repeatedly run idwcv many times to produce an averaged predictive accuracy.

n <- 100 # number of iterations, 60 to 100 is recommended.
measures <- NULL
for (i in 1:n) { 
 idwcv1 <- idwcv(petrel [, c(1,2)], petrel [, 5], nmax = 12, idp = 0.6, predacc = "ALL")
 measures <- rbind(measures, idwcv1$vecv)
 }
mean(measures)
[1] 33.69691

This procedure can be applied to all other cross validation functions in spm to produced stablised predictive accuracy of relevant methods.

4.4 Examples for idwpred and rfokpred

4.4.1 idwpred

Seabed gravel content samples in the Petrel region, northern Australia marine margin are used to make spatial predictions using idw by running idwpred.

library(spm)
data(petrel)
data(petrel.grid)
idwpred1 <- idwpred(petrel[, c(1,2)], petrel[, 5], petrel.grid, nmax = 12, idp = 0.6)
names(idwpred1)
[1] "LON"       "LAT"       "var1.pred" "var1.var" 
idwpred1 <- (idwpred1)[, -4] # remove the 4th column as it contains no information.
class(idwpred1)
[1] "data.frame"
names(idwpred1) <- c("longitude", "latitude", "gravel")
head(idwpred1)
        longitude latitude  gravel
470277  128.8022  -10.60239 22.00789
470278  128.8047  -10.60239 22.00805
470279  128.8072  -10.60239 22.00822
470280  128.8097  -10.60239 22.00838
470281  128.8122  -10.60239 22.00855
470282  128.8147  -10.60239 22.00873

Since grid file (e.g. petrel.grid) can be in WGS84 or UTM, the location info of the grid file were renamed to “LON” and “LAT”, and the variable to be predicted may vary with studies so was named to “var1”, in the prediction function. Users need to rename the prediction file as shown above.

4.4.2 rfokpred

Seabed gravel content samples in the Petrel sub-basin, northern Australia marine margin are used to make spatial predictions using rfok by running rfokpred.

set.seed(1234)
library(spm)
data(petrel)
data(petrel.grid)
data(petrel)
data(petrel.grid)
rfokpred1 <- rfokpred(petrel[, c(1,2)], petrel[, c(1,2, 6:9)], petrel[, 5], 
 petrel.grid[, c(1,2)], petrel.grid, ntree = 500, nmax = 11, vgm.args = ("Log"))
class(rfokpred1)
[1] "data.frame"
names(rfokpred1)

[1] "LON"         "LAT"         "Predictions" "Variances"

As shown above, the format of the output predictions by the functions for generating spatial predictions is in data.frame and can be exported to relevant format as desired, such as csv or raster, for further use. Please note that the variances are now added back to the output, but cautions need to be taken when use them because they are independent of the predictions.

Acknowledgements

This spm package and its associated documents are published with the permission of the CEO, Geoscience Australia.

References

Li, J., Potter, A., Huang, Z., Daniell, J. J. and Heap, A. 2010. Predicting Seabed Mud Content across the Australian Margin: Comparison of Statistical and Mathematical Techniques Using a Simulation Experiment. Geoscience Australia, Record 2010 /11, 146pp.

Li, J., Heap, A.D., Potter, A., Huang, Z., Daniell, J., 2011. Can we improve the spatial predictions of seabed sediments? A case study of spatial interpolation of mud content across the southwest Australian margin. Continental Shelf Research 31 1365-1376.

Li, J., Heap, A.D., Potter, A., Daniell, J., 2011. Application of machine learning methods to spatial interpolation of environmental variables. Environmental Modelling & Software 26 1647-1659.

Li, J., 2011. Novel spatial interpolation methods for environmental properties: using point samples of mud content as an example. The Survey Statistician: The Newsletter of the International Association of Survey Statisticians No. 63 15-16.

Li, J., Potter, A., Huang, Z. and Heap, A. 2012. Predicting Seabed Sand Content across the Australian Margin Using Machine Learning and Geostatistical Methods. Geoscience Australia, Record 2012/48, 115pp.

Li, J., 2016. Assessing spatial predictive models in the environmental sciences: accuracy measures, data variation and variance explained. Environmental Modelling & Software 80 1-8.

Li, J., 2017. Assessing the accuracy of predictive models for numerical data: Not r nor r2, why not? Then what? PLOS ONE 12 (8): e0183250.