What are the inter-relationships among predictor geochemical variables to the target variable silver using linear regression modeling in R?

Abstract

Question The presence of a geochemical element in the mineral exploration world can direct or distract an exploration geologist. Polymetallic mineral deposits are comprised of various geochemical elements, including silver, gold, arsenic, antimony, lead, and zinc, many of which are in high demand to support the expanding green revolution across the globe. What are the inter-relationships among predictor geochemical variables to the target variable silver?

Location A few years ago, in a remote location in southern Chihuahua, Mexico, a vast desert comprised of shrub-brush, the occasional Mesquite tree, and a variety of cacti hiding a large multi-metal mineral resource (add photo). Today this particular area is home to a large polymetallic mineral resource made up of silver with gold, arsenic, antimony, lead, and zinc.

Method A variety of statistical methods can be deployed in building predictive models, including both linear regression and non-linear regression models, to name a few. The main objective of ordinary least squares linear regression is ” to find the plane that minimizes the sum-of-squared-errors (SSE) between the observed (actual) and predicted response. What are the inter-relationships among geochemical predictor variables to the target variable silver using the linear regression method in caret-package in R?

Results Stay tuned

Conclusions Stay tuned

1. Introduction

Predictive modeling of exploration targets

Mineral exploration and mining development is essential to the world’s population. Choosing geologically favorable areas is becoming increasingly difficult, with more and more mineral deposits occurring under recent sedimentary cover in blind environments. Exploration geologists must leverage the ability to build predictive models to find minerals, including developing useful statistical models.

What is predictive modeling?

Carranza (2009) defined predictive modeling as “making descriptions, representations or predictions about an indirectly observable and complex real-world system via quantitative or qualitative analysis of relevant data.” (Carranza, 2009, p.4). He further suggests that “some quantity of data associated directly with the target variable must be available in order to create and validate a predictive model.” (Carranza, 2009, p.4).

Approaches to predictive modeling

Several different approaches can be taken, including generalization in a set of observations or data where patterns in a data set are examined (Carranza, 2009). Confirming particular deductions by testing the generalization on other test data sets where the result is not known (Bonham-Carter, 1994). A simple linear regression best-fit model of the predictor variable as a function of the target variable. Quantitative Empirical Modeling is where data of the target variable are divided into a training set and a testing set, and based on the training set, relationships between the target-variable and predictor variables are quantified and then used for prediction. An assessment of goodness of fit to data in a training set and its predictive ability against data in a testing set (Carranza, 2009).

Statistical methods used in building predictions

The use of classification and regression models is not new to science and many other domains Ayres, 2007. The R-language R-Development Core Team 2008 has developed many modeling functions for both classification and regression, however the R caret-package has stream-lined the building and predicting model functions, simplified model tuning parameters, and provided the ability to extend the model to parallel processes Kuhn, 2008. Both the stats package in R and the caret-package in R contain the functionality useful in the early project stages (e.g., data splitting and pre-processing) as well as unsupervised feature selection methods to tune models using resampling that helps in evaluating the model. The caret package depends on 25 other packages, some of which are loaded individually when a model is trained or predicted Kuhn, 2008.

The data set

A subset from a high-resolution geochemistry data set from drill core samples from a project in Mexico will be used in an area known for high concentrations of the target variable silver with or without possible predictor variables, including gold, arsenic, antimony, lead, and zinc.

Keywords: mapping mineral potential, statistical methods, predictive models, regression analysis, the R caret package.

2. Data

2.1 Data set provides a link to the data

The data set can be found at Geochemistry data set dropbox

The data set used in this analysis is from a real-world mineral deposit where a series of different geochemical elements are found in high quantities. The target variable herein is silver(Ag_ppm), and the predictor variables are gold (Au_ppm),arsenic(As_ppm), antimony(Sb_ppm), lead(Pb_pct) and zinc (Zn_pct). The data set covers an area measuring 2.0km by 1.0km by 1.2km in depth of high resolution (down hole core sample assays in every 2.0 meter intervals with minimal gaps) however after the data set was reduced in size, the samples will be randomly distributed across the area described above.

2.2 Data limitations

The data limitations in this data set include the determination of certain geochemical elements that do not provide an accurate analysis so these geochemical elements were not included in the study. There are data gaps where bedrock was not reached until a depth of 3-10 m below surface. These observations were not included in the study.

3. Data preparation

The original data file was comprised of hundreds of thousands of observations and many variables that were available for model training, tuning and evaluation. While training the model with all of the data might be ideal, it is unrealistic for the objectives of this study. The use of resampling (e.g., cross validation) will be included to evaluate the model performance by providing an average RMSE for a more accurate prediction. Ideally an outside test data set should be available so that the model performance can be assessed on data not used in the model training.

This file was reduced down to the target variable silver for values exceeding Ag ppm > 100 ppm and several predictor variables. The new data set has 501 observations and across nine variables that are known to spatially occur with silver. A new column was created for five Silver classes ranging from low to high silver values.

The data was slit into training and test sets. The test set will be used to evaluate performance (e.g. compare models) and the training set will be used all other runs. In this case 80% of the data will be used for model training and the remainder will be used for evaluating the model performance.

The model was tuned using resampling (hopefully if I can figure out my code error).

4. Building and tuning the model

Figure 1. Rough flow chart of basic steps - although not showing iterations as my code is not working right now.

The train function was used to select values (first 10 rows, last 12 rows), and to estimate model performance using resampling and can only accept numeric values.
The lm function stands for linear least squares 9in stats package) is the statistical model method that will be used.
The metric used to assess model performance is Root-Mean-Squared-Error or RMSE. The higher the error, the poorer the model performance.
trControl function - I am currently having trouble with my trControl - takes a list of control parameters as well as the number of iterations can be set here. According to Kuhn (2008), in order to obtain an accurate performance estimate in some cases more than 25 iterations is suggested.

Load libraries required

library(tidyverse) 
library(here) #* an easy way to find your file*
library(janitor) #* used for examining and cleaning dirty data*
library(skimr) #* creates compact flexible data summaries*
library(knitr) #* used for dynamic report generation in R*
library(mice) #* mutivariate chained equations*
library(reshape2) #* flexible reshaping data*
library(DataExplorer)
library(RColorBrewer) #* color brewer palettes*
library(tibble) #* simple data frame support*
library(ggplot2) #* create data visualizations using graphic grammar*
library(readr) #* to read text data csv*
library(lattice) #* Trellis graphics for R*
library(stats) #(* R stats package)
library(caret) #* classification and regression training*

Import the raw data to take a look at geochemical element correlations

geochemistry <- read_csv(here("data", "DSV_assays_ICP_20230428_AgEq_xyz_twoclean.csv"))
head(geochemistry)

## # A tibble: 6 × 29
##   Au_ppm Ag_ppm Al_pct As_ppm Ba_ppm Bi_ppm Ca_pct Cd_ppm Cr_ppm Cu_pct Fe_pct
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1  0.093   41.8   0.43    449    240      2   3.1    17.3      1 0.0013   1.93
## 2  0.065   19.1   0.36    363     70      1   3.67    2.6      2 0.0012   1.67
## 3  0.166   13.1   0.32    591     90      1   4.5     2.2      2 0.0006   2.09
## 4  0.13    26.5   0.31    551     40      1   4.33   15        1 0.0028   1.83
## 5  0.139   18.8   0.3     435     30      2   2.27   10.1      2 0.0018   1.83
## 6  0.403  213     0.36   1235     30      1   0.15  306        1 0.0275   4.98
## # ℹ 18 more variables: K_pct <dbl>, Mg_pct <dbl>, Mn_ppm <dbl>, Mo_ppm <dbl>,
## #   Na_pct <dbl>, Ni_ppm <dbl>, P_ppm <dbl>, Pb_pct <dbl>, S_pct <dbl>,
## #   Sb_ppm <dbl>, Sc_ppm <dbl>, Sr_ppm <dbl>, Th_ppm <dbl>, Ti_pct <dbl>,
## #   V_ppm <dbl>, W_ppm <dbl>, Zn_pct <dbl>, AgEq <dbl>

plot_correlation(na.omit(geochemistry), maxcat = 5L)

Figure 2. Correlation matrix showing individual geochemical elements and their association to other geochemical elements using a diverging scale from dark blue (a strong weak correlation) to dark red (strong positive correlation). Silver(Ag_ppm) has a fairly strong relationship to lead(Pb_pct), antimony(Sb_ppm), zinc(Zn_pct), and gold(Au_ppm).

Complete a univariate analysis using basic histograms and examine the frequency of distribution for all the variables for consideration in the analysis

plot_histogram(geochemistry)

Figure 3. Univariate distribution for potential predictive geochemical variables for consideration in the analysis. The distributions show that the frequency of certain variables and their values may not be useful in the analysis. The data needs to be further culled to ensure the analysis is easy to reproduce.

Cull the dataset for six predictor variables and add a column for Silver class (1 = high silver and 6 = low silver)

### Import data culled and use head method to see first few rows
geochemistry <- read_csv(here("data", "assaysICPsubset.csv"))
head(geochemistry)

## # A tibble: 6 × 9
##   Silver_class Ag_ppm Au_ppm As_ppm Sb_ppm Mn_ppm Pb_pct Zn_pct `Silver range`  
##   <chr>         <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr>           
## 1 Class6          213  0.403   1235     63    549  0.999  2.18  great100less250…
## 2 Class3          727  2.18     415    247   1990  3.95   0.164 greater500less1…
## 3 Class6          127  0.203   2170    241   8720  2.51   1.93  great100less250…
## 4 Class6          154  0.254   2210    167   2970  3.06   4.48  great100less250…
## 5 Class6          195  0.293   2780    217   3640  3.64   6.36  great100less250…
## 6 Class6          187  0.198    635    168    968  3.79   3.71  great100less250…

Table 1. The data set (first few rows) showing eight of the nine variables.

4. Methodology

4a. Feature engineering - model testing using in-sample-error on a simple test/train data set

Feature engineering is essential because a machine learning algorithm is only as accurate as the data we provide to it so creation of features relevant to the task (to define the relationship between silver and other variables using linear regression.

Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing (non-constant) features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one. This is done When the data has a Gaussian distribution in the input variables suited for the logistic regression statistical method.

The dataset is comprised of 501 observations and 9 variables

## Lets look at gold first#
geochemistry <- read_csv(here("data", "assaysICPsubset.csv")) #* Read in the data set*
model <- lm(Ag_ppm ~ Au_ppm, geochemistry[1:20, ]) #* a linear regression model fit on  the first 20 rows in the data set*

model

## 
## Call:
## lm(formula = Ag_ppm ~ Au_ppm, data = geochemistry[1:20, ])
## 
## Coefficients:
## (Intercept)       Au_ppm  
##      210.79        40.14

Predict in-sample-error using the predictor element gold(Au_pm)

predicted <- predict(model, geochemistry[1:20, ], type = "response") #* Make in-sample prediction using the predict function*

predicted

##        1        2        3        4        5        6        7        8 
## 226.9694 298.2924 218.9420 220.9890 222.5543 218.7413 213.1623 263.3734 
##        9       10       11       12       13       14       15       16 
## 264.3768 256.5502 260.5639 288.8603 309.1293 306.3198 252.5365 253.5399 
##       17       18       19       20 
## 242.1009 248.0813 257.3529 260.5639

Table 2. Shows the silver values predicted in the first 20 rows of the data set ranging between 190.7ppm Ag and 343.4ppm Ag.

Take a look at the frequency distribution across entire data set across all nine variables

library(ggplot2)

geochemistry <- read_csv(here("data", "assaysICPsubset.csv"))

model <- lm(Ag_ppm ~ ., geochemistry)

plot(geochemistry)

Figure 4. Frequency distribution scatter plots across all the variables in the data set; its clear that certain geochemical elements occur as frequently as the target variable (silver Ag_ppm) and have variable distributions by grade.

# *Predict on the full data set using ~. on all 9 variable geochemistry data set*

model <- lm(Ag_ppm ~ ., geochemistry)

predicted <- predict(model, geochemistry)

# *Compute errors: error*

error <- predicted - geochemistry[["Ag_ppm"]]

# *Calculate RMSE*

sqrt(mean(error ^ 2))

## [1] 51.99645

The Root Mean Square Error (RMSE) is close to 52ppm Ag. This RMSE was calculated using in-sample-validation and may NOT lead to a best-fit for model performance; and may lead to model over-fitting (overestimating the models accuracy on performance).

You must do out-of-sample-error measures

Out of sample error model testing is done with a real test-set where you don’t know the actual outcome; this provides key insight to ensure you choose best-fit models that continue to perform well regardless of the test data; pick models that perform well on NEW data.

Generate models that don’t over-fit the training data and generalize well - perform error metrics on NEW data

**4b. Feature engineering - model testing using out-of-sample-error on multiple test/train data sets

Split the data for training and testing to evaluate the effectiveness of the model. The data set needs to split into two sets (e.g. 70 % of the data for training and 30% (0.3) of the data for testing).

# *Fit a model to the geochemistry data*

geochemistry <- read_csv(here("data", "assaysICPsubset.csv"))

model <- lm(Ag_ppm ~ Au_ppm, geochemistry[1:20, ]) # *linear regression model fit on first 20 observations/rows in the data set*

Compare to RMSE on the last 12 rows of the data set

# *Predict out-of-sample on a new dataset*
predicted <- predict(
  model, geochemistry[21:32, ], type = "response"
) # * linear regression model fit on the last 12 observations/rows in the data set*

# *Evaluate error by root-mean-square-error of RMSE* by comparing predictions on the model to actual Ag_ppm values for the test set*

actual <- geochemistry[21:32, "Ag_ppm"] #* on the last 12 observations*
sqrt(mean((predicted - actual) ^ 2))

## [1] NA

The above could would not computer the RMSE. Try another way.

# *Compute errors: error manually*

error <- predicted - geochemistry[["Ag_ppm"]]

# *Calculate RMSE*

sqrt(mean(error ^ 2))

## [1] 239.4212

The result is an RMSE of 250 ppm Ag; a measure of the model’s average error in the last 12 rows in the data set; our model is off by 250 Ag_ppm on average. This RMSE is less than optimal for model performance

Now compare to in-sample RMSE on the full data set

#* Fit a model to a full data set*

model <- lm(Ag_ppm ~ ., geochemistry) #* linear regression lm(); fit Ag_ppm on full data set*

#* Predict in-sample on the full data set and evaluate error on models performance*
predicted2 <- predict(
  model, geochemistry, type = "response"
)

Evaluate error by root-mean-square-error (RMSE) by comparing predictions on the model to actual Ag_ppm values for test set

# Evaluate using RMSE; *compute errors: error manually*
error2 <- predicted2 - geochemistry[["Ag_ppm"]]

# *Calculate RMSE*
sqrt(mean(error ^ 2))

## [1] 239.4212

The out-of-sample RMSE of 250 Ag_ppm above suggests our model done on the full data set is worse than the earlier models run on partial data; if we had believed the in-sample-error of 52 Ag_ppm we would have incorrectly suggested that our model is better than it really is

Now run the model on multiple train-test-split subsets

#* set random seed; seed set to 42 ensures you get the same random split each time script is run*
set.seed(42) #* set seed to be more than 25 as suggested by Kuhn (2008)

#* Read in the data set*
geochemistry <- read_csv(here("data", "assaysICPsubset.csv"))

#* Use the sample() function to shuffle row indices: rows*
rows <- sample(nrow(geochemistry))

#* Create random vector of row indices called rows to randomly reorder data*
shuffled_geochemistry <- geochemistry[rows,  ]

Try an 80/20 split; first 80% into a training set; last 20% into a test set

#* choose a split point at 80% into your dataset*
split <- round(nrow(shuffled_geochemistry) * 0.80)

#* Use this point to break off the first 80 % of the data set as a training set*
shuffled_geochemistry[1:split,  ]

## # A tibble: 401 × 9
##    Silver_class Ag_ppm Au_ppm As_ppm Sb_ppm Mn_ppm Pb_pct Zn_pct `Silver range` 
##    <chr>         <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr>          
##  1 Class6          149  0.435   1020    147   1650  2.52    1.97 great100less25…
##  2 Class6          175  0.368   1240    137    792  3.58    2.82 great100less25…
##  3 Class6          120  0.083    241     34   2560  2.4     5.72 great100less25…
##  4 Class3          604  0.088    373   1705   2680  5.31    2.87 greater500less…
##  5 Class6          122  0.393   1120    147    823  1.82    1.72 great100less25…
##  6 Class6          109  0.183   2270    132   1330  2.16    3.04 great100less25…
##  7 Class4          361  1.44   10000    410   1770  7.08    6.11 great250less50…
##  8 Class6          184  0.966   4260    143    313  2.59    1.07 great100less25…
##  9 Class6          136  0.078     12      1   2210  0.295  10.4  great100less25…
## 10 Class6          177  0.931   1680    188   1600  2.67    2.84 great100less25…
## # ℹ 391 more rows

Table 3. Shuffled data set for 80% (401 observations) showing the first 8 of 9 variables in the data set.

#* Determine row  index to split on: split*
split <- round(nrow(geochemistry) * 0.80)

#* Create  a training-set composed of 401 observations using that index*
training_data <- geochemistry[1:split,  ]

#* Create  a test-set composed of 100 observations using that index*
testing_data <- geochemistry[(split + 1):nrow(geochemistry), ]

Because I already randomly shuffled the geochemistry data set it was easy to split off a random test-set. Shuffled_geochemistry has a total of 501 observations with 9 variables; training-set has 401 observations (80% of total obs) and the test-set has 100 observations each with 9 variables.

Predict on a single train-test-split

Now use the lm() function on the randomly split training-set and test-set to fit a model to the training-set, rather than the entire data set; use the formula interface to the linear regression function to fit a model with the specified target variable (Ag_ppm) using all other variables in the data set as predictors.

#* Fit a model to the training data set* 
mod <- lm(Ag_ppm ~ ., training_data) #* the formula is Ag_ppm ~ silver class + Au_ppm, As_ppm, Sb_ppm, Mn_ppm etc.)

Use predict() function to make predictions from that model on new data; ensure new data has all columns as training_data (any order; with different values). Predict now on the testing_set not used to train the model for an out-of-sample-error for the model

#* Make predictions on the model above on new data*
predict1 <- predict(mod, testing_data)

Calculate predictions on the test set

#* Predict on testing_data using predict()*
p <- predict(mod, testing_data)

Calculate testing_data RMSE by hand

Use the predictions on the test set above to calculate an error metric (e.g., RMSE) to see how the model performs out-of-sample rather than in-sample. To do this, calculate the errors between the predicted Ag_ppm and the actual Ag_ppm by subtracting prediction Ag_ppm value from actual Ag_ppm value.

#* Create an error vector and compute errors; extract actual value from the testing_set*
error <- p - testing_data[["Ag_ppm"]]

#* Calculate RMSE using the error vector, mean and square root*
sqrt(mean(error ^ 2))

## [1] 57.86758

This is an error close to the RMSE error on the simple train/test set but its still high suggesting the model over-fit the training set. Move from predicting using single train-test-split to predicting using cross validation (CV) through on multiple train-test-split sets and average the RMSE giving a more precise out-of-sample RMSE. This method will avoid the negative affects of outliers on the RMSE in the data set

Fitting a single model is cheaper than using cross-validation (CV) model (10 training models and 1 final model) more costly due to more time but more accurate; split the data into ten folds or train/test splits; every single point in data set occurs exactly once; assign each row to its single test-set randomly to avoid systematic bias in the data; best way to assess out-of-sample error in predictive models; each of the ten test sets are same size as training set but is composed of out-of sample predictions

Assign each row to test-set randomly to avoid biases; after you know the average RMSE you refit the model on the FULL training data set so as to fully exploit info in that data set.

#*Lets fit a cross validated model to the geochemistry data set*
#*Set the random seed for reproducibility*
set.seed(42)

I ran into a coding challenge here when I launched the cross validation multiple test-train-split steps.

4c. Visualize the accuracy of the analysis

Figure x1 and x2. x1 Linear regression scatter plot observed vs predicted values for the geochemistry test-set; x2 and predicted vs residual scatter plot.

Figure x. Scatterplots of the transformed predictors along with a regression line - using the transformed predictors is it safe to assume that the relationship between the predictors and the outcome is linear? I suspect some predictor variables will be linear and others more random distribution.

Figure y. A scree plot? Are there significant between predictor correlations? Do I need to do a PCA in the full transformed data set to answer this question? Between predictor correlations expected for Au/As, As/Sb, Ag/Pb, Pb/Zn etc.

Figure z. Between-predictor variable correlations of the transformed continuous silver (Ag_ppm) predictors? Are there significant between-predictor correlations? I suspect there will be many strong positive correlations that might create problems in developing some models like linear regression so consideration of pre-processing steps to mitigate.

Anticipated results

The relationship between certain predictor variables and the outcome is linear (e.g. best-fit Ag/Pb, Ag/Au (more random), Ag/Zn (even more random)).

Results

a description of the research conducted and the results obtained. Results presented as tables, large data sets, figures, which can include graphs, videos, diagrams and photographs. Include additional supporting data.

Discussion

Analysis and interpretation of the data that integrates new information with prior findings, states the implications of the work, and generates new hypotheses to be tested.

Future Work Stay tuned

References

Ayres, I., Ayres-Brown, A. R., & Ayres-Brown, H. J. (2007). Seeing significance: Is the 95% probability range easier to perceive? CHANCE, 20(1), 11–16

Bonham-Carter, G. F. (1994). Geographic information systems for geoscientists: Modelling with GIS. Pergamon Press, p. 180.

Carranza, E.J.M. (2009). Geochemical anomaly and mineral prospectivity mapping in GIS. Handbook of Exploration and Environmental Geochemistry 11, M. Hale, series editor, Elsevier.

Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5)

Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling

The R-language R-Development Core Team 2008