Question The presence of a geochemical element in the mineral exploration world can direct or distract an exploration geologist. Polymetallic mineral deposits are comprised of various geochemical elements, including silver, gold, arsenic, antimony, lead, and zinc, many of which are in high demand to support the expanding green revolution across the globe. What are the inter-relationships among predictor geochemical variables to the target variable silver?
Location A few years ago, in a remote location in southern Chihuahua, Mexico, a vast desert comprised of shrub-brush, the occasional Mesquite tree, and a variety of cacti hiding a large multi-metal mineral resource (add photo). Today this particular area is home to a large polymetallic mineral resource made up of silver with gold, arsenic, antimony, lead, and zinc.
Method A variety of statistical methods can be deployed in building predictive models, including both linear regression and non-linear regression models, to name a few. The main objective of ordinary least squares linear regression is ” to find the plane that minimizes the sum-of-squared-errors (SSE) between the observed (actual) and predicted response. What are the inter-relationships among geochemical predictor variables to the target variable silver using the linear regression method in caret-package in R?
Results Stay tuned
Conclusions Stay tuned
Predictive modeling of exploration targets
Mineral exploration and mining development is essential to the world’s population. Choosing geologically favorable areas is becoming increasingly difficult, with more and more mineral deposits occurring under recent sedimentary cover in blind environments. Exploration geologists must leverage the ability to build predictive models to find minerals, including developing useful statistical models.
What is predictive modeling?
Carranza (2009) defined predictive modeling as “making descriptions, representations or predictions about an indirectly observable and complex real-world system via quantitative or qualitative analysis of relevant data.” (Carranza, 2009, p.4). He further suggests that “some quantity of data associated directly with the target variable must be available in order to create and validate a predictive model.” (Carranza, 2009, p.4).
Approaches to predictive modeling
Several different approaches can be taken, including generalization in a set of observations or data where patterns in a data set are examined (Carranza, 2009). Confirming particular deductions by testing the generalization on other test data sets where the result is not known (Bonham-Carter, 1994). A simple linear regression best-fit model of the predictor variable as a function of the target variable. Quantitative Empirical Modeling is where data of the target variable are divided into a training set and a testing set, and based on the training set, relationships between the target-variable and predictor variables are quantified and then used for prediction. An assessment of goodness of fit to data in a training set and its predictive ability against data in a testing set (Carranza, 2009).
Statistical methods used in building predictions
The use of classification and regression models is not new to science and many other domains Ayres, 2007. The R-language R-Development Core Team 2008 has developed many modeling functions for both classification and regression, however the R caret-package has stream-lined the building and predicting model functions, simplified model tuning parameters, and provided the ability to extend the model to parallel processes Kuhn, 2008. Both the stats package in R and the caret-package in R contain the functionality useful in the early project stages (e.g., data splitting and pre-processing) as well as unsupervised feature selection methods to tune models using resampling that helps in evaluating the model. The caret package depends on 25 other packages, some of which are loaded individually when a model is trained or predicted Kuhn, 2008.
The data set
A subset from a high-resolution geochemistry data set from drill core samples from a project in Mexico will be used in an area known for high concentrations of the target variable silver with or without possible predictor variables, including gold, arsenic, antimony, lead, and zinc.
Keywords: mapping mineral potential, statistical methods, predictive models, regression analysis, the R caret package.
The data set can be found at Geochemistry data set dropbox
The data set used in this analysis is from a real-world mineral deposit where a series of different geochemical elements are found in high quantities. The target variable herein is silver(Ag_ppm), and the predictor variables are gold (Au_ppm),arsenic(As_ppm), antimony(Sb_ppm), lead(Pb_pct) and zinc (Zn_pct). The data set covers an area measuring 2.0km by 1.0km by 1.2km in depth of high resolution (down hole core sample assays in every 2.0 meter intervals with minimal gaps) however after the data set was reduced in size, the samples will be randomly distributed across the area described above.
The data limitations in this data set include the determination of certain geochemical elements that do not provide an accurate analysis so these geochemical elements were not included in the study. There are data gaps where bedrock was not reached until a depth of 3-10 m below surface. These observations were not included in the study.
The original data file was comprised of hundreds of thousands of observations and many variables that were available for model training, tuning and evaluation. While training the model with all of the data might be ideal, it is unrealistic for the objectives of this study. The use of resampling (e.g., cross validation) will be included to evaluate the model performance by providing an average RMSE for a more accurate prediction. Ideally an outside test data set should be available so that the model performance can be assessed on data not used in the model training.
This file was reduced down to the target variable silver for values exceeding Ag ppm > 100 ppm and several predictor variables. The new data set has 501 observations and across nine variables that are known to spatially occur with silver. A new column was created for five Silver classes ranging from low to high silver values.
The data was slit into training and test sets. The test set will be used to evaluate performance (e.g. compare models) and the training set will be used all other runs. In this case 80% of the data will be used for model training and the remainder will be used for evaluating the model performance.
The model was tuned using resampling (hopefully if I can figure out my code error).
Figure 1. Rough flow chart of basic steps - although not showing iterations as my code is not working right now.
Load libraries required
library(tidyverse)
library(here) #* an easy way to find your file*
library(janitor) #* used for examining and cleaning dirty data*
library(skimr) #* creates compact flexible data summaries*
library(knitr) #* used for dynamic report generation in R*
library(mice) #* mutivariate chained equations*
library(reshape2) #* flexible reshaping data*
library(DataExplorer)
library(RColorBrewer) #* color brewer palettes*
library(tibble) #* simple data frame support*
library(ggplot2) #* create data visualizations using graphic grammar*
library(readr) #* to read text data csv*
library(lattice) #* Trellis graphics for R*
library(stats) #(* R stats package)
library(caret) #* classification and regression training*
Import the raw data to take a look at geochemical element correlations
geochemistry <- read_csv(here("data", "DSV_assays_ICP_20230428_AgEq_xyz_twoclean.csv"))
head(geochemistry)
## # A tibble: 6 × 29
## Au_ppm Ag_ppm Al_pct As_ppm Ba_ppm Bi_ppm Ca_pct Cd_ppm Cr_ppm Cu_pct Fe_pct
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.093 41.8 0.43 449 240 2 3.1 17.3 1 0.0013 1.93
## 2 0.065 19.1 0.36 363 70 1 3.67 2.6 2 0.0012 1.67
## 3 0.166 13.1 0.32 591 90 1 4.5 2.2 2 0.0006 2.09
## 4 0.13 26.5 0.31 551 40 1 4.33 15 1 0.0028 1.83
## 5 0.139 18.8 0.3 435 30 2 2.27 10.1 2 0.0018 1.83
## 6 0.403 213 0.36 1235 30 1 0.15 306 1 0.0275 4.98
## # ℹ 18 more variables: K_pct <dbl>, Mg_pct <dbl>, Mn_ppm <dbl>, Mo_ppm <dbl>,
## # Na_pct <dbl>, Ni_ppm <dbl>, P_ppm <dbl>, Pb_pct <dbl>, S_pct <dbl>,
## # Sb_ppm <dbl>, Sc_ppm <dbl>, Sr_ppm <dbl>, Th_ppm <dbl>, Ti_pct <dbl>,
## # V_ppm <dbl>, W_ppm <dbl>, Zn_pct <dbl>, AgEq <dbl>
plot_correlation(na.omit(geochemistry), maxcat = 5L)
Figure 2. Correlation matrix showing individual geochemical elements and their association to other geochemical elements using a diverging scale from dark blue (a strong weak correlation) to dark red (strong positive correlation). Silver(Ag_ppm) has a fairly strong relationship to lead(Pb_pct), antimony(Sb_ppm), zinc(Zn_pct), and gold(Au_ppm).
Complete a univariate analysis using basic histograms and examine the frequency of distribution for all the variables for consideration in the analysis
plot_histogram(geochemistry)
Figure 3. Univariate distribution for potential predictive geochemical variables for consideration in the analysis. The distributions show that the frequency of certain variables and their values may not be useful in the analysis. The data needs to be further culled to ensure the analysis is easy to reproduce.
Cull the dataset for six predictor variables and add a column for Silver class (1 = high silver and 6 = low silver)
### Import data culled and use head method to see first few rows
geochemistry <- read_csv(here("data", "assaysICPsubset.csv"))
head(geochemistry)
## # A tibble: 6 × 9
## Silver_class Ag_ppm Au_ppm As_ppm Sb_ppm Mn_ppm Pb_pct Zn_pct `Silver range`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Class6 213 0.403 1235 63 549 0.999 2.18 great100less250…
## 2 Class3 727 2.18 415 247 1990 3.95 0.164 greater500less1…
## 3 Class6 127 0.203 2170 241 8720 2.51 1.93 great100less250…
## 4 Class6 154 0.254 2210 167 2970 3.06 4.48 great100less250…
## 5 Class6 195 0.293 2780 217 3640 3.64 6.36 great100less250…
## 6 Class6 187 0.198 635 168 968 3.79 3.71 great100less250…
Table 1. The data set (first few rows) showing eight of the nine variables.
Feature engineering is essential because a machine learning algorithm is only as accurate as the data we provide to it so creation of features relevant to the task (to define the relationship between silver and other variables using linear regression.
Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing (non-constant) features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one. This is done When the data has a Gaussian distribution in the input variables suited for the logistic regression statistical method.
The dataset is comprised of 501 observations and 9 variables
## Lets look at gold first#
geochemistry <- read_csv(here("data", "assaysICPsubset.csv")) #* Read in the data set*
model <- lm(Ag_ppm ~ Au_ppm, geochemistry[1:20, ]) #* a linear regression model fit on the first 20 rows in the data set*
model
##
## Call:
## lm(formula = Ag_ppm ~ Au_ppm, data = geochemistry[1:20, ])
##
## Coefficients:
## (Intercept) Au_ppm
## 210.79 40.14
Predict in-sample-error using the predictor element gold(Au_pm)
predicted <- predict(model, geochemistry[1:20, ], type = "response") #* Make in-sample prediction using the predict function*
predicted
## 1 2 3 4 5 6 7 8
## 226.9694 298.2924 218.9420 220.9890 222.5543 218.7413 213.1623 263.3734
## 9 10 11 12 13 14 15 16
## 264.3768 256.5502 260.5639 288.8603 309.1293 306.3198 252.5365 253.5399
## 17 18 19 20
## 242.1009 248.0813 257.3529 260.5639
Table 2. Shows the silver values predicted in the first 20 rows of the data set ranging between 190.7ppm Ag and 343.4ppm Ag.
Take a look at the frequency distribution across entire data set across all nine variables
library(ggplot2)
geochemistry <- read_csv(here("data", "assaysICPsubset.csv"))
model <- lm(Ag_ppm ~ ., geochemistry)
plot(geochemistry)
Figure 4. Frequency distribution scatter plots across all the variables in the data set; its clear that certain geochemical elements occur as frequently as the target variable (silver Ag_ppm) and have variable distributions by grade.
# *Predict on the full data set using ~. on all 9 variable geochemistry data set*
model <- lm(Ag_ppm ~ ., geochemistry)
predicted <- predict(model, geochemistry)
# *Compute errors: error*
error <- predicted - geochemistry[["Ag_ppm"]]
# *Calculate RMSE*
sqrt(mean(error ^ 2))
## [1] 51.99645
The Root Mean Square Error (RMSE) is close to 52ppm Ag. This RMSE was calculated using in-sample-validation and may NOT lead to a best-fit for model performance; and may lead to model over-fitting (overestimating the models accuracy on performance).
You must do out-of-sample-error measures
Out of sample error model testing is done with a real test-set where you don’t know the actual outcome; this provides key insight to ensure you choose best-fit models that continue to perform well regardless of the test data; pick models that perform well on NEW data.
Generate models that don’t over-fit the training data and generalize well - perform error metrics on NEW data
Split the data for training and testing to evaluate the effectiveness of the model. The data set needs to split into two sets (e.g. 70 % of the data for training and 30% (0.3) of the data for testing).
# *Fit a model to the geochemistry data*
geochemistry <- read_csv(here("data", "assaysICPsubset.csv"))
model <- lm(Ag_ppm ~ Au_ppm, geochemistry[1:20, ]) # *linear regression model fit on first 20 observations/rows in the data set*
Compare to RMSE on the last 12 rows of the data set
# *Predict out-of-sample on a new dataset*
predicted <- predict(
model, geochemistry[21:32, ], type = "response"
) # * linear regression model fit on the last 12 observations/rows in the data set*
# *Evaluate error by root-mean-square-error of RMSE* by comparing predictions on the model to actual Ag_ppm values for the test set*
actual <- geochemistry[21:32, "Ag_ppm"] #* on the last 12 observations*
sqrt(mean((predicted - actual) ^ 2))
## [1] NA
The above could would not computer the RMSE. Try another way.
# *Compute errors: error manually*
error <- predicted - geochemistry[["Ag_ppm"]]
# *Calculate RMSE*
sqrt(mean(error ^ 2))
## [1] 239.4212
The result is an RMSE of 250 ppm Ag; a measure of the model’s average error in the last 12 rows in the data set; our model is off by 250 Ag_ppm on average. This RMSE is less than optimal for model performance
Now compare to in-sample RMSE on the full data set
#* Fit a model to a full data set*
model <- lm(Ag_ppm ~ ., geochemistry) #* linear regression lm(); fit Ag_ppm on full data set*
#* Predict in-sample on the full data set and evaluate error on models performance*
predicted2 <- predict(
model, geochemistry, type = "response"
)
Evaluate error by root-mean-square-error (RMSE) by comparing predictions on the model to actual Ag_ppm values for test set
# Evaluate using RMSE; *compute errors: error manually*
error2 <- predicted2 - geochemistry[["Ag_ppm"]]
# *Calculate RMSE*
sqrt(mean(error ^ 2))
## [1] 239.4212
The out-of-sample RMSE of 250 Ag_ppm above suggests our model done on the full data set is worse than the earlier models run on partial data; if we had believed the in-sample-error of 52 Ag_ppm we would have incorrectly suggested that our model is better than it really is
Now run the model on multiple train-test-split subsets
#* set random seed; seed set to 42 ensures you get the same random split each time script is run*
set.seed(42) #* set seed to be more than 25 as suggested by Kuhn (2008)
#* Read in the data set*
geochemistry <- read_csv(here("data", "assaysICPsubset.csv"))
#* Use the sample() function to shuffle row indices: rows*
rows <- sample(nrow(geochemistry))
#* Create random vector of row indices called rows to randomly reorder data*
shuffled_geochemistry <- geochemistry[rows, ]
Try an 80/20 split; first 80% into a training set; last 20% into a test set
#* choose a split point at 80% into your dataset*
split <- round(nrow(shuffled_geochemistry) * 0.80)
#* Use this point to break off the first 80 % of the data set as a training set*
shuffled_geochemistry[1:split, ]
## # A tibble: 401 × 9
## Silver_class Ag_ppm Au_ppm As_ppm Sb_ppm Mn_ppm Pb_pct Zn_pct `Silver range`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Class6 149 0.435 1020 147 1650 2.52 1.97 great100less25…
## 2 Class6 175 0.368 1240 137 792 3.58 2.82 great100less25…
## 3 Class6 120 0.083 241 34 2560 2.4 5.72 great100less25…
## 4 Class3 604 0.088 373 1705 2680 5.31 2.87 greater500less…
## 5 Class6 122 0.393 1120 147 823 1.82 1.72 great100less25…
## 6 Class6 109 0.183 2270 132 1330 2.16 3.04 great100less25…
## 7 Class4 361 1.44 10000 410 1770 7.08 6.11 great250less50…
## 8 Class6 184 0.966 4260 143 313 2.59 1.07 great100less25…
## 9 Class6 136 0.078 12 1 2210 0.295 10.4 great100less25…
## 10 Class6 177 0.931 1680 188 1600 2.67 2.84 great100less25…
## # ℹ 391 more rows
Table 3. Shuffled data set for 80% (401 observations) showing the first 8 of 9 variables in the data set.
#* Determine row index to split on: split*
split <- round(nrow(geochemistry) * 0.80)
#* Create a training-set composed of 401 observations using that index*
training_data <- geochemistry[1:split, ]
#* Create a test-set composed of 100 observations using that index*
testing_data <- geochemistry[(split + 1):nrow(geochemistry), ]
Because I already randomly shuffled the geochemistry data set it was easy to split off a random test-set. Shuffled_geochemistry has a total of 501 observations with 9 variables; training-set has 401 observations (80% of total obs) and the test-set has 100 observations each with 9 variables.
Predict on a single train-test-split
Now use the lm() function on the randomly split training-set and test-set to fit a model to the training-set, rather than the entire data set; use the formula interface to the linear regression function to fit a model with the specified target variable (Ag_ppm) using all other variables in the data set as predictors.
#* Fit a model to the training data set*
mod <- lm(Ag_ppm ~ ., training_data) #* the formula is Ag_ppm ~ silver class + Au_ppm, As_ppm, Sb_ppm, Mn_ppm etc.)
Use predict() function to make predictions from that model on new data; ensure new data has all columns as training_data (any order; with different values). Predict now on the testing_set not used to train the model for an out-of-sample-error for the model
#* Make predictions on the model above on new data*
predict1 <- predict(mod, testing_data)
Calculate predictions on the test set
#* Predict on testing_data using predict()*
p <- predict(mod, testing_data)
Calculate testing_data RMSE by hand
Use the predictions on the test set above to calculate an error metric (e.g., RMSE) to see how the model performs out-of-sample rather than in-sample. To do this, calculate the errors between the predicted Ag_ppm and the actual Ag_ppm by subtracting prediction Ag_ppm value from actual Ag_ppm value.
#* Create an error vector and compute errors; extract actual value from the testing_set*
error <- p - testing_data[["Ag_ppm"]]
#* Calculate RMSE using the error vector, mean and square root*
sqrt(mean(error ^ 2))
## [1] 57.86758
This is an error close to the RMSE error on the simple train/test set but its still high suggesting the model over-fit the training set. Move from predicting using single train-test-split to predicting using cross validation (CV) through on multiple train-test-split sets and average the RMSE giving a more precise out-of-sample RMSE. This method will avoid the negative affects of outliers on the RMSE in the data set
Fitting a single model is cheaper than using cross-validation (CV) model (10 training models and 1 final model) more costly due to more time but more accurate; split the data into ten folds or train/test splits; every single point in data set occurs exactly once; assign each row to its single test-set randomly to avoid systematic bias in the data; best way to assess out-of-sample error in predictive models; each of the ten test sets are same size as training set but is composed of out-of sample predictions
Assign each row to test-set randomly to avoid biases; after you know the average RMSE you refit the model on the FULL training data set so as to fully exploit info in that data set.
#*Lets fit a cross validated model to the geochemistry data set*
#*Set the random seed for reproducibility*
set.seed(42)
I ran into a coding challenge here when I launched the cross validation multiple test-train-split steps.
Figure x1 and x2. x1 Linear regression scatter plot observed vs predicted values for the geochemistry test-set; x2 and predicted vs residual scatter plot.
Figure x. Scatterplots of the transformed predictors along with a regression line - using the transformed predictors is it safe to assume that the relationship between the predictors and the outcome is linear? I suspect some predictor variables will be linear and others more random distribution.
Figure y. A scree plot? Are there significant between predictor correlations? Do I need to do a PCA in the full transformed data set to answer this question? Between predictor correlations expected for Au/As, As/Sb, Ag/Pb, Pb/Zn etc.
Figure z. Between-predictor variable correlations of the transformed continuous silver (Ag_ppm) predictors? Are there significant between-predictor correlations? I suspect there will be many strong positive correlations that might create problems in developing some models like linear regression so consideration of pre-processing steps to mitigate.
Anticipated results
The relationship between certain predictor variables and the outcome is linear (e.g. best-fit Ag/Pb, Ag/Au (more random), Ag/Zn (even more random)).
Results
a description of the research conducted and the results obtained. Results presented as tables, large data sets, figures, which can include graphs, videos, diagrams and photographs. Include additional supporting data.
Discussion
Analysis and interpretation of the data that integrates new information with prior findings, states the implications of the work, and generates new hypotheses to be tested.
Future Work Stay tuned
References
Ayres, I., Ayres-Brown, A. R., & Ayres-Brown, H. J. (2007). Seeing significance: Is the 95% probability range easier to perceive? CHANCE, 20(1), 11–16
Bonham-Carter, G. F. (1994). Geographic information systems for geoscientists: Modelling with GIS. Pergamon Press, p. 180.
Carranza, E.J.M. (2009). Geochemical anomaly and mineral prospectivity mapping in GIS. Handbook of Exploration and Environmental Geochemistry 11, M. Hale, series editor, Elsevier.
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5)
Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling
The R-language R-Development Core Team 2008