Mineral exploration and mining development is essential to the world’s population. Is there a relationship between the presence of a particular geochemical element and the presence of silver? Linear regression is a simple approach to supervised machine learning where it can predict for a quantitative response. This study uses ordinal least squares linear regression in the R-caret package (Simple Linear Regression) to understand the relationship between associated geochemical elements.
Various statistical methods can be deployed in building a quantitative predictive model, including many regression models. Linear regression falls under a family of algorithms frequently used in supervised machine learning. The R-caret package short for Classification A and RE aggression T raining is a framework for building machine learning models and includes a set of functions that attempt to streamline the process of creating predictive models. In addition, several other predictive models (algorithms) can easily be deployed from within the R-caret framework.
This report illustrates Ordinal Least Squares Linear Regression to create a best-fit model for the independent variable(s) as a function of the dependent variable. For this study, the dependent (target) variable herein is silver(Ag_ppm), and the independent variables examined include gold(Au_ppm), arsenic(As_ppm), antimony(Sb_ppm), manganese (Mn_ppm), lead(Pb_pct), and zinc (Zn_pct).
The goal of the study is to use one variable as a independent or predictor variable to explain for the presence of the dependent or target variable silver. In order to do this successfully we need a good relationship between the two variables. The model can then be used to predict changes in the target variable silver. A strong relationship between the predictor variable and the response variable leads to a good model.
In particular, the research question analyzes for relationships, if any, between the presence of associated geochemical variable(s) and the presence of silver. Is there a relationship among certain geochemical variables, and can they be used to predict for silver?
The results suggest that all the geochemical variables examined have positive linear relationships with silver, however, the strength of those relationships varies between geochemical variables. Lead and antimony have moderately strong linear positive relationships between the target variable silver with correlation coefficients ranging between 0.69 and 0.64 respectively. Lead has a good ability to predict for silver (steep slope) whereas antimony has a poor ability to predict for silver (shallow to flat slope).
This study illustrates how to build a model, assess the model’s performance using metrics, and provides suggestions on how to tune a model using the R-caret package. Fitting multiple train-test split models using cross-validation (CV) is more costly but provides a more accurate prediction due to the average RMSE to gauge the model performance.
Keywords: R-caret package, predictive numeric modeling, machine learning, supervised learning, ordinary least squares.
Predictive modeling of exploration targets
Geologists need every tool in their toolbox to aid in a new discovery. Choosing geologically favorable areas for potential discovery is becoming increasingly difficult. What is the relationship between the presence of geochemical variables to the presence of silver? This study uses Ordinal Least Squares (OLS) Linear Regression in the R-caret package (Kuhn, 2008) to understand the relationship between associated geochemical variables to predict for the presence of silver.
What is predictive modeling?
In 2009, Carranza defined predictive modeling as “making descriptions, representations or predictions about an indirectly observable and complex real-world system via quantitative or qualitative analysis of relevant data” (Carranza 2009, p 11). In the same year, Carranza further suggests that “some quantity of data associated directly with the dependent variable must be available in order to create and validate a predictive model” (Carranza 2009, p 11).
Approaches to predictive modeling
Several different approaches can be taken, including generalization of a set of observations, where patterns in a data set are examined (Carranza, 2009). In 1994, Bonham-Carter suggested confirming particular deductions by testing the generalization on other test data sets where the result is not known (Bonham-Carter, 1994). Quantitative Empirical Modeling is where data of the target variable are divided into a training set and a testing set, and based on the training set, relationships between the dependent variable and independent variable(s) are quantified and then used for prediction. The goodness of fit to data in a training set and its predictive ability against data in a testing set (Carranza, 2009) is presented. An ordinary least squares linear regression best-fit model of the independent variable as a function of the dependent variable shows the relationship (Ludbrook, 2002).
Statistical methods used in building predictions
The use of classification and regression models is frequently used in supervised machine learning tasks (Alto, 2023) and can be applied to different use cases (Kuhn, 2008). The R-language (R-Core Team, 2013) has developed many modeling functions for both classification and regression, however the R caret package has stream-lined the building and predicting model functions, and offers simplified model tuning parameters, and the ability to extend the model to parallel processes (Kuhn, 2008). The caret package (short for C lassification A nd RE gression T raining) is a framework for building machine learning models in R. The R-caret package contains the functionality useful in early project stages (e.g., data splitting and pre-processing) for both supervised and unsupervised feature selection methods offered to tune models using re-sampling that help in evaluating a model. The package contains tools for data splitting, pre-processing feature selection, model tuning using re-sampling, as well as other functionality (Kuhn, 2008).The caret package depends on many other packages, some of which are loaded individually when a model is trained or predicted (Kuhn, 2008).
A simple linear regression model is a mathematical equation that allows us to predict a response for a given predictor value (Natural Resources Biometrics, Lumen Learning, 2023). According to Alto (2023) “Ordinary least squares (OTS) is an optimization strategy that helps you find a straight line as close as possible to your data points in a linear regression model” (Alto 2023, p 1). Linear regression is part of a larger family of algorithms that can be used in supervised machine learning tasks. Regression involves numerical continuous values as a response target where the algorithm is asked to predict a continuous number rather than a class as or category (Alto, 2023). This document will demonstrate the steps required to predict a dependent variable based on a independent variable by employing a Simple Linear Regression algorithm (Figure 1-1 and Figure 1-2).
SLRequation
Figure 1-1. The Simple Linear Regression (SLR) Algorithm to explain the relationship between two variables where εi is the error term, and α, β are the true (but unobserved) parameters of the regression. The parameter β represents the variation of the dependent variable (y) when the independent variable (x) has a unitary variation. If my parameter is equal to 0.75, when my x increases by one, my dependent variable will increase by 0.75. On the other hand, the parameter α represents the value of our dependent variable when the independent one is equal to zero (Alto, 2023).
SLRgraph
Figure 1-2 Simple Linear Regression Graph The further away the data points are from the regression line the higher the error and the less accurate the prediction on the dependent variable.
The data set used in this analysis is from a real-world mineral deposit located in southern Chihuahua, Mexico. The mineral deposit is characterized by a suite of geochemical elements found in high quantities together with silver. The dependent (target) variable herein is silver(Ag_ppm), and the independent (predictor) variables are gold(Au_ppm), arsenic(As_ppm), antimony(Sb_ppm), manganese (Mn_ppm), lead(Pb_pct), and zinc (Zn_pct).
The link below provides access to the data set used in this study. Individual observations occur on separate lines in the data set and represent a single core sample ranging in length from 0.30 to as wide as 2 meters (actual sample identifications were removed for confidentiality reasons). Each column in the data set represents a different geochemical element (variable) with analytical determinations. Along with these numeric values, two new categorical columns were created for each observation, Silver_range where ranges (bins) of silver values are presented as categorical values as well as Silver_class ranging from Class 1 for low silver to Class 4 for high silver. The subset of the data provided at the link above is comprised of 499 observations across seven numeric variables (e.g., Ag_ppm, Au_ppm, As_ppm, Sb_ppm, Mn_ppm, Pb_pct and Zn_pct) and two new categorical variables defined above.
The reduced data set can be found at Geochemistry reduced
Statistically, it is ideal to build a model using all geochemical elements, with as much of the data as possible, in order to find the most accurate predictive model; however, for illustrative purposes, this study will demonstrate model building, tuning, and evaluation using the R-caret Package using a subset of independent variables from the data set that are associated with the dependent variable silver. The R-caret package supports model tuning across a wide variety of modeling techniques (Kuhn, 2008), including Ordinary Least Squares (OLS) regression (e.g., Simple Linear Regression).
Ideally, one of the best uses of data is to use all the data to train the model using all the samples and then use re-sampling (e.g., cross-validation or bootstrap, etc.) to evaluate the model efficiency. Ideally, an external test-set split, not used in the model training, should be used to assess the model performance (Kuhn, 2008).
Research questions include:
Q1 - Is there a relationship between silver and other geochemical variables?
Q2 - How strong is the relationship between silver and lead?
Q4 - How large is the association between silver and lead?
Q5 - How accurately can we predict for the presence of silver with lead?
Q6 - Is the relationship between silver and lead linear?
FlowChart
Figure 3-1. A flow chart showing a summary of the steps taken/or proposed in model building, model tuning and model performance metrics through RMSE.
Data transformations for individual predictors
Feature engineering is the process of selecting, manipulating, and transforming raw data into features (variables) that can be used in supervised learning (Patel, 2021) for a particular machine learning task. Several potential independent variables known to be associated with elevated silver (e.g. > 100 ppm Ag) were selected for the study.
Dealing with missing values
Observations with missing values, were usual, but did exist and were deleted from the data set used for the study.
Removing variables
The data limitations include several analytical determinations of specific geochemical elements, that were available, but used a less than optimum analytical method resulting in an inaccurate determination and those were not included in the study. In addition, several ten’s of variables were removed from the data set due to a poor to no association with silver.
Removing outliers
Several silver values as high as 6500ppm Ag occurred within the original data set. These caused skewed visualizations of the data in the pairwise variable scatter plot graphs so these outliers were removed.
Adding new variables and binning variables
Two new categorical variables called Silver_ranges and Silver_class were added to the data set as a means for additional aesthetics to be applied to visualizations to gain insight as to the associations of different ranges of silver values with other independent variables. The classes are defined as follows: Class 1 >= 100ppm Ag to as high as 250ppm Ag, Class 2 > 250ppm Ag to as high as 500ppm Ag, Class 3 > 500ppm Ag to as high as 1000ppm Ag, and Class 4 > 1000ppm Ag.
A machine learning project’s quality is based on the input data quality, so it’s essential to learn about the data subtleties during the data exploration stage (Lanz, 2019). The first step in any model building is to understand the data from which the model will be built, which can be done by creating graphs, charts, or other visualizations to gain deeper insight.
First, look at actual relationships between the dependent variable silver, and several independent variables using scatter plots (Figures 3-2 to 3-6). A regression line was added to each image as this line can be used to predict the value of x (silver) for a given value of y (e.g., lead) and tells the reader in what direction and how much the response variable (e.g., silver) changes when the independent ( e.g., predictor) variable changes. The first two graphs clearly show that silver values increase similarly with lead(Pb) and antimony (Sb) at values < 500ppm Ag (Figure’s 3-2 and 3-3). Contrary to this relationship gold (Figure 3-4) and arsenic (Figure 3-5) do not always follow silver values. Very high arsenic values are not necessarily associated with high silver values.
A colored correlation coefficient matrix was created to explore the strength of associations between silver and the independent variables. The colors ensure that the reader sees that the darker the blue tone, the stronger the association and the better fit to a linear regression line. Silver shows no linear relationship to gold, with a correlation coefficient of 0.1, while silver shows a relatively strong linear relationship with lead and antimony with a correlation coefficient of 0.69 and 0.64 respectively (Figure 3-9).
Ordinary least squares (OLS) regression is an optimization strategy that helps you find a straight line as close as possible to your data points in a linear regression model (Figure 3-8). OLS is considered the most useful optimization strategy for linear regression models as it can help you find unbiased real value estimates for your slope alpha and beta (Alto, 2023). OLS’s goal is the find the plane that minimizes the sum-of-squared errors (SSE) between the observed and the predicted response (Kuhn, 2013).
Load libraries required for the illustration
Identify independent (predictor) variables with high pairwise correlation with the dependent variable silver
## # A tibble: 6 × 9
## Ag_ppm Au_ppm As_ppm Sb_ppm Mn_ppm Pb_pct Zn_pct Silver_class Silver_range
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 213 0.403 1235 63 549 0.999 2.18 Class1 great100less250…
## 2 727 2.18 415 247 1990 3.95 0.164 Class3 greater500less1…
## 3 127 0.203 2170 241 8720 2.51 1.93 Class1 great100less250…
## 4 154 0.254 2210 167 2970 3.06 4.48 Class1 great100less250…
## 5 195 0.293 2780 217 3640 3.64 6.36 Class1 great100less250…
## 6 187 0.198 635 168 968 3.79 3.71 Class1 great100less250…
Table 3-1. Data set considered in the analysis. The silver ranges are provided for each class where the highest class, Class 4 is for all those silver values greater than 1000ppm Ag.
Look at the relationship between actual silver and lead and use shape and color to enhance the scatter plot visualization; add the regression line to enhance the strength of the relationship.
Figure 3-2. A scatterplot of Lead (Pb_pct) vs Silver (Ag_ppm) showing the relationship between actual values of two continuous numeric variables; grouping points by a categorical variable Silver_class mapped to shape and Silver_range mapped by color. The 95 % confidence fitted positive steep linear regression model line suggests a moderate positive linear relationship between silver(Ag_ppm) and lead (Pb_pct) with silver values increasing with lead values, suggesting that lead could be used to predict for silver.
Look at the relationship between silver and antimony; use shape and color to enhance the scatter plot visualization
Figure 3-3. A scatterplot of Antimony (Sb_ppm) vs Silver (Ag_ppm) showing the relationship between actual values of two continuous numeric variables silver (Ag) > 100ppm Ag and antimony (Sb); grouping points by a categorical variable Silver_class mapped to shape and Silver_range mapped by color. The 95% confidence fitted shallow positive linear regression model line suggests a linear relationship between silver(Ag_ppm) and antimony (Sb_ppm), however antimony has a weak ability to predict for silver at values < 500ppm Ag.
Look at the relationship between silver and gold; and use shape and color to enhance the scatter plot visualization
Figure 3-4. A scatterplot of Gold (Au_ppm) vs Silver(Ag_ppm) showing the relationship between actual values of two continuous numeric variables silver > 100ppm and gold; grouping points by a categorical variable Silver_class mapped to shape and Silver_range mapped by color. The 95% confidence fitted positive flat linear regression model line suggests a linear relationship between silver and gold, however it is clear that gold has a weak ability to predict for silver given the flat regression lines.
Look at the relationship between silver and arsenic; and use shape to color class and color by silver ranges
Figure 3-5. A scatterplot of Arsenic (As_ppm) vs Silver(Ag_ppm) showing the relationship between actual values of two continuous numeric variables silver > 100ppm Ag and arsenic (As); grouping points by a categorical variable Silver_class mapped to shape and Silver_range mapped by color. The 95% confidence fitted steep positive linear regression model line suggests a moderate positive linear relationship between silver and arsenic with silver values increasing with arsenic values for some samples < 1000 ppm. Very high arsenic values can occur with low silver values < 250ppm Ag. Arsenic has a weak ability to predict for silver for values > 1000 pm Ag.
Look at the relationship between silver and zinc. Zinc is an element often associated with high lead; and use shape to map silver class and color to map silver ranges.
Figure 3-6. A scatterplot of Zinc (Zn_pct) vs Silver (Ag_ppm). A scatterplot showing the relationship between actual values of two continuous numeric variables silver > 100ppm Ag and zinc; grouping points by a categorical variable Silver_class mapped to shape and Silver_range mapped by color. The 95% confidence fitted positive shallow linear regression model line at values below 500ppm Ag, suggesting a moderate positive linear relationship between silver and zinc with silver values increasing with zinc values. The ability of zinc to predict for silver at values < 500ppm Ag is weak.
Look at the relationship between silver and manganese as these are often associated geochemical elements; and use shape to map silver_class and color to map silver_range
Figure 3-7. A scatterplot of Manganese (Mn_ppm) vs Silver (Ag_ppm) showing the relationship between actual values of two continuous numeric variables silver > 100ppm Ag and manganese (Mn); grouping points by a categorical variable Silver_class mapped to shape and Silver_range mapped by color. The 95% confidence fitted positive flat linear regression model line at values below 1000ppm Ag; suggests a moderate positive linear relationship between silver and manganese with silver values increasing with manganese values, however manganese has a poor ability to predict for silver at values < 1000 ppm Ag given the shallow regression lines.
To focus the data exploration process a new object called gc_reduced was created containing the 3 top potential independent (predictor) variables paired with silver in a correlation coefficient matrix to explore the strength of associations (darker blue has a stronger association); define the function panel.cor() where higher correlations(associations) are in a larger font to enhance the visualization (Chang, 2018).
Load the data set
Define function to show correlation coefficients use panel.cor higher correlation coefficient in larger font
Define function to show histograms use panel.hist
Define the three panel themes (correlation coefficient, histograms and scatterplots with red regression lines into a single complex visualization; tell pairs use panel.cor for upper panels; panel.hist for diagonals and panel.smooth to show trends (Chang, 2018)
Figure 3-8. Complex visualization.Scatterplots with red smoothing regression lines in the lower triangle, and histograms on the diagonal with correlation coefficients in upper triangle (Chang, 2018). Each observation in the data set is represented by a point. All plots show pairwise relationships Scatter plots show the relationship between two pairwise continuous (numeric) variables. The red line shows the predicted values based on a LOWESS or locally weighted scatter plot smoothing statistical model. Adding the line helps the reader see trends more easily. The correlation coefficient on each square in the upper triangle suggests that the strongest correlation is between silver (Ag_ppm) and lead(Pb_pct) at 0.69 approaching a moderately strong linear relationship. The next strongest correlation is silver with antimony with a moderately strong correlation coefficient at 0.64.
Create a numerical correlation (ncor) matrix with the key predictor variables colored squares, black labels at a 45 degree angle.
## Ag_ppm Sb_ppm Pb_pct Au_ppm
## Ag_ppm 1.00 0.64 0.69 0.10
## Sb_ppm 0.64 1.00 0.40 0.00
## Pb_pct 0.69 0.40 1.00 0.19
## Au_ppm 0.10 0.00 0.19 1.00
Figure 3-9. Correlation matrix between three independent (predictor) variables. The stronger the correlation, the darker the blue tone on a scale ranging from +1 (dark blue) to -1 (red) not shown here. Silver(the target variable), has the strongest correlation with lead(Pb_pct) at 0.69 approaching a moderately strong linear distribution.
In the example below a quadratic model using lm() with lead(Pb_pct) as a predictor of silver(Ag_ppm). Then use the predict () function to predict the value of silver across a range of values for the predictor (Pb)
Create a data frame with the lead % (Pb_pct) column, interpolating across the range
Predictions - for silver (dependent) variable using lead as the independent (predictor) variable
## Pb_pct Ag_ppm
## 1 0.0347000 100.4062
## 2 0.2363697 108.2642
## 3 0.4380394 116.1879
## 4 0.6397091 124.1776
## 5 0.8413788 132.2330
## 6 1.0430485 140.3543
Table 3-2.The first few rows of the predictor variable lead(Pb_pct) as a predictor of silver(Ag_ppm) using the predict () function to predict the value of silver (Ag_ppm) across a range of values for the predictor (Pb). This table shows that as lead increases so does it predict for an increase in silver.
Plot the data points along with values predicted from the model in ggplot2 (Wickham, 2016)
Figure 3-10. Predictions - for silver (dependent) variable using lead as the independent (predictor) variable creates a moderately good fit to a linear positive regression line, although many values lie below the regression line. Based on the slope of the linear regression line (red), changes in lead (Pb_pct) predict similar changes (either high or low) in silver (Ag_ppm). A lead value of 5 percent (Pb_pct) predicts a value of 300ppm Ag. Comparing this to Figure 3-2 the estimate is close to actual value where a lead value of 5% predicts for a value of 315ppm Ag. There is a strong sample selection bias below 4% lead suggesting that large values of the dependent variable silver are underrepresented in a sample.
Based on a training set, relationships between the dependent variable silver (Ag_ppm) and the independent variable selected as lead(Pb_pct) are quantified and then used for prediction. The goodness of fit of data in a training set and its predictive ability against data in a testing set are presented for several models. The strength of the relationship and the accuracy of the model estimate are also presented.
-Fit regression model with train()
-Evaluate in-sample-error a metric of performance
-Evaluate out-of-sample performance using cross-validation (CV) and average RMSE
-Model tuning to improve metrics on prediction accuracy
Use a train-test split to evaluate performance and assess for over-fitting, the model’s performance (accuracy) using pout of sample error where the in sample error (ISE) across a single train/test split should be greater than the out of sample error (OSE) across multiple train/test splits (GitHub Pages|)
The data was split into several training sets and a test set. The test set will be used to evaluate the performance and the training set will be used on all other runs.
The simple model training was completed using 80% of the data and the remaining 20% was be used for evaluating the model performance. The lm function stands for linear least squares, a statistical model method that will be used.
The train function was used to select values used in the model (first 10 rows, last 20 rows or all rows); estimate model performance using re-sampling (will only accept numeric values) through cross validation (cv) method to evaluate the model’s performance by providing an average RMSE across several train/test splits resulting in a more accurate model prediction in the train data set to avoid over-fitting (over-estimating) the model.
A good quantitative measure used to assess model performance is Root-Mean-Squared-Error (RMSE). The higher the error, the poorer the model performance. The model seeks out a linear relationship through a low Root-Mean-Squared Error (RMSE) metric showing a good correlation between the actual observed values and the predicted values for the dependent variable.
Use train() to preprocess data BEFORE fitting models
Use train() function to tweak model parameters through CV and grid search
Use resamples() to compare multiple train/test models and select the best one
Train different models by changing the “method” argument
Use Root Mean Squared Error (RMSE) to estimate model performance for an in-sample error on a train set and compare to an out-out-sample error on a test set; use cross validation to evaluate the model’s efficiency; compare to an average RMSE using out-of-sample error on a test-set not used in model training. The in-sample error is expected to be greater than the out of sample error indicating that the model over fit the train data suggesting poor predictability against data in the testing set.
Mod1ResidualsFitted
Figure 3-11a. Model 1 Residual vs Predicted plot- using lead as the single predictor for silver. This plot has a weak fan shape flaring out at the high silver values indicating a non-constant error variance that increases towards higher silver values and decreases towards lower silver values.
StdResModel1
Figure 3-11b. Model 1 Residual vs Leverage Plot - using lead as the single predictor for silver. This plot highlights two influential observations chlose to the outer Cook’s line (e.g., 127 and 480). Each observation in the model is shown with leverage of each point along the x-axis, and the standardized residual (difference between predicted value for an observation and the actual value) along the y-axis. Leverage refers to the extent to which the coefficients in the regression model would change if a particular observation was removed from the data set. Many observations show a distinct pattern and fall far from the red regression line, suggesting there is a large difference between actual and predicted values in Model 1.
Make a new object (pred1) to hold the in-sample prediction of numeric silver values in Model 1 using the predict() function
Create a new object (error1) to hold the error on Model 1
Calculate the RMSE error - in-sample prediction on Model 1
## [1] 142.0691
The RMSE is a measure of the models error and has the same units as the test set; Model 1 prediction is off by 142 ppm Ag on average
Mod2ResidualsFitted
Figure 3-12a. Residuals vs Fitted for Model 2 - using all the numeric variables as predictors for silver. This plot has a weak fan shape flaring out at the higher silver values indicating a non-constant error variance that increases towards higher silver values and decreases towards lower silver values.
Mod2StDResiduals
Figure 3-12b. Residual vs Leverage Plot for Model 2 using all variables as predictors for silver (in-sample-error) showing a single influential observation (e.g., 155) that lies outside the Cook’s line (black dashes) for consideration. This plot suggests that the difference between the predicted and actual value is smaller when using all variables to predict for silver compared to Model 1 where a single predictor variable was used.
Model 2 - Make a new object (pred2) to hold the in-sample prediction of numeric values for silver (Ag) in Model 2 using the predict() function
Model 2- Create an object for the error2 on the Model 2
Model 2 - Calculate the RMSE error of the in-sample prediction of Model 2
## [1] 118.0269
Model 1 - using a single variable lead(Pb_pct) as the independent variable to predict for silver - returned an RMSE of 142.07 ppm Ag - an indication of a poor model performance and an inaccurate prediction.
Model 2 - using all the variables in the data set to predict for silver - returned an RMSE of 118.07 ppm Ag - suggesting Model 2 had better model performance with more accurate predictions, than Model 1.
Using an out of sample error model approach is done with a real test-set that generates models that don’t over-fit (over-estimate) the training data (Leek, 2009) (overestimate the predictions); this provides key insight to ensure you choose a best-fit model that continues to perform well - based on an average RMSE error metric across all train/test sets on NEW data.
Model 3 - Set random seed to ensure you get the same random split each time the script is run. Kuhn(2008), the developer of R-caret, suggested using a random split higher than 25
Model 3 - Read in the data set that had the categorical variables removed
Model 3 - Create a random vector of row indices called rows to randomly re-order (shuffle) the data
Model 3 - Create a train/test set on a 80/20 split; first 80% into a training set; last 20% into a test set
Model 3 - Split off 80% of the shuffled/randomized data set as the training set
## # A tibble: 399 × 7
## Ag_ppm Au_ppm As_ppm Sb_ppm Mn_ppm Pb_pct Zn_pct
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 149 0.435 1020 147 1650 2.52 1.97
## 2 131 0.285 610 48 502 0.54 0.762
## 3 130 0.104 572 92 1130 2.2 2.4
## 4 604 0.088 373 1705 2680 5.31 2.87
## 5 122 0.393 1120 147 823 1.82 1.72
## 6 109 0.183 2270 132 1330 2.16 3.04
## 7 361 1.44 10000 410 1770 7.08 6.11
## 8 184 0.966 4260 143 313 2.59 1.07
## 9 136 0.078 12 1 2210 0.295 10.4
## 10 177 0.931 1680 188 1600 2.67 2.84
## # ℹ 389 more rows
## # A tibble: 6 × 7
## Ag_ppm Au_ppm As_ppm Sb_ppm Mn_ppm Pb_pct Zn_pct
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 149 0.435 1020 147 1650 2.52 1.97
## 2 131 0.285 610 48 502 0.54 0.762
## 3 130 0.104 572 92 1130 2.2 2.4
## 4 604 0.088 373 1705 2680 5.31 2.87
## 5 122 0.393 1120 147 823 1.82 1.72
## 6 109 0.183 2270 132 1330 2.16 3.04
Table 3-3 Shuffled data set comprised of 80% (401 observations) of the original data set showing the single dependent variable silver (Ag_ppm) first followed by 6 numeric predictor variables in the data set.
Model 3 - Create a new object to hold the 80% split; and use round() function to determine the row to split on
Model 3 - Create a new object to hold the training data set (399 observations)
Model 3 - Create a new object for the remaining 20% testing data set (100 observations) reserved for the out-of-sample error estimation
Because I randomly shuffled the geochemistry data set, it was easy to split off a random test-set. At this point the Shuffled_geochemistry has a total of 499 observations with 9 variables The training-set has 399 observations (80% of total obs) and the test-set has 100 observations (20% of total obs) each with 8 variables. Predict on a single train/test set using all the variables first Use the formula interface to the linear regression function lm() to fit a model with the specified target variable (Ag_ppm) using all the variables in the data set as predictors.
Model 3 - Create an new object for Model 3 for the training data (80% of the data) and fit a model to the training data using a single variable lead (Pb_pct) to predict for silver
Model 3 - Create an object to hold the prediction on the training data
Model 3 - Create an object for the error on Model 3
Model 3 - Calculate the Model 3 error RMSE
## [1] 142.9566
Model 3 - Create an object for Model 3 for the testing set (20% of the data)
Model 3 - predict on testing data
Model 3 - calculate error
Model 3 - calculate RMSE
## [1] 145.5438
Conclusion Mode1 3 - predicting for silver using a single train/test set returned an RMSE of 142.96 ppm Ag on the training set and 166.72 ppm Ag on the testing set. These results indicate that Model 3 training set over fit the data by 166.72ppm Ag - 142.96ppm Ag for a difference of 48.03ppm Ag.
*This will result in an average RMSE, giving a more precise assessment of the model’s performance through an out-of-sample RMSE error from the 20% testing set (new data). This method will avoid the negative effects of outliers on the RMSE. After assigning each row to a test set randomly to avoid biases, and after you know the average RMSE, you refit the model on the FULL training data set so as to fully exploit info in that data set.**Model 4 - Create Model 5 by using lead as the independent variable to predict for silver; train on multiple train/test splits (ten); use cross validation to control the train*
Model 4 - set random seed - like the index of a sequence for reproduceability- to ensure the code avoid the code reproduces-every time you give it the same seed it splits out the same numbers (Godoy, 2022)
Fit the final model on ten folds (ten train/test sets, using cross validation to control the train
## + Fold01: intercept=TRUE
## - Fold01: intercept=TRUE
## + Fold02: intercept=TRUE
## - Fold02: intercept=TRUE
## + Fold03: intercept=TRUE
## - Fold03: intercept=TRUE
## + Fold04: intercept=TRUE
## - Fold04: intercept=TRUE
## + Fold05: intercept=TRUE
## - Fold05: intercept=TRUE
## + Fold06: intercept=TRUE
## - Fold06: intercept=TRUE
## + Fold07: intercept=TRUE
## - Fold07: intercept=TRUE
## + Fold08: intercept=TRUE
## - Fold08: intercept=TRUE
## + Fold09: intercept=TRUE
## - Fold09: intercept=TRUE
## + Fold10: intercept=TRUE
## - Fold10: intercept=TRUE
## Aggregating results
## Fitting final model on full training set
Model 4 - predict on cross validated testing set
## 1 2 3 4 5 6 7
## 170.63488 97.58981 115.73765 156.01579 99.70706 106.56291 104.09279
## 8 9 10 11 12 13 14
## 92.09505 216.76065 242.97421 141.90080 136.60767 125.97102 220.28940
## 15 16 17 18 19 20 21
## 189.03478 196.09228 207.68673 128.28991 157.78016 775.81508 775.81508
## 22 23 24 25 26 27 28
## 148.70624 350.85306 120.57708 198.10870 270.19597 198.10870 331.69700
## 29 30 31 32 33 34 35
## 294.39310 252.04813 235.41260 549.97524 884.19804 1007.70420 148.45418
## 36 37 38 39 40 41 42
## 115.33437 302.96291 156.51989 144.92544 113.87246 194.57995 237.42903
## 43 44 45 46 47 48 49
## 356.39823 353.37359 129.04607 257.08920 100.21116 151.22677 171.39104
## 50 51 52 53 54 55 56
## 272.21240 165.08971 243.47831 181.47318 89.77615 88.16301 242.97421
## 57 58 59 60 61 62 63
## 143.16106 209.19905 163.57739 124.10583 245.49474 273.22061 207.18262
## 64 65 66 67 68 69 70
## 268.68365 270.70008 244.48653 196.09228 129.04607 173.15542 154.75552
## 71 72 73 74 75 76 77
## 94.91804 101.82431 276.74936 229.86743 305.48345 391.68571 187.52246
## 78 79 80 81 82 83 84
## 82.09357 83.04633 81.86168 1085.84075 208.19084 204.15798 258.09741
## 85 86 87 88 89 90 91
## 161.56096 104.84895 165.08971 212.72780 196.59638 223.81815 167.61024
## 92 93 94 95 96 97 98
## 86.29782 596.85717 134.33919 130.81045 263.13848 297.41774 211.21548
## 99 100
## 173.15542 92.49833
Table 3-4. Table of the 100 predicted values of silver (in ppm or parts-per-million) using lead as the predictor or independent variable. These values are well within the range of possible silver values in the data set where occasionally values > 1000ppm Ag (silver) occur. The majority of the values predicted fall within the >100ppm Ag to 350ppm Ag which are reasonable predicted values for the data set used in the analysis. The data is plotted on scatterplot below.
Model4Prediction
Figure 3.13. Scatterplot Prediction of Model 4 vs the Index. Model 4 - Fit a simple cross validated model using lead to predict for silver employing ten train/test sets
Model 4 - Use the predictions on the test set above to calculate an error metric to see how the model performed; To do this, calculate the errors between the predicted Ag_ppm and the actual Ag_ppm by subtracting prediction Ag_ppm value from actual Ag_ppm value.
Model 4 - calculate the out of sample error on the testing set
## [1] 134.7611
Conclusion Model 4 - Employing cross validation on multiple train/test sets for an average RMSE of 134.76 ppm Ag, a value 10.58ppm Ag lower than Model 3 where a single train test set was used returning an out-of-sample error of RMSE of 145.34 ppm Ag on the testing set. Model 4 provides a more accurate assessment of the model’s performance using lead to predict for silver.
Suggestions for model tuning to support choosing the final model
Understanding the metrics in a predictive numeric model’s performance allows for a quick assessment of how well the model performed and provides insight into the accuracy of the predictions. Root-Mean-Square-Error (RMSE) presents a single quantitative number in the units of the analysis, in this particular case, in silver units (e,g. in parts-per-million) to represent how far off the predictions are for silver using lead(Pb_pct) as the predictor.
RMSE for Root-Mean-Square-Error is a measure of how far off the predicted values are from the actual values in a regression analysis. In other words, how concentrated the data are around the line of best fit.
Compare the model’s performance using different statistical methods. In this analysis, an illustration was presented on how to build a simple model in the R-caret package using Ordinary Least Squares (OLS) Regression. In R-caret it is easy to compare models by changing the model method argument (Figure 4-1).
ModelTuning
Figure 4-1. Suggested workflow for model training and tuning in R-caret (Kuhn, 2009)
This study demonstrates the detailed steps taken in using supervised machine learning to build a simple numeric predictive model using linear regression in the R-caret package to explore for relationships between independent predictor geochemical variables and the target variable silver. Two different model build approaches were presented and assessed for model performance using both in-sample-error, and an out-of-sample error where cross validation was employed through multiple train/test splits for an average RMSE across multiple train/test sets providing a more precise RMSE value of the model’s performance.
For the various model runs, when all the variables were used to predict for silver, the average RMSE was lower on the testing set at 66.09 ppm Ag, a value close to the value on the training data at 53.22 ppm Ag, suggesting that the train set did not over fit the model too much (Figure 4-3). In comparison, Model 4 demonstrated using lead to predict for silver returning an average RMSE of 145.54 ppm Ag on the test set, a value close to the value on the train set at 142.96 ppm Ag, a difference of 2.58ppm Ag suggesting that the training data did not over fit the testing data. Using all variables rather than a single variables resulted in a more accurate prediction for the presence of silver.
Q1 - Is there a relationship between silver and other geochemical variables?. Yes there is a moderate positive linear relationship between silver and lead (Figure 3-2), silver and antimony (Figure 3-3) moderate positive relationship between silver(Ag_ppm) and lead (Pb_pct) with silver values increasing with lead and antimony values; and with silver and gold (Figure 3-4) where low gold values can occur with a wide range of silver values. This relationship is curious, as gold is known to occur with silver in the mineral species electrum (a silver-gold amalgam). Further work is required to explore the relationship between gold and silver. Other elements have more erratic relationships with silver where very high arsenic values can occur with low silver values (Figure 3-5).
Q2 - How strong is the relationship between silver and lead? The strength of a linear relationship is evident between silver and lead, based on a correlation coefficient as high as of .69, and similarly between silver and antimony at 0.64 in the subset used (Figure 3-8 and Figure 3-9). Silver and gold show a weak relationship with a correlation coefficient of 0.10 (Figure’s 3-8 and 3-9). Does knowledge about these relationships help a geologist find silver? Yes - often a lead sulfide (PbS) mineral species called galena can have silver-bearing mineral species invisible to the naked eye.
Q3- Which geochemical element have a relationship with silver? Of the geochemical predictor elements selected for review, lead (Pb) and antimony (Sb) have an obvious relationship with silver. In addition, gold (Au), arsenic (As), zinc(Zn) and manganese (Mn_ppm) have a weak relationship with silver at lower silver values (Figure’s 3-2 to 3-7).
Q4 - How large is the association between lead and silver The association of lead with silver is moderately large (Figure 3-8 and 3-9).
Q5 - How accurately can we predict for the presence of silver with lead ? The prediction for silver using lead had error 145.54ppm Ag presented in Model 5 (Figure 4-5). The prediction fo silver using all variables in the data set had an error of 66.09 ppm Ag in the Model 4 test set, a lower RMSE suggesting that using all the variables can produce a more accurate prediction for silver.
Q6 - Is the relationship linear? There is a weak to moderate linear straight-line relationship between lead and silver suggesting that linear regression is an appropriate tool, however this relationship is evident at certain lead values below 5% Pb. Values greater than 5% require further transformation (e.g., further binning) of either the predictor (lead) or the response variable (silver) to obtain a linear response.
Given that there are currently 238 training models available in caret there are many options to further explore outside of the linear regression model selected for this study. TrainModelList
Any model build should involve an open-minded approach to balancing time, funding as well as interpretibility and accuracy. Adjustments to the parameters, methods, metrics and algorithms are always options. Reaching the best performance value with the smallest error is the goal but is it required for the machine learning task in question. Consideration of randomness, size of subset (representative or not), number of train/test splits, evaluation metrics, number of model runs all need to be taken into consideration during any model build.
Here are some thoughts:
Examining both numeric variables as well as with categorical variables during exploratory data analysis to get familiar with the additional data set nuances.
Fitting a single model (in-sample-error) is cheaper than using cross-validation (CV) model where multiple training models are used for an average out-of-sample-error prediction using the test-set not used in training the model. Cross validation takes longer and hence is more costly, but the trade off, is a more precise model result.
Using linear regression to build a predictive numeric model strives to seek a smooth linear fit in a model with all predicted data points close to the regression line suggesting more precise predicted values in a model; but careful consideration to dealing with outliers, those data points that fall far from the line should not be ignored by the analyst (are they real and have they been verified?)
Plotting actual values on Residuals against Fitted Model value plots may help to separate the actual strength of the predicted data values vs noise.
Given that readers respond differently to the same statistical graphic depending on a variety of issues including state of mind (Whitson, 2008), consideration of common traits across most readers should be considered when choosing visualization.
Matching the correct model (algorithm) to the data is an important consideration. The size of the training data should be large enough to make accurate predictions.
Accuracy of the prediction was created using predict() function where it predicts a response value for that observation using linear regression as it is fairly easy to interpret based on how far the predicted values fall from the regression line. It seems that linear regression provides a lower accuracy and easier interpretability than many other models available in caret.
These points might direct to future research in supervised machine learning and building numeric predictive models, statistical visualizations, the importance of exploring other statistical methods, and understanding the meaning of a model metric.
Figure4-6. A representation of the
trade-off between accuracy and interpretability. (Source: O’Reilly
Media, Inc.)
It is easy to lose the forest for the trees when conducting a meaningful machine learning task with the goal of a working model to produce reasonable prediction/performance accuracy. Balancing time spent on data exploration rather than experimenting with parameters, algorithms, functions, and other metrics was a mistake as time became a factor.
I found this deliverable incredibly challenging, enlightening and enjoyable at the same time, and many rabbit holes were explored, some far too deeply. I learned about machine learning and the real meaning and power of an algorithm, supervised machine learning, and working in R-markdown, although I have a long way to go. I now feel like I have the tools to further explore the depths of the R-caret package, in particular the plethora of supervised machine learning numeric predictive statistical models as well as the unsupervised statistical models available.
If I had the option to start over, I would still choose the R-caret package as I feel I have just touched the surface of this complex and well seasoned package that has been used successfully since its development in 2008 in the pharmaceutical world (e.g., Pfizer), a world that needs to make accurate predictions in machine learning tasks as the stakes are high in this industry.
At no point during the term project did I feel like I did not have the tools to solve a coding problem. I did, however, need to learn basic statistics, as I have never taken a statistics course so I now understand Ordinary Least Squares Linear Regression, also called Simple Linear Regression, and it turns out my assumptions were wrong on just how to interpret a basic scatterplot with a predicted regression line. The slope of the regression line (steep positive, shallow positive) relates to the accuracy of the prediction and not to the relationship between the two variables presented. I also understand why it is important to include residual plots to understand the spread of the actual values and predicted values to make sure the linear model is adequate for the goal of the machine learning task.
The next package for me to dive into is the R-tidymodels meta-package as it is setting the stage to supersede R-caret, although my first choice at this point is the package I am most familiar with in R-caret.
References
Alto, V. (2023, Feb 14). Understanding Ordinary Least Squares (OLS) Regression - Builtin. https://builtin.com/data-science/ols-regression
Bonham-Carter, G. F. (1994). Geographic information systems for geoscientists: Modelling with GIS. Pergamon Press.https://doi.org/10.1016/B978-0-08-041867-4.50001-1
Buja, A., Cook, D., Hofmann, H., Lawrence, M., Lee, E.K., Swayne, D. F., Wickham, H. (2009). Statistical Inference for Exploratory Data Analysis and model diagnostics. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367(1906), 4361–4383. https://doi.org/10.1098/rsta.2009.0120
Carranza, E.J.M. (2009, July 14). Geochemical anomaly and mineral prospectivity mapping in GIS. Handbook of Exploration and Environmental Geochemistry, Vol 11, M. Hale (Series Editor).
Chang, W. (2018) R Graphics Cookbook (2018, Dec 18). O’Reilly Media. Kindle Edition.
Contributed Packages. (2023, July 19). https://cran.r-project.org/
GitHub Pages| 5 Model Training and Tuning| The caret Package. GitHub. https://topepo.github.io/caret/model-training-and-tuning.html
GitHub Pages| 6 Available Models| The caret Package. GitHub. https://topepo.github.io/caret/available-models.html
Godoy, D. (2022, May 12). Random Seeds and Reproduceability. Towards Data Science.https://towardsdatascience.com/random-seeds-and-reproducibility-933da79446e3
Kuhn, Max (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1–26. doi:10.18637/jss.v028.i05, https://www.jstatsoft.org/index.php/jss/article/view/v028i05.
Kuhn, M., Johnson, K. (2013). Applied Predictive Modeling. Springerlink. https://doi.org/10.1007/978-1-4614-6849-3 Kuhn, M. (2019, March 27). The Caret package. Github Sites. https://topepo.github.io/caret/
Lanz, B. (2019). Machine learning with R. Expert techniques for predictive modeling. Third edition. Packt Publishing, Birmingham. https://www.packtpub.com/product/machine-learning-with-r-third-edition/9781788295864
Leek, J. (2009). In sample and out of sample error. Github Pages. http://datasciencespecialization.github.io/courses/08_PracticalMachineLearning/004inOutSampleErrors/#1
Ludbrook, J. (2002). Statistical Techniques for Comparing Measurers and Methods of Measurement: A Critical Review. Clinical and Experimental Pharmacology and Physiology, 29(7), 527–536. https://doi.org/10.1046/j.1440-1681.2002.03686.x
Natural Resources Biometrics. Lumen Learning. (2023, July 17) Chapter 7: Correlation and Simple Linear Regression. https://courses.lumenlearning.com/suny-natural-resources-biometrics/chapter/chapter-7-correlation-and-simple-linear-regression/
O’Reilly Media, Inc. Strata conference santa clara (2014): Complete Video Compilation. O’Reilly Online Learning. https://www.oreilly.com/library/view/strata-conference-santa/9781491900321/oreillyvideos1977272.html
Pardoe, I. (2006). Simple linear regression (2020). 3rd Edition. Applied Regression Modeling, 39–93. https://doi.org/10.1002/9781119615941.ch2
Patel, H. (2021, August 30). What is Feature Engineering - Importance, Tools and Techniques. Towards Data Science. https://towardsdatascience.com/what-is-feature-engineering-importance-tools-and-techniques-for-machine-learning-208 0b0269f10
R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org/.
Whitson, J. A., Galinsky, A. D. (2008). Lacking control increases illusory pattern perception. Science, 322(5898), 115–117. https://doi.org/10.1126/science.1159845
Wickham, H. (2016). Ggplot2: Elegant graphics for data analysis (2nd ed.) Springer International Publishing.. Kindle Edition