1 Project Description

This project is divided into two unrelated sections. The first compares the performance of various machine learning algorithms in relation to their ability to predict accident outcomes based on factors associated with individual insurance applicants. The second explores the the use of principal components analysis to effectively reduce the dimensionality of water quality measurements from a wide variety of stream sampling stations in New York State.

2 Machine Learning Classifiers

The purpose of this section is to estimate the probability that an individual (seeking auto insurance) will be in an accident. The classification models introduced here draw on a synthetic data set of ~ 8000 observations. The response, ‘target_flag’, is a binary variable scored as either a 1 or 0 indicating that an individual has or has not been in an accident, respectively.

This work extends an earlier project to classify the same data using logistic regression. The results can be viewed online at Rpubs. In brief, the final logistic model produced an AUC of .82 and, setting a cutoff threshold of 0.25, correctly classified 90% of automobiles associated with accidents, as compared to 49% of those not associated with accidents.

In this analysis, a diverse set of ‘machine learning’ classification algorithms are trained/tested and the results evaluated against those produced using the logistic model. They include:

  1. Logistic Regression (baseline model) - subject to assumptions of normality and is sensitive to over-dispersion and outliers.
  2. K-Nearest Neighbors - assumes similarity between data is related to distance; ensitive to missing data, outliers, nominal variables, and imbalanced data
  3. Support Vector Machine - requires scaling/centering
  4. XGBoost - sensitive to multicollinearity
  5. Random Forest - sensitive to multicollinearity

Note that the machine learning models make no assumptions about the underlying distribution of the data. This is not the case with logistic regression.

The data include the following variables:

Variable Description
Target FLAG Was car in crash? 1=Yes, 0=No
Target_ AMT If crash, what was the cost
AGE Age of driver
BLUEBOOK Value of vehicle
CAR_AGE Age of car
CAR_TYPE Type of car
CAR_USE Vehicle use
CLM_FREQ # Claims (past 5 yrs)
EDUCATION Max educational level
HOMEKIDS # Children at home
HOME_VAL Home value
INCOME Annual income
JOB Job category
KIDSDRIVE # Driving children
MSTATUS Martial status
MVR_PTS Motor vehicle record points
OLDCLAIM Total claims (past 5 yrs)
PARENT1 Single parent
RED_CAR A red car
REVOKED License revoked (past 7 yrs)
SEX Gender
TIF Time as customer
TRAVTIME Distance to work
URBANICITY Home/work area
YOJ Years on job
# Import libraries

library(tidyverse) # cleaning & wrangling functions
library(tidymodels) # classification
library(finetune) # faster grid searches
library(factoextra) # PCA
library(vip) # variable importance
library(janitor) # data cleaning
library(magrittr) # piping
library(flextable) #table layout and formatting
library(dlookr) # exploratory data analysis functions
library(corrplot)
library(themis) # downsampling
library(viridis) #coloration for box plots + ROC_AUC
library(patchwork) # easy plot layout

First we load the dataset from a Github respository.

# Read in data

path <- "https://raw.githubusercontent.com/sconnin/insurance_model/main/insurance_dataset.csv"

raw <- read_csv(path)

2.1 Data Cleaning & Exploratory Data Analysis

Data processing is required to put variables in tidy form for model building as well as to support initial exploration and visualization. Additional steps are taken to facilitate interpretation of feature variables.

The following processing steps are included here:

  • data storage as dataframe(s)
  • case consistency and removal of special characters
  • type conversion and re-leveling of select factor variables
  • removal of empty/duplicate rows and/or columns
  • subsetting to remove unnecessary columns
  • renaming column values for clarity
# Copy data into working dataframe for downstream use

df<-raw%>%
  
  clean_names%>% # initial clean of col names
    
  remove_empty(c("rows", "cols"))%>%  # remove any empty rows and cols
    
  distinct()%>%     # remove duplicate rows
    
  mutate_if(is_character, str_replace_all, '\\$|,|z_|<', '')%>%  # clean any special chars in character variables

  dplyr::select(-index)%>%  # remove index

  mutate_if(is_character, str_replace_all, "No|no",'N')%>%  
    
  mutate_if(is_character, str_replace_all, "Yes|yes",'Y')%>%
    
  mutate_if(is_character, str_replace_all, "Highly Urban/ Urban",'Urban')%>%
    
  mutate_if(is_character, str_replace_all, "Highly Rural/ Rural",'Rural')%>%
    
  mutate_at(vars(income, home_val, bluebook, oldclaim), funs(as.numeric))%>%   # correct variable type: char to numeric

  mutate_if(is.numeric, round)%>%  # round out our numerics
    
  mutate_if(is_character, ~(factor(.)))%>%  # convert all character variables to factor for modeling
    
  mutate(education = fct_relevel(education, c("High School", "Bachelors", "Masters", "PhD")))%>% # relevel to show educational attainment steps
    
  mutate(car_type = fct_relevel(car_type, c("Sports Car", "SUV", "Minivan", "Van", "Pickup", "Panel Truck")))%>%
  
  mutate_at(vars(target_flag), factor)

2.2 Count and Frequency of Target Variable

Recall that the variable target_flag indicates whether or not an individual was in an accident (1 = Yes, 0 = No).

  • target_flag (level = 0) includes 6008 observations and accounts for ~74% of the data.

  • target_flag (level = 1) includes 2153 observations and accounts for ~26% of the data.

# Subset levels of target_flag into new dataframes for analyses

target_0 <- df%>%
    dplyr::select(-target_amt)%>%
    filter(target_flag == 0) # obs not associated with automobile accidents

target_1 <- df%>%
    dplyr::select(-target_amt)%>%
    filter(target_flag == 1)  # obs associated with automobile accidents

# Calculate the proportion of data in each subset of target_flag

df %>%
    group_by(target_flag) %>%
    summarise(n = n()) %>%
    mutate(freq = n / sum(n))%>%
    flextable()%>%
    set_caption("Table 1. Count and Frequency of Target Levels")

2.3 Summary Statistics

On the basis of the following statistical summaries we find:

  • There is one negative value (-3) for ‘car_age’ that should be investigated. This may be a data entry error.
  • A number of numerical variables have highly skewed distributions (compare mean & median)
  • A high percentage of zeros occur in variables: ‘kidsdriv’, ‘homekids’, ‘oldclaim’, ‘clm_freq’, ‘mvr_pt’
  • Outliers are common across the numerical variables, particularly for observations where target_flag = 0.
# Calculate statistics summary for numeric variables at each level of target_flag    


target_0%>%
    dplyr::select(-target_flag)%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Table 2. Summary Statistics for Target_Flag = 0")
target_1%>%
    dplyr::select(-target_flag)%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Table 3. Summary Statistics for Target_Flag = 1")
target_0%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Table 4. Summary Statistics for Categorical Variables: Target Level = 0")
target_1%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Table 5. Summary Statistics for Categorical Variables: Target Level = 1")

We drop one observation where ‘car_age’ is < 0 given that it is impossible to have a negative age. This appears to be a data entry error given there are no other negative values in this variable category.

# Drop observation with data error for car age (obs. value = -3). 

df$car_age[df$car_age < 0] <- NA

2.4 Numerical Distributions

The histograms shown in Figure 1. indicate that distributions for ‘target_flag’, ‘kidsdriv’, ‘homekids’, ‘yoj’, ‘income’, ‘oldclaim’, ‘clm_freq’, and ‘mvr_pts’ depart from a normal model.

From an interpretive stand-point, it’s worth noting the high 0 value counts for the following covariates: ‘tif’, ‘oldclaim’, ‘car_age’, ‘income’, ‘home_val’. Particularly the latter two. A 0 value for ‘income’ may indicate that an individual is unemployed and/or a student/dependent. A 0 value for ‘home_val’ may indicate that an individual is a renter, a dependent, or homeless.

These details can be important in establishing the costs:benefits of insuring a driver. And their absence (as in this dataset) can limit the discriminatory ability of a classification model.

df %>%
  dplyr::select(!target_flag)%>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_histogram(fill='steelblue')+
  labs(title='Figure 1. Distributions of Numerical Covariates')+
  theme_minimal()

2.5 Outliers

The variables kidsrive, oldclaim, homekids, clm_freq, an yoj contain the highest percentage of outliers (~6-17%, depending on target_flag level). The percentage of outliers is generally greater for observations associated with accidents (target_flag = 1) than without (target_flag = 0).

# Identify percentage of outliers for each covariate across levels of target_flag
    

diagnose_outlier(df) %>%
    filter(outliers_ratio > 5)%>%
    mutate(ratio_means= outliers_mean/with_mean)%>% # mean val of outliers/total mean
    arrange(desc(ratio_means))%>%
    dplyr::select(variables, outliers_ratio, ratio_means)%>%
    flextable()%>%
    set_caption("Table 6. Outlier Summary")

2.6 Missing Data

The dataset includes missing observations in one categorical variable (‘job’) and six numerical variables (‘car_age’, ‘home-val’, ‘yoj’, ‘income’, ‘age’). Across covariates (i.e., predictor variables) the extent of missingness is < 6.5% of the total observations. There are no obvious patterns in the missing data that warrant concern.

#Basic missing tables by target level
    
df%>%
    diagnose()%>%
    dplyr::select(-unique_count, -unique_rate)%>%
    arrange(desc(missing_count))%>%
    filter(missing_count>0)%>%
    flextable()%>%
    set_caption("Table 7. Missing Data Summary")

2.7 Pairwise Comparisons

Pairwise comparisons provide a method to identify multicollinearities in the dataset. It is worth noting that tree-based models are sensitive to multicollinearity.

Notable correlations within the dataset include:

  • moderate positive correlation between ‘home_val’ and ‘income’, coeff ~ 0.58
  • moderate positive correlation between ‘clm_freq’ and ‘oldclaim’, coeff ~ 0.49
  • moderate negative correlation between ‘age’ and ‘homekids’, coeff ~ -0.45
  • moderate positive correlation between ‘homekids’ and ‘homedrive’, coeff ~ 0.45
#Assess collinearity in numerical cols --- remove na, should also remove zeros


corrdf<-df%>%
  na.omit()%>%
  dplyr::select(!c(target_flag, target_amt))%>%
  dplyr::select_if(is.numeric)%>%
  cor()

corrplot(corrdf, method = 'number', type='lower', title='Figure 2. Correlation Plot with Zero Values Included', mar=c(0,0,1,0)) # mar() makes title visible

We can summarize the results of the exploratory data analysis as follows:

  1. the response target_flag is unbalanced
  2. there missing data among both continuous and categorical predictors (upwards of ~ 6%)
  3. outliers are present in the dataset
  4. there is moderate multicollinearity within numerical predictors
  5. the majority of predictors follow non-normal distributions
  6. zero inflation can be observed among a subset of predictors

A more complete data summary can be reviewed at Rpubs.

2.8 Model Building

The first step in our modeling building step is to create a randomized training (75%) and testing split (25%).

set.seed(939864536)

# create random data split

data.split <- initial_split(df, prop = .75, strata = target_flag)

# create train and test sets from split

train.data <- training(data.split)
test.data<- testing(data.split)

Then we establish a resampling framework for parameter tuning and performance evaluation using 10-fold cross-validation.

set.seed(11003)

folds<-
  vfold_cv(train.data, 
           v=10,
           strata= target_flag)

Data preprocessing steps are incorporated prior to model construction using the Tidymodel’s Recipe package. The selected machine learning algorithms can handle missing data, with the exception of KNN. That said missingness can cause grid-based parameter tuning to fail due to a lack of convergence. For these reasons missing value imputation is included in our preprocessing recipe for all of the models examined here.

The following preprocessing steps have also been addressed:

  • data normalization
  • YeoJohnson transformation
  • dummy coding (categorical features)
  • removal of near zero variance features
  • removal of highly correlated features
  • downsampling to address class imbalance
# build recipe

data.recipe<-
  recipe(target_flag ~., data=train.data)%>%
  step_select(!target_amt)%>%
  step_impute_bag(all_numeric_predictors())%>% #impute numerical data
  step_impute_mode(all_nominal_predictors())%>% #impute nominal data
  step_YeoJohnson(all_numeric_predictors())%>% # approximate near normal distributions (optional here)
  step_normalize(all_numeric_predictors())%>% # center and scale numerical vars 
  step_dummy(all_nominal_predictors(), -all_outcomes(), one_hot = TRUE)%>% # keep ref levels with one hot
  step_nzv(all_numeric_predictors())%>% # remove numeric vars that have zero variance (single unique value)
  step_corr(all_predictors(), threshold = 0.5, method = 'spearman')%>% # address collinearity
  themis::step_downsample(target_flag) # rebalance the dataset based on the response variable

# review data results

prepped.data <- 
  data.recipe %>% # use the recipe object
  prep() %>% # perform the recipe on training data
  juice() # extract only the preprocessed dataframe 

Model development can be outlined as follows (see associated code):

  1. Model specification and selection of tuning parameters
# logistic model

log.spec <- 
  logistic_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet")

# KNN 

knn.spec <- 
  nearest_neighbor(neighbors = tune()) %>% # we can adjust the number of neighbors 
  set_engine("kknn") %>% 
  set_mode("classification")

#Support Vector Machine

svm.spec <-
  svm_rbf(
  cost = tune(),
  rbf_sigma = tune(), 
  margin = tune()) %>%  
  set_engine("kernlab") %>% 
  set_mode("classification") %>% 
  translate()

# XGboost

xgb.spec <- 
  boost_tree(
    learn_rate=tune(),
    tree_depth=tune(),
    trees=1000,
    mtry=tune()
  ) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")%>%
  translate

#Random Forest

rf.spec <- 
  rand_forest(
    mtry=tune(),
    trees=1000,
    min_n=tune(),
  ) %>%
  set_mode("classification")%>%
  set_engine("ranger", importance = "impurity")
  1. Workflows to link model recipes and specifications in a streamlined process
models.wf<-
  workflow_set(
    preproc=list(Models = data.recipe),
    models = list(Logistic = log.spec, KNN=knn.spec, SVM.radial = svm.spec, XGBoost = xgb.spec, R.Forest = rf.spec)
  )
  1. Grid design, parameter tuning and training fit.
# speed grid estimation

race.ctrl <-
   control_race(
      save_pred = TRUE,
      save_workflow = TRUE
   )


grid.results <-
   models.wf %>%
   workflow_map(     
      'tune_race_anova', # compute performance metrics (finetune)
      seed = 052191211,
      resamples = folds,
      grid = 20, # setting up grid size of 15
      control = race.ctrl,
      metrics = metric_set(roc_auc, accuracy), #recall, precision, f_meas, accuracy, kap, roc_auc, sens, spec
      verbose = TRUE
   )

# save results to local directory

#save(grid.results, file='models.Rdata')

#load('models.Rdata')

# rank results by .metric

ranks<-rank_results(grid.results)

With initial parameter estimation and model fit to training data complete, we can evaluate model performance based on classification accuracy. On this basis, XGBoost is the highest performing models (Figure 3 and Table 8).

# plot rmse tuning results for each model

autoplot(
  grid.results,
  rank_metric = "accuracy",  # <- how to order models
  metric = "accuracy",       # <- which metric to visualize
  select_best = TRUE) +     # <- one point per workflow
  labs(title='Figure 3. Estimated Accuracy for Classification Model Tuning Results', x='')+
  theme_light()

# select best model for each category based on accuracy

rank_results(grid.results, rank_metric = "accuracy", select_best=TRUE)%>%
  flextable()%>%
  set_caption('Table 8. Model Performance Metrics for Grid Tuning')

2.9 Model Evaluation

For the purpose of this analysis, we will update the top three models (XGBoost, Random Forest, Logistic Regression) to include parameter values optimized via. the resampling process, refit the models to the full training data, and then evaluate each model against the remaining test set.

#select best xgboost model

boost.best<- grid.results %>% 
   extract_workflow_set_result('Models_XGBoost') %>% 
   select_best(metric = "accuracy")

rforest.best<-
  grid.results %>% 
   extract_workflow_set_result('Models_R.Forest') %>% 
   select_best(metric = "accuracy")

logistic.best<-
  grid.results %>% 
   extract_workflow_set_result('Models_Logistic') %>% 
   select_best(metric = "accuracy")

Table 9 shows the final accuracy scores and other related performance metrics for model evaluation against the test data. On this basis, the Random Forest model ranked highest, followed by XGBoost and the logistic model. The differences in performance, however, are relatively small. The latter is supported by the overlap in AUC (area under the curve) curves displayed in Figure 4.

It is also worth noting that none of the models appears to have over-fit the data, as might be indicated by higher accuracy scores on the test data vs. the training data.

# Finalize model on test data

boost.test<-
  grid.results%>%
  extract_workflow('Models_XGBoost') %>% 
  finalize_workflow(boost.best) %>% 
  last_fit(split = data.split)

rforest.test<-
  grid.results%>%
  extract_workflow('Models_R.Forest') %>% 
  finalize_workflow(rforest.best) %>% 
  last_fit(split = data.split)

logistic.test<-
  grid.results%>%
  extract_workflow('Models_Logistic') %>% 
  finalize_workflow(logistic.best) %>% 
  last_fit(split = data.split)

custom_metrics <- metric_set(accuracy, sens, spec, precision, recall, f_meas, kap, mcc)

metrics1<-rforest.test%>% 
  collect_predictions()%>%
  group_by(id)%>%
  custom_metrics(truth = target_flag, estimate = .pred_class)%>%
  select(.metric, .estimate)%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  pivot_wider(names_from = Metric, values_from = Estimate)%>%
  select(accuracy, sens, spec, precision, recall, f_meas)%>%
  rename(Accuracy = accuracy, Sensitivity=sens, Specificity=spec, Precision=precision, Recall=recall, F1=f_meas)%>%
  mutate(Model='Random Forest', .before=Accuracy)

metrics2<-logistic.test%>% 
  collect_predictions()%>%
  group_by(id)%>%
  custom_metrics(truth = target_flag, estimate = .pred_class)%>%
  select(.metric, .estimate)%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  pivot_wider(names_from = Metric, values_from = Estimate)%>%
  select(accuracy, sens, spec, precision, recall, f_meas)%>%
  rename(Accuracy = accuracy, Sensitivity=sens, Specificity=spec, Precision=precision, Recall=recall, F1=f_meas)%>%
  mutate(Model='XGBoost', .before=Accuracy)

metrics3<-logistic.test%>% 
  collect_predictions()%>%
  group_by(id)%>%
  custom_metrics(truth = target_flag, estimate = .pred_class)%>%
  select(.metric, .estimate)%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  pivot_wider(names_from = Metric, values_from = Estimate)%>%
  select(accuracy, sens, spec, precision, recall, f_meas)%>%
  rename(Accuracy = accuracy, Sensitivity=sens, Specificity=spec, Precision=precision, Recall=recall, F1=f_meas)%>%
  mutate(Model='Logistic', .before=Accuracy)

bind_rows(metrics1, metrics2, metrics3)%>%
  flextable()%>%
  set_caption('Table 9. Evaluation Metrics: Test Evaluation')
auc.boost<-boost.test%>% 
  collect_predictions()%>%
  group_by(id) %>% # id contains our folds
  roc_curve(target_flag, .pred_0) %>% 
  mutate(model='XGBoost')

auc.rforest<-rforest.test%>% 
  collect_predictions()%>%
  group_by(id) %>% # id contains our folds
  roc_curve(target_flag, .pred_0)%>%
  mutate(model='Random Forest')


auc.logistic<-logistic.test%>% 
  collect_predictions()%>%
  group_by(id) %>% # id contains our folds
  roc_curve(target_flag, .pred_0) %>% 
  mutate(model='Logistic Regression')

#plot aggregate roc_auc curve for final models

bind_rows(auc.boost, auc.rforest, auc.logistic)%>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal()+
  labs(title='Figure 4. Comparison of Area Under the Curve for Test Predictions', y='Sensitivity (TPR)', x='1-Specificity (FPR)')+
  theme_minimal()

Variable importance rankings as show for each model in Figures 5-7. In each case, Urbanicity had the most influence on the predicted classification followed by either home_val (Random Forest, XGBoost) or Job/Student (Logistic Regression). It follows that automobile accidents occur more frequently in dense urban environments than in other less populated areas. The cost of these claims, however, is not considered here.

Figures 8-10 display the confusion matrix produced by each model on the test data. With a threshold criterion of 0.5, the models favor correct classification of true negatives (0 = no accident) over true positives (1 = accidents). An insurer seeking to minimize claim payouts would likely adjust the criterion (downward in this case) to favor correct classification of the latter.

# Plot VIP for final Models

p1<-rforest.test%>%
  extract_fit_parsnip()%>%
  vip(num_features=20)+
  labs(title='Fig. 5. Random Forest', y='Variable Importance')+
  theme_minimal()+
  
  theme(plot.title=element_text(size=10))

p2<-boost.test%>%
  extract_fit_parsnip()%>%
  vip(num_features=20)+
  labs(title='Fig. 6. XGBoost', y='Variable Importance')+
  theme_minimal()+
  theme_minimal()+
  theme(plot.title=element_text(size=10))

p3<-logistic.test%>%
  extract_fit_parsnip()%>%
  vip(num_features=20, fill='midnightblue')+
  labs(title='Fig. 7. Logistic Regression',y='Variable Importance')+
  theme_minimal()+
  theme(plot.title=element_text(size=10))

# combine vip plots

p1|p2|p3

# Plot confusion matrix for final Models 

p4<-boost.test%>% 
  collect_predictions()%>%
  conf_mat(target_flag, .pred_class)%>%
  autoplot(type = "heatmap")+
  labs(title='Fig. 8. Confusion Matrix', subtitle= 'XGBoost')+
  theme(plot.title=element_text(size=12))

p5<-rforest.test%>% 
  collect_predictions()%>%
  conf_mat(target_flag, .pred_class)%>%
  autoplot(type = "heatmap")+
  labs(title='Fig. 9. Confusion Matrix', subtitle='Random Forest')+
  theme(plot.title=element_text(size=12))

p6<-logistic.test%>% 
  collect_predictions()%>%
  conf_mat(target_flag, .pred_class)%>%
  autoplot(type = "heatmap")+
  labs(title='Fig. 10. Confusion Matrix', subtitle='Logistic Regression')+
  theme(plot.title=element_text(size=12))

p4|p5|p6

2.10 Classification Summary

This assessment sought to classify drivers based on their probability of an auto accident using a select set of ML algorithms and to then compare the results to those produced by a base logistic regression model. Surprisingly, the ML models produced results comparable to those of the logistic classifer - a review/assessment of model assumptions for the latter can be found on Rpubs.

For the purposes of model interpretability & accuracy, the logistic classifier remains a reasonable choice for accident prediction using the data available for this work.

3 Principal Components Analysis

This section highlights the use of Principal Components Analysis (PCA) to reduce the dimensionality of complex numerical data while minimizing information loss. The PCA technique creates new, uncorrelated variables (components), that correspond to linear combinations of the original data and successively maximize variance.

PCA is not intended for analysis of categorical variables. For datasets that include a mix of continuous and categorical features, multiple factor analysis is recommended.

To demonstrate the use of PCA, we will evaluate the water chemistry of streams distributed across New York States. The data were downloaded in .csv format from the NYS Department of Environmental Conservation website:

3.1 Data Processing

Some data processing is required prior to use of PCA. Steps include:

  1. Loading the dataset.
chem<-read_csv('Streams_ Chemistry.csv')

chem%>%head(5)%>%
  flextable()%>%
  set_caption('Table 10: Stream Water Chemistry from Sites in New York State')
  1. Cleaning the data and selecting numerical variables for comparison.
  2. Creating groupings by site and chemical measurement.
  3. Calculating the median value within groups - for comparison across locations and time.
chem %<>%
  select(site_id, sample_date, parameter_name, parameter_fraction, result_value, parameter_unit)%>%
  filter(parameter_fraction == 'TOTAL')%>%
  group_by(site_id, parameter_name, parameter_fraction)%>%
  summarise(median(result_value))%>%
  select(-parameter_fraction)%>%
  rename(id = site_id, parameter = parameter_name, median_value = 'median(result_value)')%>%
  pivot_wider(names_from = parameter, values_from = median_value)%>%
  filter_at(vars('NITROGEN, TOTAL', ALUMINUM, POTASSIUM, SODIUM, 'TOTAL SOLIDS'),all_vars(!is.na(.)))%>%
  select(1:35)%>%
  select(!c(ARSENIC, CADMIUM, LEAD, 'NITROGEN, AMMONIA (AS N)', 'NITROGEN, NITRITE', SILVER, ZINC, FLUORIDE, MERCURY, 'TOTAL SUSPENDED SOLIDS'))%>%
  mutate(id=str_remove(id,'01-'))%>%
  mutate_at(c(2:25), as.numeric)

id<-chem%>%
  select(id)
  1. Removing variables with > 30% missing data.
  2. Applying median imputation to fill missing values within columns.
# build function for median imputation across colummns

f=function(x){
  x<-as.numeric(as.character(x)) #first convert each column into numeric if it is from factor
  x[is.na(x)] =median(x, na.rm=TRUE) #convert the item with NA to median value from the column
  x #display the column
}

# conduct median imputation

chem=data.frame(apply(chem,2,f))%>%
  mutate_at(c(2:25), as.numeric)%>%
  select(!id)

# combine imputed data and original id

chem<-cbind(id, chem)

#convert id to rowname

chem%<>%
  column_to_rownames('id')

3.2 Applying PCA

For the purposes of this work, the R packages Stats and Factoextra will be used for PCA evaluation and related visualizations. It’s worth noting that Factoextra package can perform a wide variety of multivariate data analyses including:

  1. Principal Component Analysis (PCA), which is used to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important information.

  2. Correspondence Analysis (CA), which is an extension of the principal component analysis suited to analyse a large contingency table formed by two qualitative variables (or categorical data).

  3. Multiple Correspondence Analysis (MCA), which is an adaptation of CA to a data table containing more than two categorical variables.

  4. Multiple Factor Analysis (MFA) dedicated to datasets where variables are organized into groups (qualitative and/or quantitative variables).

  5. Hierarchical Multiple Factor Analysis (HMFA): An extension of MFA in a situation where the data are organized into a hierarchical structure.

  6. Factor Analysis of Mixed Data (FAMD), a particular case of the MFA, dedicated to analyze a data set containing both quantitative and qualitative variables.

For more information see the Factoextra website: Factoextra

In this step, we compute the PCA components.

# estimate pca scores by covariate

components <- prcomp(chem, scale = TRUE)

# apply rotation

components$rotation <- -1*components$rotation

# show components

components$rotation%>%
  as.data.frame()

3.3 Components and Variables

Figure 11 shows the percentage of variance in the dataset accounted for by the first 10 principal components. The first four components alone explain ~ 71% of this variance.

The importance of individual chemical measures to each component is highlighted in Figures 12-15.

  • PCA1 is dominated by solids and high water hardness, which may indicate groundwater influence.
  • PCA2 shows the influence of nitrogen and phosphorus wich is often indicative of non-point source pollution (e.g., agricultural settings) .
  • PCA3 is dominated by metals (AL & FE) and high turbidity, which may be the result of high flows or runoff in an acidic environments (e.g., near mine tailings, waste sites, or where there are acidic soils). -PCA4 is dominated by sulfates and may indicate a predominantly groundwater source; this is consistent with high levels of hardness.

Figure 16 offers an additional view of variable influence for PCA1 and PCA2. Note that the length of the arrows (variables) provides an collinearities. For example, potassium, chloride, magnesium, and sodium appear highly (positively) correlated. Also note that, phosphorus and nitrogen are distributed along the negative range of PCA2 - which distinguishes them from streams high relative concentrations of potassium, chloride, magnesium, and sodium.

# estimate proportion of variance accounted for by each PCA and feature

fviz_screeplot(components, addlabels = TRUE, ylim = c(0, 36))+
  labs(title='Figure 11. Proportion of Variance Explained by Individual PCA Components (Dimensions)')

# assess variable importance to each if the first 4 components (dim)

v1<-fviz_contrib(components, choice = "var", axes = 1, top = 10, title='Figure 12. Contribution of Variables to PCA1')
v2<-fviz_contrib(components, choice = "var", axes = 2, top = 10, title='Figure 13. Contribution of Variables to PCA2')
v3<-fviz_contrib(components, choice = "var", axes = 3, top = 10, title='Figure 14. Contribution of Variables to PCA3')
v4<-fviz_contrib(components, choice = "var", axes = 4, top = 10, title='Figure 15. Contribution of Variables to PCA4')

v1|v2

v3|v4

#Graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.

fviz_pca_var(components,
             alpha.var = "contrib",
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title = 'Figure 16. Influence of Variables on PCA1 and PCA2',
             repel = TRUE     # Avoid text overlapping
)

3.4 Site Comparisons

We can also look at how stream sampling sites are distributed relative to the components. For Example, site 12-Caya-4.7 scores high on PCA2 (indicative of high nitrogen and phosphorus loads) and is below the mid-range for PCA1, which suggests the influence of surface water (rain & runoff) contributions.

Figure 18 can be used to distinguish or group locations in terms of their water chemistry.

#Graph of individuals. Individuals with a similar profile are grouped together.

fviz_pca_ind(components,
             col.ind = "contrib", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title='Figure 17. Distribution of Stream Locations (Water Quality) based on PCA1 and PCA2',
             repel = TRUE     # Avoid text overlapping
)

fviz_pca_ind(components,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title='Figure 17. Distribution of Stream Locations (Water Quality) based on PCA1 and PCA2',
             repel = TRUE     # Avoid text overlapping
)

#Biplot of individuals and variables
fviz_pca_biplot(components, repel = TRUE,
                title='Figure 18. Distribution of Stream Locations based on Water Quality: from PCA1 and PCA2',
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969"  # Individuals color
                )

## PCA Summary

From the previous analysis, it is clear that PCA can be used as an unsupervised method to quickly distinguish associations between water quality measurements that have interpretive value as well as identify patterns of association between streams based on their chemical composition. Future assessments might include qualitative measurements of stream condition along with other multivariate approaches to describe and classify these sites.