1 Project Description

This project is divided into two unrelated sections. The first compares the performance of various machine learning algorithms in relation to their ability to predict accident outcomes based on factors associated with individual insurance applicants. The second explores the the use of principal components analysis to effectively reduce the dimensionality of water quality measurements from a wide variety of stream sampling stations in New York State.

2 Machine Learning Classifiers

The purpose of this section is to estimate the probability that an individual (seeking auto insurance) will be in an accident. The classification models introduced here draw on a synthetic data set of ~ 8000 observations. The response, ‘target_flag’, is a binary variable scored as either a 1 or 0 indicating that an individual has or has not been in an accident, respectively.

This work extends an earlier project to classify the same data using logistic regression. The results can be viewed online at Rpubs . In brief, the final logistic model produced an AUC of .82 and, setting a cutoff threshold of 0.25, correctly classified 90% of automobiles associated with accidents, as compared to 49% of those not associated with accidents.

In this analysis, a diverse set of ‘machine learning’ classification algorithms are trained/tested and the results evaluated against those produced using the logistic model. They include:

Logistic Regression (baseline model) - subject to assumptions of normality and is sensitive to over-dispersion and outliers.
K-Nearest Neighbors - assumes similarity between data is related to distance; ensitive to missing data, outliers, nominal variables, and imbalanced data
Support Vector Machine - requires scaling/centering
XGBoost - sensitive to multicollinearity
Random Forest - sensitive to multicollinearity

Note that the machine learning models make no assumptions about the underlying distribution of the data. This is not the case with logistic regression.

The data include the following variables:

Variable	Description
Target FLAG	Was car in crash? 1=Yes, 0=No
Target_ AMT	If crash, what was the cost
AGE	Age of driver
BLUEBOOK	Value of vehicle
CAR_AGE	Age of car
CAR_TYPE	Type of car
CAR_USE	Vehicle use
CLM_FREQ	# Claims (past 5 yrs)
EDUCATION	Max educational level
HOMEKIDS	# Children at home
HOME_VAL	Home value
INCOME	Annual income
JOB	Job category
KIDSDRIVE	# Driving children
MSTATUS	Martial status
MVR_PTS	Motor vehicle record points
OLDCLAIM	Total claims (past 5 yrs)
PARENT1	Single parent
RED_CAR	A red car
REVOKED	License revoked (past 7 yrs)
SEX	Gender
TIF	Time as customer
TRAVTIME	Distance to work
URBANICITY	Home/work area
YOJ	Years on job

# Import libraries

library(tidyverse) # cleaning & wrangling functions
library(tidymodels) # classification
library(finetune) # faster grid searches
library(factoextra) # PCA
library(vip) # variable importance
library(janitor) # data cleaning
library(magrittr) # piping
library(flextable) #table layout and formatting
library(dlookr) # exploratory data analysis functions
library(corrplot)
library(themis) # downsampling
library(viridis) #coloration for box plots + ROC_AUC
library(patchwork) # easy plot layout

First we load the dataset from a Github respository.

# Read in data

path <- "https://raw.githubusercontent.com/sconnin/insurance_model/main/insurance_dataset.csv"

raw <- read_csv(path)

2.1 Data Cleaning & Exploratory Data Analysis

Data processing is required to put variables in tidy form for model building as well as to support initial exploration and visualization. Additional steps are taken to facilitate interpretation of feature variables.

The following processing steps are included here:

data storage as dataframe(s)
case consistency and removal of special characters
type conversion and re-leveling of select factor variables
removal of empty/duplicate rows and/or columns
subsetting to remove unnecessary columns
renaming column values for clarity

# Copy data into working dataframe for downstream use

df<-raw%>%
  
  clean_names%>% # initial clean of col names
    
  remove_empty(c("rows", "cols"))%>%  # remove any empty rows and cols
    
  distinct()%>%     # remove duplicate rows
    
  mutate_if(is_character, str_replace_all, '\\$|,|z_|<', '')%>%  # clean any special chars in character variables

  dplyr::select(-index)%>%  # remove index

  mutate_if(is_character, str_replace_all, "No|no",'N')%>%  
    
  mutate_if(is_character, str_replace_all, "Yes|yes",'Y')%>%
    
  mutate_if(is_character, str_replace_all, "Highly Urban/ Urban",'Urban')%>%
    
  mutate_if(is_character, str_replace_all, "Highly Rural/ Rural",'Rural')%>%
    
  mutate_at(vars(income, home_val, bluebook, oldclaim), funs(as.numeric))%>%   # correct variable type: char to numeric

  mutate_if(is.numeric, round)%>%  # round out our numerics
    
  mutate_if(is_character, ~(factor(.)))%>%  # convert all character variables to factor for modeling
    
  mutate(education = fct_relevel(education, c("High School", "Bachelors", "Masters", "PhD")))%>% # relevel to show educational attainment steps
    
  mutate(car_type = fct_relevel(car_type, c("Sports Car", "SUV", "Minivan", "Van", "Pickup", "Panel Truck")))%>%
  
  mutate_at(vars(target_flag), factor)

2.2 Count and Frequency of Target Variable

Recall that the variable target_flag indicates whether or not an individual was in an accident (1 = Yes, 0 = No).

target_flag (level = 0) includes 6008 observations and accounts for ~74% of the data.
target_flag (level = 1) includes 2153 observations and accounts for ~26% of the data.

# Subset levels of target_flag into new dataframes for analyses

target_0 <- df%>%
    dplyr::select(-target_amt)%>%
    filter(target_flag == 0) # obs not associated with automobile accidents

target_1 <- df%>%
    dplyr::select(-target_amt)%>%
    filter(target_flag == 1)  # obs associated with automobile accidents

# Calculate the proportion of data in each subset of target_flag

df %>%
    group_by(target_flag) %>%
    summarise(n = n()) %>%
    mutate(freq = n / sum(n))%>%
    flextable()%>%
    set_caption("Table 1. Count and Frequency of Target Levels")

Table 1. Count and Frequency of Target Levels
target_flag	n	freq
0	6,008	0.7361843
1	2,153	0.2638157

2.3 Summary Statistics

On the basis of the following statistical summaries we find:

There is one negative value (-3) for ‘car_age’ that should be investigated. This may be a data entry error.
A number of numerical variables have highly skewed distributions (compare mean & median)
A high percentage of zeros occur in variables: ‘kidsdriv’, ‘homekids’, ‘oldclaim’, ‘clm_freq’, ‘mvr_pt’
Outliers are common across the numerical variables, particularly for observations where target_flag = 0.

# Calculate statistics summary for numeric variables at each level of target_flag    


target_0%>%
    dplyr::select(-target_flag)%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Table 2. Summary Statistics for Target_Flag = 0")

Table 2. Summary Statistics for Target_Flag = 0
variables	min	mean	median	max	zero	outlier
kidsdriv	0	0.1393142	0	4	5,407	601
age	16	45.3227901	46	81	0	47
homekids	0	0.6439747	0	5	4,116	559
yoj	0	10.6718337	12	23	379	398
income	0	65,951.9700335	58,368	367,030	365	156
home_val	0	169,075.4123566	175,052	885,282	1,448	5
travtime	5	33.0251332	32	142	0	69
bluebook	1,500	16,230.9487350	15,000	69,740	0	74
tif	1	5.5557590	6	25	0	29
oldclaim	0	3,311.5948735	0	53,986	4,111	622
clm_freq	0	0.6486352	0	5	4,111	583
mvr_pts	0	1.4137816	1	11	2,998	280
car_age	0	8.6709220	9	28	3	2

target_1%>%
    dplyr::select(-target_flag)%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Table 3. Summary Statistics for Target_Flag = 1")

Table 3. Summary Statistics for Target_Flag = 1
variables	min	mean	median	max	zero	minus	outlier
kidsdriv	0	0.2596377	0	4	1,773	0	380
age	16	43.3012104	43	76	0	0	7
homekids	0	0.9368323	0	5	1,173	0	0
yoj	0	10.0167488	11	19	246	0	250
income	0	50,641.2980910	43,971	320,127	250	0	77
home_val	0	115,256.5541339	114,565	750,455	846	0	11
travtime	5	34.7710172	34	97	0	0	12
bluebook	1,500	14,255.8987459	12,600	62,240	0	0	36
tif	1	4.7807710	4	21	0	0	27
oldclaim	0	6,061.5499303	2,448	57,037	898	0	233
clm_freq	0	1.2169066	1	5	898	0	0
mvr_pts	0	2.4816535	2	13	714	0	11
car_age	-3	7.3674789	7	25	0	1	0

target_0%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Table 4. Summary Statistics for Categorical Variables: Target Level = 0")

Table 4. Summary Statistics for Categorical Variables: Target Level = 0
variables	levels	N	freq	ratio	rank
target_flag	0	6,008	6,008	100.000000	1
parent1	N	6,008	5,407	89.996671	1
parent1	Y	6,008	601	10.003329	2
mstatus	Y	6,008	3,841	63.931425	1
mstatus	N	6,008	2,167	36.068575	2
sex	F	6,008	3,183	52.979361	1
sex	M	6,008	2,825	47.020639	2
education	High School	6,008	2,355	39.197736	1
education	Bachelors	6,008	1,719	28.611851	2
education	Masters	6,008	1,331	22.153795	3
education	PhD	6,008	603	10.036618	4
job	Blue Collar	6,008	1,191	19.823569	1
job	Clerical	6,008	900	14.980027	2
job	Professional	6,008	870	14.480692	3
job	Manager	6,008	851	14.164447	4
job	Lawyer	6,008	682	11.351531	5
job	Home Maker	6,008	461	7.673103	6
job	Student	6,008	446	7.423435	7
job		6,008	390	6.491345	8
job	Doctor	6,008	217	3.611851	9
car_use	Private	6,008	4,026	67.010652	1
car_use	Commercial	6,008	1,982	32.989348	2
car_type	Minivan	6,008	1,796	29.893475	1
car_type	SUV	6,008	1,616	26.897470	2
car_type	Pickup	6,008	946	15.745672	3
car_type	Sports Car	6,008	603	10.036618	4
car_type	Van	6,008	549	9.137816	5
car_type	Panel Truck	6,008	498	8.288948	6
red_car	N	6,008	4,246	70.672437	1
red_car	Y	6,008	1,762	29.327563	2
revoked	N	6,008	5,451	90.729028	1
revoked	Y	6,008	557	9.270972	2
urbanicity	Urban	6,008	4,454	74.134487	1
urbanicity	Rural	6,008	1,554	25.865513	2

target_1%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Table 5. Summary Statistics for Categorical Variables: Target Level = 1")

Table 5. Summary Statistics for Categorical Variables: Target Level = 1
variables	levels	N	freq	ratio	rank
target_flag	1	2,153	2,153	100.000000	1
parent1	N	2,153	1,677	77.891314	1
parent1	Y	2,153	476	22.108686	2
mstatus	N	2,153	1,100	51.091500	1
mstatus	Y	2,153	1,053	48.908500	2
sex	F	2,153	1,192	55.364608	1
sex	M	2,153	961	44.635392	2
education	High School	2,153	1,178	54.714352	1
education	Bachelors	2,153	523	24.291686	2
education	Masters	2,153	327	15.188110	3
education	PhD	2,153	125	5.805852	4
job	Blue Collar	2,153	634	29.447283	1
job	Clerical	2,153	371	17.231770	2
job	Student	2,153	266	12.354854	3
job	Professional	2,153	247	11.472364	4
job	Home Maker	2,153	180	8.360427	5
job	Lawyer	2,153	153	7.106363	6
job	Manager	2,153	137	6.363214	7
job		2,153	136	6.316767	8
job	Doctor	2,153	29	1.346958	9
car_use	Private	2,153	1,106	51.370181	1
car_use	Commercial	2,153	1,047	48.629819	2
car_type	SUV	2,153	678	31.490943	1
car_type	Pickup	2,153	443	20.575941	2
car_type	Minivan	2,153	349	16.209940	3
car_type	Sports Car	2,153	304	14.119833	4
car_type	Van	2,153	201	9.335810	5
car_type	Panel Truck	2,153	178	8.267534	6
red_car	N	2,153	1,537	71.388760	1
red_car	Y	2,153	616	28.611240	2
revoked	N	2,153	1,710	79.424059	1
revoked	Y	2,153	443	20.575941	2
urbanicity	Urban	2,153	2,038	94.658616	1
urbanicity	Rural	2,153	115	5.341384	2

We drop one observation where ‘car_age’ is < 0 given that it is impossible to have a negative age. This appears to be a data entry error given there are no other negative values in this variable category.

# Drop observation with data error for car age (obs. value = -3). 

df$car_age[df$car_age < 0] <- NA

2.4 Numerical Distributions

The histograms shown in Figure 1. indicate that distributions for ‘target_flag’, ‘kidsdriv’, ‘homekids’, ‘yoj’, ‘income’, ‘oldclaim’, ‘clm_freq’, and ‘mvr_pts’ depart from a normal model.

From an interpretive stand-point, it’s worth noting the high 0 value counts for the following covariates: ‘tif’, ‘oldclaim’, ‘car_age’, ‘income’, ‘home_val’. Particularly the latter two. A 0 value for ‘income’ may indicate that an individual is unemployed and/or a student/dependent. A 0 value for ‘home_val’ may indicate that an individual is a renter, a dependent, or homeless.

These details can be important in establishing the costs:benefits of insuring a driver. And their absence (as in this dataset) can limit the discriminatory ability of a classification model.

df %>%
  dplyr::select(!target_flag)%>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_histogram(fill='steelblue')+
  labs(title='Figure 1. Distributions of Numerical Covariates')+
  theme_minimal()

2.5 Outliers

The variables kidsrive, oldclaim, homekids, clm_freq, an yoj contain the highest percentage of outliers (~6-17%, depending on target_flag level). The percentage of outliers is generally greater for observations associated with accidents (target_flag = 1) than without (target_flag = 0).

# Identify percentage of outliers for each covariate across levels of target_flag
    

diagnose_outlier(df) %>%
    filter(outliers_ratio > 5)%>%
    mutate(ratio_means= outliers_mean/with_mean)%>% # mean val of outliers/total mean
    arrange(desc(ratio_means))%>%
    dplyr::select(variables, outliers_ratio, ratio_means)%>%
    flextable()%>%
    set_caption("Table 6. Outlier Summary")

Table 6. Outlier Summary
variables	outliers_ratio	ratio_means
kidsdriv	12.020586	8.31906218
oldclaim	8.124004	7.51994989
target_amt	19.850509	4.67982430
homekids	10.439897	4.47198413
yoj	7.940203	0.01205255

2.6 Missing Data

The dataset includes missing observations in one categorical variable (‘job’) and six numerical variables (‘car_age’, ‘home-val’, ‘yoj’, ‘income’, ‘age’). Across covariates (i.e., predictor variables) the extent of missingness is < 6.5% of the total observations. There are no obvious patterns in the missing data that warrant concern.

#Basic missing tables by target level
    
df%>%
    diagnose()%>%
    dplyr::select(-unique_count, -unique_rate)%>%
    arrange(desc(missing_count))%>%
    filter(missing_count>0)%>%
    flextable()%>%
    set_caption("Table 7. Missing Data Summary")

Table 7. Missing Data Summary
variables	types	missing_count	missing_percent
job	factor	526	6.4452886
car_age	numeric	511	6.2614876
home_val	numeric	464	5.6855777
yoj	numeric	454	5.5630437
income	numeric	445	5.4527631
age	numeric	6	0.0735204

2.7 Pairwise Comparisons

Pairwise comparisons provide a method to identify multicollinearities in the dataset. It is worth noting that tree-based models are sensitive to multicollinearity.

Notable correlations within the dataset include:

moderate positive correlation between ‘home_val’ and ‘income’, coeff ~ 0.58
moderate positive correlation between ‘clm_freq’ and ‘oldclaim’, coeff ~ 0.49
moderate negative correlation between ‘age’ and ‘homekids’, coeff ~ -0.45
moderate positive correlation between ‘homekids’ and ‘homedrive’, coeff ~ 0.45

#Assess collinearity in numerical cols --- remove na, should also remove zeros


corrdf<-df%>%
  na.omit()%>%
  dplyr::select(!c(target_flag, target_amt))%>%
  dplyr::select_if(is.numeric)%>%
  cor()

corrplot(corrdf, method = 'number', type='lower', title='Figure 2. Correlation Plot with Zero Values Included', mar=c(0,0,1,0)) # mar() makes title visible

We can summarize the results of the exploratory data analysis as follows:

the response target_flag is unbalanced
there missing data among both continuous and categorical predictors (upwards of ~ 6%)
outliers are present in the dataset
there is moderate multicollinearity within numerical predictors
the majority of predictors follow non-normal distributions
zero inflation can be observed among a subset of predictors

A more complete data summary can be reviewed at Rpubs .

2.8 Model Building

The first step in our modeling building step is to create a randomized training (75%) and testing split (25%).

set.seed(939864536)

# create random data split

data.split <- initial_split(df, prop = .75, strata = target_flag)

# create train and test sets from split

train.data <- training(data.split)
test.data<- testing(data.split)

Then we establish a resampling framework for parameter tuning and performance evaluation using 10-fold cross-validation.

set.seed(11003)

folds<-
  vfold_cv(train.data, 
           v=10,
           strata= target_flag)

Data preprocessing steps are incorporated prior to model construction using the Tidymodel’s Recipe package. The selected machine learning algorithms can handle missing data, with the exception of KNN. That said missingness can cause grid-based parameter tuning to fail due to a lack of convergence. For these reasons missing value imputation is included in our preprocessing recipe for all of the models examined here.

The following preprocessing steps have also been addressed:

data normalization
YeoJohnson transformation
dummy coding (categorical features)
removal of near zero variance features
removal of highly correlated features
downsampling to address class imbalance

# build recipe

data.recipe<-
  recipe(target_flag ~., data=train.data)%>%
  step_select(!target_amt)%>%
  step_impute_bag(all_numeric_predictors())%>% #impute numerical data
  step_impute_mode(all_nominal_predictors())%>% #impute nominal data
  step_YeoJohnson(all_numeric_predictors())%>% # approximate near normal distributions (optional here)
  step_normalize(all_numeric_predictors())%>% # center and scale numerical vars 
  step_dummy(all_nominal_predictors(), -all_outcomes(), one_hot = TRUE)%>% # keep ref levels with one hot
  step_nzv(all_numeric_predictors())%>% # remove numeric vars that have zero variance (single unique value)
  step_corr(all_predictors(), threshold = 0.5, method = 'spearman')%>% # address collinearity
  themis::step_downsample(target_flag) # rebalance the dataset based on the response variable

# review data results

prepped.data <- 
  data.recipe %>% # use the recipe object
  prep() %>% # perform the recipe on training data
  juice() # extract only the preprocessed dataframe

Model development can be outlined as follows (see associated code):

Model specification and selection of tuning parameters

# logistic model

log.spec <- 
  logistic_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet")

# KNN 

knn.spec <- 
  nearest_neighbor(neighbors = tune()) %>% # we can adjust the number of neighbors 
  set_engine("kknn") %>% 
  set_mode("classification")

#Support Vector Machine

svm.spec <-
  svm_rbf(
  cost = tune(),
  rbf_sigma = tune(), 
  margin = tune()) %>%  
  set_engine("kernlab") %>% 
  set_mode("classification") %>% 
  translate()

# XGboost

xgb.spec <- 
  boost_tree(
    learn_rate=tune(),
    tree_depth=tune(),
    trees=1000,
    mtry=tune()
  ) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")%>%
  translate

#Random Forest

rf.spec <- 
  rand_forest(
    mtry=tune(),
    trees=1000,
    min_n=tune(),
  ) %>%
  set_mode("classification")%>%
  set_engine("ranger", importance = "impurity")

Workflows to link model recipes and specifications in a streamlined process

models.wf<-
  workflow_set(
    preproc=list(Models = data.recipe),
    models = list(Logistic = log.spec, KNN=knn.spec, SVM.radial = svm.spec, XGBoost = xgb.spec, R.Forest = rf.spec)
  )

Grid design, parameter tuning and training fit.

# speed grid estimation

race.ctrl <-
   control_race(
      save_pred = TRUE,
      save_workflow = TRUE
   )


grid.results <-
   models.wf %>%
   workflow_map(     
      'tune_race_anova', # compute performance metrics (finetune)
      seed = 052191211,
      resamples = folds,
      grid = 20, # setting up grid size of 15
      control = race.ctrl,
      metrics = metric_set(roc_auc, accuracy), #recall, precision, f_meas, accuracy, kap, roc_auc, sens, spec
      verbose = TRUE
   )

# save results to local directory

#save(grid.results, file='models.Rdata')

#load('models.Rdata')

# rank results by .metric

ranks<-rank_results(grid.results)

With initial parameter estimation and model fit to training data complete, we can evaluate model performance based on classification accuracy. On this basis, XGBoost is the highest performing models (Figure 3 and Table 8).

# plot rmse tuning results for each model

autoplot(
  grid.results,
  rank_metric = "accuracy",  # <- how to order models
  metric = "accuracy",       # <- which metric to visualize
  select_best = TRUE) +     # <- one point per workflow
  labs(title='Figure 3. Estimated Accuracy for Classification Model Tuning Results', x='')+
  theme_light()

# select best model for each category based on accuracy

rank_results(grid.results, rank_metric = "accuracy", select_best=TRUE)%>%
  flextable()%>%
  set_caption('Table 8. Model Performance Metrics for Grid Tuning')

Table 8. Model Performance Metrics for Grid Tuning
wflow_id	.config	.metric	mean	std_err	n	preprocessor	model	rank
Models_XGBoost	Preprocessor1_Model19	accuracy	0.7161724	0.004783707	10	recipe	boost_tree	1
Models_XGBoost	Preprocessor1_Model19	roc_auc	0.8140844	0.005253838	10	recipe	boost_tree	1
Models_Logistic	Preprocessor1_Model15	accuracy	0.7114301	0.007563557	10	recipe	logistic_reg	2
Models_Logistic	Preprocessor1_Model15	roc_auc	0.8034997	0.005231407	10	recipe	logistic_reg	2
Models_R.Forest	Preprocessor1_Model01	accuracy	0.7112643	0.005915120	10	recipe	rand_forest	3
Models_R.Forest	Preprocessor1_Model01	roc_auc	0.8094962	0.005384637	10	recipe	rand_forest	3
Models_SVM.radial	Preprocessor1_Model11	accuracy	0.7078398	0.007061415	10	recipe	svm_rbf	4
Models_SVM.radial	Preprocessor1_Model11	roc_auc	0.8065030	0.006256920	10	recipe	svm_rbf	4
Models_KNN	Preprocessor1_Model13	accuracy	0.6949233	0.006326744	10	recipe	nearest_neighbor	5
Models_KNN	Preprocessor1_Model13	roc_auc	0.7533843	0.006521541	10	recipe	nearest_neighbor	5

2.9 Model Evaluation

For the purpose of this analysis, we will update the top three models (XGBoost, Random Forest, Logistic Regression) to include parameter values optimized via. the resampling process, refit the models to the full training data, and then evaluate each model against the remaining test set.

#select best xgboost model

boost.best<- grid.results %>% 
   extract_workflow_set_result('Models_XGBoost') %>% 
   select_best(metric = "accuracy")

rforest.best<-
  grid.results %>% 
   extract_workflow_set_result('Models_R.Forest') %>% 
   select_best(metric = "accuracy")

logistic.best<-
  grid.results %>% 
   extract_workflow_set_result('Models_Logistic') %>% 
   select_best(metric = "accuracy")

Table 9 shows the final accuracy scores and other related performance metrics for model evaluation against the test data. On this basis, the Random Forest model ranked highest, followed by XGBoost and the logistic model. The differences in performance, however, are relatively small. The latter is supported by the overlap in AUC (area under the curve) curves displayed in Figure 4.

It is also worth noting that none of the models appears to have over-fit the data, as might be indicated by higher accuracy scores on the test data vs. the training data.

# Finalize model on test data

boost.test<-
  grid.results%>%
  extract_workflow('Models_XGBoost') %>% 
  finalize_workflow(boost.best) %>% 
  last_fit(split = data.split)

rforest.test<-
  grid.results%>%
  extract_workflow('Models_R.Forest') %>% 
  finalize_workflow(rforest.best) %>% 
  last_fit(split = data.split)

logistic.test<-
  grid.results%>%
  extract_workflow('Models_Logistic') %>% 
  finalize_workflow(logistic.best) %>% 
  last_fit(split = data.split)

custom_metrics <- metric_set(accuracy, sens, spec, precision, recall, f_meas, kap, mcc)

metrics1<-rforest.test%>% 
  collect_predictions()%>%
  group_by(id)%>%
  custom_metrics(truth = target_flag, estimate = .pred_class)%>%
  select(.metric, .estimate)%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  pivot_wider(names_from = Metric, values_from = Estimate)%>%
  select(accuracy, sens, spec, precision, recall, f_meas)%>%
  rename(Accuracy = accuracy, Sensitivity=sens, Specificity=spec, Precision=precision, Recall=recall, F1=f_meas)%>%
  mutate(Model='Random Forest', .before=Accuracy)

metrics2<-logistic.test%>% 
  collect_predictions()%>%
  group_by(id)%>%
  custom_metrics(truth = target_flag, estimate = .pred_class)%>%
  select(.metric, .estimate)%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  pivot_wider(names_from = Metric, values_from = Estimate)%>%
  select(accuracy, sens, spec, precision, recall, f_meas)%>%
  rename(Accuracy = accuracy, Sensitivity=sens, Specificity=spec, Precision=precision, Recall=recall, F1=f_meas)%>%
  mutate(Model='XGBoost', .before=Accuracy)

metrics3<-logistic.test%>% 
  collect_predictions()%>%
  group_by(id)%>%
  custom_metrics(truth = target_flag, estimate = .pred_class)%>%
  select(.metric, .estimate)%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  pivot_wider(names_from = Metric, values_from = Estimate)%>%
  select(accuracy, sens, spec, precision, recall, f_meas)%>%
  rename(Accuracy = accuracy, Sensitivity=sens, Specificity=spec, Precision=precision, Recall=recall, F1=f_meas)%>%
  mutate(Model='Logistic', .before=Accuracy)

bind_rows(metrics1, metrics2, metrics3)%>%
  flextable()%>%
  set_caption('Table 9. Evaluation Metrics: Test Evaluation')

Table 9. Evaluation Metrics: Test Evaluation
Model	Accuracy	Sensitivity	Specificity	Precision	Recall	F1
Random Forest	0.7143557	0.6930759	0.7736549	0.8950989	0.6930759	0.7812383
XGBoost	0.7055365	0.6950732	0.7346939	0.8795282	0.6950732	0.7764968
Logistic	0.7055365	0.6950732	0.7346939	0.8795282	0.6950732	0.7764968

auc.boost<-boost.test%>% 
  collect_predictions()%>%
  group_by(id) %>% # id contains our folds
  roc_curve(target_flag, .pred_0) %>% 
  mutate(model='XGBoost')

auc.rforest<-rforest.test%>% 
  collect_predictions()%>%
  group_by(id) %>% # id contains our folds
  roc_curve(target_flag, .pred_0)%>%
  mutate(model='Random Forest')


auc.logistic<-logistic.test%>% 
  collect_predictions()%>%
  group_by(id) %>% # id contains our folds
  roc_curve(target_flag, .pred_0) %>% 
  mutate(model='Logistic Regression')

#plot aggregate roc_auc curve for final models

bind_rows(auc.boost, auc.rforest, auc.logistic)%>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal()+
  labs(title='Figure 4. Comparison of Area Under the Curve for Test Predictions', y='Sensitivity (TPR)', x='1-Specificity (FPR)')+
  theme_minimal()

Variable importance rankings as show for each model in Figures 5-7. In each case, Urbanicity had the most influence on the predicted classification followed by either home_val (Random Forest, XGBoost) or Job/Student (Logistic Regression). It follows that automobile accidents occur more frequently in dense urban environments than in other less populated areas. The cost of these claims, however, is not considered here.

Figures 8-10 display the confusion matrix produced by each model on the test data. With a threshold criterion of 0.5, the models favor correct classification of true negatives (0 = no accident) over true positives (1 = accidents). An insurer seeking to minimize claim payouts would likely adjust the criterion (downward in this case) to favor correct classification of the latter.

# Plot VIP for final Models

p1<-rforest.test%>%
  extract_fit_parsnip()%>%
  vip(num_features=20)+
  labs(title='Fig. 5. Random Forest', y='Variable Importance')+
  theme_minimal()+
  
  theme(plot.title=element_text(size=10))

p2<-boost.test%>%
  extract_fit_parsnip()%>%
  vip(num_features=20)+
  labs(title='Fig. 6. XGBoost', y='Variable Importance')+
  theme_minimal()+
  theme_minimal()+
  theme(plot.title=element_text(size=10))

p3<-logistic.test%>%
  extract_fit_parsnip()%>%
  vip(num_features=20, fill='midnightblue')+
  labs(title='Fig. 7. Logistic Regression',y='Variable Importance')+
  theme_minimal()+
  theme(plot.title=element_text(size=10))

# combine vip plots

p1|p2|p3

# Plot confusion matrix for final Models 

p4<-boost.test%>% 
  collect_predictions()%>%
  conf_mat(target_flag, .pred_class)%>%
  autoplot(type = "heatmap")+
  labs(title='Fig. 8. Confusion Matrix', subtitle= 'XGBoost')+
  theme(plot.title=element_text(size=12))

p5<-rforest.test%>% 
  collect_predictions()%>%
  conf_mat(target_flag, .pred_class)%>%
  autoplot(type = "heatmap")+
  labs(title='Fig. 9. Confusion Matrix', subtitle='Random Forest')+
  theme(plot.title=element_text(size=12))

p6<-logistic.test%>% 
  collect_predictions()%>%
  conf_mat(target_flag, .pred_class)%>%
  autoplot(type = "heatmap")+
  labs(title='Fig. 10. Confusion Matrix', subtitle='Logistic Regression')+
  theme(plot.title=element_text(size=12))

p4|p5|p6

2.10 Classification Summary

This assessment sought to classify drivers based on their probability of an auto accident using a select set of ML algorithms and to then compare the results to those produced by a base logistic regression model. Surprisingly, the ML models produced results comparable to those of the logistic classifer - a review/assessment of model assumptions for the latter can be found on Rpubs .

For the purposes of model interpretability & accuracy, the logistic classifier remains a reasonable choice for accident prediction using the data available for this work.

3 Principal Components Analysis

This section highlights the use of Principal Components Analysis (PCA) to reduce the dimensionality of complex numerical data while minimizing information loss. The PCA technique creates new, uncorrelated variables (components), that correspond to linear combinations of the original data and successively maximize variance.

PCA is not intended for analysis of categorical variables. For datasets that include a mix of continuous and categorical features, multiple factor analysis is recommended.

To demonstrate the use of PCA, we will evaluate the water chemistry of streams distributed across New York States. The data were downloaded in .csv format from the NYS Department of Environmental Conservation website:

3.1 Data Processing

Some data processing is required prior to use of PCA. Steps include:

Loading the dataset.

chem<-read_csv('Streams_ Chemistry.csv')

chem%>%head(5)%>%
  flextable()%>%
  set_caption('Table 10: Stream Water Chemistry from Sites in New York State')

Cleaning the data and selecting numerical variables for comparison.
Creating groupings by site and chemical measurement.
Calculating the median value within groups - for comparison across locations and time.

chem %<>%
  select(site_id, sample_date, parameter_name, parameter_fraction, result_value, parameter_unit)%>%
  filter(parameter_fraction == 'TOTAL')%>%
  group_by(site_id, parameter_name, parameter_fraction)%>%
  summarise(median(result_value))%>%
  select(-parameter_fraction)%>%
  rename(id = site_id, parameter = parameter_name, median_value = 'median(result_value)')%>%
  pivot_wider(names_from = parameter, values_from = median_value)%>%
  filter_at(vars('NITROGEN, TOTAL', ALUMINUM, POTASSIUM, SODIUM, 'TOTAL SOLIDS'),all_vars(!is.na(.)))%>%
  select(1:35)%>%
  select(!c(ARSENIC, CADMIUM, LEAD, 'NITROGEN, AMMONIA (AS N)', 'NITROGEN, NITRITE', SILVER, ZINC, FLUORIDE, MERCURY, 'TOTAL SUSPENDED SOLIDS'))%>%
  mutate(id=str_remove(id,'01-'))%>%
  mutate_at(c(2:25), as.numeric)

id<-chem%>%
  select(id)

Removing variables with > 30% missing data.
Applying median imputation to fill missing values within columns.

# build function for median imputation across colummns

f=function(x){
  x<-as.numeric(as.character(x)) #first convert each column into numeric if it is from factor
  x[is.na(x)] =median(x, na.rm=TRUE) #convert the item with NA to median value from the column
  x #display the column
}

# conduct median imputation

chem=data.frame(apply(chem,2,f))%>%
  mutate_at(c(2:25), as.numeric)%>%
  select(!id)

# combine imputed data and original id

chem<-cbind(id, chem)

#convert id to rowname

chem%<>%
  column_to_rownames('id')

3.2 Applying PCA

For the purposes of this work, the R packages Stats and Factoextra will be used for PCA evaluation and related visualizations. It’s worth noting that Factoextra package can perform a wide variety of multivariate data analyses including:

Principal Component Analysis (PCA), which is used to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important information.
Correspondence Analysis (CA), which is an extension of the principal component analysis suited to analyse a large contingency table formed by two qualitative variables (or categorical data).
Multiple Correspondence Analysis (MCA), which is an adaptation of CA to a data table containing more than two categorical variables.
Multiple Factor Analysis (MFA) dedicated to datasets where variables are organized into groups (qualitative and/or quantitative variables).
Hierarchical Multiple Factor Analysis (HMFA): An extension of MFA in a situation where the data are organized into a hierarchical structure.
Factor Analysis of Mixed Data (FAMD), a particular case of the MFA, dedicated to analyze a data set containing both quantitative and qualitative variables.

For more information see the Factoextra website:

In this step, we compute the PCA components.

# estimate pca scores by covariate

components <- prcomp(chem, scale = TRUE)

# apply rotation

components$rotation <- -1*components$rotation

# show components

components$rotation%>%
  as.data.frame()

3.3 Components and Variables

Figure 11 shows the percentage of variance in the dataset accounted for by the first 10 principal components. The first four components alone explain ~ 71% of this variance.

The importance of individual chemical measures to each component is highlighted in Figures 12-15.

PCA1 is dominated by solids and high water hardness, which may indicate groundwater influence.
PCA2 shows the influence of nitrogen and phosphorus wich is often indicative of non-point source pollution (e.g., agricultural settings) .
PCA3 is dominated by metals (AL & FE) and high turbidity, which may be the result of high flows or runoff in an acidic environments (e.g., near mine tailings, waste sites, or where there are acidic soils). -PCA4 is dominated by sulfates and may indicate a predominantly groundwater source; this is consistent with high levels of hardness.

Figure 16 offers an additional view of variable influence for PCA1 and PCA2. Note that the length of the arrows (variables) provides an collinearities. For example, potassium, chloride, magnesium, and sodium appear highly (positively) correlated. Also note that, phosphorus and nitrogen are distributed along the negative range of PCA2 - which distinguishes them from streams high relative concentrations of potassium, chloride, magnesium, and sodium.

# estimate proportion of variance accounted for by each PCA and feature

fviz_screeplot(components, addlabels = TRUE, ylim = c(0, 36))+
  labs(title='Figure 11. Proportion of Variance Explained by Individual PCA Components (Dimensions)')

# assess variable importance to each if the first 4 components (dim)

v1<-fviz_contrib(components, choice = "var", axes = 1, top = 10, title='Figure 12. Contribution of Variables to PCA1')
v2<-fviz_contrib(components, choice = "var", axes = 2, top = 10, title='Figure 13. Contribution of Variables to PCA2')
v3<-fviz_contrib(components, choice = "var", axes = 3, top = 10, title='Figure 14. Contribution of Variables to PCA3')
v4<-fviz_contrib(components, choice = "var", axes = 4, top = 10, title='Figure 15. Contribution of Variables to PCA4')

v1|v2

v3|v4

#Graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.

fviz_pca_var(components,
             alpha.var = "contrib",
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title = 'Figure 16. Influence of Variables on PCA1 and PCA2',
             repel = TRUE     # Avoid text overlapping
)

3.4 Site Comparisons

We can also look at how stream sampling sites are distributed relative to the components. For Example, site 12-Caya-4.7 scores high on PCA2 (indicative of high nitrogen and phosphorus loads) and is below the mid-range for PCA1, which suggests the influence of surface water (rain & runoff) contributions.

Figure 18 can be used to distinguish or group locations in terms of their water chemistry.

#Graph of individuals. Individuals with a similar profile are grouped together.

fviz_pca_ind(components,
             col.ind = "contrib", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title='Figure 17. Distribution of Stream Locations (Water Quality) based on PCA1 and PCA2',
             repel = TRUE     # Avoid text overlapping
)

fviz_pca_ind(components,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title='Figure 17. Distribution of Stream Locations (Water Quality) based on PCA1 and PCA2',
             repel = TRUE     # Avoid text overlapping
)

#Biplot of individuals and variables
fviz_pca_biplot(components, repel = TRUE,
                title='Figure 18. Distribution of Stream Locations based on Water Quality: from PCA1 and PCA2',
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969"  # Individuals color
                )

## PCA Summary

From the previous analysis, it is clear that PCA can be used as an unsupervised method to quickly distinguish associations between water quality measurements that have interpretive value as well as identify patterns of association between streams based on their chemical composition. Future assessments might include qualitative measurements of stream condition along with other multivariate approaches to describe and classify these sites.

site_id	sample_date	sample_type	sample_matrix	parameter_name	parameter_fraction	result_value	parameter_unit	sample_comment	analytical_method	analysis_date	test_type	laboratory	lab_qualifier_source	lab_qualifier	validated	validation_date	validator_qualifier	validator_qualifier_explanation	method_detection_limit	reporting_limit	quantitation_limit	detection_limit_unit	result_lab_comment
14-NEVR-8.9	10/22/2020, 12:15 PM	LAB	WS	ALKALINITY, TOTAL (AS CACO3)	TOTAL	16.4	mg/L		SM 2320 B	10/27/2020, 2:49 AM	initial	ALSRNY	AE	0	Y	4/13/2021	A		1.80		2	mg/l
14-NEVR-8.9	10/22/2020, 12:15 PM	LAB	WS	ALUMINUM	DISSOLVED	11.2	ug/L		E200.8	10/29/2020, 12:29 PM	initial	ALSRNY	AE	0	Y	4/13/2021	A		2.30		10	ug/l
14-NEVR-8.9	10/22/2020, 12:15 PM	LAB	WS	ARSENIC	TOTAL		ug/L		E200.8	10/29/2020, 1:41 PM	initial	ALSRNY	AE	U	Y	4/13/2021	U	Analyte was analyzed for but not detected	0.32	0.32	1	ug/l	not detected
14-NEVR-8.9	10/22/2020, 12:15 PM	LAB	WS	CADMIUM	DISSOLVED		ug/L		E200.8	10/29/2020, 12:29 PM	initial	ALSRNY	AE	U	Y	4/13/2021	U	Analyte was analyzed for but not detected	0.38	0.38	1	ug/l	not detected
14-NEVR-8.9	10/22/2020, 12:15 PM	LAB	WS	CALCIUM	TOTAL	6,730.0	ug/L		E200.7	10/29/2020, 7:32 PM	initial	ALSRNY	AE	0	Y	4/13/2021	A		110.00		1,000	ug/l

Data Analysis with Tidymodels & Factoextra

Sean Connin

5/11/22