This project is divided into two unrelated sections. The first compares the performance of various machine learning algorithms in relation to their ability to predict accident outcomes based on factors associated with individual insurance applicants. The second explores the the use of principal components analysis to effectively reduce the dimensionality of water quality measurements from a wide variety of stream sampling stations in New York State.
The purpose of this section is to estimate the probability that an individual (seeking auto insurance) will be in an accident. The classification models introduced here draw on a synthetic data set of ~ 8000 observations. The response, ‘target_flag’, is a binary variable scored as either a 1 or 0 indicating that an individual has or has not been in an accident, respectively.
This work extends an earlier project to classify the same data using logistic regression. The results can be viewed online at . In brief, the final logistic model produced an AUC of .82 and, setting a cutoff threshold of 0.25, correctly classified 90% of automobiles associated with accidents, as compared to 49% of those not associated with accidents.
In this analysis, a diverse set of ‘machine learning’ classification algorithms are trained/tested and the results evaluated against those produced using the logistic model. They include:
Note that the machine learning models make no assumptions about the underlying distribution of the data. This is not the case with logistic regression.
The data include the following variables:
| Variable | Description | 
|---|---|
| Target FLAG | Was car in crash? 1=Yes, 0=No | 
| Target_ AMT | If crash, what was the cost | 
| AGE | Age of driver | 
| BLUEBOOK | Value of vehicle | 
| CAR_AGE | Age of car | 
| CAR_TYPE | Type of car | 
| CAR_USE | Vehicle use | 
| CLM_FREQ | # Claims (past 5 yrs) | 
| EDUCATION | Max educational level | 
| HOMEKIDS | # Children at home | 
| HOME_VAL | Home value | 
| INCOME | Annual income | 
| JOB | Job category | 
| KIDSDRIVE | # Driving children | 
| MSTATUS | Martial status | 
| MVR_PTS | Motor vehicle record points | 
| OLDCLAIM | Total claims (past 5 yrs) | 
| PARENT1 | Single parent | 
| RED_CAR | A red car | 
| REVOKED | License revoked (past 7 yrs) | 
| SEX | Gender | 
| TIF | Time as customer | 
| TRAVTIME | Distance to work | 
| URBANICITY | Home/work area | 
| YOJ | Years on job | 
# Import libraries
library(tidyverse) # cleaning & wrangling functions
library(tidymodels) # classification
library(finetune) # faster grid searches
library(factoextra) # PCA
library(vip) # variable importance
library(janitor) # data cleaning
library(magrittr) # piping
library(flextable) #table layout and formatting
library(dlookr) # exploratory data analysis functions
library(corrplot)
library(themis) # downsampling
library(viridis) #coloration for box plots + ROC_AUC
library(patchwork) # easy plot layoutFirst we load the dataset from a Github respository.
# Read in data
path <- "https://raw.githubusercontent.com/sconnin/insurance_model/main/insurance_dataset.csv"
raw <- read_csv(path)Data processing is required to put variables in tidy form for model building as well as to support initial exploration and visualization. Additional steps are taken to facilitate interpretation of feature variables.
The following processing steps are included here:
# Copy data into working dataframe for downstream use
df<-raw%>%
  
  clean_names%>% # initial clean of col names
    
  remove_empty(c("rows", "cols"))%>%  # remove any empty rows and cols
    
  distinct()%>%     # remove duplicate rows
    
  mutate_if(is_character, str_replace_all, '\\$|,|z_|<', '')%>%  # clean any special chars in character variables
  dplyr::select(-index)%>%  # remove index
  mutate_if(is_character, str_replace_all, "No|no",'N')%>%  
    
  mutate_if(is_character, str_replace_all, "Yes|yes",'Y')%>%
    
  mutate_if(is_character, str_replace_all, "Highly Urban/ Urban",'Urban')%>%
    
  mutate_if(is_character, str_replace_all, "Highly Rural/ Rural",'Rural')%>%
    
  mutate_at(vars(income, home_val, bluebook, oldclaim), funs(as.numeric))%>%   # correct variable type: char to numeric
  mutate_if(is.numeric, round)%>%  # round out our numerics
    
  mutate_if(is_character, ~(factor(.)))%>%  # convert all character variables to factor for modeling
    
  mutate(education = fct_relevel(education, c("High School", "Bachelors", "Masters", "PhD")))%>% # relevel to show educational attainment steps
    
  mutate(car_type = fct_relevel(car_type, c("Sports Car", "SUV", "Minivan", "Van", "Pickup", "Panel Truck")))%>%
  
  mutate_at(vars(target_flag), factor)Recall that the variable target_flag indicates whether or not an individual was in an accident (1 = Yes, 0 = No).
target_flag (level = 0) includes 6008 observations and accounts for ~74% of the data.
target_flag (level = 1) includes 2153 observations and accounts for ~26% of the data.
# Subset levels of target_flag into new dataframes for analyses
target_0 <- df%>%
    dplyr::select(-target_amt)%>%
    filter(target_flag == 0) # obs not associated with automobile accidents
target_1 <- df%>%
    dplyr::select(-target_amt)%>%
    filter(target_flag == 1)  # obs associated with automobile accidents
# Calculate the proportion of data in each subset of target_flag
df %>%
    group_by(target_flag) %>%
    summarise(n = n()) %>%
    mutate(freq = n / sum(n))%>%
    flextable()%>%
    set_caption("Table 1. Count and Frequency of Target Levels")| target_flag | n | freq | 
| 0 | 6,008 | 0.7361843 | 
| 1 | 2,153 | 0.2638157 | 
On the basis of the following statistical summaries we find:
# Calculate statistics summary for numeric variables at each level of target_flag    
target_0%>%
    dplyr::select(-target_flag)%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Table 2. Summary Statistics for Target_Flag = 0")| variables | min | mean | median | max | zero | minus | outlier | 
| kidsdriv | 0 | 0.1393142 | 0 | 4 | 5,407 | 0 | 601 | 
| age | 16 | 45.3227901 | 46 | 81 | 0 | 0 | 47 | 
| homekids | 0 | 0.6439747 | 0 | 5 | 4,116 | 0 | 559 | 
| yoj | 0 | 10.6718337 | 12 | 23 | 379 | 0 | 398 | 
| income | 0 | 65,951.9700335 | 58,368 | 367,030 | 365 | 0 | 156 | 
| home_val | 0 | 169,075.4123566 | 175,052 | 885,282 | 1,448 | 0 | 5 | 
| travtime | 5 | 33.0251332 | 32 | 142 | 0 | 0 | 69 | 
| bluebook | 1,500 | 16,230.9487350 | 15,000 | 69,740 | 0 | 0 | 74 | 
| tif | 1 | 5.5557590 | 6 | 25 | 0 | 0 | 29 | 
| oldclaim | 0 | 3,311.5948735 | 0 | 53,986 | 4,111 | 0 | 622 | 
| clm_freq | 0 | 0.6486352 | 0 | 5 | 4,111 | 0 | 583 | 
| mvr_pts | 0 | 1.4137816 | 1 | 11 | 2,998 | 0 | 280 | 
| car_age | 0 | 8.6709220 | 9 | 28 | 3 | 0 | 2 | 
target_1%>%
    dplyr::select(-target_flag)%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Table 3. Summary Statistics for Target_Flag = 1")| variables | min | mean | median | max | zero | minus | outlier | 
| kidsdriv | 0 | 0.2596377 | 0 | 4 | 1,773 | 0 | 380 | 
| age | 16 | 43.3012104 | 43 | 76 | 0 | 0 | 7 | 
| homekids | 0 | 0.9368323 | 0 | 5 | 1,173 | 0 | 0 | 
| yoj | 0 | 10.0167488 | 11 | 19 | 246 | 0 | 250 | 
| income | 0 | 50,641.2980910 | 43,971 | 320,127 | 250 | 0 | 77 | 
| home_val | 0 | 115,256.5541339 | 114,565 | 750,455 | 846 | 0 | 11 | 
| travtime | 5 | 34.7710172 | 34 | 97 | 0 | 0 | 12 | 
| bluebook | 1,500 | 14,255.8987459 | 12,600 | 62,240 | 0 | 0 | 36 | 
| tif | 1 | 4.7807710 | 4 | 21 | 0 | 0 | 27 | 
| oldclaim | 0 | 6,061.5499303 | 2,448 | 57,037 | 898 | 0 | 233 | 
| clm_freq | 0 | 1.2169066 | 1 | 5 | 898 | 0 | 0 | 
| mvr_pts | 0 | 2.4816535 | 2 | 13 | 714 | 0 | 11 | 
| car_age | -3 | 7.3674789 | 7 | 25 | 0 | 1 | 0 | 
target_0%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Table 4. Summary Statistics for Categorical Variables: Target Level = 0")| variables | levels | N | freq | ratio | rank | 
| target_flag | 0 | 6,008 | 6,008 | 100.000000 | 1 | 
| parent1 | N | 6,008 | 5,407 | 89.996671 | 1 | 
| parent1 | Y | 6,008 | 601 | 10.003329 | 2 | 
| mstatus | Y | 6,008 | 3,841 | 63.931425 | 1 | 
| mstatus | N | 6,008 | 2,167 | 36.068575 | 2 | 
| sex | F | 6,008 | 3,183 | 52.979361 | 1 | 
| sex | M | 6,008 | 2,825 | 47.020639 | 2 | 
| education | High School | 6,008 | 2,355 | 39.197736 | 1 | 
| education | Bachelors | 6,008 | 1,719 | 28.611851 | 2 | 
| education | Masters | 6,008 | 1,331 | 22.153795 | 3 | 
| education | PhD | 6,008 | 603 | 10.036618 | 4 | 
| job | Blue Collar | 6,008 | 1,191 | 19.823569 | 1 | 
| job | Clerical | 6,008 | 900 | 14.980027 | 2 | 
| job | Professional | 6,008 | 870 | 14.480692 | 3 | 
| job | Manager | 6,008 | 851 | 14.164447 | 4 | 
| job | Lawyer | 6,008 | 682 | 11.351531 | 5 | 
| job | Home Maker | 6,008 | 461 | 7.673103 | 6 | 
| job | Student | 6,008 | 446 | 7.423435 | 7 | 
| job | 6,008 | 390 | 6.491345 | 8 | |
| job | Doctor | 6,008 | 217 | 3.611851 | 9 | 
| car_use | Private | 6,008 | 4,026 | 67.010652 | 1 | 
| car_use | Commercial | 6,008 | 1,982 | 32.989348 | 2 | 
| car_type | Minivan | 6,008 | 1,796 | 29.893475 | 1 | 
| car_type | SUV | 6,008 | 1,616 | 26.897470 | 2 | 
| car_type | Pickup | 6,008 | 946 | 15.745672 | 3 | 
| car_type | Sports Car | 6,008 | 603 | 10.036618 | 4 | 
| car_type | Van | 6,008 | 549 | 9.137816 | 5 | 
| car_type | Panel Truck | 6,008 | 498 | 8.288948 | 6 | 
| red_car | N | 6,008 | 4,246 | 70.672437 | 1 | 
| red_car | Y | 6,008 | 1,762 | 29.327563 | 2 | 
| revoked | N | 6,008 | 5,451 | 90.729028 | 1 | 
| revoked | Y | 6,008 | 557 | 9.270972 | 2 | 
| urbanicity | Urban | 6,008 | 4,454 | 74.134487 | 1 | 
| urbanicity | Rural | 6,008 | 1,554 | 25.865513 | 2 | 
target_1%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Table 5. Summary Statistics for Categorical Variables: Target Level = 1")| variables | levels | N | freq | ratio | rank | 
| target_flag | 1 | 2,153 | 2,153 | 100.000000 | 1 | 
| parent1 | N | 2,153 | 1,677 | 77.891314 | 1 | 
| parent1 | Y | 2,153 | 476 | 22.108686 | 2 | 
| mstatus | N | 2,153 | 1,100 | 51.091500 | 1 | 
| mstatus | Y | 2,153 | 1,053 | 48.908500 | 2 | 
| sex | F | 2,153 | 1,192 | 55.364608 | 1 | 
| sex | M | 2,153 | 961 | 44.635392 | 2 | 
| education | High School | 2,153 | 1,178 | 54.714352 | 1 | 
| education | Bachelors | 2,153 | 523 | 24.291686 | 2 | 
| education | Masters | 2,153 | 327 | 15.188110 | 3 | 
| education | PhD | 2,153 | 125 | 5.805852 | 4 | 
| job | Blue Collar | 2,153 | 634 | 29.447283 | 1 | 
| job | Clerical | 2,153 | 371 | 17.231770 | 2 | 
| job | Student | 2,153 | 266 | 12.354854 | 3 | 
| job | Professional | 2,153 | 247 | 11.472364 | 4 | 
| job | Home Maker | 2,153 | 180 | 8.360427 | 5 | 
| job | Lawyer | 2,153 | 153 | 7.106363 | 6 | 
| job | Manager | 2,153 | 137 | 6.363214 | 7 | 
| job | 2,153 | 136 | 6.316767 | 8 | |
| job | Doctor | 2,153 | 29 | 1.346958 | 9 | 
| car_use | Private | 2,153 | 1,106 | 51.370181 | 1 | 
| car_use | Commercial | 2,153 | 1,047 | 48.629819 | 2 | 
| car_type | SUV | 2,153 | 678 | 31.490943 | 1 | 
| car_type | Pickup | 2,153 | 443 | 20.575941 | 2 | 
| car_type | Minivan | 2,153 | 349 | 16.209940 | 3 | 
| car_type | Sports Car | 2,153 | 304 | 14.119833 | 4 | 
| car_type | Van | 2,153 | 201 | 9.335810 | 5 | 
| car_type | Panel Truck | 2,153 | 178 | 8.267534 | 6 | 
| red_car | N | 2,153 | 1,537 | 71.388760 | 1 | 
| red_car | Y | 2,153 | 616 | 28.611240 | 2 | 
| revoked | N | 2,153 | 1,710 | 79.424059 | 1 | 
| revoked | Y | 2,153 | 443 | 20.575941 | 2 | 
| urbanicity | Urban | 2,153 | 2,038 | 94.658616 | 1 | 
| urbanicity | Rural | 2,153 | 115 | 5.341384 | 2 | 
We drop one observation where ‘car_age’ is < 0 given that it is impossible to have a negative age. This appears to be a data entry error given there are no other negative values in this variable category.
# Drop observation with data error for car age (obs. value = -3). 
df$car_age[df$car_age < 0] <- NAThe histograms shown in Figure 1. indicate that distributions for ‘target_flag’, ‘kidsdriv’, ‘homekids’, ‘yoj’, ‘income’, ‘oldclaim’, ‘clm_freq’, and ‘mvr_pts’ depart from a normal model.
From an interpretive stand-point, it’s worth noting the high 0 value counts for the following covariates: ‘tif’, ‘oldclaim’, ‘car_age’, ‘income’, ‘home_val’. Particularly the latter two. A 0 value for ‘income’ may indicate that an individual is unemployed and/or a student/dependent. A 0 value for ‘home_val’ may indicate that an individual is a renter, a dependent, or homeless.
These details can be important in establishing the costs:benefits of insuring a driver. And their absence (as in this dataset) can limit the discriminatory ability of a classification model.
df %>%
  dplyr::select(!target_flag)%>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_histogram(fill='steelblue')+
  labs(title='Figure 1. Distributions of Numerical Covariates')+
  theme_minimal()The variables kidsrive, oldclaim, homekids, clm_freq, an yoj contain the highest percentage of outliers (~6-17%, depending on target_flag level). The percentage of outliers is generally greater for observations associated with accidents (target_flag = 1) than without (target_flag = 0).
# Identify percentage of outliers for each covariate across levels of target_flag
    
diagnose_outlier(df) %>%
    filter(outliers_ratio > 5)%>%
    mutate(ratio_means= outliers_mean/with_mean)%>% # mean val of outliers/total mean
    arrange(desc(ratio_means))%>%
    dplyr::select(variables, outliers_ratio, ratio_means)%>%
    flextable()%>%
    set_caption("Table 6. Outlier Summary")| variables | outliers_ratio | ratio_means | 
| kidsdriv | 12.020586 | 8.31906218 | 
| oldclaim | 8.124004 | 7.51994989 | 
| target_amt | 19.850509 | 4.67982430 | 
| homekids | 10.439897 | 4.47198413 | 
| yoj | 7.940203 | 0.01205255 | 
The dataset includes missing observations in one categorical variable (‘job’) and six numerical variables (‘car_age’, ‘home-val’, ‘yoj’, ‘income’, ‘age’). Across covariates (i.e., predictor variables) the extent of missingness is < 6.5% of the total observations. There are no obvious patterns in the missing data that warrant concern.
#Basic missing tables by target level
    
df%>%
    diagnose()%>%
    dplyr::select(-unique_count, -unique_rate)%>%
    arrange(desc(missing_count))%>%
    filter(missing_count>0)%>%
    flextable()%>%
    set_caption("Table 7. Missing Data Summary")| variables | types | missing_count | missing_percent | 
| job | factor | 526 | 6.4452886 | 
| car_age | numeric | 511 | 6.2614876 | 
| home_val | numeric | 464 | 5.6855777 | 
| yoj | numeric | 454 | 5.5630437 | 
| income | numeric | 445 | 5.4527631 | 
| age | numeric | 6 | 0.0735204 | 
Pairwise comparisons provide a method to identify multicollinearities in the dataset. It is worth noting that tree-based models are sensitive to multicollinearity.
Notable correlations within the dataset include:
#Assess collinearity in numerical cols --- remove na, should also remove zeros
corrdf<-df%>%
  na.omit()%>%
  dplyr::select(!c(target_flag, target_amt))%>%
  dplyr::select_if(is.numeric)%>%
  cor()
corrplot(corrdf, method = 'number', type='lower', title='Figure 2. Correlation Plot with Zero Values Included', mar=c(0,0,1,0)) # mar() makes title visibleWe can summarize the results of the exploratory data analysis as follows:
A more complete data summary can be reviewed at .
The first step in our modeling building step is to create a randomized training (75%) and testing split (25%).
set.seed(939864536)
# create random data split
data.split <- initial_split(df, prop = .75, strata = target_flag)
# create train and test sets from split
train.data <- training(data.split)
test.data<- testing(data.split)Then we establish a resampling framework for parameter tuning and performance evaluation using 10-fold cross-validation.
set.seed(11003)
folds<-
  vfold_cv(train.data, 
           v=10,
           strata= target_flag)Data preprocessing steps are incorporated prior to model construction using the Tidymodel’s Recipe package. The selected machine learning algorithms can handle missing data, with the exception of KNN. That said missingness can cause grid-based parameter tuning to fail due to a lack of convergence. For these reasons missing value imputation is included in our preprocessing recipe for all of the models examined here.
The following preprocessing steps have also been addressed:
# build recipe
data.recipe<-
  recipe(target_flag ~., data=train.data)%>%
  step_select(!target_amt)%>%
  step_impute_bag(all_numeric_predictors())%>% #impute numerical data
  step_impute_mode(all_nominal_predictors())%>% #impute nominal data
  step_YeoJohnson(all_numeric_predictors())%>% # approximate near normal distributions (optional here)
  step_normalize(all_numeric_predictors())%>% # center and scale numerical vars 
  step_dummy(all_nominal_predictors(), -all_outcomes(), one_hot = TRUE)%>% # keep ref levels with one hot
  step_nzv(all_numeric_predictors())%>% # remove numeric vars that have zero variance (single unique value)
  step_corr(all_predictors(), threshold = 0.5, method = 'spearman')%>% # address collinearity
  themis::step_downsample(target_flag) # rebalance the dataset based on the response variable
# review data results
prepped.data <- 
  data.recipe %>% # use the recipe object
  prep() %>% # perform the recipe on training data
  juice() # extract only the preprocessed dataframe Model development can be outlined as follows (see associated code):
# logistic model
log.spec <- 
  logistic_reg(penalty = tune(), mixture = 1) %>% 
  set_engine("glmnet")
# KNN 
knn.spec <- 
  nearest_neighbor(neighbors = tune()) %>% # we can adjust the number of neighbors 
  set_engine("kknn") %>% 
  set_mode("classification")
#Support Vector Machine
svm.spec <-
  svm_rbf(
  cost = tune(),
  rbf_sigma = tune(), 
  margin = tune()) %>%  
  set_engine("kernlab") %>% 
  set_mode("classification") %>% 
  translate()
# XGboost
xgb.spec <- 
  boost_tree(
    learn_rate=tune(),
    tree_depth=tune(),
    trees=1000,
    mtry=tune()
  ) %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")%>%
  translate
#Random Forest
rf.spec <- 
  rand_forest(
    mtry=tune(),
    trees=1000,
    min_n=tune(),
  ) %>%
  set_mode("classification")%>%
  set_engine("ranger", importance = "impurity")models.wf<-
  workflow_set(
    preproc=list(Models = data.recipe),
    models = list(Logistic = log.spec, KNN=knn.spec, SVM.radial = svm.spec, XGBoost = xgb.spec, R.Forest = rf.spec)
  )# speed grid estimation
race.ctrl <-
   control_race(
      save_pred = TRUE,
      save_workflow = TRUE
   )
grid.results <-
   models.wf %>%
   workflow_map(     
      'tune_race_anova', # compute performance metrics (finetune)
      seed = 052191211,
      resamples = folds,
      grid = 20, # setting up grid size of 15
      control = race.ctrl,
      metrics = metric_set(roc_auc, accuracy), #recall, precision, f_meas, accuracy, kap, roc_auc, sens, spec
      verbose = TRUE
   )
# save results to local directory
#save(grid.results, file='models.Rdata')
#load('models.Rdata')
# rank results by .metric
ranks<-rank_results(grid.results)With initial parameter estimation and model fit to training data complete, we can evaluate model performance based on classification accuracy. On this basis, XGBoost is the highest performing models (Figure 3 and Table 8).
# plot rmse tuning results for each model
autoplot(
  grid.results,
  rank_metric = "accuracy",  # <- how to order models
  metric = "accuracy",       # <- which metric to visualize
  select_best = TRUE) +     # <- one point per workflow
  labs(title='Figure 3. Estimated Accuracy for Classification Model Tuning Results', x='')+
  theme_light()# select best model for each category based on accuracy
rank_results(grid.results, rank_metric = "accuracy", select_best=TRUE)%>%
  flextable()%>%
  set_caption('Table 8. Model Performance Metrics for Grid Tuning')| wflow_id | .config | .metric | mean | std_err | n | preprocessor | model | rank | 
| Models_XGBoost | Preprocessor1_Model19 | accuracy | 0.7161724 | 0.004783707 | 10 | recipe | boost_tree | 1 | 
| Models_XGBoost | Preprocessor1_Model19 | roc_auc | 0.8140844 | 0.005253838 | 10 | recipe | boost_tree | 1 | 
| Models_Logistic | Preprocessor1_Model15 | accuracy | 0.7114301 | 0.007563557 | 10 | recipe | logistic_reg | 2 | 
| Models_Logistic | Preprocessor1_Model15 | roc_auc | 0.8034997 | 0.005231407 | 10 | recipe | logistic_reg | 2 | 
| Models_R.Forest | Preprocessor1_Model01 | accuracy | 0.7112643 | 0.005915120 | 10 | recipe | rand_forest | 3 | 
| Models_R.Forest | Preprocessor1_Model01 | roc_auc | 0.8094962 | 0.005384637 | 10 | recipe | rand_forest | 3 | 
| Models_SVM.radial | Preprocessor1_Model11 | accuracy | 0.7078398 | 0.007061415 | 10 | recipe | svm_rbf | 4 | 
| Models_SVM.radial | Preprocessor1_Model11 | roc_auc | 0.8065030 | 0.006256920 | 10 | recipe | svm_rbf | 4 | 
| Models_KNN | Preprocessor1_Model13 | accuracy | 0.6949233 | 0.006326744 | 10 | recipe | nearest_neighbor | 5 | 
| Models_KNN | Preprocessor1_Model13 | roc_auc | 0.7533843 | 0.006521541 | 10 | recipe | nearest_neighbor | 5 | 
For the purpose of this analysis, we will update the top three models (XGBoost, Random Forest, Logistic Regression) to include parameter values optimized via. the resampling process, refit the models to the full training data, and then evaluate each model against the remaining test set.
#select best xgboost model
boost.best<- grid.results %>% 
   extract_workflow_set_result('Models_XGBoost') %>% 
   select_best(metric = "accuracy")
rforest.best<-
  grid.results %>% 
   extract_workflow_set_result('Models_R.Forest') %>% 
   select_best(metric = "accuracy")
logistic.best<-
  grid.results %>% 
   extract_workflow_set_result('Models_Logistic') %>% 
   select_best(metric = "accuracy")Table 9 shows the final accuracy scores and other related performance metrics for model evaluation against the test data. On this basis, the Random Forest model ranked highest, followed by XGBoost and the logistic model. The differences in performance, however, are relatively small. The latter is supported by the overlap in AUC (area under the curve) curves displayed in Figure 4.
It is also worth noting that none of the models appears to have over-fit the data, as might be indicated by higher accuracy scores on the test data vs. the training data.
# Finalize model on test data
boost.test<-
  grid.results%>%
  extract_workflow('Models_XGBoost') %>% 
  finalize_workflow(boost.best) %>% 
  last_fit(split = data.split)
rforest.test<-
  grid.results%>%
  extract_workflow('Models_R.Forest') %>% 
  finalize_workflow(rforest.best) %>% 
  last_fit(split = data.split)
logistic.test<-
  grid.results%>%
  extract_workflow('Models_Logistic') %>% 
  finalize_workflow(logistic.best) %>% 
  last_fit(split = data.split)
custom_metrics <- metric_set(accuracy, sens, spec, precision, recall, f_meas, kap, mcc)
metrics1<-rforest.test%>% 
  collect_predictions()%>%
  group_by(id)%>%
  custom_metrics(truth = target_flag, estimate = .pred_class)%>%
  select(.metric, .estimate)%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  pivot_wider(names_from = Metric, values_from = Estimate)%>%
  select(accuracy, sens, spec, precision, recall, f_meas)%>%
  rename(Accuracy = accuracy, Sensitivity=sens, Specificity=spec, Precision=precision, Recall=recall, F1=f_meas)%>%
  mutate(Model='Random Forest', .before=Accuracy)
metrics2<-logistic.test%>% 
  collect_predictions()%>%
  group_by(id)%>%
  custom_metrics(truth = target_flag, estimate = .pred_class)%>%
  select(.metric, .estimate)%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  pivot_wider(names_from = Metric, values_from = Estimate)%>%
  select(accuracy, sens, spec, precision, recall, f_meas)%>%
  rename(Accuracy = accuracy, Sensitivity=sens, Specificity=spec, Precision=precision, Recall=recall, F1=f_meas)%>%
  mutate(Model='XGBoost', .before=Accuracy)
metrics3<-logistic.test%>% 
  collect_predictions()%>%
  group_by(id)%>%
  custom_metrics(truth = target_flag, estimate = .pred_class)%>%
  select(.metric, .estimate)%>%
  rename(Metric = .metric, Estimate=.estimate)%>%
  pivot_wider(names_from = Metric, values_from = Estimate)%>%
  select(accuracy, sens, spec, precision, recall, f_meas)%>%
  rename(Accuracy = accuracy, Sensitivity=sens, Specificity=spec, Precision=precision, Recall=recall, F1=f_meas)%>%
  mutate(Model='Logistic', .before=Accuracy)
bind_rows(metrics1, metrics2, metrics3)%>%
  flextable()%>%
  set_caption('Table 9. Evaluation Metrics: Test Evaluation')| Model | Accuracy | Sensitivity | Specificity | Precision | Recall | F1 | 
| Random Forest | 0.7143557 | 0.6930759 | 0.7736549 | 0.8950989 | 0.6930759 | 0.7812383 | 
| XGBoost | 0.7055365 | 0.6950732 | 0.7346939 | 0.8795282 | 0.6950732 | 0.7764968 | 
| Logistic | 0.7055365 | 0.6950732 | 0.7346939 | 0.8795282 | 0.6950732 | 0.7764968 | 
auc.boost<-boost.test%>% 
  collect_predictions()%>%
  group_by(id) %>% # id contains our folds
  roc_curve(target_flag, .pred_0) %>% 
  mutate(model='XGBoost')
auc.rforest<-rforest.test%>% 
  collect_predictions()%>%
  group_by(id) %>% # id contains our folds
  roc_curve(target_flag, .pred_0)%>%
  mutate(model='Random Forest')
auc.logistic<-logistic.test%>% 
  collect_predictions()%>%
  group_by(id) %>% # id contains our folds
  roc_curve(target_flag, .pred_0) %>% 
  mutate(model='Logistic Regression')
#plot aggregate roc_auc curve for final models
bind_rows(auc.boost, auc.rforest, auc.logistic)%>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal()+
  labs(title='Figure 4. Comparison of Area Under the Curve for Test Predictions', y='Sensitivity (TPR)', x='1-Specificity (FPR)')+
  theme_minimal()Variable importance rankings as show for each model in Figures 5-7. In each case, Urbanicity had the most influence on the predicted classification followed by either home_val (Random Forest, XGBoost) or Job/Student (Logistic Regression). It follows that automobile accidents occur more frequently in dense urban environments than in other less populated areas. The cost of these claims, however, is not considered here.
Figures 8-10 display the confusion matrix produced by each model on the test data. With a threshold criterion of 0.5, the models favor correct classification of true negatives (0 = no accident) over true positives (1 = accidents). An insurer seeking to minimize claim payouts would likely adjust the criterion (downward in this case) to favor correct classification of the latter.
# Plot VIP for final Models
p1<-rforest.test%>%
  extract_fit_parsnip()%>%
  vip(num_features=20)+
  labs(title='Fig. 5. Random Forest', y='Variable Importance')+
  theme_minimal()+
  
  theme(plot.title=element_text(size=10))
p2<-boost.test%>%
  extract_fit_parsnip()%>%
  vip(num_features=20)+
  labs(title='Fig. 6. XGBoost', y='Variable Importance')+
  theme_minimal()+
  theme_minimal()+
  theme(plot.title=element_text(size=10))
p3<-logistic.test%>%
  extract_fit_parsnip()%>%
  vip(num_features=20, fill='midnightblue')+
  labs(title='Fig. 7. Logistic Regression',y='Variable Importance')+
  theme_minimal()+
  theme(plot.title=element_text(size=10))
# combine vip plots
p1|p2|p3# Plot confusion matrix for final Models 
p4<-boost.test%>% 
  collect_predictions()%>%
  conf_mat(target_flag, .pred_class)%>%
  autoplot(type = "heatmap")+
  labs(title='Fig. 8. Confusion Matrix', subtitle= 'XGBoost')+
  theme(plot.title=element_text(size=12))
p5<-rforest.test%>% 
  collect_predictions()%>%
  conf_mat(target_flag, .pred_class)%>%
  autoplot(type = "heatmap")+
  labs(title='Fig. 9. Confusion Matrix', subtitle='Random Forest')+
  theme(plot.title=element_text(size=12))
p6<-logistic.test%>% 
  collect_predictions()%>%
  conf_mat(target_flag, .pred_class)%>%
  autoplot(type = "heatmap")+
  labs(title='Fig. 10. Confusion Matrix', subtitle='Logistic Regression')+
  theme(plot.title=element_text(size=12))
p4|p5|p6This assessment sought to classify drivers based on their probability of an auto accident using a select set of ML algorithms and to then compare the results to those produced by a base logistic regression model. Surprisingly, the ML models produced results comparable to those of the logistic classifer - a review/assessment of model assumptions for the latter can be found on .
For the purposes of model interpretability & accuracy, the logistic classifier remains a reasonable choice for accident prediction using the data available for this work.
This section highlights the use of Principal Components Analysis (PCA) to reduce the dimensionality of complex numerical data while minimizing information loss. The PCA technique creates new, uncorrelated variables (components), that correspond to linear combinations of the original data and successively maximize variance.
PCA is not intended for analysis of categorical variables. For datasets that include a mix of continuous and categorical features, multiple factor analysis is recommended.
To demonstrate the use of PCA, we will evaluate the water chemistry of streams distributed across New York States. The data were downloaded in .csv format from the NYS Department of Environmental Conservation website:
Some data processing is required prior to use of PCA. Steps include:
chem<-read_csv('Streams_ Chemistry.csv')
chem%>%head(5)%>%
  flextable()%>%
  set_caption('Table 10: Stream Water Chemistry from Sites in New York State')| site_id | sample_date | sample_type | sample_matrix | parameter_name | parameter_fraction | result_value | parameter_unit | sample_comment | analytical_method | analysis_date | test_type | laboratory | lab_qualifier_source | lab_qualifier | validated | validation_date | validator_qualifier | validator_qualifier_explanation | method_detection_limit | reporting_limit | quantitation_limit | detection_limit_unit | result_lab_comment | 
| 14-NEVR-8.9 | 10/22/2020, 12:15 PM | LAB | WS | ALKALINITY, TOTAL (AS CACO3) | TOTAL | 16.4 | mg/L | SM 2320 B | 10/27/2020, 2:49 AM | initial | ALSRNY | AE | 0 | Y | 4/13/2021 | A | 1.80 | 2 | mg/l | ||||
| 14-NEVR-8.9 | 10/22/2020, 12:15 PM | LAB | WS | ALUMINUM | DISSOLVED | 11.2 | ug/L | E200.8 | 10/29/2020, 12:29 PM | initial | ALSRNY | AE | 0 | Y | 4/13/2021 | A | 2.30 | 10 | ug/l | ||||
| 14-NEVR-8.9 | 10/22/2020, 12:15 PM | LAB | WS | ARSENIC | TOTAL | ug/L | E200.8 | 10/29/2020, 1:41 PM | initial | ALSRNY | AE | U | Y | 4/13/2021 | U | Analyte was analyzed for but not detected | 0.32 | 0.32 | 1 | ug/l | not detected | ||
| 14-NEVR-8.9 | 10/22/2020, 12:15 PM | LAB | WS | CADMIUM | DISSOLVED | ug/L | E200.8 | 10/29/2020, 12:29 PM | initial | ALSRNY | AE | U | Y | 4/13/2021 | U | Analyte was analyzed for but not detected | 0.38 | 0.38 | 1 | ug/l | not detected | ||
| 14-NEVR-8.9 | 10/22/2020, 12:15 PM | LAB | WS | CALCIUM | TOTAL | 6,730.0 | ug/L | E200.7 | 10/29/2020, 7:32 PM | initial | ALSRNY | AE | 0 | Y | 4/13/2021 | A | 110.00 | 1,000 | ug/l | 
chem %<>%
  select(site_id, sample_date, parameter_name, parameter_fraction, result_value, parameter_unit)%>%
  filter(parameter_fraction == 'TOTAL')%>%
  group_by(site_id, parameter_name, parameter_fraction)%>%
  summarise(median(result_value))%>%
  select(-parameter_fraction)%>%
  rename(id = site_id, parameter = parameter_name, median_value = 'median(result_value)')%>%
  pivot_wider(names_from = parameter, values_from = median_value)%>%
  filter_at(vars('NITROGEN, TOTAL', ALUMINUM, POTASSIUM, SODIUM, 'TOTAL SOLIDS'),all_vars(!is.na(.)))%>%
  select(1:35)%>%
  select(!c(ARSENIC, CADMIUM, LEAD, 'NITROGEN, AMMONIA (AS N)', 'NITROGEN, NITRITE', SILVER, ZINC, FLUORIDE, MERCURY, 'TOTAL SUSPENDED SOLIDS'))%>%
  mutate(id=str_remove(id,'01-'))%>%
  mutate_at(c(2:25), as.numeric)
id<-chem%>%
  select(id)# build function for median imputation across colummns
f=function(x){
  x<-as.numeric(as.character(x)) #first convert each column into numeric if it is from factor
  x[is.na(x)] =median(x, na.rm=TRUE) #convert the item with NA to median value from the column
  x #display the column
}
# conduct median imputation
chem=data.frame(apply(chem,2,f))%>%
  mutate_at(c(2:25), as.numeric)%>%
  select(!id)
# combine imputed data and original id
chem<-cbind(id, chem)
#convert id to rowname
chem%<>%
  column_to_rownames('id')For the purposes of this work, the R packages Stats and Factoextra will be used for PCA evaluation and related visualizations. It’s worth noting that Factoextra package can perform a wide variety of multivariate data analyses including:
Principal Component Analysis (PCA), which is used to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important information.
Correspondence Analysis (CA), which is an extension of the principal component analysis suited to analyse a large contingency table formed by two qualitative variables (or categorical data).
Multiple Correspondence Analysis (MCA), which is an adaptation of CA to a data table containing more than two categorical variables.
Multiple Factor Analysis (MFA) dedicated to datasets where variables are organized into groups (qualitative and/or quantitative variables).
Hierarchical Multiple Factor Analysis (HMFA): An extension of MFA in a situation where the data are organized into a hierarchical structure.
Factor Analysis of Mixed Data (FAMD), a particular case of the MFA, dedicated to analyze a data set containing both quantitative and qualitative variables.
For more information see the Factoextra website: 
In this step, we compute the PCA components.
# estimate pca scores by covariate
components <- prcomp(chem, scale = TRUE)
# apply rotation
components$rotation <- -1*components$rotation
# show components
components$rotation%>%
  as.data.frame()Figure 11 shows the percentage of variance in the dataset accounted for by the first 10 principal components. The first four components alone explain ~ 71% of this variance.
The importance of individual chemical measures to each component is highlighted in Figures 12-15.
Figure 16 offers an additional view of variable influence for PCA1 and PCA2. Note that the length of the arrows (variables) provides an collinearities. For example, potassium, chloride, magnesium, and sodium appear highly (positively) correlated. Also note that, phosphorus and nitrogen are distributed along the negative range of PCA2 - which distinguishes them from streams high relative concentrations of potassium, chloride, magnesium, and sodium.
# estimate proportion of variance accounted for by each PCA and feature
fviz_screeplot(components, addlabels = TRUE, ylim = c(0, 36))+
  labs(title='Figure 11. Proportion of Variance Explained by Individual PCA Components (Dimensions)')# assess variable importance to each if the first 4 components (dim)
v1<-fviz_contrib(components, choice = "var", axes = 1, top = 10, title='Figure 12. Contribution of Variables to PCA1')
v2<-fviz_contrib(components, choice = "var", axes = 2, top = 10, title='Figure 13. Contribution of Variables to PCA2')
v3<-fviz_contrib(components, choice = "var", axes = 3, top = 10, title='Figure 14. Contribution of Variables to PCA3')
v4<-fviz_contrib(components, choice = "var", axes = 4, top = 10, title='Figure 15. Contribution of Variables to PCA4')
v1|v2v3|v4#Graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph.
fviz_pca_var(components,
             alpha.var = "contrib",
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title = 'Figure 16. Influence of Variables on PCA1 and PCA2',
             repel = TRUE     # Avoid text overlapping
)We can also look at how stream sampling sites are distributed relative to the components. For Example, site 12-Caya-4.7 scores high on PCA2 (indicative of high nitrogen and phosphorus loads) and is below the mid-range for PCA1, which suggests the influence of surface water (rain & runoff) contributions.
Figure 18 can be used to distinguish or group locations in terms of their water chemistry.
#Graph of individuals. Individuals with a similar profile are grouped together.
fviz_pca_ind(components,
             col.ind = "contrib", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title='Figure 17. Distribution of Stream Locations (Water Quality) based on PCA1 and PCA2',
             repel = TRUE     # Avoid text overlapping
)fviz_pca_ind(components,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             title='Figure 17. Distribution of Stream Locations (Water Quality) based on PCA1 and PCA2',
             repel = TRUE     # Avoid text overlapping
)#Biplot of individuals and variables
fviz_pca_biplot(components, repel = TRUE,
                title='Figure 18. Distribution of Stream Locations based on Water Quality: from PCA1 and PCA2',
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969"  # Individuals color
                ) ## PCA Summary
From the previous analysis, it is clear that PCA can be used as an unsupervised method to quickly distinguish associations between water quality measurements that have interpretive value as well as identify patterns of association between streams based on their chemical composition. Future assessments might include qualitative measurements of stream condition along with other multivariate approaches to describe and classify these sites.