train_data <- read.csv(
"/Users/teddykelly/Downloads/insurance-training-data2-2.csv")
eval_data <- read.csv(
"/Users/teddykelly/Downloads/insurance-testing-data2-2.csv")Predicting Car Accident Likelihood and Estimating Crash Costs for an Auto Insurance Company
Motivation
Every day, millions of Americans from all different backgrounds get in their cars to drive to work, school, shopping, etc. Unfortunately, while the vast majority of drivers get to their destinations without any problems, not a day goes by without the occurrence of car accidents. Some are minor while others are fatal, however, all car accidents result in some type of damage requiring costs to be paid. The goal of this project is to develop a binary logistic regression model that can accurately predict the likelihood of someone getting in a car accident based on their background, and a multiple linear regression model that will predict the expected cost of the accident if the person does get in an accident.
Setup
Clear the environment
Changing the variable names both datasets to be more readable.
#Converting variables with multiple categories into ordered factors before
# converting them to numerics
train_data$CAR_TYPE <- factor(
train_data$CAR_TYPE,
levels = c("Minivan", "Panel Truck", "Pickup", "Sports Car", "Van",
"z_SUV"),
ordered = TRUE
)
train_data$EDUCATION <- factor(
train_data$EDUCATION,
levels = c("<High School", "Bachelors", "Masters", "PhD",
"z_High School"),
ordered = TRUE
)
train_data$JOB <- factor(
train_data$JOB,
levels = c("", "Clerical", "Doctor", "Home Maker", "Lawyer", "Manager",
"Professional", "Student", "z_Blue Collar"),
ordered = TRUE
)
train_data <- train_data |> transmute(
crash_flag = TARGET_FLAG,
crash_cost = TARGET_AMT,
driver_age = AGE,
car_value = as.numeric(gsub("[$,]", "", BLUEBOOK)),
car_age = as.numeric(CAR_AGE),
car_type = CAR_TYPE,
car_use = ifelse(CAR_USE=='Private', 1, 0),
past_claims_count = CLM_FREQ,
educ_level = EDUCATION,
num_children_home = HOMEKIDS,
home_value = as.numeric(gsub("[$,]", "", HOME_VAL)),
income = as.numeric(gsub("[$,]", "", INCOME)),
job_category = JOB,
teen_drivers = KIDSDRIV,
married = ifelse(MSTATUS=='Yes', 1, 0),
mvr_points = MVR_PTS,
past_claims_amount = as.numeric(gsub("[$,]","", OLDCLAIM)),
single_parent = ifelse(PARENT1=='Yes', 1, 0),
red_car = ifelse(RED_CAR=='yes', 1, 0),
license_revoked = ifelse(train_data$REVOKED=='Yes', 1, 0),
gender = ifelse(train_data$SEX=='M', 1, 0),
policy_tenure = TIF,
commute_time = TRAVTIME,
urban_rural = ifelse(URBANICITY=='Highly Urban/ Urban', 1, 0),
years_on_job = YOJ
)1 Data Exploration
Section Goal
Before developing any sophisticated models for predicting the likelihood of a crash and how much the crashes, we must first understand the structure of the training dataset. This involves understanding the following:
- The size of the data
- Whether there are missing values
- Distribution of data for each variable
Training Dataset Structure
The total data contains about 8,000 observations and 25 variables of interest measured at a fixed point in time (cross-sectional data)
Each observation represents a customer at an auto insurance company. Each variable represents information about the corresponding customer. Note that since there are so many variables, we must make decisions about which variables to include in the regressions.
The data is split into two groups:
Training dataset: Used to train or fit the models on. It contains 6,528 observations (~80% of the data)
Evaluation dataset: Used to evaluate the performance of the trained models on new unseen data. It contains 1,633 observations (~20% of the data)
Below is a table displaying each of the 25 variables with their names as how they appear in the training data and a description of what they mean.
| Variable Name | Variable Meaning |
|---|---|
| crash_flag | Indicates whether the car was involved in a crash (1 = Yes, 0 = No) |
| crash_cost | Cost of the accident if person was in a crash |
| driver_age | Age of the driver |
| car_value | Estimated value of the vehicle (Blue Book value) |
| car_age | Age of the vehicle in years |
| car_type | Type of vehicle (e.g., SUV, Sedan, Sports) |
| car_use | How the vehicle is used (e.g., Commercial or Private) |
| past_claims_count | Number of past claims filed in the last 5 years |
| educ_level | Driver’s highest education level |
| num_children_home | Number of children living at home |
| home_value | Estimated value of the driver’s home |
| income | Driver’s annual income |
| job_category | Job category or occupation type |
| teen_drivers | Number of teenage drivers using the car |
| married | Marital status of the driver |
| mvr_points | Motor Vehicle Record (MVR) points — measure of violations |
| past_claims_amount | Total claim amount from the past 5 years |
| single_parent | Indicates if the driver is a single parent |
| red_car | Indicates if the vehicle is red |
| license_revoked | Indicates if the driver’s license was revoked in the past 7 years |
| gender | Gender of the driver |
| policy_tenure | Number of years the policy has been active |
| commute_time | Average commute time or distance to work |
| urban_rural | Urban or rural location indicator |
| years_on_job | Number of years the driver has been at their current job |
1.2 Summary Statistics
The summary statistics on the numerical and binary variables are seen below. As we can see, there are no longer any missing variables which will allow us to build our models to make predictions. Notice that I did not include any of the categorical variables like car_type, educ_level, and job_category. I will display the summary statistics for these variables separately to make more sense of them.
labels <- c(
"Crash Indicator",
"Crash Cost",
"Driver Age",
"Vehicle Value",
"Vehicle Age in years",
"Vehicle Use",
"Past Claims count",
"Children at Home",
"Home Value",
"Annual Income",
"Teen Drivers count",
"Marital Status",
"MVR Points",
"Past Claims",
"Single Parent",
"Red Car Indicator",
"License Revoked",
"Driver Gender",
"Policy Tenure in years",
"Commute Time/Distance",
"Urbanicity",
"Years on Job"
)
stargazer(train_data, type = "text",
digits = 2, median = T, covariate.labels = labels)
===========================================================================
Statistic N Mean St. Dev. Min Median Max
---------------------------------------------------------------------------
Crash Indicator 6,528 0.26 0.44 0 0 1
Crash Cost 6,528 1,466.62 4,545.65 0.00 0.00 107,586.10
Driver Age 6,525 44.85 8.65 16 45 81
Vehicle Value 6,528 15,641.76 8,381.49 1,500 14,370 69,740
Vehicle Age in years 6,129 8.32 5.72 -3 8 27
Vehicle Use 6,528 0.63 0.48 0 1 1
Past Claims count 6,528 0.80 1.16 0 0 5
Children at Home 6,528 0.72 1.11 0 0 5
Home Value 6,160 154,298.40 128,874.60 0 160,680 885,282
Annual Income 6,174 61,649.32 47,658.41 0 53,276 367,030
Teen Drivers count 6,528 0.17 0.51 0 0 4
Marital Status 6,528 0.60 0.49 0 1 1
MVR Points 6,528 1.70 2.14 0 1 13
Past Claims 6,528 4,119.32 8,924.67 0 0 57,037
Single Parent 6,528 0.13 0.34 0 0 1
Red Car Indicator 6,528 0.29 0.45 0 0 1
License Revoked 6,528 0.13 0.33 0 0 1
Driver Gender 6,528 0.46 0.50 0 0 1
Policy Tenure in years 6,528 5.37 4.15 1 4 25
Commute Time/Distance 6,528 33.44 15.97 5 33 142
Urbanicity 6,528 0.79 0.40 0 1 1
Years on Job 6,153 10.51 4.10 0 11 23
---------------------------------------------------------------------------
Summary Statistics Analysis:
Dependent Variables
crash_flagis a binary variable that determines whether a person was involved in an accident. 1 indicates that the observation was in a crash and 0 means they were not. The mean value is 0.26 which means that about 26% of the people in the survey did not crash their car.crash_costhas a mean value of $1,466.62, meaning that the average cost of accidents for both people involved and not involved in car crashes is about $1,466. The cost of car crashes is heavily skewed to the right as the median value ofcrash_costis zero which confirms what we found above that most of the people in this sample were not involved in car crashes. I will have to filter the data to only include observations who were in a crash when running the multiple linear regression.
Key Independent Variables
driver_ageis very symmetrical as the mean and median ages of the driver are both around 45 years old. This will be an intriguing variable to look at since it’s usually believed that young drivers are more reckless and also that the elderly are not great drivers.incomewill be interesting to investigate because we would assume that wealthier people will be better drivers. However, wealthy people could own fancy sports cars which could actually be more dangerous to operate. Income is of course right-skewed which is expected.
Missing Values
driver_age,car_age,home_value,income, andyears_on_joball have missing observations.I will address these missing values in the data preparation section before building the logistic and linear regression models.
1.3 Histograms and Bar Charts
In the previous section, we looked at the summary statistics of the numerical and dummy variables in the dataset. Now, we will visually identify the distributions of each of the numerical and binary variables using histograms.
Dependent Variables:
# Crash Indicator Bar Chart
ggplot(data = train_data,
mapping = aes(x = factor(crash_flag))) +
labs(title = "Bar Chart of People in Crashes",
x = "In Crash (1) Not in crash (0)") +
geom_bar(fill = "blue") ggplot(data = train_data,
mapping = aes(x = crash_cost))+
labs(title = "Histogram of the cost of Crashes", x = "Cost of crash in $")+
geom_histogram(bins = 20, fill = "navy")Nearly 5,000 customers did not get into an accident, while under 2,000 did get into a car crash.
The distribution of the cost of the crashes is heavily right skewed. There are potentially some outliers that I will deal with either through imputation or a log transformation.
Independent Variables
I have broken up the histograms into two groups. The first group of histograms displayed are the continuous quantitative variables and the second group will contain histograms of the binary and discrete numerical variables. This way, it is much easier to identify patterns in how the data is distributed for each variable.
Group 1: Numerical Continuous Variables
# Create clean dataset without the categorical variables with lots of
# categories
clean_df_num <- train_data |> dplyr::select(driver_age, car_value, car_age,
home_value, income, past_claims_amount,
policy_tenure, commute_time,
years_on_job)
# Create a melted dataset to display the histograms of the quantitative
# variables
df_melted_num <- reshape2::melt(data = clean_df_num)
# Graphing the histograms
ggplot(data = df_melted_num,
mapping = aes(x = value)) +
geom_histogram()+
facet_wrap(~variable, scales = "free_x")Observations:
There are lots of right skewed data because the highest frequency for most of the variables is 0.
In the data preparation section, I will address these skewness problems with log transformations.
Group 2: Binary and Numerical Discrete Variables
clean_df_cat <- train_data |> dplyr::select(-crash_cost, -driver_age,
-car_value, -car_age, -home_value,
-income,-past_claims_amount,
-policy_tenure, -commute_time,
-years_on_job, -car_type,
-educ_level, -job_category,
-crash_flag)
df_melted_cat <- reshape2::melt(data = clean_df_cat)
# Graphing the histograms
ggplot(data = df_melted_cat,
mapping = aes(x = value)) +
geom_histogram()+
facet_wrap(~variable, scales = "free_x")Categorical Variables Bar Charts
Below, I have recoded the categorical data into factors so that they can be graphed using bar charts. Notice that I deleted the observations that have empty values for job_category.
# Recode data values in the train dataset first
train_data$car_type <- dplyr::recode(train_data$car_type,
"z_SUV" = "SUV")
train_data$educ_level <- dplyr:: recode(train_data$educ_level,
"<High School" = "High School",
"z_High School" = "High School")
train_data$job_category <- dplyr::case_when(
train_data$job_category == "z_Blue Collar" ~ "Blue Collar",
is.na(train_data$job_category) | train_data$job_category ==
"" ~ "Missing Values",
TRUE ~ as.character(train_data$job_category)
)
# Create separate data rame for the categorical variables
train_categorical <- train_data |> dplyr::select(car_type, educ_level,
job_category)Now I will create separate data frame for the frequencies of each categorical for each variable.
# Car Type
car_type_df <- data.frame(Value = names(table(train_categorical$car_type)),
Frequency = table(train_categorical$car_type))
# Graph this
ggplot(data = car_type_df,
mapping = aes(x = reorder(x = Value, X = -Frequency.Freq),
y = Frequency.Freq)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Bar Chart of Car Type", x = "Car Types", y = "Frequency")Most of the customers drive SUVs and Minivans
Very few customers drive vans or trucks
We will do the same for education level.
educ_level_df <- data.frame(table(train_categorical$educ_level))
# Graph this
ggplot(data = educ_level_df,
mapping = aes(x = reorder(x = Var1, X = -Freq), y = Freq)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Bar chart of Education Level", x = "Education Level",
y = "Frequency")The education level category with the most amount of customers is high school
The majority of customers still have some type of college education.
Now for job category.
job_cat_df <- data.frame(table(train_categorical$job_category))
#job_cat_df <- job_cat_df |> dplyr::filter(Var1 != "")
#Graph this
ggplot(data = job_cat_df,
mapping = aes(x = reorder(x = Var1, X = -Freq), y = Freq)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Bar Chart of Job Category", x = "Job Category",
y = "Frequency") +
theme(
axis.text.x = element_text(
angle = 45,
hjust = 1,
size = 8
)
)The job category with the most amount of customers is Blue Collar
Note that there are some customers whose job category was not given.
1.4 Correlation Table
Now, I will investigate if there are any explanatory variables that are correlated with either of the response variables crash_flag or crash_cost. Correlation tables do not work well with categorical data, so I will exclude car type, education level, and job category from the correlation plot.
# Create the correlation table with specified variables
corr_tab <- cor(x = train_data[, -c(5, 6, 9, 13)])
colnames(corr_tab) <- c(
"car crash",
"crash cost",
"driver age",
"car value",
"car use",
"# of past claims",
"# of children",
"home value",
"income",
"teen drivers",
"married",
"mvr points",
"past claims amount",
"single parent",
"red car",
"license revoked",
"gender",
"policy tenure",
"commute time",
"urban",
"years on job"
)
rownames(corr_tab) <- c(
"car crash",
"crash cost",
"driver age",
"car value",
"car use",
"# of past claims",
"# of children",
"home value",
"income",
"teen drivers",
"married",
"mvr points",
"past claims amount",
"single parent",
"red car",
"license revoked",
"gender",
"policy tenure",
"commute time",
"urban",
"years on job"
)
# Use ggcorplot to create the upper correlation plot
corr_plot <- ggcorrplot(corr = corr_tab,
type = "upper",
method = "square",
title = "correlation plot",
colors = c("red", "white", "green"),
lab = T,
lab_size = 1,
insig = "pch",
tl.cex = 8,
digits = 2
)
corr_plotCorrelation Plot Analysis
It’s clear that both
crash_flagandcrash_costhave the same variables that are positively and negatively correlated with them. This makes sense because both of the dependent variables of interest are fairly strongly correlated with a value of 0.54.Most notably,
urban,mvr_points, andpast_claims_counthave the strongest positive correlations withcrash_flagandcrash_cost.The most negatively correlated variables with whether or not someone gets into a crash are
marriedandcar_usewhich makes sense because in theory, married people are typically more responsible and get into less crashes.
Independent variables correlated among themselves
teen_driversandnum_children_homePositively correlated with a value of 0.47.
This makes sense because customers with more children at home are likely to have more teen drivers. Note that including both of these variables may result in multicollinearity.
past_claims_amountandpast_claims_count- Obvious positive correlation between the number of past claims someone has made and the total amount of money from those claims. Including both variables in the regressions may result in multicollinearity.
mvr_pointsandpast_claims_count- Positively correlated with a value of 0.4 which makes sense because people who have more violations are likely to have filed more claims.
genderandred_car- The positive correlation value of 0.67 is a very interesting result in theory, we don’t think of the color of cars being correlated with gender. This result suggests that more men are linked with driving red cars.
license_revokedandpast_claims_amount- Positively correlated with a value of 0.41 which makes sense because people who have their license revoked are more likely to have made claims in the past.
single_parentandmarried- negatively correlated which is obvious since single parents are not married and vice versa.
2 Data Preparation
2.1 Missing Values
Identifying the variables that have missing values is an important step for data cleaning. Below, I have outlined the procedure I use on how to address missing values.
Determine the percentage of values that are missing for each variable that contains missing values.
- Anything variable that has less than 10% of its observations missing, I will keep those columns.
Consider the importance of each variable to the problem
Consider deleting the specific observations with missing values if the data is missing completely at random (MCAR)
- I will run an
mcar_testfrom the naniar package to evaluate this.
- I will run an
If there is no evidence to suggest that the data is not missing completely at random, then deleting certain missing values or imputing them.
Below is a visual representation of the percentage of missing values that each variable contains.
plot_missing(train_data)As we can see, the following variables contain missing values in the training data:
driver_age: 0.05% missing values- The age of the driver has a strong theoretical effect on how safe they drive and there are very few missing values. No reason to drop the entire column.
income: 5.42% missing valuesThe income of customers could be very intriguing to study. Stereotypes would suggest that people with higher income are safer drivers, however, more expensive sports cars may be more dangerous.
Less than 10% of values are missing and the variable
incomeis of interest, so there is no reason to drop this entire column.
home_value: 5.64% of observations have missing values- Similar to income, theory points to people with high home values being safer drivers, but yet again, they may own more fancy cars that could be more dangerous. Not many missing values, so no need to remove the entire column.
years_on_job: 5.74% of observations- Not many missing values and will be interesting to study since people who have worked at one place for longer may be more responsible, however, they may drive to work more often and put themselves in danger. No reason to drop the entire column.
car_age: 6.11% of observations.- Older cars could get into more accidents since they are more likely to have malfunctions. Only about 6% of its values are missing so it would be better to keep this column.
All of these variables have under 10% of their values missing, and from the graph, we can see that these quantities of missing values are not alarming. Therefore, I have decided to not drop any of these variable columns outright.
Notice that summing these percentages up gives a value of about 22.96%, meaning that about 23% of the total observations in the training data have at least one missing value. I have confirmed this below:
(length(which(is.na(train_data))) / dim(train_data)[1])*100[1] 22.96262Therefore, deleting all rows that have at least one missing value will delete approximately 23% of the observations recorded in the dataset which could significantly bias the results.
However the missingness map displays that only 1% of all of the total data entries are missing, implying that imputing these missing values should have little to no bias on the results. I will still test below if the missing data is missing completely at random or not.
Missing Completely at Random?
I will run an mcar test to evaluate if any of the missing values are indeed missing at random.
naniar::mcar_test(train_data)# A tibble: 1 × 4
statistic df p.value missing.patterns
<dbl> <dbl> <dbl> <int>
1 402. 390 0.323 18
The null hypothesis is that the data is missing completely at random, so the p-value of 0.15>0.05, meaning that we fail to reject the null hypothesis that the data is missing completely at random. Note this does not fully prove that the data is missing completely at random, but rather there is no strong evidence against this claim. Therefore, implementing median imputation or deleting observations will likely not bias the results from our regressions.
Imputation of Missing Variables
I have decided to impute all of the missing values using the standard median imputation method. I did consider using the
micealgorithm that considers the relationships between the variables when deciding how to impute the missing values. However, this method imputes the values with the mean which could cause some bias since we found that lots of the variables are heavily skewed.Imputing with the median, although it does not implement the same rigorous algorithm as the mice method, will still be more resistant to the skewness of the data, leading to less biased coefficient estimates for our regressions.
# Median imputation
train_data$driver_age[is.na(train_data$driver_age)] <-
median(train_data$driver_age, na.rm = T)
train_data$income[is.na(train_data$income)] <-
median(train_data$income, na.rm = T)
train_data$home_value[is.na(train_data$home_value)] <-
median(train_data$home_value, na.rm = T)
train_data$years_on_job[is.na(train_data$years_on_job)] <-
median(train_data$years_on_job, na.rm = T)
train_data$car_age[is.na(train_data$car_age)] <-
median(train_data$car_age, na.rm = T)After imputing all of the missing values in the dataset, I will now investigate any outlier values and address them in the next section.
2.2 Outlier Values
Now, I must deal with any outlier values that are present in the dataset. To observe which variables have outliers, below I have displayed box plots of the numerical independent variables in the dataset.
ggplot(data = df_melted_num,
mapping = aes(x = factor(variable), y = value)) +
geom_boxplot() +
facet_wrap(facets = ~variable, scale = "free")Box Plot Analysis:
It’s clear that almost all of the numerical predictors have some outliers which we will need to address to reduce bias that those values might have in the regressions.
I do not want to delete any observations with outliers and impute those values with the mean or median because there are a substantial number of outliers in the data.
Instead, I will transform the variables with large outliers in order to reduce their influence on the data. These variables include:
car_value,home_value,income,past_claims_amount,policy_tenure, andcommute_time.
2.2.1 Transforming Variables
Below I have performed log transformations on the variables that are heavily right-skewed. Note that the original data for some of these variables were zero, meaning that taking the log would result in a value of negative infinity which would greatly disrupt our results. Therefore, I converted any instances of negative infinity values to zeros.
# Car Value
train_data$log_car_value <- log(train_data$car_value)
# home value
train_data$log_home_value <- log(train_data$home_value)
train_data$log_home_value[train_data$log_home_value == -Inf] <- 0
# income
train_data$log_income <- log(train_data$income)
train_data$log_income[train_data$log_income == -Inf] <- 0
# Past claims amount
train_data$log_past_claims_amount <- log(train_data$past_claims_amount)
train_data$log_past_claims_amount[
train_data$log_past_claims_amount == -Inf] <- 0
# Policy Tenure
train_data$log_policy_tenure <- log(train_data$policy_tenure)
# Commute Time
train_data$log_commute_time <- log(train_data$commute_time)Below are the update box plots of the log transformed variables that I specified previously.
clean_df_num_log <- train_data |>
dplyr::select(log_car_value, log_home_value, log_income,
log_past_claims_amount, log_policy_tenure, log_commute_time)
# Create a melted dataset to display the histograms of the quantitative
# variables
df_melted_num_log <- reshape2::melt(data = clean_df_num_log)
ggplot(data = df_melted_num_log,
mapping = aes(x = factor(variable), y = value)) +
geom_boxplot() +
facet_wrap(facets = ~variable, scale = "free")From the box plots above, applying log transformations to those six variables appears to have reduced the number of high outliers that was causing many of the variables to be right-skewed. However, notice that there are some new low outliers especially for log_income which could be a problem. Also the distributions of some of the variables look odd which may be a problem for our regressions.
With all of the missing values imputed and the outlier values dealt with, we can finally take a look at our new summary statistics.
stargazer(train_data, type = "text", covariate.labels = labels,
median = T, title = "Updated Summary Statistics",
omit.summary.stat = "N", notes = "N = 6,528")
Updated Summary Statistics
========================================================================
Statistic Mean St. Dev. Min Median Max
------------------------------------------------------------------------
Crash Indicator 0.264 0.441 0 0 1
Crash Cost 1,466.616 4,545.654 0.000 0.000 107,586.100
Driver Age 44.853 8.649 16 45 81
Vehicle Value 15,641.760 8,381.489 1,500 14,370 69,740
Vehicle Age in years 8.296 5.547 -3 8 27
Vehicle Use 0.631 0.483 0 1 1
Past Claims count 0.799 1.159 0 0 5
Children at Home 0.719 1.106 0 0 5
Home Value 154,658.200 125,197.500 0 160,680 885,282
Annual Income 61,195.250 46,386.770 0 53,276 367,030
Teen Drivers count 0.171 0.509 0 0 4
Marital Status 0.600 0.490 0 1 1
MVR Points 1.700 2.145 0 1 13
Past Claims 4,119.320 8,924.665 0 0 57,037
Single Parent 0.131 0.338 0 0 1
Red Car Indicator 0.292 0.455 0 0 1
License Revoked 0.127 0.333 0 0 1
Driver Gender 0.465 0.499 0 0 1
Policy Tenure in years 5.367 4.150 1 4 25
Commute Time/Distance 33.442 15.966 5 33 142
Urbanicity 0.794 0.405 0 1 1
Years on Job 10.539 3.986 0 11 23
log_car_value 9.491 0.629 7.313 9.573 11.153
log_home_value 8.758 5.495 0.000 11.987 13.694
log_income 10.005 2.991 0.000 10.883 12.813
log_past_claims_amount 3.397 4.320 0.000 0.000 10.951
log_policy_tenure 1.300 0.952 0.000 1.386 3.219
log_commute_time 3.361 0.608 1.609 3.497 4.956
------------------------------------------------------------------------
N = 6,528
3 Build Models
Now it’s time to construct regression models on the training data that can accurately predict whether or not an insurance customer has been in a crash, and if so, predict how much the crash will cost.
To make predictions on if an individual has been in a crash, I will use logistic regression because the dependent variable is a binary variable (in crash or not in a crash). I will construct three different logistic regression models, interpret each of their results, and then compare their confusion matrices to determine which model is the best one..
For predicting the cost of crashes, I will create one multiple linear regression using stepAIC and I will experiment using Lasso regression to create the second regression model. For these regressions, I will separate the training data so that I only run the regressions on observations of people who have been in car accidents. I will compare the coefficient estimates, their prediction accuracy metrics, and their AIC values to determine which model is the best.
3.1 Logistic Regression
train_data$educ_level <- factor(train_data$educ_level, ordered = F)
train_data$car_type <- factor(train_data$car_type, ordered = F)3.1.1 Logistic Model 1
In my first logistic regression model I will manually select variables that I believe to most accurately predict whether a customer is involved in a crash. My estimating equation for model 1 is seen below:
\[ \begin{aligned} P(crash.flag_i=1)=\;& \beta_0+\beta_1 \cdot driver.age_{i}+\beta_2 \cdot car.age_{i}+ \\ &\beta_3 \cdot past.claims.count_{i}+\beta_4 \cdot income_{i} + \\ &\beta_5 \cdot teen.drivers_{i}+\beta_6 \cdot licsense.revoked_{i} +\\ &\beta_7 \cdot policy.tenure_{i}+u_i \end{aligned} \]
Variable Reasons
In my explanatory data analysis, I identified driver_age and income as two main variables that I wanted to explore.
I also included past_claims_amount and license_revoked because in theory, people who have filed more claims in the past and who have had their license revoked may be more likely to get into accidents.
I also included teen_drivers because younger drivers are assumed to be less responsible, and customers with more teen drivers therefore may have a higher likelihood of getting into a crash. Lastly, I included policy_tenure to have a variable that may have the opposite effect crash likelihood. as the previous variables I mentioned. People who have been customers for longer may be more responsible drivers and less likely to be involved in a crash.
I have run the first regression below, and I will compare its regression output to the next two models.
log_reg1 <- glm(
data = train_data,
formula = crash_flag ~ driver_age + car_age + past_claims_count + income +
teen_drivers + license_revoked + policy_tenure,
family = binomial(link = "logit")
)3.1.2 Logistic Model 2
The next model I will create will include every variable in the data set. My goal here is to generate a model and then use backward selection to find the best model with the lowest AIC. The estimating equation for the full model is seen below:
P\[ \begin{aligned} P(crash.flag_i=1) = \;& \beta_0+\beta_1\cdot driver.age_{i}+\beta_2\cdot car.value_{i}+\beta_3\cdot car.age_{i}+ \\ &\delta_2\cdot car.type2_i+\cdots+\delta_6\cdot car.type6_i+ \\ &\beta_4\cdot car.use_i+\beta_5\cdot past.claims.count_i+ \\ &\gamma_2\cdot educ.level2_i+\cdots+\gamma_5\cdot educ.level5_i+ \beta_6\cdot num.children.home_i+\\ & \beta_7\cdot home.value_i+\beta_8\cdot income_i+\ \sigma_2\cdot job.category2_i+\cdots+\\ & \sigma_9\cdot job.category9_i +\beta_9\cdot teen.drivers_i+\beta_{10}\cdot married_i+ \\ &\beta_{11}\cdot mvr.points_i +\beta_{12}\cdot past.claims.amount_i+\beta_{13}\cdot single.parent_i+\\ &\beta_{14}\cdot red.car_i+\beta_{15}\cdot license.revoked_i+\beta_{16}\cdot gender_i+\beta_{17}\cdot policy.tenure_i+\\ &\beta_{18}\cdot commute.time_i+\beta_{19}\cdot urban.rural_i+\beta_{20}\cdot years.on.job_i+ \\& \beta_{21}\cdot log.car.value_i+\beta_{22}\cdot log.home.value_i \\& \beta_{23}\cdot log.income_i+\beta_{24}\cdot \log.past.claims.amount_i+\\ &\beta_{25}\cdot log.policy.tenure_i+\beta_{26}\cdot log.commute.time_i+u_i \end{aligned} \]
\(\delta_2...\delta_6\) represent the coefficients for each car type dummy variable.
\(\gamma_2...\gamma_5\) represent the coefficients for each education level dummy variable.
\(\sigma_2...\sigma_9\) represent the coefficients for each job category dummy variable.
log_reg2 <- glm(data = train_data,
formula = crash_flag ~ . -crash_cost,
family = binomial(link = "logit"))3.1.3 Logistic Model 3
Now, I will use backward selection from my full model to develop my third logistic regression model. Below is the estimating equation for this model:
\[ \begin{aligned} P(crash.flag_i=1) = \;& \delta_2\cdot car.type2_i+\cdots+\delta_6\cdot car.type6_i+ \\ &\beta_1\cdot car.use_i+ \gamma_2\cdot educ.level2_i+\cdots+\gamma_5\cdot educ.level5_i+ \\ & \beta_2\cdot income_i+\sigma_2\cdot job.category2_i+\cdots+\sigma_9\cdot job.category9_i +\\& \beta_3\cdot teen.drivers_i+\beta_4\cdot married_i+ \\ &\beta_5\cdot mvr.points_i +\beta_6\cdot past.claims.amount_i+\beta_7\cdot single.parent_i+\\ &\beta_8\cdot license.revoked_i+\beta_9\cdot gender_i+\\ &\beta_{10}\cdot urban.rural_i+ \beta_{11}\cdot log.car.value_i+\beta_{12}\cdot log.home.value_i \\& \beta_{13}\cdot log.income_i+\beta_{14}\cdot \log.past.claims.amount_i+\\ &\beta_{15}\cdot log.policy.tenure_i+\beta_{16}\cdot log.commute.time_i +u_i \end{aligned} \]
back_reg <- stepAIC(object = log_reg2,
direction = c("backward"))log_reg3 <- glm(data = train_data,
formula = crash_flag ~ car_type + car_use + educ_level
+ income + job_category + teen_drivers + married
+ mvr_points + past_claims_amount + single_parent +
license_revoked + urban_rural + log_car_value +
log_home_value + log_income + log_past_claims_amount
+ log_policy_tenure + log_commute_time,
family = binomial(link = "logit"))3.1.4 Logistic Regression Coefficient Comparison
Below, I have displayed the coefficient estimates for the three logistic regression models.
# Create covariate labels for the regression table
stargazer(log_reg1, log_reg2, log_reg3, type = "text", digits = 2,
title = "Logistic Regression Results",
column.labels = c("Reg 1", "Reg 2", "Reg 3"),
dep.var.labels = c("Crash Flag"),
covariate.labels = c("Driver Age", "Car Value", "Car Age",
"Panel Truck", "Pickup", "Sports Car", "Van",
"SUV", "Car Use", "Past Claims count",
"Bachelors", "Masters", "PhD",
"Number of Children home", "Home Value",
"income", "Clerical Worker", "Doctor",
"Home Maker", "Lawyer", "Manager",
"Missing Values", "Professional", "Student",
"Teen Drivers", "Married", "MVR Points",
"Past Claims Amount", "Single Parent",
"Red Car", "License Revoked", "Gender",
"Policy Tenure", "Commute Time",
"Urban or Rural", "Years on Job",
"log Car Value", "log Home Value",
"log Income", "log Past Claims Amount",
"log Policy Tenure", "log Commute Time"))
Logistic Regression Results
========================================================
Dependent variable:
--------------------------------
Crash Flag
Reg 1 Reg 2 Reg 3
(1) (2) (3)
--------------------------------------------------------
Driver Age -0.02*** -0.004
(0.003) (0.005)
Car Value 0.0000
(0.0000)
Car Age -0.02*** 0.005
(0.01) (0.01)
Panel Truck 0.31* 0.48***
(0.18) (0.16)
Pickup 0.59*** 0.57***
(0.11) (0.11)
Sports Car 1.05*** 0.91***
(0.15) (0.12)
Van 0.56*** 0.62***
(0.14) (0.14)
SUV 0.88*** 0.74***
(0.13) (0.10)
Car Use -0.74*** -0.73***
(0.10) (0.10)
Past Claims count 0.38*** 0.04
(0.02) (0.05)
Bachelors -0.46*** -0.44***
(0.10) (0.09)
Masters -0.29 -0.26
(0.18) (0.16)
PhD -0.23 -0.19
(0.22) (0.20)
Number of Children home 0.02
(0.04)
Home Value 0.0000
(0.0000)
income -0.0000*** -0.0000** -0.0000**
(0.0000) (0.0000) (0.0000)
Clerical Worker -0.0005 -0.003
(0.12) (0.12)
Doctor -1.22*** -1.23***
(0.34) (0.34)
Home Maker -0.37** -0.40**
(0.19) (0.18)
Lawyer -0.34 -0.35*
(0.21) (0.21)
Manager -0.92*** -0.94***
(0.16) (0.16)
Missing Values -0.42** -0.43**
(0.21) (0.21)
Professional -0.23* -0.24*
(0.13) (0.13)
Student -0.50*** -0.47***
(0.17) (0.16)
Teen Drivers 0.40*** 0.46*** 0.47***
(0.05) (0.07) (0.06)
Married -0.48*** -0.44***
(0.10) (0.09)
MVR Points 0.10*** 0.10***
(0.02) (0.02)
Past Claims Amount -0.0000*** -0.0000***
(0.0000) (0.0000)
Single Parent 0.39*** 0.46***
(0.12) (0.11)
Red Car -0.02
(0.10)
License Revoked 0.89*** 1.01*** 1.02***
(0.08) (0.10) (0.10)
Gender 0.19
(0.13)
Policy Tenure -0.05*** -0.02
(0.01) (0.02)
Commute Time 0.01
(0.01)
Urban or Rural 2.33*** 2.32***
(0.13) (0.13)
Years on Job 0.01
(0.01)
log Car Value -0.46*** -0.34***
(0.14) (0.06)
log Home Value -0.03** -0.03***
(0.01) (0.01)
log Income -0.07*** -0.06***
(0.02) (0.02)
log Past Claims Amount 0.07*** 0.09***
(0.02) (0.01)
log Policy Tenure -0.18** -0.24***
(0.09) (0.03)
log Commute Time 0.23 0.36***
(0.15) (0.06)
Constant -0.06 1.35 0.14
(0.16) (1.22) (0.65)
--------------------------------------------------------
Observations 6,528 6,528 6,528
Log Likelihood -3,427.87 -2,885.99 -2,890.02
Akaike Inf. Crit. 6,871.74 5,857.98 5,844.04
========================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Before interpreting the coefficients, it is important to discuss the log likelihood and AIC metrics that we see at the bottom of the.
The log-likelihood is used to calculate the AIC for logistic regressions, and lower values of AIC indicate a better logistic regression model. The log-likelihood itself is also a measure of a model’s performance and high values for log-likelihood are more desirable. According to the regression results, model 3 has the lowest AIC value of 5,844 and model 2 has the highest log-likelihood value of -2885.99.
It is difficult to decide which of these models is the best based on these measures, so I will decide after constructing the confusion matrices for each model.
Coefficient Interpretations:
I will go ahead and interpret some of the coefficient estimates for model 2 since this is one of the two best models for prediction. Since there are so many variables in these regressions, I will only interpret one continuous variable, one binary variable, and I will also make sure that I interpret both a positive and negative coefficient estimate.
Coefficient Sign
license_revoked(+): The coefficient sign is positive which is expected since people who have had their license revoked are more likely to get into an accident.mvr_points(+): The coefficient sign is positive which makes sense because people with more tickets or traffic violations are more likely to get into a crash.married(-):The coefficient sign is negative, suggesting that people who are married are less likely to get into an accident. This makes sense because we traditionally think of married people as more responsible, and thus safer drivers.
Economic magnitude
license_revoked: \(\beta_{15}=1.01\): This means that when a customer has had their license revoked before, the log-odds of getting in a crash increase by 1.01, holding all else constant. In other words, when we take \(e^{1.01}=2.75\), the odds of a customer who has had their licence revoked getting into a crash increase by a factor of about 2.75. This is a major result, because this means that people who have had their license revoked are more than double as likely to get into an accident than people who have not.mvr_points: \(\beta_{11}=0.10\): For every additional ticket a customer receives, the log-odds of getting into a crash increase by about 0.10, holding all other factors constant. Alternatively, taking \(e^{0.10}=1.105\), for every additional ticket a customer receives, the odds of a crash occurring increases by a factor of about 1.105. This is not a very significant result, however if a customer has many ticket violations, we can predict that they will be much more likely to get into an accident.married: \(\beta_4=-0.48\). The log-odds of a customer getting into an accident decreases by -0.48 when the customer is married versus if a customer is single. Alternatively, taking \(e^{-0.48}=0.62\) means that the odds of getting into an accident decreases by a factor of 0.62 or 38% for married customers. This is a major result because it can help us determine whether a customer is likely to get in an accident based on their marital status..
Statistical Significance
- The coefficient estimates for
license_revoked,mvr_points, andmarriedare all statistically significant below the significance level of \(\alpha=0.01\) level. This means that the coefficient estimates most likely did not come about randomly, and thus, we can be very confident in the coefficient estimates.
- The coefficient estimates for
3.2 Linear Regressions
Now, we will shift our attention to predicting the accident costs for customers who got into a car crash. The cost of accidents is a continuous numerical variable, therefore, we will incorporate a multivariate linear regression model to predict the cost of accidents. Also, since we are only focusing on people involved in car accidents, I have created a new data frame that only includes observations who were involved in an accident.
df_crash <- train_data |> dplyr::filter(crash_flag == 1)Below is a histogram of the cost of crashes which is very skewed to the right.
ggplot(data = df_crash,
mapping = aes(x = crash_cost)) +
geom_histogram(fill = "blue", bins = 20) +
labs(title = "Histogram of the Cost of Crashes",
x = "Crash Costs")Use a log transformation on the dependent variable crash_cost to address its outliers and control for the skewness.
df_crash$log_crash_cost <- log(df_crash$crash_cost)
ggplot(data = df_crash,
mapping = aes(x = log_crash_cost)) +
geom_histogram(fill = "blue", bins = 20) +
labs(title = "Histogram of Log of Crash Costs",
x = "Log of Crash Costs")The histogram illustrates a much more symmetric distribution of the costs of the crashes. Now let’s run some multiple linear regressions using the log transformation of the crash costs to make predictions on how much a crash will cost.
3.2.1 OLS Linear Regression Model 1
The first model I will build is an OLS multiple linear regression model that will contain all of the variables to predict the cost accidents. I will then use backward selection to construct the second linear regression model. My estimating equation for the full regression model is the following:
\[ \begin{aligned} log.crash.cost_i = \;& \beta_0+\beta_1\cdot driver.age_{i}+\beta_2\cdot car.value_{i}+\beta_3\cdot car.age_{i}+ \\ &\delta_2\cdot car.type2_i+\cdots+\delta_6\cdot car.type6_i+ \\ &\beta_4\cdot car.use_i+\beta_5\cdot past.claims.count_i+ \\ &\gamma_2\cdot educ.level2_i+\cdots+\gamma_5\cdot educ.level5_i+ \beta_6\cdot num.children.home_i+\\ & \beta_7\cdot home.value+\beta_8\cdot income+\ \sigma_2\cdot job.category2_i+\cdots+\\ & \sigma_9\cdot job.category9_i +\beta_9\cdot teen.drivers_i+\beta_{10}\cdot married_i+ \\ &\beta_{11}\cdot mvr.points_i +\beta_{12}\cdot past.claims.amount+\beta_{13}\cdot single.parent_i+\\ &\beta_{14}\cdot red.car_i+\beta_{15}\cdot license.revoked_i+\beta_{16}\cdot gender_i+\beta_{17}\cdot policy.tenure+\\ &\beta_{18}\cdot commute.time_i+\beta_{19}\cdot urban.rural_i+\beta_{20}\cdot years.on.job_i+ \\& \beta_{21}\cdot log.car.value_i+\beta_{22}\cdot log.home.value_i \\& \beta_{23}\cdot log.income_i+\beta_{24}\cdot \log.past.claims.amount_i+\\ &\beta_{25}\cdot log.policy.tenure_i+\beta_{26}\cdot log.commute.time_i+u_i \end{aligned} \]
lin_full <- lm(data = df_crash, formula = log_crash_cost ~ . -crash_flag
-crash_cost)3.2.2 OLS Linear Regression Model 2
The second model I will build is an OLS multiple linear regression model. I will use backward selection using stepAIC to determine the best linear model.
back_linear <- MASS::stepAIC(object = lin_full,
direction = c("backward"))After performing backward selection, the model with the lowest AIC contained only the following five variables:
married,mvr_points,gender,log_car_value, andlog_home_value
Based on the results from using stepAIC, my estimating equation for model 2 is the following:
\[ \begin{aligned} log.crash.cost_i = \;&\beta_0+\beta_1\cdot married_i + \beta_2\cdot mvr.points_i + \beta_3\cdot gender_i\\&+\beta_4\cdot log.car.value_i+\beta_5\cdot log.home.value + u_i \end{aligned} \]
lin_reg1 <- lm(data = df_crash,
formula = log_crash_cost ~ married + mvr_points + gender
+ log_car_value + log_home_value)3.2.3 Linear Regression Coefficient Comparisons
Now, we will compare the coefficient results for the multiple linear regressions we built using the Lasso and the setpAIC feature selection methods. The coefficient estimate results are seen in the table below.
linear_labels = c(
""
)
stargazer(lin_full, lin_reg1, type = 'text',
title = "Regression Results for Linear Models",
digits = 2, column.labels = c("Full Model", "setpAIC Model"),
dep.var.labels = c("Log Cost of Crash"),
covariate.labels = c("Driver Age", "Car Value", "Car Age",
"Panel Truck", "Pickup", "Sports Car",
"Van", "SUV", "Car Use",
"Past Claims count", "Bachelors",
"Masters", "PhD", "Number of Children home",
"Home Value", "income", "Clerical Worker",
"Doctor", "Home Maker", "Lawyer", "Manager",
"Missing Values", "Professional", "Student",
"Teen Drivers", "Married", "MVR Points",
"Past Claims Amount", "Single Parent",
"Red Car", "License Revoked", "Gender",
"Policy Tenure", "Commute Time",
"Urban or Rural", "Years on Job",
"log Car Value", "log Home Value",
"log Income", "log Past Claims Amount",
"log Policy Tenure", "log Commute Time"))
Regression Results for Linear Models
=====================================================================
Dependent variable:
---------------------------------------------
Log Cost of Crash
Full Model setpAIC Model
(1) (2)
---------------------------------------------------------------------
Driver Age 0.002
(0.002)
Car Value -0.0000
(0.0000)
Car Age -0.005
(0.01)
Panel Truck 0.12
(0.11)
Pickup 0.03
(0.07)
Sports Car 0.03
(0.09)
Van -0.04
(0.09)
SUV 0.07
(0.08)
Car Use 0.01
(0.06)
Past Claims count -0.04
(0.03)
Bachelors -0.02
(0.06)
Masters 0.13
(0.11)
PhD 0.18
(0.14)
Number of Children home 0.01
(0.02)
Home Value -0.0000
(0.0000)
income -0.0000
(0.0000)
Clerical Worker -0.005
(0.07)
Doctor 0.06
(0.24)
Home Maker -0.003
(0.11)
Lawyer -0.02
(0.13)
Manager -0.01
(0.10)
Missing Values -0.04
(0.13)
Professional 0.06
(0.08)
Student 0.11
(0.09)
Teen Drivers -0.05
(0.04)
Married -0.10* -0.10**
(0.06) (0.05)
MVR Points 0.02* 0.02**
(0.01) (0.01)
Past Claims Amount 0.0000
(0.0000)
Single Parent 0.04
(0.07)
Red Car 0.02
(0.06)
License Revoked -0.08
(0.06)
Gender 0.09 0.06
(0.08) (0.04)
Policy Tenure 0.01
(0.01)
Commute Time -0.0003
(0.004)
Urban or Rural 0.05
(0.09)
Years on Job -0.01
(0.01)
log Car Value 0.26*** 0.15***
(0.08) (0.03)
log Home Value 0.01* 0.01
(0.01) (0.004)
log Income 0.01
(0.01)
log Past Claims Amount 0.005
(0.01)
log Policy Tenure -0.04
(0.05)
log Commute Time -0.002
(0.09)
Constant 5.73*** 6.78***
(0.70) (0.27)
---------------------------------------------------------------------
Observations 1,722 1,722
R2 0.04 0.02
Adjusted R2 0.01 0.02
Residual Std. Error 0.79 (df = 1679) 0.78 (df = 1716)
F Statistic 1.52** (df = 42; 1679) 8.57*** (df = 5; 1716)
=====================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Coefficient Interpretations
I will use the second model for coefficient interpretations because it has the lowest AIC, so I will most likely use this model for predictions on the evaluation dataset.
Coefficient Signs
marriedhas a negative coefficient, indicating that crashes that involve unmarried people may be more expensive than when married people get in accidents.mvr_pointshas a positive coefficient which makes sense because people who have lots of traffic tickets are likely to pay more when they get into an accident.genderhas a positive coefficient which makes sense because men are probably more dangerous drivers and thus have to pay more when they get into an accident.log_car_valuehas a positive coefficient which is expected since more expensive cars will result in a higher payout when crashes occur.home_valuehas a positive coefficient which is interesting because we normally think of wealthier people as more responsible drivers, however, if a person who has a high home value does get into an accident, that could play a role in how expensive the crash becomes.
Economic Magnitude
married: \(\beta_1=-0.10\): Since the dependent variable is the log transformation of the cost of the crashes, then the interpretations will be in percentages. On average, married people pay about 10% less than unmarried people in the event of an accident, holding all else constant. This is a pretty significant effect and it can allow us to predict that crashes involving married customers will probably not be as expensive.mvr_points: \(\beta_2=0.02\): On average, for every additional violation from an insurance customer, the cost of the crash increases by about 2%, holding everything else constant. This doesn’t seem like a significant effect, however, if we know that someone has a large number of tickets, this can be helpful to predict the cost of the crash.gender: \(\beta_3=0.06\): In the event of a crash, if the customer is a male, the cost of the crash will be about 6% more than if the customer is a female, holding everything else constant. This makes sense because men are usually seen as more reckless and thus face more severe penalties when involved in accidents.log_car_value: \(\beta_4=0.15\): In the event of a crash, for every 1% increase in the car value, the cost of a crash will increase by about 15%, holding all else constant. This is a major effect and it tells us that crashes involving expensive cars will result in higher payouts.log_home_value: \(\beta_5=.01\): In the event a crash, for every additional 1% increase in the home value of the customers, the cost of a crash increases by about 1%. This is not a very significant result in terms of the magnitude.
Statistical Significance
The coefficient estimates for the variables
married,mvr_points, andlog_car_valueare all statistically significant under the level \(\alpha=0.05\).This indicates that these estimates are most likely not to have randomly occurred, and thus, we are confident in the accuracy of these coefficient results.
genderandlog_home_valueare not statistically significant, meaning that we are not confident in the accuracy of these coefficient estimates and may have occurred by randomness.
4 Select Models
4.1 Logistic Regression Model Selection
I will construct confusion matrices for all three models and compare them to determine which logistic regression is the best model. Before constructing the confusion matrices, I will first make predictions on the training data using the three logistic regressions.
# Predict probabilities for each regression model
train_data$pred_log1 <- predict(log_reg1, type = "response")
train_data$pred_log2 <- predict(log_reg2, type = "response")
train_data$pred_log3 <- predict(log_reg3, type = "response")
# Create Threshold at 0.5
threshold = 0.5
# indicated classes for the observations using the threshold (1 for crash
# and 0 for no crash)
train_data$logit1_class <- ifelse(train_data$pred_log1 >= threshold, 1, 0)
train_data$logit2_class <- ifelse(train_data$pred_log2 >= threshold, 1, 0)
train_data$logit3_class <- ifelse(train_data$pred_log3 >= threshold, 1, 0)Now that we have our predictions for the three logistic regression models, we can construct the confusion matrices to evaluate which one has the best prediction accuracy.
Diagnostic Testing
library(caret)
conf1<- confusionMatrix(as.factor(train_data$logit1_class),
as.factor(train_data$crash_flag), positive = "1")
conf2 <- confusionMatrix(as.factor(train_data$logit2_class),
as.factor(train_data$crash_flag), positive = "1")
conf3 <- confusionMatrix(as.factor(train_data$logit3_class),
as.factor(train_data$crash_flag), positive = "1")conf_table <- data.frame(
Model = c("Model 1", "Model 2", "Model 3"),
Accuracy = c(conf1$overall["Accuracy"],
conf2$overall["Accuracy"],
conf3$overall["Accuracy"]),
Classification_Error = c(1 - conf1$overall["Accuracy"],
1 - conf2$overall["Accuracy"],
1 - conf3$overall["Accuracy"]),
Sensitivity = c(conf1$byClass["Sensitivity"],
conf2$byClass["Sensitivity"],
conf3$byClass["Sensitivity"]),
Specificity = c(conf1$byClass["Specificity"],
conf2$byClass["Specificity"],
conf3$byClass["Specificity"]),
Precision = c(conf1$byClass["Pos Pred Value"],
conf2$byClass["Pos Pred Value"],
conf3$byClass["Pos Pred Value"])
)
conf_table |> kable(caption = "Comparison of the Confusion Matrix Metrics",
digits = 3)| Model | Accuracy | Classification_Error | Sensitivity | Specificity | Precision |
|---|---|---|---|---|---|
| Model 1 | 0.746 | 0.254 | 0.165 | 0.954 | 0.561 |
| Model 2 | 0.794 | 0.206 | 0.435 | 0.923 | 0.670 |
| Model 3 | 0.792 | 0.208 | 0.429 | 0.922 | 0.664 |
Confusion Matrix Analysis
Accuracy
- Model 2 slightly has the highest accuracy at 79.4% correct classifications of whether someone got in an accident or not.
Classification Error
- Classification error is just one minus the accuracy, so therefore, Model 2 also has the lowest classification error.
Sensitivity (True positive Rate)
- Model 2 has the highest sensitivity at 43.5%. This means for the people who actually got into accidents, our model only correctly classified 43.5% as getting into an accident.
Specificity (True negative Rate)
- Model 1 has the highest specificity at 95.4%. Model 2 has a specificity of about 92.3%. This means that out of the customers who did not get into a crash, model 1 correctly identified 95.4% of them as not getting into an accident and model 2 correctly classified 92.3% of people who were not in accidents..
Precision (Positive Predicted Value)
- Model 2 has the highest precision value at about 67%. This means that out of the observations our model classified as getting into an accident, about 67% of those classifications were correct.
Based on the above metrics, I have decided to choose Model 2 to make predictions on the evaluation data. Model 2 has the highest log-likelihood, the highest accuracy, the highest sensitivity, and the highest precision. Therefore, it will likely be the most suitable for prediction on the evaluation dataset.
4.2 Linear Regression Model Selection
I will compare the mean squared error, F-statistic values, and the R-squared values for the two linear regression models I built. I will then graph their residual plots to compare which of the models better follow the Gauss-Markov assumptions. After considering all of the diagnostic tests, I will make a decision on which linear regression model to use to predict the cost of the crashes in the evaluation data.
diag_table <- data.frame(
model = c("Full Model", "Model 2"),
MSE = c(mean(lin_full$residuals**2), mean(lin_reg1$residuals**2)),
R_squared = c((summary(lin_full))$adj.r.squared,
(summary(lin_reg1))$adj.r.squared),
f_stat = c((summary(lin_full))$fstatistic[1],
(summary(lin_reg1))$fstatistic[1])
)
diag_table |> kable(caption = "Comparison of Diagonostic Testing",
digits = 2)| model | MSE | R_squared | f_stat |
|---|---|---|---|
| Full Model | 0.61 | 0.01 | 1.52 |
| Model 2 | 0.61 | 0.02 | 8.57 |
MSE
- The full model has a slightly lower mean squared error than the second model, however this is not a very big difference, and in fact, they show up as identical when rounding to the second decimal place in the table above.
R-Squared
- Both adjusted R-squared values are very low with the full model being at 0.01 whereas the model 2 is at 0.02. This means that the models only explain about 1-2% of the variability that is observed in the cost of the crashes.
F-statistic
- Both F-statistic values are relatively low, however, the F-statistic value for the second model is about 8.57 which is higher than the F-stat value of 1.52 for the full model. The F-statistic value for model 2 is statistically significant under the \(\alpha=0.01\) level, implying that the model as a whole is statistically significant and that we are confident there is a relationship between the predictors and the cost of a crash.
Residual Analysis Plots
Below are the residual plots on the training data for both the full model and the updated model using backward selection.
Full Linear Model Residual Plots
autoplot(lin_full,label.size =3,size =1)+theme_minimal(base_size =10)Linear Model 2 Residuals Plots
autoplot(lin_reg1, label.size = 3, size = 1)+theme_minimal(base_size = 10)Residual Plot Analysis
All of the residual plots for both the full and the shortened models are all very similar to each other. Below, I have provided a summary of what each of the four residual plots means in regards to the models obeying the Gauss-Markov assumptions.
Residual vs Fitted
The residual vs fitted plots both show the residuals to be randomly scattered around zero with no systematic curvature.
The blue regression lines are linear around zero which indicates that the assumption of linearity holds.
Normal Q-Q
The residuals are relatively normally distributed for theoretical quantile values around zero.
However, the residuals do shift away from the normal distribution near the tails, which is likely a result of the skewness of some of the data.
Scale-Location
The scale-location plot for
lin_reg1appears to exhibit less heteroscedasticity than for the full model.It appears that the homoscedasticity assumption holds for the second model since the scale-location plot shows constant variance among the residuals.
Residuals vs Leverage
- There are no major indications of leverage points. There are some data points labeled that are far away from zero, but the blue regression line is still linear, indicating they have little effect on either of the models
Testing the VIF for both models
Our last diagnostic test to determine which model is the best will be seeing if either model has multicollinearity by looking at the variance inflation factor for both models. Below are two tables of the VIF values for each model.
vif1 <- vif(lin_full)
#Model 2
vif2 <- vif(lin_reg1)
#Model 3
#vif_table |> kable(caption = "Comparison of the Variance Inflation Factor")
vif1[,1] |> kable (caption = "VIF For Full Linear Model", col.names = "VIF")| VIF | |
|---|---|
| driver_age | 1.544652 |
| car_value | 9.990124 |
| car_age | 2.034179 |
| car_type | 8.185516 |
| car_use | 2.334061 |
| past_claims_count | 3.126323 |
| educ_level | 9.214506 |
| num_children_home | 2.360495 |
| home_value | 7.625297 |
| income | 4.401080 |
| job_category | 41.662806 |
| teen_drivers | 1.473977 |
| married | 2.350962 |
| mvr_points | 1.191089 |
| past_claims_amount | 2.808207 |
| single_parent | 2.195859 |
| red_car | 1.846706 |
| license_revoked | 1.636857 |
| gender | 3.917239 |
| policy_tenure | 7.264878 |
| commute_time | 8.029088 |
| urban_rural | 1.064013 |
| years_on_job | 2.676284 |
| log_car_value | 7.405589 |
| log_home_value | 7.000381 |
| log_income | 4.863951 |
| log_past_claims_amount | 4.838392 |
| log_policy_tenure | 7.257288 |
| log_commute_time | 7.964766 |
vif2 |> kable(caption = "VIF for Model 2", col.names = "VIF")| VIF | |
|---|---|
| married | 1.471499 |
| mvr_points | 1.004295 |
| gender | 1.006821 |
| log_car_value | 1.017753 |
| log_home_value | 1.481523 |
VIF Analysis
Overall, the full model has higher VIF values that indicate there is likely multicollinearity which could result in some bias for the coefficient estimates. There is one very high VIF value for
job_categoryThe VIF values for the second linear model are very low with all values below 2. This suggests that the second Gauss-Markov assumption of no perfect multicollinearity is not violated, thus, we can trust our coefficient estimates.
Linear Model Selection for Predictions on Evaluation Data
Based on the combination of the diagnostic tests with MSE, adjusted R-squared, the f statistic, the residual plots, and the VIF test for multicollinearity, I have decided to use Linear Model 2 to make predictions on the cost of car accidents in the evaluation data.
4.3 Logistic Model Evaluation Data Predictions
To make predictions for both the logistic and linear regressions on the evaluation dataset, I first had to clean the evaluation data so it would be compatible with the models I trained on the training data. This process involved the same steps for renaming variables, imputing missing values, and log transformations that I performed on the training data. For this reason, I have hidden the code below to make the report more concise and presentable.
eval_data$pred_log <- predict(log_reg2, newdata = eval_data,
type = "response")eval_data$logit_class <- ifelse(eval_data$pred_log >= threshold, 1, 0)confpred <- confusionMatrix(as.factor(eval_data$logit_class),
as.factor(eval_data$crash_flag),
positive = "1")
confpred_table <- data.frame(
model = c("Model 2"),
Accuracy = c(confpred$overall["Accuracy"]),
Classification_Error = c(1 - confpred$overall["Accuracy"]),
Sensitivity = c(confpred$byClass["Sensitivity"]),
Specificity = c(confpred$byClass["Specificity"]),
Precision = c(confpred$byClass["Pos Pred Value"])
)
confpred$table Reference
Prediction 0 1
0 1105 246
1 97 185
confpred_table |> kable(caption = "Confusion matrix Prediction Metrics",
digits = 3, row.names = F)| model | Accuracy | Classification_Error | Sensitivity | Specificity | Precision |
|---|---|---|---|---|---|
| Model 2 | 0.79 | 0.21 | 0.429 | 0.919 | 0.656 |
The table above displays the classification accuracy metrics for logistic Model 2 for the evaluation data.
Accuracy:
- Our best logistic model trained on the training data has a prediction accuracy of 79% on the evaluation data set. This means that it correctly classified 79% of the unseen customers as either getting into a crash or not getting into a crash.
Classification Error:
- The model incorrectly classified 21% of the observations in the evaluation data.
Sensitivity (True Positive Rate)
- The true positive rate is about 43%, meaning that out of the customers in the evaluation data who actually got into an accident, our logistic model correctly classified 43% of those individuals.
Specificity (True Negative Rate)
- The true negative rate is about 92%, meaning that out of the customers in the evaluation data who did not get into an accident, our logistic model correctly classified 92% of those individuals as not getting into a crash.
Precision (Positive Predicted Vale)
- The positive predicted value is 65.6%, meaning that out of the customers our logistic model classified as getting into an accident, 65.6% of those insurance customers actually did get into a crash.
4.4 Linear Regression Evaluation Data Model Predictions
I created a separate data frame for just predicting the cost of the crashes for people who were in a crash. I also created a log transformation of the crash cost variable so that the evaluation data can be suitable for our model to make predictions.
df_eval_crash <- eval_data |> dplyr::filter(crash_flag == 1)
df_eval_crash$log_crash_cost <- log(df_eval_crash$crash_cost)Now I will make predictions on the log cost of accidents for customers in the evaluation data and compare the results to the actual log values of the car accident costs.
df_eval_crash$pred_cost <- predict(lin_reg1, newdata = df_eval_crash)I have added the predicted log crash costs to the data frame. Below is a comparison of the summary statistics for the actual log car crash values and the predicted log crash values for Model.
log_cost_comp <- data.frame(
Model = c("Predictions", "Actual"),
Mean = c(mean(df_eval_crash$pred_cost), mean(df_eval_crash$log_crash_cost)),
Median = c(median(df_eval_crash$pred_cost),
median(df_eval_crash$log_crash_cost)),
Min = c(min(df_eval_crash$pred_cost), min(df_eval_crash$log_crash_cost)),
Max = c(max(df_eval_crash$pred_cost), max(df_eval_crash$log_crash_cost))
)
log_cost_comp |> kable(caption = "Summary Statistics Comparison for Log
predicted Crash Costs versus actual log crash costs",
digits = 2)| Model | Mean | Median | Min | Max |
|---|---|---|---|---|
| Predictions | 8.27 | 8.28 | 7.88 | 8.67 |
| Actual | 8.31 | 8.32 | 3.41 | 11.21 |
Table Analysis
The mean and median log crash cost values are very similar, indicating that our model’s predictions of the log crash costs of the accidents are very accurate
Using the log transformation of the crash costs in our model and for the actual crash costs helped reduce the influence of outliers which allowed for better prediction accuracy.
The min and max values are much more spread apart for the actual crashes, indicating more potential outliers and likely skewness. However, taking the log helped mitigate these effects which are why the mean and median values are so close to each other for the predictions and actual values.
I have also exponentiated the log predicted costs to get a more intuitive idea about what our model predicts to be the normal costs of these accidents. Below, is a table comparing the summary statistics between the exponentiation of the log predicted costs and the actual crash cost values.
cost_comp <- data.frame(
Model = c("Predictions", "Actual"),
Mean = c(mean(exp(df_eval_crash$pred_cost)), mean(df_eval_crash$crash_cost)),
Median = c(median(exp(df_eval_crash$pred_cost)),
median(df_eval_crash$crash_cost)),
Min = c(min(exp(df_eval_crash$pred_cost)), min(df_eval_crash$crash_cost)),
Max = c(max(exp(df_eval_crash$pred_cost)), max(df_eval_crash$crash_cost))
)
cost_comp |> kable(caption = "Summary Statistics Comparison between Actual
and Predicted crash costs for Evaluation Data", digits = 2)| Model | Mean | Median | Min | Max |
|---|---|---|---|---|
| Predictions | 3939.48 | 3952.93 | 2632.75 | 5807.00 |
| Actual | 6270.82 | 4093.00 | 30.28 | 73783.47 |
Prediction Analysis
The median predicted crash cost value from our model is about $3,952.93 which is very close to the actual median crash cost value of $4,093.
The mean value for the actual crash costs is much higher because of the high outliers in the actual data. This can be seen with the wide dispersion between the min and max values.
The distribution for the exponentiated predicted values is much more symmetric as seen with the similar median and mean values.
Below is a comparison of the distribution of the predicted crash costs and the actual crash costs. We can see that the predictions of the car crash costs are much more normally distributed than the actual costs of the accidents.
hist1 <- ggplot(data = df_eval_crash,
mapping = aes(x = pred_cost)) +
geom_histogram(fill = "blue") +
labs(title = "Predicted Crash Costs",
x = "Predicted Crash Costs")
hist2 <- ggplot(data = df_eval_crash,
mapping = aes(x = crash_cost)) +
geom_histogram(fill = "blue") +
labs(title = "Actual Crash Costs",
x = "Actual Crash Costs")
hist1 + hist2Conclusion
In this project, we looked at an auto insurance dataset that contained important information about the demographics of customers and whether they had been in an accident and how much the accident cost if they were in an accident.
I trained three logistic regression models to predict the likelihood that someone was involved in a crash.
After comparing the logistic models using confusion matrix metrics, I determined that logistic model 2 was the best model for prediction on the evaluation data set. This model contained every variable in the dataset, including the log transformations that I created in the data cleaning section.
The following are notable variables that were statistically significantly associated with getting in more accidents:
People who work in an urban area
Single parents
Unmarried people
People who drive sports cars
Customers who have lots of teen drivers
People with more traffic tickets and traffic violations
This logistic model had a 79% accuracy rate, meaning that it correctly classified 79% of the observations in the evaluation data as getting into an accident or not.
I also trained two multivariate linear regression models to predict the cost of crashes for customers who were involved in accidents.
After running various diagnostic tests and residual analysis, I decided that linear model 2 was the best model for prediction on the evaluation data.
The following are the most notable variables that were significantly associated with higher crash payouts:
Unmarried Customers
People with lots of traffic tickets or traffic violations
High value of the car involved
The mean and median value of the predicted log crash cost values are 8.27 and 8.28 respectively. After exponentiating the predicted log crash costs, the mean and median predicted crash costs are $3,939 and $3,952 respectively.