First, empty the environment so that we can uplaod the clean data.
# Clear the workspace
rm(list = ls()) # Clear environment-remove all files from your workspace
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 619491 33.1 1417732 75.8 NA 700248 37.4
## Vcells 1181940 9.1 8388608 64.0 32768 1963344 15.0
cat("\f") # Clear the console
graphics.off() # Clear all graphs
# Set working directory and path to data
# setwd("/Users/arvindsharma/Dropbox/WCAS/Econometrics/")
Now, we load the packages.
# List needed packages
packages <- c("psych", # quick summary stats for data exploration,
"stargazer", # summary stats for sharing,
"tidyverse", # data manipulation like selecting variables,
"corrplot", # correlation plots
"ggplot2", # graphing
"data.table", # reshape for graphing
"car", # vif
"MASS", # negative binomial regression
"caret", # for confusion matrix
"visdat", # finding missing values
"patchwork", # placing ggplot2 graphs together
"cowplot", # placing ggplot2 graphs together
"skimr"
)
# Install packages if not already installed, then load the packages
for (i in 1:length(packages)) {
# Install packages if they have not been installed on your machine
if (!packages[i] %in% rownames(x = installed.packages())) {
install.packages(package = packages[i],
repos = "http://cran.rstudio.com/",
dependencies = TRUE
)
}
# Load all packages
library(package = packages[i],
character.only = TRUE
)
}
# Makes sure which command will be used in in two different packages
conflicted::conflict_prefer(name = "select",
winner = "dplyr")
rm(packages)
One can declare blank observations in character variables when importing the raw data as NA. This can simplify the data cleaning process a little - blank values will be imported as NA and not as blank spaces.
Read ?read.csv na.strings options.
- a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields. Note that the test happens after white space is stripped from the input, so na.strings values may need their own white space stripped in advance.
#setwd("/Users/arvindsharma/Dropbox/WCAS/Econometrics/")
?read.csv
df_train <- read.csv("insurance-training-data.csv" , na.strings=c("")) # treat "" as NA in character variables
df_eval <- read.csv("insurance-evaluation-data.csv", na.strings=c(""))
sapply(X = df_train,
FUN = class
)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS
## "integer" "integer" "numeric" "integer" "integer" "integer"
## YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX
## "integer" "character" "character" "character" "character" "character"
## EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF
## "character" "character" "integer" "character" "character" "integer"
## CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## "character" "character" "character" "integer" "character" "integer"
## CAR_AGE URBANICITY
## "integer" "character"
skim(df_train)
| Name | df_train |
| Number of rows | 8161 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 14 |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| INCOME | 445 | 0.95 | 2 | 8 | 0 | 6612 | 0 |
| PARENT1 | 0 | 1.00 | 2 | 3 | 0 | 2 | 0 |
| HOME_VAL | 464 | 0.94 | 2 | 8 | 0 | 5106 | 0 |
| MSTATUS | 0 | 1.00 | 3 | 4 | 0 | 2 | 0 |
| SEX | 0 | 1.00 | 1 | 3 | 0 | 2 | 0 |
| EDUCATION | 0 | 1.00 | 3 | 13 | 0 | 5 | 0 |
| JOB | 526 | 0.94 | 6 | 13 | 0 | 8 | 0 |
| CAR_USE | 0 | 1.00 | 7 | 10 | 0 | 2 | 0 |
| BLUEBOOK | 0 | 1.00 | 6 | 7 | 0 | 2789 | 0 |
| CAR_TYPE | 0 | 1.00 | 3 | 11 | 0 | 6 | 0 |
| RED_CAR | 0 | 1.00 | 2 | 3 | 0 | 2 | 0 |
| OLDCLAIM | 0 | 1.00 | 2 | 7 | 0 | 2857 | 0 |
| REVOKED | 0 | 1.00 | 2 | 3 | 0 | 2 | 0 |
| URBANICITY | 0 | 1.00 | 19 | 21 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| INDEX | 0 | 1.00 | 5151.87 | 2978.89 | 1 | 2559 | 5133 | 7745 | 10302.0 | ▇▇▇▇▇ |
| TARGET_FLAG | 0 | 1.00 | 0.26 | 0.44 | 0 | 0 | 0 | 1 | 1.0 | ▇▁▁▁▃ |
| TARGET_AMT | 0 | 1.00 | 1504.32 | 4704.03 | 0 | 0 | 0 | 1036 | 107586.1 | ▇▁▁▁▁ |
| KIDSDRIV | 0 | 1.00 | 0.17 | 0.51 | 0 | 0 | 0 | 0 | 4.0 | ▇▁▁▁▁ |
| AGE | 6 | 1.00 | 44.79 | 8.63 | 16 | 39 | 45 | 51 | 81.0 | ▁▆▇▂▁ |
| HOMEKIDS | 0 | 1.00 | 0.72 | 1.12 | 0 | 0 | 0 | 1 | 5.0 | ▇▂▁▁▁ |
| YOJ | 454 | 0.94 | 10.50 | 4.09 | 0 | 9 | 11 | 13 | 23.0 | ▂▃▇▃▁ |
| TRAVTIME | 0 | 1.00 | 33.49 | 15.91 | 5 | 22 | 33 | 44 | 142.0 | ▇▇▁▁▁ |
| TIF | 0 | 1.00 | 5.35 | 4.15 | 1 | 1 | 4 | 7 | 25.0 | ▇▆▁▁▁ |
| CLM_FREQ | 0 | 1.00 | 0.80 | 1.16 | 0 | 0 | 0 | 2 | 5.0 | ▇▂▁▁▁ |
| MVR_PTS | 0 | 1.00 | 1.70 | 2.15 | 0 | 0 | 1 | 3 | 13.0 | ▇▂▁▁▁ |
| CAR_AGE | 510 | 0.94 | 8.33 | 5.70 | -3 | 1 | 8 | 12 | 28.0 | ▆▇▇▃▁ |
5 quantitative variables have missing NA values.
CAR_AGE (510) i.e. 6.25% observations.
YOJ (454) i.e. 5.56% observations. {years on job}.
AGE (6)
The first three were easily identified as R recognized them as
numeric variables and automatically imputes NAs when there were blank
observations. However, the last variables have $ in front of
them and thus R thinks of them as characters variables. Here it will
treat “missing values” as a category, and you will have to fix it. You
can remove the dollar signs with gsub command and
convert the character variables into numeric variables. I show that
later in the file.
INCOME (445) i.e 5.45% observations.
HOME VALUE (464) i.e 5.69% observations.
1 qualitative variable/character vector/string variable has missing NA values too.
### NOT REQUIRED AS SPECIFIED DURING IMPORT COMMAND IN na.staring("") option
# df_train<-df_train %>%
# mutate(INCOME = ifelse(INCOME == "", "NA", INCOME))
?visdat
visdat::vis_dat(df_train)
visdat::vis_miss(df_train) + theme(axis.text.x = element_text(angle = 90))
# Sum NAs across columns
dplyr::summarise(.data = df_train,
across(.cols = everything(),
.fns = ~ sum(is.na(.x))
)
) %>%
glimpse()
## Rows: 1
## Columns: 26
## $ INDEX <int> 0
## $ TARGET_FLAG <int> 0
## $ TARGET_AMT <int> 0
## $ KIDSDRIV <int> 0
## $ AGE <int> 6
## $ HOMEKIDS <int> 0
## $ YOJ <int> 454
## $ INCOME <int> 445
## $ PARENT1 <int> 0
## $ HOME_VAL <int> 464
## $ MSTATUS <int> 0
## $ SEX <int> 0
## $ EDUCATION <int> 0
## $ JOB <int> 526
## $ TRAVTIME <int> 0
## $ CAR_USE <int> 0
## $ BLUEBOOK <int> 0
## $ TIF <int> 0
## $ CAR_TYPE <int> 0
## $ RED_CAR <int> 0
## $ OLDCLAIM <int> 0
## $ CLM_FREQ <int> 0
## $ REVOKED <int> 0
## $ MVR_PTS <int> 0
## $ CAR_AGE <int> 510
## $ URBANICITY <int> 0
# Display NAs as percentage of observations
round(x = sapply(X = df_train,
FUN = function(df){100*sum(is.na(df==TRUE)/length(df))}),
digits = 2)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS
## 0.00 0.00 0.00 0.00 0.07 0.00
## YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX
## 5.56 5.45 0.00 5.69 0.00 0.00
## EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF
## 0.00 6.45 0.00 0.00 0.00 0.00
## CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 0.00 0.00 0.00 0.00 0.00 0.00
## CAR_AGE URBANICITY
## 6.25 0.00
First, I will convert quantitative variables that got read in as qualitative by removing the $ and , signs. I will begin by creating a clean data set and fix these variables.
df_train_clean <- df_train
df_train_clean$INCOME <- gsub(pattern = "\\$|\\,", # have to remove both $ and , - one has to put \\ before these signs, and | is the or condition
replacement = "", # replace with nothing / empty space
x = df_train_clean$INCOME # apply on INCOME vector in df_train_clean
)
df_train_clean$INCOME <- as.numeric(df_train_clean$INCOME) # replace character as numeric
# now, apply on all dollar variables - BRUTE FORCE...
df_train_clean$HOME_VAL <- gsub(pattern = "\\$|\\,", replacement = "",x = df_train_clean$HOME_VAL)
df_train_clean$BLUEBOOK <- gsub(pattern = "\\$|\\,", replacement = "",x = df_train_clean$BLUEBOOK)
df_train_clean$OLDCLAIM <- gsub(pattern = "\\$|\\,", replacement = "",x = df_train_clean$OLDCLAIM)
# finally, apply numeric conversion on these dollar variables - BRUTE FORCE...
df_train_clean$HOME_VAL <- as.numeric(df_train_clean$HOME_VAL)
df_train_clean$BLUEBOOK <- as.numeric(df_train_clean$BLUEBOOK)
df_train_clean$OLDCLAIM <- as.numeric(df_train_clean$OLDCLAIM)
I am keeping the observations that have no missing co-variates (filtering on rows, not selecting on columns). I prefer not to impute values as means/medians. One can do a robustness check of the final model on datasets with imputed values if they create one. So I have 6447 observations or about 80% of my original data. I could probably extract some more info from these observations as some rows may only have non-important variables missing.
I drop one observation where car age is -3 (typo presumably), but you could have replaced it with 0 too. Many ways to delete rows - I prefer using conditions in the subset command. I found this discrepencany when I created my summary statistics table, and chose to delete this row entirely instead of imputing values.
# DROPING ROWS WITH MSSING VALUES of the 3 variables and creating a new dataframe
df_train_clean <- dplyr::filter(.data = df_train_clean,
!is.na(CAR_AGE) & !is.na(YOJ) & !is.na(AGE) & !is.na(INCOME) & !is.na(HOME_VAL) & !is.na(JOB)
)
# DROPING VARIABLES BY SELECTING VARIABLE NAMES for applying stargazer package on dataframe
df_train_clean <- dplyr::select(.data = df_train_clean,
-c(INDEX)
)
# KEEPING ROWS where Car Age variable is non-negtive
table(df_train_clean$CAR_AGE<0) # only one observation / typo
##
## FALSE TRUE
## 6044 1
df_train_clean <- subset(df_train_clean, df_train_clean$CAR_AGE>=0 )
I will provide the summary statistics of numerical variables separately from non-numerical variables. One can also convert these non-numerical variables like factor variables into numbers and include them in the same table, but measures like mean and sd do not make sense for them. So I would rather display them in charts.
We have 15 quantitative variables. I use stargazer package to create basic summary statistics table.
About 26% people crash their cars, and the average cost of car repairs seems to be roughly $ 1,500. The positive skewness in car crash cost variable suggests that most car damages are small, but we have some big/major crashes. Mean driver age is about 45 years, and the average distance to work is 33 (miles?). The data dictionary is not very clear on the units on this variable definition. An average car has been in use for 7.9 years and has a book value of about $15,000.
## LABEL VARIABLES
labels_numeric <-c(
"Car Crash Dummy",
"Car Crash Cost",
"# Driving Children",
"Driver's Age",
"# Children at Home",
"Years on Job",
"Income",
"Home Value",
"Distance to Work",
"Vehicle Value",
"Time in Force",
"Claims Payout", # (Past 5 Years)
"Claims Frequency", # (Past 5 Years)
"MVR Points", # Motor Vehicle Record Points
"Car Age"
)
stargazer(df_train_clean,
type = "text", # html, latex
# out =
# summary.stat =
covariate.labels = labels_numeric, # NULL
digits = 2)
##
## ==============================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------------------
## Car Crash Dummy 6,044 0.26 0.44 0 1
## Car Crash Cost 6,044 1,479.66 4,553.55 0.00 85,523.65
## # Driving Children 6,044 0.17 0.52 0 4
## Driver's Age 6,044 44.63 8.71 16 81
## # Children at Home 6,044 0.74 1.13 0 5
## Years on Job 6,044 10.49 4.14 0 23
## Income 6,044 58,178.58 43,830.43 0 367,030
## Home Value 6,044 150,091.80 123,736.40 0 885,282
## Distance to Work 6,044 33.69 15.89 5 142
## Vehicle Value 6,044 15,235.58 8,041.63 1,500 65,970
## Time in Force 6,044 5.36 4.14 1 25
## Claims Payout 6,044 3,999.99 8,815.06 0 57,037
## Claims Frequency 6,044 0.78 1.15 0 5
## MVR Points 6,044 1.70 2.16 0 13
## Car Age 6,044 7.92 5.58 0 28
## --------------------------------------------------------------
We have 10 qualitative variables. Some interesting trends in the raw data are -
Try to create a bare bones graph first, and then embellish it.
table(df_train_clean$SEX)
##
## M z_F
## 2686 3358
# clean values for graph/tables...
df_train_clean$SEX[df_train_clean$SEX == "z_F"] <- "Female"
df_train_clean$SEX[df_train_clean$SEX == "M" ] <- "Male"
?subset # Return subsets of vectors, matrices or data frames which meet conditions.
## Help on topic 'subset' was found in the following packages:
##
## Package Library
## timeDate /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## base /Library/Frameworks/R.framework/Resources/library
## data.table /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##
##
## Using the first match ...
?geom_bar # There are two types of bar charts: geom_bar() and geom_col(). geom_bar() uses stat_count() by default: it counts the number of cases at each x position
# Create a bar chart with the count of people in car crash and store the output in crashes_by_gender
crashes_by_gender_0 <-
ggplot(data = subset(x = df_train_clean,
select = TARGET_FLAG == 1 # focus on people who got into a crash
),
mapping = aes(x = SEX, # variable on x axis
fill = SEX # Colour related aesthetics
)
) +
geom_bar()
# par(mfrow = c(2,2))
# par(mfrow = c(1,1))
crashes_by_gender_0 # print the output
Once you have the bare bones graph, embellish it/make it more presentable. One can add ggplot options one by one.
Please read up on ggplot options by typing ?theme in
console and reading the help file to see how to control chart
elements.
Of course, in your own final report, just create one chart and not multiple charts like here (for pedagogical purposes only).
Fix the labels on axis, add graph/plot title and notes, and store the chart in an object crashes by gender so that you can easily recall it in the future.
# Fix the labels - axis, add plot title and notes, and store the chart in an object so that you can easily recall it in the future.
crashes_by_gender_1 <-
crashes_by_gender_0 + labs( x = "Sex",
y = "Frequency",
title = "Women are more likely to be in a crash",
caption = "Note: Data Source is XX."
)
crashes_by_gender_1 # print your chart - I would definitely want to move the legend to the left
### Also, MULTIPLE WAYS TO ADD CHART TITLE, axis labels,...
# ggtitle("Women more likely to be in a crash") + ylab("Frequency") + xlab("") # titles and labels
I now want to shift the notes to the bottom left, and use the
horizontal justification hjust argument for the
caption.
?element_text # In conjunction with the theme system, the element_ functions specify the display of how non-data components of the plot are drawn.
crashes_by_gender_1 + theme(plot.caption = element_text(hjust = 0) # Horizontal justification in [0, 1]
)
Note that you will have to store/assign the command applied on the
graph object crashes_by_gender into an object (not done
above) - else if you later apply to print the object it will not return
the captions on bottom left in-spite of the command giving you no
errors.
Lets do that below.
crashes_by_gender_2 <-
crashes_by_gender_1 + theme(plot.caption = element_text(hjust = 0) # Horizontal justification in [0, 1]
)
crashes_by_gender_2
crashes_by_gender_3 <-
crashes_by_gender_2 + theme(legend.position = "bottom", # legend at bottom
legend.title = element_blank()
)
# crashes_by_gender_3
crashes_by_gender_4 <-
crashes_by_gender_3 + theme(legend.position = "none") # no legend
# crashes_by_gender_4
# plot the two charts - one ith legend at bottom and one without legend
plot_grid(crashes_by_gender_3,
crashes_by_gender_4,
labels = "AUTO")
So in summary we created charts by sequentially adding elements.
# library(cowplot)
plot_grid(crashes_by_gender_0, # ggplot with data and mapping arguments, along with chart type argument
crashes_by_gender_1, # control labels
crashes_by_gender_2, # caption (and shift to bottom left)
crashes_by_gender_3, # control legend - at the bottom
crashes_by_gender_4, # control legend - remove it
labels = "AUTO")
This is an optional part and not the most important element of the
chart. I control the size of the title by
theme(plot.title = element_text(size=8)). Everything in a
ggplot2 graph can be controlled !
?theme # Themes are a powerful way to customize the non-data components of your plots: i.e. titles, labels, fonts, background, gridlines, and legends. Themes can be used to give plots a consistent customized look. Modify a single plot's theme using theme()
g1 <- crashes_by_gender_1 + theme_bw() + theme(plot.title = element_text(size=8))
g2 <- crashes_by_gender_2 + theme_classic() + theme(plot.title = element_text(size=8))
g3 <- crashes_by_gender_3 + theme_void() + theme(plot.title = element_text(size=8))
g4 <- crashes_by_gender_4 + theme_dark() + theme(plot.title = element_text(size=8))
g1+g2+g3+g4 # requires library(patchwork)
People may feel safer and thus drive less carefully, or maybe their center of gravity is higher than cars making them more likely to get displaced easily…
table(df_train_clean$CAR_TYPE)
##
## Minivan Panel Truck Pickup Sports Car Van z_SUV
## 1701 347 1013 714 488 1781
# clean values for graph/tables...
df_train_clean$CAR_TYPE[df_train_clean$CAR_TYPE == "z_SUV"] <- "SUV"
# create ggplot the usual way
crashes_by_car_type_0 <-
ggplot( data = subset( x = df_train_clean,
select = TARGET_FLAG == 1 # focus on people who got into a crash
),
mapping = aes(x = CAR_TYPE, # variable on x axis
fill = CAR_TYPE # colour related aesthetics
)
) +
geom_bar() +
labs( x = "Car Type", # labels
y = "Frequency",
title = "SUVs are most likely to crash",
subtitle = "", # Put your subtitle here (optional) - E.G. Panel trucks least lilely to crash
caption = "Note: Car crash data."
) +
theme(plot.caption = element_text(hjust = 0)) + # shift caption to the bottom left
theme(legend.position = "none") # no legend
crashes_by_car_type_0
You can also sort out the bars by height in either ascending or descending order.
There are multiple ways to reorder bar charts, just search “Order Bars in ggplot2 bar graph” in StackOverflow. One simple way is to create a function that sorts the underlying factors and then apply that function in ggplot.
Also, you can label the bars with the count for easier reading.
Can use brute force, or better use sapply function to implement gsub command.
# AUTOMATE
df_train_cleaned <- as_data_frame(sapply(X = df_train_clean[c(10,12:13,25)],
FUN = function(x) {
gsub(pattern = "z_",
replacement = "",
x = x
)
}
)
)
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
## tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
df_train_clean[c(10,12:13,25)] <- df_train_cleaned # replacing the 4 variables with cleaned values
rm(df_train_cleaned)
# BRUTE FORCE
#df_train_clean$MSTATUS <- gsub(pattern = "z_",replacement = "",x = df_train_clean$MSTATUS)
#df_train_clean$EDUCATION <- gsub(pattern = "z_",replacement = "",x = df_train_clean$EDUCATION)
#df_train_clean$JOB <- gsub(pattern = "z_",replacement = "",x = df_train_clean$JOB)
#df_train_clean$URBANICITY<- gsub(pattern = "z_",replacement = "",x = df_train_clean$URBANICITY)
# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>%
dplyr::select(JOB, TARGET_FLAG) %>% # select the 2 columns/variables
count(JOB, TARGET_FLAG) %>% # create a table of count of these 2 variables
dplyr::filter(TARGET_FLAG == 1) %>% # filter onto rows where cars crashed
ggplot(aes(x = JOB, # X axis is type of car
y = n, # Y axis is count of car crashes
fill = JOB) # different colors
) +
geom_bar(stat = "identity") + # height of the bar proportional to the number of cases in each group
geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) +
ggtitle("Blue Collar people are more likely to be in a crash") + ylab("Frequency") + xlab("Job")+ # titles and labels
theme(legend.position = "none")
# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>%
dplyr::select(EDUCATION, TARGET_FLAG) %>% # select the 2 columns/variables
count(EDUCATION, TARGET_FLAG) %>% # create a table of count of these 2 variables
dplyr::filter(TARGET_FLAG == 1) %>% # filter onto rows where cars crashed
ggplot(aes(x = EDUCATION, # X axis is type of car
y = n, # Y axis is count of car crashes
fill = EDUCATION) # different colors
) +
geom_bar(stat = "identity") + # height of the bar proportional to the number of cases in each group
geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) +
ggtitle("High School students to be more likely in a crash") + ylab("Frequency") + xlab("Education")+ # titles and labels
theme(legend.position = "none")
# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>%
dplyr::select(MSTATUS, TARGET_FLAG) %>% # select the 2 columns/variables
count(MSTATUS, TARGET_FLAG) %>% # create a table of count of these 2 variables
dplyr::filter(TARGET_FLAG == 1) %>% # filter onto rows where cars crashed
ggplot(aes(x = MSTATUS, # X axis is type of car
y = n, # Y axis is count of car crashes
fill = MSTATUS) # different colors
) +
geom_bar(stat = "identity") + # height of the bar proportional to the number of cases in each group
geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) +
ggtitle("No difference between married and non-married individuals in terms of car carshes") + ylab("Frequency") + xlab("Marital Status") + scale_fill_manual(values=c("deepskyblue", "deeppink")) + theme(legend.position = "none")
df_train_clean %>%
# rename(SINGLE_PARENT = PARENT1) %>%
dplyr::select(PARENT1, TARGET_FLAG) %>% # select the 2 columns/variables
count(PARENT1, TARGET_FLAG) %>% # create a table of count of these 2 variables
dplyr::filter(TARGET_FLAG == 1) %>% # filter onto rows where cars crashed
ggplot(aes(x = PARENT1,
y = n, # Y axis is count of car crashes
fill = PARENT1) # different colors
) +
geom_bar(stat = "identity") + # height of the bar proportional to the number of cases in each group
geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) +
ggtitle("Single Parent much less likely to be in a crash") + ylab("Frequency") + xlab("Single Parent") + scale_fill_manual(values=c("gray", "black")) + theme(legend.position = "none")
# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>%
dplyr::select(CAR_USE, TARGET_FLAG) %>% # select the 2 columns/variables
count(CAR_USE, TARGET_FLAG) %>% # create a table of count of these 2 variables
dplyr::filter(TARGET_FLAG == 1) %>% # filter onto rows where cars crashed
ggplot(aes(x = CAR_USE,
y = n, # Y axis is count of car crashes
fill = CAR_USE) # different colors
) +
geom_bar(stat = "identity") + # height of the bar proportional to the number of cases in each group
geom_text(aes(label=n), vjust=1.6, color="black", size=3.5) +
ggtitle("Private cars to be more likely in a crash") + ylab("Frequency") + xlab("Car Type")+ # titles and labels
theme(legend.position = "none") + scale_fill_manual(values=c("yellow", "violet"))
Maybe one can see a red car more easily and thus move away from it more easily.
# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>%
dplyr::select(RED_CAR, TARGET_FLAG) %>% # select the 2 columns/variables
count(RED_CAR, TARGET_FLAG) %>% # create a table of count of these 2 variables
dplyr::filter(TARGET_FLAG == 1) %>% # filter onto rows where cars crashed
ggplot(aes(x = RED_CAR,
y = n, # Y axis is count of car crashes
fill = RED_CAR) # different colors
) +
geom_bar(stat = "identity") + # height of the bar proportional to the number of cases in each group
geom_text(aes(label=n), vjust=1.6, color="black", size=3.5) +
ggtitle("Red cars to less likely in a crash ") + ylab("Frequency") + xlab("Car Type")+ # titles and labels
theme(legend.position = "none") + scale_fill_manual(values=c("chartreuse3", "blue1"))
If your car license was revoked in the last 7 years, one cannot drive and thus is likely to not be a car accident.
# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>%
dplyr::select(REVOKED, TARGET_FLAG) %>% # select the 2 columns/variables
count(REVOKED, TARGET_FLAG) %>% # create a table of count of these 2 variables
dplyr::filter(TARGET_FLAG == 1) %>% # filter onto rows where cars crashed
ggplot(aes(x = REVOKED,
y = n, # Y axis is count of car crashes
fill = REVOKED) # different colors
) +
geom_bar(stat = "identity") + # height of the bar proportional to the number of cases in each group
geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) +
ggtitle("Revoked Licensees are less likely to be in a crash") + ylab("Frequency") + xlab("License revoked in last 7 years")+ # titles and labels
theme(legend.position = "none") + scale_fill_manual(values=c("darkorange", "deeppink4"))
Makes sense as car density is likely to be less in rural areas.
# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>%
dplyr::select(URBANICITY, TARGET_FLAG) %>% # select the 2 columns/variables
count(URBANICITY, TARGET_FLAG) %>% # create a table of count of these 2 variables
dplyr::filter(TARGET_FLAG == 1) %>% # filter onto rows where cars crashed
ggplot(aes(x = URBANICITY,
y = n, # Y axis is count of car crashes
fill = URBANICITY) # different colors
) +
geom_bar(stat = "identity") + # height of the bar proportional to the number of cases in each group
geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) +
ggtitle("Urban home/work area are much more likely to see a car crash") + ylab("Frequency") + xlab("")+ # titles and labels
theme(legend.position = "none") + scale_fill_manual(values=c("cornflowerblue", "brown4"))
Keep all the variables.
Note we have 10 factor variables here (you can keep them as character variables), and 15 quantitative (numeric/integer) variables.
The log model (of crash damage) is much better fit than the non-log model of crash damage.
lm_model_1a <-lm(data = df_train_clean,
formula = TARGET_FLAG ~ .- TARGET_AMT)
lm_model_1b <-lm(data = df_train_clean,
formula = TARGET_AMT ~ .- TARGET_FLAG)
lm_model_1b_log <-lm(data = df_train_clean,
formula = log(TARGET_AMT + .001) ~ .- TARGET_FLAG)
vif(lm_model_1b_log)
## GVIF Df GVIF^(1/(2*Df))
## KIDSDRIV 1.313573 1 1.146112
## AGE 1.492538 1 1.221695
## HOMEKIDS 2.111876 1 1.453230
## YOJ 1.467483 1 1.211397
## INCOME 3.151934 1 1.775369
## PARENT1 1.856506 1 1.362536
## HOME_VAL 2.455181 1 1.566902
## MSTATUS 2.111076 1 1.452954
## SEX 3.223269 1 1.795347
## EDUCATION 11.267988 4 1.353570
## JOB 22.068377 7 1.247336
## TRAVTIME 1.036217 1 1.017947
## CAR_USE 2.354052 1 1.534292
## BLUEBOOK 1.873002 1 1.368577
## TIF 1.008645 1 1.004313
## CAR_TYPE 4.506696 5 1.162481
## RED_CAR 1.850800 1 1.360441
## OLDCLAIM 1.702785 1 1.304908
## CLM_FREQ 1.606960 1 1.267659
## REVOKED 1.294901 1 1.137937
## MVR_PTS 1.236740 1 1.112088
## CAR_AGE 2.048116 1 1.431124
## URBANICITY 1.251308 1 1.118619
# Convert all columns to factor
df_factor <- as.data.frame(unclass(x = df_train_clean),
stringsAsFactors = TRUE)
# lm_model_1a_fac <-lm(data = df_factor,
# formula = TARGET_FLAG ~ .- TARGET_AMT)
# lm_model_1b_fac <-lm(data = df_factor,
# formula = TARGET_AMT ~ .- TARGET_FLAG)
stargazer(lm_model_1a, lm_model_1b, # lm_model_1a_fac, lm_model_1b_fac,
lm_model_1b_log,
type='text',
column.labels=c("Crash Dummy","Crash Damage"#,"Crash Dummy","Crash Damage"
,"Crash Damage"
) #,covariate.labels=c()
)
##
## =================================================================================
## Dependent variable:
## -------------------------------------------------
## TARGET_FLAG TARGET_AMT log(TARGET_AMT + 0.001)
## Crash Dummy Crash Damage Crash Damage
## (1) (2) (3)
## ---------------------------------------------------------------------------------
## KIDSDRIV 0.048*** 206.551* 0.712***
## (0.011) (125.535) (0.168)
##
## AGE -0.001 -4.368 -0.010
## (0.001) (7.921) (0.011)
##
## HOMEKIDS 0.001 15.650 0.027
## (0.006) (72.430) (0.097)
##
## YOJ -0.002 -2.216 -0.027
## (0.001) (16.525) (0.022)
##
## INCOME -0.00000* -0.003 -0.00001*
## (0.00000) (0.002) (0.00000)
##
## PARENT1Yes 0.085*** 468.244** 1.291***
## (0.020) (224.397) (0.300)
##
## HOME_VAL -0.00000*** -0.001 -0.00000***
## (0.00000) (0.001) (0.00000)
##
## MSTATUSYes -0.059*** -541.918*** -0.905***
## (0.015) (167.105) (0.224)
##
## SEXMale 0.032* 470.387** 0.499*
## (0.018) (203.976) (0.273)
##
## EDUCATIONBachelors -0.061*** -361.819 -0.944***
## (0.020) (226.694) (0.304)
##
## EDUCATIONHigh School 0.0002 -221.025 -0.003
## (0.016) (186.983) (0.250)
##
## EDUCATIONMasters -0.057* -302.319 -0.862*
## (0.030) (339.650) (0.455)
##
## EDUCATIONPhD 0.007 439.095 0.153
## (0.038) (426.262) (0.571)
##
## JOBClerical 0.030 -157.256 0.436
## (0.018) (209.929) (0.281)
##
## JOBDoctor -0.118*** -1,347.928*** -1.841***
## (0.044) (498.049) (0.667)
##
## JOBHome Maker -0.013 -254.283 -0.221
## (0.026) (300.733) (0.403)
##
## JOBLawyer -0.022 -225.825 -0.343
## (0.030) (341.000) (0.457)
##
## JOBManager -0.140*** -1,115.487*** -2.147***
## (0.023) (256.217) (0.343)
##
## JOBProfessional -0.022 -27.588 -0.310
## (0.021) (233.689) (0.313)
##
## JOBStudent -0.013 -433.902 -0.241
## (0.023) (264.783) (0.355)
##
## TRAVTIME 0.002*** 10.872*** 0.033***
## (0.0003) (3.616) (0.005)
##
## CAR_USEPrivate -0.131*** -755.101*** -1.987***
## (0.016) (183.177) (0.245)
##
## BLUEBOOK -0.00000*** 0.013 -0.00004***
## (0.00000) (0.010) (0.00001)
##
## TIF -0.008*** -44.585*** -0.115***
## (0.001) (13.680) (0.018)
##
## CAR_TYPEPanel Truck 0.072** 476.472 1.119**
## (0.029) (328.799) (0.440)
##
## CAR_TYPEPickup 0.069*** 406.858** 1.066***
## (0.017) (188.044) (0.252)
##
## CAR_TYPESports Car 0.155*** 1,263.493*** 2.379***
## (0.021) (236.703) (0.317)
##
## CAR_TYPESUV 0.111*** 909.604*** 1.718***
## (0.017) (194.832) (0.261)
##
## CAR_TYPEVan 0.064*** 489.637** 0.986***
## (0.021) (241.866) (0.324)
##
## RED_CARyes -0.035** -169.227 -0.511**
## (0.015) (171.095) (0.229)
##
## OLDCLAIM -0.00000*** -0.005 -0.00003***
## (0.00000) (0.008) (0.00001)
##
## CLM_FREQ 0.034*** 79.674 0.494***
## (0.005) (62.040) (0.083)
##
## REVOKEDYes 0.144*** 497.806** 2.161***
## (0.017) (195.419) (0.262)
##
## MVR_PTS 0.022*** 172.716*** 0.339***
## (0.003) (29.095) (0.039)
##
## CAR_AGE -0.001 -25.433* -0.010
## (0.001) (14.474) (0.019)
##
## URBANICITYHighly Urban/ Urban 0.294*** 1,640.830*** 4.472***
## (0.014) (153.696) (0.206)
##
## Constant 0.148*** 718.227 -4.695***
## (0.045) (516.388) (0.691)
##
## ---------------------------------------------------------------------------------
## Observations 6,044 6,044 6,044
## R2 0.237 0.077 0.237
## Adjusted R2 0.233 0.071 0.232
## Residual Std. Error (df = 6007) 0.387 4,388.962 5.877
## F Statistic (df = 36; 6007) 51.944*** 13.826*** 51.827***
## =================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Keep the relevant variables.
First, begin by removing job variables as they have the highest multicollinearity, which fixes the issue. This also increase model fit substantially.
Second, remove variables that are insignificant - both in magnitude and potentially theoretical effect. For instance,
lm_model_2a <-lm(data = df_train_clean,
formula = TARGET_FLAG ~ .- TARGET_AMT - JOB - HOMEKIDS - YOJ - CAR_AGE)
lm_model_2b <-lm(data = df_train_clean,
formula = TARGET_AMT ~ .- TARGET_FLAG - JOB - HOMEKIDS - YOJ - CAR_AGE)
lm_model_2b_log <-lm(data = df_train_clean,
formula = log(TARGET_AMT + .001) ~ .- TARGET_FLAG - JOB - HOMEKIDS - YOJ - CAR_AGE)
vif(lm_model_2b_log)
## GVIF Df GVIF^(1/(2*Df))
## KIDSDRIV 1.075088 1 1.036864
## AGE 1.248689 1 1.117448
## INCOME 2.564212 1 1.601316
## PARENT1 1.533708 1 1.238430
## HOME_VAL 2.352907 1 1.533919
## MSTATUS 1.917462 1 1.384725
## SEX 3.159437 1 1.777480
## EDUCATION 1.915983 4 1.084673
## TRAVTIME 1.032804 1 1.016270
## CAR_USE 1.522138 1 1.233749
## BLUEBOOK 1.866322 1 1.366134
## TIF 1.005249 1 1.002621
## CAR_TYPE 4.043306 5 1.149936
## RED_CAR 1.848084 1 1.359443
## OLDCLAIM 1.699405 1 1.303612
## CLM_FREQ 1.605332 1 1.267017
## REVOKED 1.290697 1 1.136088
## MVR_PTS 1.229020 1 1.108612
## URBANICITY 1.199977 1 1.095435
# plot(lm_model_2b_log)
stargazer(lm_model_2a, lm_model_2b, # lm_model_1a_fac, lm_model_1b_fac,
lm_model_2b_log,
type='text',
column.labels=c("Crash Dummy","Crash Damage"#,"Crash Dummy","Crash Damage"
,"Crash Damage"
) #,covariate.labels=c()
)
##
## ================================================================================
## Dependent variable:
## ------------------------------------------------
## TARGET_FLAG TARGET_AMT log(TARGET_AMT + 0.001)
## Crash Dummy Crash Damage Crash Damage
## (1) (2) (3)
## --------------------------------------------------------------------------------
## KIDSDRIV 0.049*** 224.399** 0.733***
## (0.010) (113.837) (0.153)
##
## AGE -0.001* -6.417 -0.017*
## (0.001) (7.262) (0.010)
##
## INCOME -0.00000*** -0.004* -0.00001***
## (0.00000) (0.002) (0.00000)
##
## PARENT1Yes 0.086*** 485.810** 1.309***
## (0.018) (204.439) (0.275)
##
## HOME_VAL -0.00000** -0.001 -0.00000**
## (0.00000) (0.001) (0.00000)
##
## MSTATUSYes -0.061*** -569.813*** -0.933***
## (0.014) (159.633) (0.214)
##
## SEXMale 0.029 460.400** 0.457*
## (0.018) (202.423) (0.272)
##
## EDUCATIONBachelors -0.095*** -640.800*** -1.459***
## (0.017) (191.637) (0.257)
##
## EDUCATIONHigh School -0.013 -323.833* -0.198
## (0.016) (181.267) (0.243)
##
## EDUCATIONMasters -0.098*** -750.512*** -1.486***
## (0.020) (223.181) (0.300)
##
## EDUCATIONPhD -0.071*** -572.329* -1.072***
## (0.027) (304.843) (0.409)
##
## TRAVTIME 0.002*** 11.601*** 0.034***
## (0.0003) (3.619) (0.005)
##
## CAR_USEPrivate -0.144*** -921.134*** -2.182***
## (0.013) (147.643) (0.198)
##
## BLUEBOOK -0.00000*** 0.012 -0.00004***
## (0.00000) (0.010) (0.00001)
##
## TIF -0.008*** -44.180*** -0.115***
## (0.001) (13.689) (0.018)
##
## CAR_TYPEPanel Truck 0.057** 333.756 0.890**
## (0.028) (316.296) (0.425)
##
## CAR_TYPEPickup 0.063*** 331.721* 0.963***
## (0.016) (185.183) (0.249)
##
## CAR_TYPESports Car 0.151*** 1,239.269*** 2.329***
## (0.021) (236.861) (0.318)
##
## CAR_TYPESUV 0.110*** 898.172*** 1.700***
## (0.017) (194.917) (0.262)
##
## CAR_TYPEVan 0.063*** 466.371* 0.976***
## (0.021) (239.785) (0.322)
##
## RED_CARyes -0.037** -190.372 -0.541**
## (0.015) (171.373) (0.230)
##
## OLDCLAIM -0.00000*** -0.005 -0.00004***
## (0.00000) (0.008) (0.00001)
##
## CLM_FREQ 0.034*** 84.685 0.507***
## (0.005) (62.155) (0.083)
##
## REVOKEDYes 0.146*** 515.142*** 2.201***
## (0.017) (195.562) (0.263)
##
## MVR_PTS 0.023*** 180.110*** 0.355***
## (0.003) (29.072) (0.039)
##
## URBANICITYHighly Urban/ Urban 0.274*** 1,524.455*** 4.168***
## (0.013) (150.866) (0.203)
##
## Constant 0.188*** 752.485 -4.103***
## (0.041) (464.974) (0.624)
##
## --------------------------------------------------------------------------------
## Observations 6,044 6,044 6,044
## R2 0.228 0.071 0.228
## Adjusted R2 0.225 0.067 0.224
## Residual Std. Error (df = 6017) 0.389 4,399.319 5.908
## F Statistic (df = 26; 6017) 68.403*** 17.582*** 68.212***
## ================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
I will focus on Model 2a (column 1) and Model 2b log (column 3).
Model 2a (column 1):
This type of regression is also called a linear probability model (LPM) i.e.a regression model where the outcome variable is a binary variable, and one or more explanatory variables are used to predict the outcome. Explanatory variables can themselves be binary, or be continuous.
It is in level-level format.
If you have a bachelors degree, then the car crash probability reduces by 0.061 compared to those who have less than high school education (baseline category), cetrius paribus. This is statistically significant at the 1% significance level and the the economic magnitude seems fair.
If you are living in an urban area, then the car crash probability increases by 0.294 compared to if one is living in rural areas. This is not only significant at the 1% level but also large in economic magnitude.
Model 2b log (column 3):
It is in log-level format, as crash damages are positive skewed.
If you have a bachelors degree, then the car crash probability reduces by 145.9 percent (not percentage points) compared to those who have less than high school education (baseline category), cetrius paribus. This is statistically significant at the 1% significance level and the the economic magnitude seems fair.
If you are living in an urban area, then the car crash probability increases by 416.80 percent (not percentage points) compared to if one is living in rural areas. This is not only significant at the 1% level but also large in economic magnitude.
As theory suggests, our regression provides evidence for -
Very young people tend to be risky. Can include a quadratic term in age for testing if very old people also risky.
Commercial vehicles are more likely to crash, probably because they are driven more.
The more claims you filed in the past, the more you are likely to file in the future.
More educated people tend to drive more safely.
Home owners tend to drive more responsibly.
Rich people tend to get into fewer crashes.
Married people drive more safely.
If you get lots of traffic tickets(Motor Vehicle Records), you tend to get into more crashes.
Red cars are riskier.
If your license was revoked in the past 7 years, you probably are a riskier driver.
Women have less crashes than men.
Long drives to work usually suggest greater risk.
The raw data summary charts showed unconditional means. In the regression, we are controlling for many more factors / risk profiles and thus we have conditional effects here - these regression estimates are more causal estimates.
So the models looks fine.
I will begin by modelling the binary dependent variable.
Model 1 is the kitchen sink approach. Logit regression.
Model 2 is the kitchen sink approach with relevant variables. Logit regression.
Model 3 is the kitchen sink approach with relevant variables. Probit regression
Probit regression implementation in R. Probit analysis will produce results similar logistic regression. The choice of probit versus logit depends largely on individual preferences.
logit_model_1a <-glm(data = df_train_clean,
formula = TARGET_FLAG ~ .- TARGET_AMT,
family = "binomial"
)
summary(logit_model_1a)
##
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = "binomial",
## data = df_train_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.431e+00 3.206e-01 -7.583 3.38e-14 ***
## KIDSDRIV 3.196e-01 7.059e-02 4.527 5.99e-06 ***
## AGE -3.933e-03 4.681e-03 -0.840 0.400784
## HOMEKIDS 2.295e-02 4.275e-02 0.537 0.591322
## YOJ -9.407e-03 9.774e-03 -0.962 0.335863
## INCOME -3.384e-06 1.433e-06 -2.362 0.018185 *
## PARENT1Yes 4.135e-01 1.266e-01 3.266 0.001091 **
## HOME_VAL -1.433e-06 4.289e-07 -3.342 0.000833 ***
## MSTATUSYes -4.225e-01 1.007e-01 -4.195 2.73e-05 ***
## SEXMale 2.131e-01 1.288e-01 1.655 0.097986 .
## EDUCATIONBachelors -3.815e-01 1.325e-01 -2.878 0.003998 **
## EDUCATIONHigh School -2.509e-03 1.068e-01 -0.023 0.981260
## EDUCATIONMasters -4.487e-01 2.141e-01 -2.096 0.036081 *
## EDUCATIONPhD 8.444e-02 2.635e-01 0.320 0.748671
## JOBClerical 1.951e-01 1.205e-01 1.619 0.105370
## JOBDoctor -6.914e-01 3.288e-01 -2.103 0.035506 *
## JOBHome Maker -1.273e-01 1.784e-01 -0.714 0.475536
## JOBLawyer 4.411e-02 2.151e-01 0.205 0.837531
## JOBManager -8.811e-01 1.604e-01 -5.493 3.95e-08 ***
## JOBProfessional -9.442e-02 1.369e-01 -0.690 0.490347
## JOBStudent -1.678e-01 1.532e-01 -1.095 0.273334
## TRAVTIME 1.565e-02 2.192e-03 7.140 9.35e-13 ***
## CAR_USEPrivate -8.323e-01 1.061e-01 -7.848 4.24e-15 ***
## BLUEBOOK -2.250e-05 6.102e-06 -3.688 0.000226 ***
## TIF -5.222e-02 8.545e-03 -6.112 9.86e-10 ***
## CAR_TYPEPanel Truck 6.915e-01 1.949e-01 3.549 0.000387 ***
## CAR_TYPEPickup 5.505e-01 1.155e-01 4.767 1.87e-06 ***
## CAR_TYPESports Car 1.111e+00 1.467e-01 7.575 3.58e-14 ***
## CAR_TYPESUV 8.301e-01 1.258e-01 6.599 4.13e-11 ***
## CAR_TYPEVan 5.655e-01 1.497e-01 3.778 0.000158 ***
## RED_CARyes -2.273e-01 1.032e-01 -2.203 0.027600 *
## OLDCLAIM -1.326e-05 4.579e-06 -2.895 0.003787 **
## CLM_FREQ 1.997e-01 3.321e-02 6.014 1.81e-09 ***
## REVOKEDYes 8.496e-01 1.075e-01 7.903 2.72e-15 ***
## MVR_PTS 1.165e-01 1.588e-02 7.336 2.20e-13 ***
## CAR_AGE -3.386e-03 8.911e-03 -0.380 0.703990
## URBANICITYHighly Urban/ Urban 2.306e+00 1.244e-01 18.537 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6988.2 on 6043 degrees of freedom
## Residual deviance: 5360.5 on 6007 degrees of freedom
## AIC: 5434.5
##
## Number of Fisher Scoring iterations: 5
logit_model_2a <-glm(data = df_train_clean,
formula = TARGET_FLAG ~ .- TARGET_AMT - JOB - HOMEKIDS - YOJ - CAR_AGE,
family = "binomial"
)
#vif(logit_model_2b)
probit_model_2a <-glm(data = df_train_clean,
formula = TARGET_FLAG ~ .- TARGET_AMT - JOB - HOMEKIDS - YOJ - CAR_AGE,
family = binomial(link = "probit")
)
stargazer(logit_model_1a, logit_model_2a, probit_model_2a,
type='text',
column.labels=c("Crash Dummy","Crash Dummy","Crash Dummy")
)
##
## =================================================================
## Dependent variable:
## -----------------------------------
## TARGET_FLAG
## logistic probit
## Crash Dummy Crash Dummy Crash Dummy
## (1) (2) (3)
## -----------------------------------------------------------------
## KIDSDRIV 0.320*** 0.327*** 0.190***
## (0.071) (0.063) (0.037)
##
## AGE -0.004 -0.007* -0.004
## (0.005) (0.004) (0.002)
##
## HOMEKIDS 0.023
## (0.043)
##
## YOJ -0.009
## (0.010)
##
## INCOME -0.00000** -0.00000*** -0.00000***
## (0.00000) (0.00000) (0.00000)
##
## PARENT1Yes 0.414*** 0.431*** 0.252***
## (0.127) (0.113) (0.066)
##
## HOME_VAL -0.00000*** -0.00000*** -0.00000***
## (0.00000) (0.00000) (0.00000)
##
## MSTATUSYes -0.422*** -0.429*** -0.250***
## (0.101) (0.094) (0.054)
##
## SEXMale 0.213* 0.203 0.125*
## (0.129) (0.127) (0.072)
##
## EDUCATIONBachelors -0.381*** -0.573*** -0.329***
## (0.133) (0.112) (0.065)
##
## EDUCATIONHigh School -0.003 -0.083 -0.041
## (0.107) (0.103) (0.060)
##
## EDUCATIONMasters -0.449** -0.620*** -0.348***
## (0.214) (0.134) (0.077)
##
## EDUCATIONPhD 0.084 -0.398** -0.234**
## (0.264) (0.186) (0.106)
##
## JOBClerical 0.195
## (0.120)
##
## JOBDoctor -0.691**
## (0.329)
##
## JOBHome Maker -0.127
## (0.178)
##
## JOBLawyer 0.044
## (0.215)
##
## JOBManager -0.881***
## (0.160)
##
## JOBProfessional -0.094
## (0.137)
##
## JOBStudent -0.168
## (0.153)
##
## TRAVTIME 0.016*** 0.016*** 0.009***
## (0.002) (0.002) (0.001)
##
## CAR_USEPrivate -0.832*** -0.878*** -0.501***
## (0.106) (0.086) (0.050)
##
## BLUEBOOK -0.00002*** -0.00002*** -0.00001***
## (0.00001) (0.00001) (0.00000)
##
## TIF -0.052*** -0.051*** -0.030***
## (0.009) (0.008) (0.005)
##
## CAR_TYPEPanel Truck 0.691*** 0.579*** 0.328***
## (0.195) (0.186) (0.107)
##
## CAR_TYPEPickup 0.551*** 0.493*** 0.278***
## (0.115) (0.113) (0.064)
##
## CAR_TYPESports Car 1.111*** 1.072*** 0.620***
## (0.147) (0.145) (0.083)
##
## CAR_TYPESUV 0.830*** 0.814*** 0.467***
## (0.126) (0.125) (0.071)
##
## CAR_TYPEVan 0.565*** 0.539*** 0.302***
## (0.150) (0.148) (0.085)
##
## RED_CARyes -0.227** -0.234** -0.130**
## (0.103) (0.102) (0.059)
##
## OLDCLAIM -0.00001*** -0.00001*** -0.00001***
## (0.00000) (0.00000) (0.00000)
##
## CLM_FREQ 0.200*** 0.197*** 0.119***
## (0.033) (0.033) (0.019)
##
## REVOKEDYes 0.850*** 0.860*** 0.495***
## (0.108) (0.106) (0.062)
##
## MVR_PTS 0.116*** 0.122*** 0.071***
## (0.016) (0.016) (0.009)
##
## CAR_AGE -0.003
## (0.009)
##
## URBANICITYHighly Urban/ Urban 2.306*** 2.248*** 1.221***
## (0.124) (0.124) (0.065)
##
## Constant -2.431*** -2.223*** -1.263***
## (0.321) (0.287) (0.163)
##
## -----------------------------------------------------------------
## Observations 6,044 6,044 6,044
## Log Likelihood -2,680.261 -2,711.779 -2,717.147
## Akaike Inf. Crit. 5,434.522 5,477.557 5,488.294
## =================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
I will focus on Logit Model 2a (column 2) as it is my preferred model.
Find your preferred interpretations.
This is statistically significant at the 10% significance level but the economic magnitude seems small.
If you have a bachelors degree, compared to those who have less than high school education (baseline category), the log odds of car crash reduces by 0.573, cetrius paribus. This is statistically significant at the 1% significance level and the the economic magnitude seems large i.e. \((1 - exp(-.573) ) * 100\%\) or 43 (percent, not percentage points) decrease.
If you are living in an urban area, compared to if one is living in rural areas (baseline caretory), the log odds of car crash increases by 1.561, cetrius paribus. This is not only significant at the 1% level but also very large in economic magnitude i.e. \((exp(1.561) -1 ) * 100\%\) or 376% (percent, not percentage points) increase.
# use the best model to fit the original data / predict in the new==original data
logit_probabilities <- predict(object = logit_model_2a,
newdata = df_train_clean,
type = "response"
)
# between 0 and 1 as expected
psych::describe(logit_probabilities)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 6044 0.26 0.22 0.2 0.24 0.21 0 0.96 0.95 0.92 -0.05 0
# convert the logit probabilities to a binary variable
logit_probability_2_binary_variable <- ifelse(test = logit_probabilities > 0.5,
yes = 1,
no = 0
)
psych::describe(logit_probability_2_binary_variable)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 6044 0.17 0.38 0 0.09 0 0 1 1 1.76 1.11 0
df_train_clean$PREDICTED_FLAG <- logit_probability_2_binary_variable
table("Actual" = df_train_clean$TARGET_FLAG,
"Predictions" = df_train_clean$PREDICTED_FLAG
)
## Predictions
## Actual 0 1
## 0 4101 342
## 1 920 681
You can construct measures by hand, or use the caret package. Be sure to specify what is positive (event of interest) in your data (crash here)
I get about 80% accuracy, which is not great, but at least better than randomly guessing (50% chance of right guess). These are just the sum of the diagnol terms in the confusion matrix divided by total observations. By hand, \(\dfrac{4101+681}{4101+681+920+342} \approx 79.12\%\).
Classification Error Rate is simply the complement of accuracy (1-accuracy). It is the sum of the off-diagnol terms in the confusion matrix divided by total observations. By hand, \(\dfrac{920+342}{920+342+4101+681} \approx 20.88\%\).
Sensitivity/True Positive Rate/ probability of detection: Detecting true positives (who will get into a crash) incorrectly most of the time. By hand, \(\dfrac{681}{681+920} \approx 42\%\). If you are a car insurance company, you can run into serious losses by selling more policies to these risky people.
Specificity/True Negative Rate: Detecting true negatives (who will NOT get into a crash) correctly most of the time. By hand, \(\dfrac{4101}{4101+342} \approx 92\%\). If you are a car insurance company, you can create profits by selling policies to these less risky people.
df_train_clean$fac_TARGET_FLAG <- as_factor(df_train_clean$TARGET_FLAG)
df_train_clean$fac_PREDICTED_FLAG <- as_factor(df_train_clean$PREDICTED_FLAG)
cm <- caret::confusionMatrix(data = df_train_clean$fac_PREDICTED_FLAG, # predicted
reference = df_train_clean$fac_TARGET_FLAG, # truth
positive = "1" # an optional character string for the factor level that corresponds to a "positive" result. If there are only two factor levels, the first level will be used as the "positive" result
)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4101 920
## 1 342 681
##
## Accuracy : 0.7912
## 95% CI : (0.7807, 0.8014)
## No Information Rate : 0.7351
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3939
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4254
## Specificity : 0.9230
## Pos Pred Value : 0.6657
## Neg Pred Value : 0.8168
## Prevalence : 0.2649
## Detection Rate : 0.1127
## Detection Prevalence : 0.1693
## Balanced Accuracy : 0.6742
##
## 'Positive' Class : 1
##
# define Classification Error Rate function
cm_cer <- function(dataset){
cm <- table("Predictions" = df_train_clean$fac_PREDICTED_FLAG,
"Actual" = df_train_clean$fac_TARGET_FLAG )
TP <- cm[2,2]
TN <- cm[1,1]
FP <- cm[2,1]
FN <- cm[1,2]
return((FP + FN)/(TP + FP + TN + FN))
}
# apply Classification Error Rate (CER) function on confusion matrix
cm_cer(cm)
## [1] 0.2088021
# CER is 1-Accuracy
1 - 0.7912
## [1] 0.2088
https://www.rostrum.blog/2018/10/13/sessioninfo/
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] skimr_2.2.1 cowplot_1.2.0 patchwork_1.3.2 visdat_0.6.0
## [5] caret_7.0-1 lattice_0.22-7 MASS_7.3-65 car_3.1-3
## [9] carData_3.0-5 data.table_1.17.8 corrplot_0.95 lubridate_1.9.4
## [13] forcats_1.0.0 stringr_1.5.2 dplyr_1.1.4 purrr_1.1.0
## [17] readr_2.1.5 tidyr_1.3.1 tibble_3.3.0 ggplot2_3.5.2
## [21] tidyverse_2.0.0 stargazer_5.2.3 psych_2.5.6
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4041.110 farver_2.1.2
## [4] fastmap_1.2.0 pROC_1.19.0.1 digest_0.6.37
## [7] rpart_4.1.24 timechange_0.3.0 lifecycle_1.0.4
## [10] survival_3.8-3 magrittr_2.0.3 compiler_4.4.2
## [13] rlang_1.1.6 sass_0.4.10 tools_4.4.2
## [16] yaml_2.3.10 knitr_1.50 labeling_0.4.3
## [19] mnormt_2.1.1 repr_1.1.7 plyr_1.8.9
## [22] RColorBrewer_1.1-3 abind_1.4-8 withr_3.0.2
## [25] nnet_7.3-20 grid_4.4.2 stats4_4.4.2
## [28] e1071_1.7-16 future_1.67.0 globals_0.18.0
## [31] scales_1.4.0 iterators_1.0.14 cli_3.6.5
## [34] crayon_1.5.3 rmarkdown_2.29 generics_0.1.4
## [37] rstudioapi_0.17.1 future.apply_1.20.0 reshape2_1.4.4
## [40] tzdb_0.5.0 proxy_0.4-27 cachem_1.1.0
## [43] splines_4.4.2 parallel_4.4.2 base64enc_0.1-3
## [46] vctrs_0.6.5 hardhat_1.4.2 Matrix_1.7-4
## [49] jsonlite_2.0.0 hms_1.1.3 Formula_1.2-5
## [52] listenv_0.9.1 foreach_1.5.2 gower_1.0.2
## [55] jquerylib_0.1.4 recipes_1.3.1 glue_1.8.0
## [58] parallelly_1.45.1 codetools_0.2-20 stringi_1.8.7
## [61] gtable_0.3.6 pillar_1.11.0 htmltools_0.5.8.1
## [64] ipred_0.9-15 lava_1.8.1 R6_2.6.1
## [67] conflicted_1.2.0 evaluate_1.0.5 memoise_2.0.1
## [70] bslib_0.9.0 class_7.3-23 Rcpp_1.1.0
## [73] nlme_3.1-168 prodlim_2025.04.28 xfun_0.53
## [76] pkgconfig_2.0.3 ModelMetrics_1.2.2.2
# devtools::session_info()
Controlling output - https://sebastiansauer.github.io/figure_sizing_knitr/