1 Setup

1.1 Empty variables and functions in the environment tab/window

First, empty the environment so that we can uplaod the clean data.

# Clear the workspace
  rm(list = ls()) # Clear environment-remove all files from your workspace
  gc()            # Clear unused memory
##           used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells  619491 33.1    1417732 75.8         NA   700248 37.4
## Vcells 1181940  9.1    8388608 64.0      32768  1963344 15.0
  cat("\f")       # Clear the console
  graphics.off()  # Clear all graphs

# Set working directory and path to data
# setwd("/Users/arvindsharma/Dropbox/WCAS/Econometrics/")

1.2 Load packages

Now, we load the packages.

# List needed packages

packages <- c("psych",       # quick summary stats for data exploration,
              "stargazer",   # summary stats for sharing,
              "tidyverse",   # data manipulation like selecting variables,
              "corrplot",    # correlation plots
              "ggplot2",     # graphing
              "data.table",  # reshape for graphing 
              "car",         # vif
              "MASS",        # negative binomial regression
              "caret",       # for confusion matrix
              "visdat",      # finding missing values
              "patchwork",   # placing ggplot2 graphs together
              "cowplot",     # placing ggplot2 graphs together
              "skimr"
             )

# Install packages if not already installed, then load the packages
for (i in 1:length(packages)) {
  
      # Install packages if they have not been installed on your machine 
    if (!packages[i] %in% rownames(x = installed.packages())) {
      install.packages(package      =  packages[i], 
                       repos        =  "http://cran.rstudio.com/", 
                       dependencies =  TRUE
                      )
    }
  
      # Load all packages  
    library(package        = packages[i], 
            character.only = TRUE
            )
  
}

# Makes sure which command will be used in in two different packages 

conflicted::conflict_prefer(name = "select", 
                            winner = "dplyr")

rm(packages)

2 Data Exploration

2.1 Load raw data

One can declare blank observations in character variables when importing the raw data as NA. This can simplify the data cleaning process a little - blank values will be imported as NA and not as blank spaces.

Read ?read.csv na.strings options.

- a character vector of strings which are to be interpreted as NA values. Blank fields are also considered to be missing values in logical, integer, numeric and complex fields. Note that the test happens after white space is stripped from the input, so na.strings values may need their own white space stripped in advance.

#setwd("/Users/arvindsharma/Dropbox/WCAS/Econometrics/")

?read.csv 

df_train    <- read.csv("insurance-training-data.csv"  ,  na.strings=c("")) # treat "" as NA in character variables
df_eval     <- read.csv("insurance-evaluation-data.csv",  na.strings=c(""))

sapply(X = df_train,
       FUN =  class
       )
##       INDEX TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS 
##   "integer"   "integer"   "numeric"   "integer"   "integer"   "integer" 
##         YOJ      INCOME     PARENT1    HOME_VAL     MSTATUS         SEX 
##   "integer" "character" "character" "character" "character" "character" 
##   EDUCATION         JOB    TRAVTIME     CAR_USE    BLUEBOOK         TIF 
## "character" "character"   "integer" "character" "character"   "integer" 
##    CAR_TYPE     RED_CAR    OLDCLAIM    CLM_FREQ     REVOKED     MVR_PTS 
## "character" "character" "character"   "integer" "character"   "integer" 
##     CAR_AGE  URBANICITY 
##   "integer" "character"
skim(df_train)
Data summary
Name df_train
Number of rows 8161
Number of columns 26
_______________________
Column type frequency:
character 14
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
INCOME 445 0.95 2 8 0 6612 0
PARENT1 0 1.00 2 3 0 2 0
HOME_VAL 464 0.94 2 8 0 5106 0
MSTATUS 0 1.00 3 4 0 2 0
SEX 0 1.00 1 3 0 2 0
EDUCATION 0 1.00 3 13 0 5 0
JOB 526 0.94 6 13 0 8 0
CAR_USE 0 1.00 7 10 0 2 0
BLUEBOOK 0 1.00 6 7 0 2789 0
CAR_TYPE 0 1.00 3 11 0 6 0
RED_CAR 0 1.00 2 3 0 2 0
OLDCLAIM 0 1.00 2 7 0 2857 0
REVOKED 0 1.00 2 3 0 2 0
URBANICITY 0 1.00 19 21 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
INDEX 0 1.00 5151.87 2978.89 1 2559 5133 7745 10302.0 ▇▇▇▇▇
TARGET_FLAG 0 1.00 0.26 0.44 0 0 0 1 1.0 ▇▁▁▁▃
TARGET_AMT 0 1.00 1504.32 4704.03 0 0 0 1036 107586.1 ▇▁▁▁▁
KIDSDRIV 0 1.00 0.17 0.51 0 0 0 0 4.0 ▇▁▁▁▁
AGE 6 1.00 44.79 8.63 16 39 45 51 81.0 ▁▆▇▂▁
HOMEKIDS 0 1.00 0.72 1.12 0 0 0 1 5.0 ▇▂▁▁▁
YOJ 454 0.94 10.50 4.09 0 9 11 13 23.0 ▂▃▇▃▁
TRAVTIME 0 1.00 33.49 15.91 5 22 33 44 142.0 ▇▇▁▁▁
TIF 0 1.00 5.35 4.15 1 1 4 7 25.0 ▇▆▁▁▁
CLM_FREQ 0 1.00 0.80 1.16 0 0 0 2 5.0 ▇▂▁▁▁
MVR_PTS 0 1.00 1.70 2.15 0 0 1 3 13.0 ▇▂▁▁▁
CAR_AGE 510 0.94 8.33 5.70 -3 1 8 12 28.0 ▆▇▇▃▁

2.2 Missing Values

5 quantitative variables have missing NA values.

  • CAR_AGE (510) i.e. 6.25% observations.

  • YOJ (454) i.e. 5.56% observations. {years on job}.

  • AGE (6)

The first three were easily identified as R recognized them as numeric variables and automatically imputes NAs when there were blank observations. However, the last variables have $ in front of them and thus R thinks of them as characters variables. Here it will treat “missing values” as a category, and you will have to fix it. You can remove the dollar signs with gsub command and convert the character variables into numeric variables. I show that later in the file.

  • INCOME (445) i.e 5.45% observations.

  • HOME VALUE (464) i.e 5.69% observations.

1 qualitative variable/character vector/string variable has missing NA values too.

  • JOBS (526) i.e. 6.45% observations.
### NOT REQUIRED AS SPECIFIED DURING IMPORT COMMAND IN na.staring("") option
# df_train<-df_train %>% 
#   mutate(INCOME = ifelse(INCOME == "", "NA", INCOME))


?visdat
visdat::vis_dat(df_train) 

visdat::vis_miss(df_train) + theme(axis.text.x = element_text(angle = 90))

# Sum NAs across columns

dplyr::summarise(.data = df_train, 
                 across(.cols = everything(),
                        .fns  = ~ sum(is.na(.x))
                       )
                ) %>%
    glimpse() 
## Rows: 1
## Columns: 26
## $ INDEX       <int> 0
## $ TARGET_FLAG <int> 0
## $ TARGET_AMT  <int> 0
## $ KIDSDRIV    <int> 0
## $ AGE         <int> 6
## $ HOMEKIDS    <int> 0
## $ YOJ         <int> 454
## $ INCOME      <int> 445
## $ PARENT1     <int> 0
## $ HOME_VAL    <int> 464
## $ MSTATUS     <int> 0
## $ SEX         <int> 0
## $ EDUCATION   <int> 0
## $ JOB         <int> 526
## $ TRAVTIME    <int> 0
## $ CAR_USE     <int> 0
## $ BLUEBOOK    <int> 0
## $ TIF         <int> 0
## $ CAR_TYPE    <int> 0
## $ RED_CAR     <int> 0
## $ OLDCLAIM    <int> 0
## $ CLM_FREQ    <int> 0
## $ REVOKED     <int> 0
## $ MVR_PTS     <int> 0
## $ CAR_AGE     <int> 510
## $ URBANICITY  <int> 0
# Display NAs as percentage of observations 
round(x = sapply(X = df_train,
                 FUN = function(df){100*sum(is.na(df==TRUE)/length(df))}),
      digits = 2)
##       INDEX TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS 
##        0.00        0.00        0.00        0.00        0.07        0.00 
##         YOJ      INCOME     PARENT1    HOME_VAL     MSTATUS         SEX 
##        5.56        5.45        0.00        5.69        0.00        0.00 
##   EDUCATION         JOB    TRAVTIME     CAR_USE    BLUEBOOK         TIF 
##        0.00        6.45        0.00        0.00        0.00        0.00 
##    CAR_TYPE     RED_CAR    OLDCLAIM    CLM_FREQ     REVOKED     MVR_PTS 
##        0.00        0.00        0.00        0.00        0.00        0.00 
##     CAR_AGE  URBANICITY 
##        6.25        0.00

2.3 Clean Training Dataset

First, I will convert quantitative variables that got read in as qualitative by removing the $ and , signs. I will begin by creating a clean data set and fix these variables.

  df_train_clean <- df_train

  df_train_clean$INCOME <- gsub(pattern = "\\$|\\,",     # have to remove both $ and , - one has to put \\ before these signs, and | is the or condition
                                replacement =  "",        # replace with nothing / empty space
                                x = df_train_clean$INCOME # apply on INCOME vector in df_train_clean  
                                )

  df_train_clean$INCOME <- as.numeric(df_train_clean$INCOME) # replace character as numeric 
  

  
    
# now, apply on all dollar variables - BRUTE FORCE...
  
df_train_clean$HOME_VAL <- gsub(pattern = "\\$|\\,", replacement = "",x = df_train_clean$HOME_VAL)

df_train_clean$BLUEBOOK <- gsub(pattern = "\\$|\\,", replacement = "",x = df_train_clean$BLUEBOOK)

df_train_clean$OLDCLAIM <- gsub(pattern = "\\$|\\,", replacement = "",x = df_train_clean$OLDCLAIM)

# finally, apply numeric conversion on these dollar variables - BRUTE FORCE...

df_train_clean$HOME_VAL <- as.numeric(df_train_clean$HOME_VAL) 
df_train_clean$BLUEBOOK <- as.numeric(df_train_clean$BLUEBOOK) 
df_train_clean$OLDCLAIM <- as.numeric(df_train_clean$OLDCLAIM) 
  • I am keeping the observations that have no missing co-variates (filtering on rows, not selecting on columns). I prefer not to impute values as means/medians. One can do a robustness check of the final model on datasets with imputed values if they create one. So I have 6447 observations or about 80% of my original data. I could probably extract some more info from these observations as some rows may only have non-important variables missing.

  • I drop one observation where car age is -3 (typo presumably), but you could have replaced it with 0 too. Many ways to delete rows - I prefer using conditions in the subset command. I found this discrepencany when I created my summary statistics table, and chose to delete this row entirely instead of imputing values.

# DROPING ROWS WITH MSSING VALUES of the 3 variables and creating a new dataframe  
df_train_clean <- dplyr::filter(.data = df_train_clean, 
                                !is.na(CAR_AGE)   &   !is.na(YOJ)   &   !is.na(AGE)    &    !is.na(INCOME)   &   !is.na(HOME_VAL)    &    !is.na(JOB)
                                )

# DROPING VARIABLES BY SELECTING VARIABLE NAMES for applying stargazer package on dataframe 
df_train_clean <- dplyr::select(.data = df_train_clean, 
                                -c(INDEX)
                                )

# KEEPING ROWS where Car Age variable is non-negtive
table(df_train_clean$CAR_AGE<0)  # only one observation / typo
## 
## FALSE  TRUE 
##  6044     1
df_train_clean <- subset(df_train_clean, df_train_clean$CAR_AGE>=0 )

2.4 Data Insights

I will provide the summary statistics of numerical variables separately from non-numerical variables. One can also convert these non-numerical variables like factor variables into numbers and include them in the same table, but measures like mean and sd do not make sense for them. So I would rather display them in charts.

2.4.1 Quantitative Variables: Summary Statistics

We have 15 quantitative variables. I use stargazer package to create basic summary statistics table.

2.4.1.1 Basic Insights

About 26% people crash their cars, and the average cost of car repairs seems to be roughly $ 1,500. The positive skewness in car crash cost variable suggests that most car damages are small, but we have some big/major crashes. Mean driver age is about 45 years, and the average distance to work is 33 (miles?). The data dictionary is not very clear on the units on this variable definition. An average car has been in use for 7.9 years and has a book value of about $15,000.

## LABEL VARIABLES 
labels_numeric <-c(
            "Car Crash Dummy",
            "Car Crash Cost",
            "# Driving Children",
            "Driver's Age",
            "# Children at Home",
            "Years on Job",
            "Income",
            "Home Value",
              "Distance to Work",
            "Vehicle Value",
            "Time in Force",
            "Claims Payout",       # (Past 5 Years)
            "Claims Frequency",    # (Past 5 Years)
            "MVR Points",          # Motor Vehicle Record Points
            "Car Age"
  )

stargazer(df_train_clean, 
          type = "text", # html, latex
          # out =
          # summary.stat = 
          covariate.labels = labels_numeric, # NULL
          digits = 2)
## 
## ==============================================================
## Statistic            N      Mean     St. Dev.   Min     Max   
## --------------------------------------------------------------
## Car Crash Dummy    6,044    0.26       0.44      0       1    
## Car Crash Cost     6,044  1,479.66   4,553.55  0.00  85,523.65
## # Driving Children 6,044    0.17       0.52      0       4    
## Driver's Age       6,044   44.63       8.71     16      81    
## # Children at Home 6,044    0.74       1.13      0       5    
## Years on Job       6,044   10.49       4.14      0      23    
## Income             6,044 58,178.58  43,830.43    0    367,030 
## Home Value         6,044 150,091.80 123,736.40   0    885,282 
## Distance to Work   6,044   33.69      15.89      5      142   
## Vehicle Value      6,044 15,235.58   8,041.63  1,500  65,970  
## Time in Force      6,044    5.36       4.14      1      25    
## Claims Payout      6,044  3,999.99   8,815.06    0    57,037  
## Claims Frequency   6,044    0.78       1.15      0       5    
## MVR Points         6,044    1.70       2.16      0      13    
## Car Age            6,044    7.92       5.58      0      28    
## --------------------------------------------------------------

2.4.2 Qualitative Variables

We have 10 qualitative variables. Some interesting trends in the raw data are -

2.4.2.1 Women more likely to be in a crash

2.4.2.1.1 Creating a ggplot2 chart
2.4.2.1.1.1 Bare Bones GGPLOT2 chart

Try to create a bare bones graph first, and then embellish it.

table(df_train_clean$SEX)
## 
##    M  z_F 
## 2686 3358
# clean values for graph/tables... 
df_train_clean$SEX[df_train_clean$SEX == "z_F"] <- "Female"
df_train_clean$SEX[df_train_clean$SEX == "M"  ] <- "Male"


?subset       # Return subsets of vectors, matrices or data frames which meet conditions.
## Help on topic 'subset' was found in the following packages:
## 
##   Package               Library
##   timeDate              /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
##   base                  /Library/Frameworks/R.framework/Resources/library
##   data.table            /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## 
## 
## Using the first match ...
?geom_bar     # There are two types of bar charts: geom_bar() and geom_col(). geom_bar() uses stat_count() by default: it counts the number of cases at each x position


# Create a bar chart with the count of people in car crash and store the output in crashes_by_gender
crashes_by_gender_0 <-
ggplot(data = subset(x      = df_train_clean, 
                     select = TARGET_FLAG == 1    # focus on people who got into a crash
                     ), 
       mapping = aes(x    = SEX,                  # variable on x axis
                     fill = SEX                   # Colour related aesthetics
                     )
       ) +
  geom_bar() 

# par(mfrow = c(2,2))
# par(mfrow = c(1,1))

crashes_by_gender_0 # print the output 

Once you have the bare bones graph, embellish it/make it more presentable. One can add ggplot options one by one.

Please read up on ggplot options by typing ?theme in console and reading the help file to see how to control chart elements.

Of course, in your own final report, just create one chart and not multiple charts like here (for pedagogical purposes only).

Fix the labels on axis, add graph/plot title and notes, and store the chart in an object crashes by gender so that you can easily recall it in the future.

2.4.2.1.1.2 Adding labels to the Bare Bones GGPLOT2 chart
# Fix the labels - axis, add plot title and notes, and store the chart in an object so that you can easily recall it in the future.
crashes_by_gender_1 <- 
crashes_by_gender_0  +       labs( x = "Sex", 
                                   y = "Frequency", 
                               title = "Women are more likely to be in a crash",
                             caption = "Note: Data Source is XX."
                              )   

crashes_by_gender_1 # print your chart - I would definitely want to move the legend to the left

### Also, MULTIPLE WAYS TO ADD CHART TITLE, axis labels,...
# ggtitle("Women more likely to be in a crash") + ylab("Frequency") + xlab("")  # titles and labels
2.4.2.1.1.3 Adding captions to the Bare Bones GGPLOT2 chart

I now want to shift the notes to the bottom left, and use the horizontal justification hjust argument for the caption.

?element_text # In conjunction with the theme system, the element_ functions specify the display of how non-data components of the plot are drawn.

crashes_by_gender_1 + theme(plot.caption = element_text(hjust = 0) # Horizontal justification in [0, 1]
                          )

Note that you will have to store/assign the command applied on the graph object crashes_by_gender into an object (not done above) - else if you later apply to print the object it will not return the captions on bottom left in-spite of the command giving you no errors.

Lets do that below.

crashes_by_gender_2 <- 
crashes_by_gender_1 + theme(plot.caption = element_text(hjust = 0) # Horizontal justification in [0, 1]
                          )
crashes_by_gender_2

2.4.2.1.1.4 Controlling Legend of the Bare Bones GGPLOT2 chart
crashes_by_gender_3 <- 
crashes_by_gender_2 +   theme(legend.position = "bottom",            # legend at bottom 
                              legend.title    = element_blank()
                              ) 

# crashes_by_gender_3 


crashes_by_gender_4 <- 
crashes_by_gender_3 +   theme(legend.position = "none")               # no legend

# crashes_by_gender_4 

# plot the two charts - one ith legend at bottom and one without legend 
plot_grid(crashes_by_gender_3,
          crashes_by_gender_4,
          labels = "AUTO") 

So in summary we created charts by sequentially adding elements.

# library(cowplot)
plot_grid(crashes_by_gender_0,  # ggplot with data and mapping arguments, along with chart type argument 
          crashes_by_gender_1,  # control labels 
          crashes_by_gender_2,  # caption (and shift to bottom left)
          crashes_by_gender_3,  # control legend - at the bottom
          crashes_by_gender_4,  # control legend - remove it 
          labels = "AUTO") 

2.4.2.1.1.5 Controlling Theme of Bare Bones GGPLOT2 chart

This is an optional part and not the most important element of the chart. I control the size of the title by theme(plot.title = element_text(size=8)). Everything in a ggplot2 graph can be controlled !

?theme # Themes are a powerful way to customize the non-data components of your plots: i.e. titles, labels, fonts, background, gridlines, and legends. Themes can be used to give plots a consistent customized look. Modify a single plot's theme using theme()


g1 <- crashes_by_gender_1 + theme_bw()       + theme(plot.title = element_text(size=8))
g2 <- crashes_by_gender_2 + theme_classic()  + theme(plot.title = element_text(size=8))
g3 <- crashes_by_gender_3 + theme_void()     + theme(plot.title = element_text(size=8))
g4 <- crashes_by_gender_4 + theme_dark()     + theme(plot.title = element_text(size=8))


g1+g2+g3+g4 # requires library(patchwork)

2.4.2.2 SUVs are most likely to crash

People may feel safer and thus drive less carefully, or maybe their center of gravity is higher than cars making them more likely to get displaced easily…

table(df_train_clean$CAR_TYPE)
## 
##     Minivan Panel Truck      Pickup  Sports Car         Van       z_SUV 
##        1701         347        1013         714         488        1781
# clean values for graph/tables... 
df_train_clean$CAR_TYPE[df_train_clean$CAR_TYPE == "z_SUV"] <- "SUV"

# create ggplot the usual way
crashes_by_car_type_0 <-
  ggplot(  data = subset( x      = df_train_clean, 
                          select = TARGET_FLAG == 1    # focus on people who got into a crash
                        ), 
        mapping = aes(x    = CAR_TYPE,                 # variable on x axis
                     fill  = CAR_TYPE                  # colour related aesthetics
                     )
       ) +
  geom_bar()  +
  labs( x = "Car Type",                                 # labels 
        y = "Frequency", 
    title = "SUVs are most likely to crash",
 subtitle = "",        # Put your subtitle here (optional) - E.G. Panel trucks least lilely to crash
  caption = "Note: Car crash data."
                              )  +   
  theme(plot.caption = element_text(hjust = 0)) +     # shift caption to the bottom left
    theme(legend.position = "none")                   # no legend

  
  crashes_by_car_type_0

You can also sort out the bars by height in either ascending or descending order.

There are multiple ways to reorder bar charts, just search “Order Bars in ggplot2 bar graph” in StackOverflow. One simple way is to create a function that sorts the underlying factors and then apply that function in ggplot.

Also, you can label the bars with the count for easier reading.

2.4.2.2.1 Replacing all other observations starting with z_ in different variables…

Can use brute force, or better use sapply function to implement gsub command.

# AUTOMATE 
df_train_cleaned <- as_data_frame(sapply(X   = df_train_clean[c(10,12:13,25)], 
                                        FUN = function(x) {
                                                  gsub(pattern = "z_", 
                                                  replacement = "",
                                                  x = x
                                                      )
                                                }
                                       )
                                 )
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
##   tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
df_train_clean[c(10,12:13,25)] <- df_train_cleaned  # replacing the 4 variables with cleaned values
rm(df_train_cleaned)


# BRUTE FORCE
#df_train_clean$MSTATUS   <- gsub(pattern = "z_",replacement = "",x = df_train_clean$MSTATUS)
#df_train_clean$EDUCATION <- gsub(pattern = "z_",replacement = "",x = df_train_clean$EDUCATION)
#df_train_clean$JOB       <- gsub(pattern = "z_",replacement = "",x = df_train_clean$JOB)
#df_train_clean$URBANICITY<- gsub(pattern = "z_",replacement = "",x = df_train_clean$URBANICITY)

2.4.2.3 Blue Collar people more likely to be in a crash

# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>% 
  dplyr::select(JOB, TARGET_FLAG) %>%      # select the 2 columns/variables
  count(JOB, TARGET_FLAG) %>%              # create a table of count of these 2 variables
  dplyr::filter(TARGET_FLAG == 1) %>%           # filter onto rows where cars crashed
  ggplot(aes(x = JOB,                      # X axis is type of car
             y = n,                             # Y axis is count of car crashes
             fill = JOB)                   # different colors 
         ) +
  geom_bar(stat = "identity") +                 # height of the bar proportional to the number of cases in each group 
  geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) + 
  ggtitle("Blue Collar people are more likely to be in a crash") + ylab("Frequency") + xlab("Job")+ # titles and labels
  theme(legend.position = "none") 

2.4.2.4 High School students to be more likely in a crash

# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>% 
  dplyr::select(EDUCATION, TARGET_FLAG) %>%      # select the 2 columns/variables
  count(EDUCATION, TARGET_FLAG) %>%              # create a table of count of these 2 variables
  dplyr::filter(TARGET_FLAG == 1) %>%           # filter onto rows where cars crashed
  ggplot(aes(x = EDUCATION,                      # X axis is type of car
             y = n,                             # Y axis is count of car crashes
             fill = EDUCATION)                   # different colors 
         ) +
  geom_bar(stat = "identity") +                 # height of the bar proportional to the number of cases in each group 
  geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) + 
  ggtitle("High School students to be more likely in a crash") + ylab("Frequency") + xlab("Education")+ # titles and labels
  theme(legend.position = "none") 

2.4.2.5 Married and non-married individuals are not different in terms of car carshes

Color Palette

# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>% 
  dplyr::select(MSTATUS, TARGET_FLAG) %>%      # select the 2 columns/variables
  count(MSTATUS, TARGET_FLAG) %>%              # create a table of count of these 2 variables
  dplyr::filter(TARGET_FLAG == 1) %>%           # filter onto rows where cars crashed
  ggplot(aes(x = MSTATUS,                      # X axis is type of car
             y = n,                             # Y axis is count of car crashes
             fill = MSTATUS)                   # different colors 
         ) +
  geom_bar(stat = "identity") +                 # height of the bar proportional to the number of cases in each group 
  geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) +
  ggtitle("No difference between married and non-married individuals in terms of car carshes") + ylab("Frequency") + xlab("Marital Status") +  scale_fill_manual(values=c("deepskyblue", "deeppink")) + theme(legend.position = "none") 

2.4.2.6 Single Parent much less likely to be in a crash

df_train_clean %>% 
#  rename(SINGLE_PARENT = PARENT1) %>% 
  dplyr::select(PARENT1, TARGET_FLAG) %>%      # select the 2 columns/variables
  count(PARENT1, TARGET_FLAG) %>%              # create a table of count of these 2 variables
  dplyr::filter(TARGET_FLAG == 1) %>%          # filter onto rows where cars crashed
  ggplot(aes(x = PARENT1,                      
             y = n,                            # Y axis is count of car crashes
             fill = PARENT1)                   # different colors 
         ) +
  geom_bar(stat = "identity") +                 # height of the bar proportional to the number of cases in each group 
  geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) +
  ggtitle("Single Parent much less likely to be in a crash") + ylab("Frequency") + xlab("Single Parent") +  scale_fill_manual(values=c("gray", "black"))  + theme(legend.position = "none") 

2.4.2.7 Private cars to be more likely in a crash

# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>% 
  dplyr::select(CAR_USE, TARGET_FLAG) %>%      # select the 2 columns/variables
  count(CAR_USE, TARGET_FLAG) %>%              # create a table of count of these 2 variables
  dplyr::filter(TARGET_FLAG == 1) %>%           # filter onto rows where cars crashed
  ggplot(aes(x = CAR_USE,                      
             y = n,                             # Y axis is count of car crashes
             fill = CAR_USE)                   # different colors 
         ) +
  geom_bar(stat = "identity") +                 # height of the bar proportional to the number of cases in each group 
  geom_text(aes(label=n), vjust=1.6, color="black", size=3.5) + 
  ggtitle("Private cars to be more likely in a crash") + ylab("Frequency") + xlab("Car Type")+ # titles and labels
  theme(legend.position = "none") +  scale_fill_manual(values=c("yellow", "violet")) 

2.4.2.8 Red cars are less likely in a crash

Maybe one can see a red car more easily and thus move away from it more easily.

# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>% 
  dplyr::select(RED_CAR, TARGET_FLAG) %>%      # select the 2 columns/variables
  count(RED_CAR, TARGET_FLAG) %>%              # create a table of count of these 2 variables
  dplyr::filter(TARGET_FLAG == 1) %>%           # filter onto rows where cars crashed
  ggplot(aes(x = RED_CAR,                      
             y = n,                             # Y axis is count of car crashes
             fill = RED_CAR)                   # different colors 
         ) +
  geom_bar(stat = "identity") +                 # height of the bar proportional to the number of cases in each group 
  geom_text(aes(label=n), vjust=1.6, color="black", size=3.5) + 
  ggtitle("Red cars to less likely in a crash ") + ylab("Frequency") + xlab("Car Type")+ # titles and labels
  theme(legend.position = "none") +  scale_fill_manual(values=c("chartreuse3", "blue1")) 

2.4.2.9 Revoked Licensees are less likely to be in a crash

If your car license was revoked in the last 7 years, one cannot drive and thus is likely to not be a car accident.

# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>% 
  dplyr::select(REVOKED, TARGET_FLAG) %>%      # select the 2 columns/variables
  count(REVOKED, TARGET_FLAG) %>%              # create a table of count of these 2 variables
  dplyr::filter(TARGET_FLAG == 1) %>%           # filter onto rows where cars crashed
  ggplot(aes(x = REVOKED,                      
             y = n,                             # Y axis is count of car crashes
             fill = REVOKED)                   # different colors 
         ) +
  geom_bar(stat = "identity") +                 # height of the bar proportional to the number of cases in each group 
  geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) + 
  ggtitle("Revoked Licensees are less likely to be in a crash") + ylab("Frequency") + xlab("License revoked in last 7 years")+ # titles and labels
  theme(legend.position = "none") +  scale_fill_manual(values=c("darkorange", "deeppink4")) 

2.4.2.10 Urban home/work area are much more likely to see a car crash

Makes sense as car density is likely to be less in rural areas.

# transform raw data into a format that ggplot accepts for graphing
df_train_clean %>% 
  dplyr::select(URBANICITY, TARGET_FLAG) %>%      # select the 2 columns/variables
  count(URBANICITY, TARGET_FLAG) %>%              # create a table of count of these 2 variables
  dplyr::filter(TARGET_FLAG == 1) %>%           # filter onto rows where cars crashed
  ggplot(aes(x = URBANICITY,                      
             y = n,                             # Y axis is count of car crashes
             fill = URBANICITY)                   # different colors 
         ) +
  geom_bar(stat = "identity") +                 # height of the bar proportional to the number of cases in each group 
  geom_text(aes(label=n), vjust=1.6, color="white", size=3.5) + 
  ggtitle("Urban home/work area are much more likely to see a car crash") + ylab("Frequency") + xlab("")+ # titles and labels
  theme(legend.position = "none") +  scale_fill_manual(values=c("cornflowerblue", "brown4")) 

3 Modelling

3.1 Multivariate Model

3.1.1 Model 1

Keep all the variables.

Note we have 10 factor variables here (you can keep them as character variables), and 15 quantitative (numeric/integer) variables.

The log model (of crash damage) is much better fit than the non-log model of crash damage.

lm_model_1a <-lm(data    = df_train_clean,
                 formula = TARGET_FLAG ~ .- TARGET_AMT)

lm_model_1b <-lm(data    = df_train_clean,
                formula  = TARGET_AMT  ~ .- TARGET_FLAG)

lm_model_1b_log <-lm(data    = df_train_clean,
                     formula = log(TARGET_AMT + .001)  ~ .- TARGET_FLAG)
vif(lm_model_1b_log)
##                 GVIF Df GVIF^(1/(2*Df))
## KIDSDRIV    1.313573  1        1.146112
## AGE         1.492538  1        1.221695
## HOMEKIDS    2.111876  1        1.453230
## YOJ         1.467483  1        1.211397
## INCOME      3.151934  1        1.775369
## PARENT1     1.856506  1        1.362536
## HOME_VAL    2.455181  1        1.566902
## MSTATUS     2.111076  1        1.452954
## SEX         3.223269  1        1.795347
## EDUCATION  11.267988  4        1.353570
## JOB        22.068377  7        1.247336
## TRAVTIME    1.036217  1        1.017947
## CAR_USE     2.354052  1        1.534292
## BLUEBOOK    1.873002  1        1.368577
## TIF         1.008645  1        1.004313
## CAR_TYPE    4.506696  5        1.162481
## RED_CAR     1.850800  1        1.360441
## OLDCLAIM    1.702785  1        1.304908
## CLM_FREQ    1.606960  1        1.267659
## REVOKED     1.294901  1        1.137937
## MVR_PTS     1.236740  1        1.112088
## CAR_AGE     2.048116  1        1.431124
## URBANICITY  1.251308  1        1.118619
# Convert all columns to factor
 df_factor <- as.data.frame(unclass(x = df_train_clean),   
                            stringsAsFactors = TRUE)

# lm_model_1a_fac <-lm(data    = df_factor,
#                    formula = TARGET_FLAG ~ .- TARGET_AMT)

# lm_model_1b_fac <-lm(data    = df_factor,
#                    formula = TARGET_AMT  ~ .- TARGET_FLAG)


stargazer(lm_model_1a, lm_model_1b, # lm_model_1a_fac, lm_model_1b_fac,
          lm_model_1b_log,
          type='text',
          column.labels=c("Crash Dummy","Crash Damage"#,"Crash Dummy","Crash Damage"
                          ,"Crash Damage"
                          ) #,covariate.labels=c()
)
## 
## =================================================================================
##                                                Dependent variable:               
##                                 -------------------------------------------------
##                                 TARGET_FLAG  TARGET_AMT   log(TARGET_AMT + 0.001)
##                                 Crash Dummy Crash Damage       Crash Damage      
##                                     (1)          (2)                (3)          
## ---------------------------------------------------------------------------------
## KIDSDRIV                         0.048***     206.551*           0.712***        
##                                   (0.011)     (125.535)           (0.168)        
##                                                                                  
## AGE                               -0.001       -4.368             -0.010         
##                                   (0.001)      (7.921)            (0.011)        
##                                                                                  
## HOMEKIDS                           0.001       15.650              0.027         
##                                   (0.006)     (72.430)            (0.097)        
##                                                                                  
## YOJ                               -0.002       -2.216             -0.027         
##                                   (0.001)     (16.525)            (0.022)        
##                                                                                  
## INCOME                           -0.00000*     -0.003            -0.00001*       
##                                  (0.00000)     (0.002)           (0.00000)       
##                                                                                  
## PARENT1Yes                       0.085***     468.244**          1.291***        
##                                   (0.020)     (224.397)           (0.300)        
##                                                                                  
## HOME_VAL                        -0.00000***    -0.001           -0.00000***      
##                                  (0.00000)     (0.001)           (0.00000)       
##                                                                                  
## MSTATUSYes                       -0.059***   -541.918***         -0.905***       
##                                   (0.015)     (167.105)           (0.224)        
##                                                                                  
## SEXMale                           0.032*      470.387**           0.499*         
##                                   (0.018)     (203.976)           (0.273)        
##                                                                                  
## EDUCATIONBachelors               -0.061***    -361.819           -0.944***       
##                                   (0.020)     (226.694)           (0.304)        
##                                                                                  
## EDUCATIONHigh School              0.0002      -221.025            -0.003         
##                                   (0.016)     (186.983)           (0.250)        
##                                                                                  
## EDUCATIONMasters                  -0.057*     -302.319            -0.862*        
##                                   (0.030)     (339.650)           (0.455)        
##                                                                                  
## EDUCATIONPhD                       0.007       439.095             0.153         
##                                   (0.038)     (426.262)           (0.571)        
##                                                                                  
## JOBClerical                        0.030      -157.256             0.436         
##                                   (0.018)     (209.929)           (0.281)        
##                                                                                  
## JOBDoctor                        -0.118***  -1,347.928***        -1.841***       
##                                   (0.044)     (498.049)           (0.667)        
##                                                                                  
## JOBHome Maker                     -0.013      -254.283            -0.221         
##                                   (0.026)     (300.733)           (0.403)        
##                                                                                  
## JOBLawyer                         -0.022      -225.825            -0.343         
##                                   (0.030)     (341.000)           (0.457)        
##                                                                                  
## JOBManager                       -0.140***  -1,115.487***        -2.147***       
##                                   (0.023)     (256.217)           (0.343)        
##                                                                                  
## JOBProfessional                   -0.022       -27.588            -0.310         
##                                   (0.021)     (233.689)           (0.313)        
##                                                                                  
## JOBStudent                        -0.013      -433.902            -0.241         
##                                   (0.023)     (264.783)           (0.355)        
##                                                                                  
## TRAVTIME                         0.002***     10.872***          0.033***        
##                                  (0.0003)      (3.616)            (0.005)        
##                                                                                  
## CAR_USEPrivate                   -0.131***   -755.101***         -1.987***       
##                                   (0.016)     (183.177)           (0.245)        
##                                                                                  
## BLUEBOOK                        -0.00000***     0.013           -0.00004***      
##                                  (0.00000)     (0.010)           (0.00001)       
##                                                                                  
## TIF                              -0.008***   -44.585***          -0.115***       
##                                   (0.001)     (13.680)            (0.018)        
##                                                                                  
## CAR_TYPEPanel Truck               0.072**      476.472            1.119**        
##                                   (0.029)     (328.799)           (0.440)        
##                                                                                  
## CAR_TYPEPickup                   0.069***     406.858**          1.066***        
##                                   (0.017)     (188.044)           (0.252)        
##                                                                                  
## CAR_TYPESports Car               0.155***   1,263.493***         2.379***        
##                                   (0.021)     (236.703)           (0.317)        
##                                                                                  
## CAR_TYPESUV                      0.111***    909.604***          1.718***        
##                                   (0.017)     (194.832)           (0.261)        
##                                                                                  
## CAR_TYPEVan                      0.064***     489.637**          0.986***        
##                                   (0.021)     (241.866)           (0.324)        
##                                                                                  
## RED_CARyes                       -0.035**     -169.227           -0.511**        
##                                   (0.015)     (171.095)           (0.229)        
##                                                                                  
## OLDCLAIM                        -0.00000***    -0.005           -0.00003***      
##                                  (0.00000)     (0.008)           (0.00001)       
##                                                                                  
## CLM_FREQ                         0.034***      79.674            0.494***        
##                                   (0.005)     (62.040)            (0.083)        
##                                                                                  
## REVOKEDYes                       0.144***     497.806**          2.161***        
##                                   (0.017)     (195.419)           (0.262)        
##                                                                                  
## MVR_PTS                          0.022***    172.716***          0.339***        
##                                   (0.003)     (29.095)            (0.039)        
##                                                                                  
## CAR_AGE                           -0.001      -25.433*            -0.010         
##                                   (0.001)     (14.474)            (0.019)        
##                                                                                  
## URBANICITYHighly Urban/ Urban    0.294***   1,640.830***         4.472***        
##                                   (0.014)     (153.696)           (0.206)        
##                                                                                  
## Constant                         0.148***      718.227           -4.695***       
##                                   (0.045)     (516.388)           (0.691)        
##                                                                                  
## ---------------------------------------------------------------------------------
## Observations                       6,044        6,044              6,044         
## R2                                 0.237        0.077              0.237         
## Adjusted R2                        0.233        0.071              0.232         
## Residual Std. Error (df = 6007)    0.387      4,388.962            5.877         
## F Statistic (df = 36; 6007)      51.944***    13.826***          51.827***       
## =================================================================================
## Note:                                                 *p<0.1; **p<0.05; ***p<0.01

3.1.2 Model 2

Keep the relevant variables.

First, begin by removing job variables as they have the highest multicollinearity, which fixes the issue. This also increase model fit substantially.

Second, remove variables that are insignificant - both in magnitude and potentially theoretical effect. For instance,

  • kids at home (and we already keep kids of driving age).
  • years on job.
  • car age.
lm_model_2a <-lm(data    = df_train_clean,
                 formula = TARGET_FLAG ~ .- TARGET_AMT - JOB - HOMEKIDS - YOJ - CAR_AGE)

lm_model_2b <-lm(data    = df_train_clean,
                formula  = TARGET_AMT  ~ .- TARGET_FLAG - JOB - HOMEKIDS - YOJ - CAR_AGE)

lm_model_2b_log <-lm(data    = df_train_clean,
                     formula = log(TARGET_AMT + .001)  ~ .- TARGET_FLAG - JOB - HOMEKIDS - YOJ - CAR_AGE)
vif(lm_model_2b_log)
##                GVIF Df GVIF^(1/(2*Df))
## KIDSDRIV   1.075088  1        1.036864
## AGE        1.248689  1        1.117448
## INCOME     2.564212  1        1.601316
## PARENT1    1.533708  1        1.238430
## HOME_VAL   2.352907  1        1.533919
## MSTATUS    1.917462  1        1.384725
## SEX        3.159437  1        1.777480
## EDUCATION  1.915983  4        1.084673
## TRAVTIME   1.032804  1        1.016270
## CAR_USE    1.522138  1        1.233749
## BLUEBOOK   1.866322  1        1.366134
## TIF        1.005249  1        1.002621
## CAR_TYPE   4.043306  5        1.149936
## RED_CAR    1.848084  1        1.359443
## OLDCLAIM   1.699405  1        1.303612
## CLM_FREQ   1.605332  1        1.267017
## REVOKED    1.290697  1        1.136088
## MVR_PTS    1.229020  1        1.108612
## URBANICITY 1.199977  1        1.095435
# plot(lm_model_2b_log)

stargazer(lm_model_2a, lm_model_2b, # lm_model_1a_fac, lm_model_1b_fac,
          lm_model_2b_log,
          type='text',
          column.labels=c("Crash Dummy","Crash Damage"#,"Crash Dummy","Crash Damage"
                          ,"Crash Damage"
                          ) #,covariate.labels=c()
)
## 
## ================================================================================
##                                               Dependent variable:               
##                                 ------------------------------------------------
##                                 TARGET_FLAG  TARGET_AMT  log(TARGET_AMT + 0.001)
##                                 Crash Dummy Crash Damage      Crash Damage      
##                                     (1)         (2)                (3)          
## --------------------------------------------------------------------------------
## KIDSDRIV                         0.049***    224.399**          0.733***        
##                                   (0.010)    (113.837)           (0.153)        
##                                                                                 
## AGE                               -0.001*      -6.417            -0.017*        
##                                   (0.001)     (7.262)            (0.010)        
##                                                                                 
## INCOME                          -0.00000***   -0.004*          -0.00001***      
##                                  (0.00000)    (0.002)           (0.00000)       
##                                                                                 
## PARENT1Yes                       0.086***    485.810**          1.309***        
##                                   (0.018)    (204.439)           (0.275)        
##                                                                                 
## HOME_VAL                        -0.00000**     -0.001          -0.00000**       
##                                  (0.00000)    (0.001)           (0.00000)       
##                                                                                 
## MSTATUSYes                       -0.061***  -569.813***         -0.933***       
##                                   (0.014)    (159.633)           (0.214)        
##                                                                                 
## SEXMale                            0.029     460.400**           0.457*         
##                                   (0.018)    (202.423)           (0.272)        
##                                                                                 
## EDUCATIONBachelors               -0.095***  -640.800***         -1.459***       
##                                   (0.017)    (191.637)           (0.257)        
##                                                                                 
## EDUCATIONHigh School              -0.013     -323.833*           -0.198         
##                                   (0.016)    (181.267)           (0.243)        
##                                                                                 
## EDUCATIONMasters                 -0.098***  -750.512***         -1.486***       
##                                   (0.020)    (223.181)           (0.300)        
##                                                                                 
## EDUCATIONPhD                     -0.071***   -572.329*          -1.072***       
##                                   (0.027)    (304.843)           (0.409)        
##                                                                                 
## TRAVTIME                         0.002***    11.601***          0.034***        
##                                  (0.0003)     (3.619)            (0.005)        
##                                                                                 
## CAR_USEPrivate                   -0.144***  -921.134***         -2.182***       
##                                   (0.013)    (147.643)           (0.198)        
##                                                                                 
## BLUEBOOK                        -0.00000***    0.012           -0.00004***      
##                                  (0.00000)    (0.010)           (0.00001)       
##                                                                                 
## TIF                              -0.008***   -44.180***         -0.115***       
##                                   (0.001)     (13.689)           (0.018)        
##                                                                                 
## CAR_TYPEPanel Truck               0.057**     333.756            0.890**        
##                                   (0.028)    (316.296)           (0.425)        
##                                                                                 
## CAR_TYPEPickup                   0.063***     331.721*          0.963***        
##                                   (0.016)    (185.183)           (0.249)        
##                                                                                 
## CAR_TYPESports Car               0.151***   1,239.269***        2.329***        
##                                   (0.021)    (236.861)           (0.318)        
##                                                                                 
## CAR_TYPESUV                      0.110***    898.172***         1.700***        
##                                   (0.017)    (194.917)           (0.262)        
##                                                                                 
## CAR_TYPEVan                      0.063***     466.371*          0.976***        
##                                   (0.021)    (239.785)           (0.322)        
##                                                                                 
## RED_CARyes                       -0.037**     -190.372          -0.541**        
##                                   (0.015)    (171.373)           (0.230)        
##                                                                                 
## OLDCLAIM                        -0.00000***    -0.005          -0.00004***      
##                                  (0.00000)    (0.008)           (0.00001)       
##                                                                                 
## CLM_FREQ                         0.034***      84.685           0.507***        
##                                   (0.005)     (62.155)           (0.083)        
##                                                                                 
## REVOKEDYes                       0.146***    515.142***         2.201***        
##                                   (0.017)    (195.562)           (0.263)        
##                                                                                 
## MVR_PTS                          0.023***    180.110***         0.355***        
##                                   (0.003)     (29.072)           (0.039)        
##                                                                                 
## URBANICITYHighly Urban/ Urban    0.274***   1,524.455***        4.168***        
##                                   (0.013)    (150.866)           (0.203)        
##                                                                                 
## Constant                         0.188***     752.485           -4.103***       
##                                   (0.041)    (464.974)           (0.624)        
##                                                                                 
## --------------------------------------------------------------------------------
## Observations                       6,044       6,044              6,044         
## R2                                 0.228       0.071              0.228         
## Adjusted R2                        0.225       0.067              0.224         
## Residual Std. Error (df = 6017)    0.389     4,399.319            5.908         
## F Statistic (df = 26; 6017)      68.403***   17.582***          68.212***       
## ================================================================================
## Note:                                                *p<0.1; **p<0.05; ***p<0.01

3.2 Coefficient Interpretation

I will focus on Model 2a (column 1) and Model 2b log (column 3).

3.2.1 Target Flag Regression

Model 2a (column 1):

This type of regression is also called a linear probability model (LPM) i.e.a regression model where the outcome variable is a binary variable, and one or more explanatory variables are used to predict the outcome. Explanatory variables can themselves be binary, or be continuous.

It is in level-level format.

3.2.1.1 Quantitative Variable

  • If the age of the driver increases by one year, then the car crash probability reduces by .001, cetrius paribus. This is statistically significant at the 10% significance level but the economic magnitude seems small.

3.2.1.2 Quanlitative/Dummy Variable

  • If you have a bachelors degree, then the car crash probability reduces by 0.061 compared to those who have less than high school education (baseline category), cetrius paribus. This is statistically significant at the 1% significance level and the the economic magnitude seems fair.

  • If you are living in an urban area, then the car crash probability increases by 0.294 compared to if one is living in rural areas. This is not only significant at the 1% level but also large in economic magnitude.

3.2.2 Target Amount Regression

Model 2b log (column 3):

It is in log-level format, as crash damages are positive skewed.

3.2.2.1 Quantitative Variable

  • If the age of the driver increases by one year, then the car crash probability reduces by approximately 1.7 percent (different from percentage points), cetrius paribus. This is statistically significant at the 10% significance level but the economic magnitude seems small.

3.2.2.2 Quanlitative/Dummy Variable

  • If you have a bachelors degree, then the car crash probability reduces by 145.9 percent (not percentage points) compared to those who have less than high school education (baseline category), cetrius paribus. This is statistically significant at the 1% significance level and the the economic magnitude seems fair.

  • If you are living in an urban area, then the car crash probability increases by 416.80 percent (not percentage points) compared to if one is living in rural areas. This is not only significant at the 1% level but also large in economic magnitude.

3.3 Coefficients make sense ???

As theory suggests, our regression provides evidence for -

  • Very young people tend to be risky. Can include a quadratic term in age for testing if very old people also risky.

  • Commercial vehicles are more likely to crash, probably because they are driven more.

  • The more claims you filed in the past, the more you are likely to file in the future.

  • More educated people tend to drive more safely.

  • Home owners tend to drive more responsibly.

  • Rich people tend to get into fewer crashes.

  • Married people drive more safely.

  • If you get lots of traffic tickets(Motor Vehicle Records), you tend to get into more crashes.

  • Red cars are riskier.

  • If your license was revoked in the past 7 years, you probably are a riskier driver.

  • Women have less crashes than men.

  • Long drives to work usually suggest greater risk.

The raw data summary charts showed unconditional means. In the regression, we are controlling for many more factors / risk profiles and thus we have conditional effects here - these regression estimates are more causal estimates.

So the models looks fine.

3.4 Logistic & PROBIT Model

3.4.1 Model 1, Model 2 and Model 3

I will begin by modelling the binary dependent variable.

  • Model 1 is the kitchen sink approach. Logit regression.

  • Model 2 is the kitchen sink approach with relevant variables. Logit regression.

  • Model 3 is the kitchen sink approach with relevant variables. Probit regression

Probit regression implementation in R. Probit analysis will produce results similar logistic regression. The choice of probit versus logit depends largely on individual preferences.

logit_model_1a <-glm(data    =  df_train_clean,
                     formula =  TARGET_FLAG ~ .- TARGET_AMT,
                     family  =  "binomial"
                     )
 summary(logit_model_1a)
## 
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = "binomial", 
##     data = df_train_clean)
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.431e+00  3.206e-01  -7.583 3.38e-14 ***
## KIDSDRIV                       3.196e-01  7.059e-02   4.527 5.99e-06 ***
## AGE                           -3.933e-03  4.681e-03  -0.840 0.400784    
## HOMEKIDS                       2.295e-02  4.275e-02   0.537 0.591322    
## YOJ                           -9.407e-03  9.774e-03  -0.962 0.335863    
## INCOME                        -3.384e-06  1.433e-06  -2.362 0.018185 *  
## PARENT1Yes                     4.135e-01  1.266e-01   3.266 0.001091 ** 
## HOME_VAL                      -1.433e-06  4.289e-07  -3.342 0.000833 ***
## MSTATUSYes                    -4.225e-01  1.007e-01  -4.195 2.73e-05 ***
## SEXMale                        2.131e-01  1.288e-01   1.655 0.097986 .  
## EDUCATIONBachelors            -3.815e-01  1.325e-01  -2.878 0.003998 ** 
## EDUCATIONHigh School          -2.509e-03  1.068e-01  -0.023 0.981260    
## EDUCATIONMasters              -4.487e-01  2.141e-01  -2.096 0.036081 *  
## EDUCATIONPhD                   8.444e-02  2.635e-01   0.320 0.748671    
## JOBClerical                    1.951e-01  1.205e-01   1.619 0.105370    
## JOBDoctor                     -6.914e-01  3.288e-01  -2.103 0.035506 *  
## JOBHome Maker                 -1.273e-01  1.784e-01  -0.714 0.475536    
## JOBLawyer                      4.411e-02  2.151e-01   0.205 0.837531    
## JOBManager                    -8.811e-01  1.604e-01  -5.493 3.95e-08 ***
## JOBProfessional               -9.442e-02  1.369e-01  -0.690 0.490347    
## JOBStudent                    -1.678e-01  1.532e-01  -1.095 0.273334    
## TRAVTIME                       1.565e-02  2.192e-03   7.140 9.35e-13 ***
## CAR_USEPrivate                -8.323e-01  1.061e-01  -7.848 4.24e-15 ***
## BLUEBOOK                      -2.250e-05  6.102e-06  -3.688 0.000226 ***
## TIF                           -5.222e-02  8.545e-03  -6.112 9.86e-10 ***
## CAR_TYPEPanel Truck            6.915e-01  1.949e-01   3.549 0.000387 ***
## CAR_TYPEPickup                 5.505e-01  1.155e-01   4.767 1.87e-06 ***
## CAR_TYPESports Car             1.111e+00  1.467e-01   7.575 3.58e-14 ***
## CAR_TYPESUV                    8.301e-01  1.258e-01   6.599 4.13e-11 ***
## CAR_TYPEVan                    5.655e-01  1.497e-01   3.778 0.000158 ***
## RED_CARyes                    -2.273e-01  1.032e-01  -2.203 0.027600 *  
## OLDCLAIM                      -1.326e-05  4.579e-06  -2.895 0.003787 ** 
## CLM_FREQ                       1.997e-01  3.321e-02   6.014 1.81e-09 ***
## REVOKEDYes                     8.496e-01  1.075e-01   7.903 2.72e-15 ***
## MVR_PTS                        1.165e-01  1.588e-02   7.336 2.20e-13 ***
## CAR_AGE                       -3.386e-03  8.911e-03  -0.380 0.703990    
## URBANICITYHighly Urban/ Urban  2.306e+00  1.244e-01  18.537  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6988.2  on 6043  degrees of freedom
## Residual deviance: 5360.5  on 6007  degrees of freedom
## AIC: 5434.5
## 
## Number of Fisher Scoring iterations: 5
logit_model_2a <-glm(data    = df_train_clean,
                     formula = TARGET_FLAG ~ .- TARGET_AMT - JOB - HOMEKIDS - YOJ - CAR_AGE,
                     family  = "binomial"
                     )
#vif(logit_model_2b)

probit_model_2a <-glm(data    = df_train_clean,
                     formula = TARGET_FLAG ~ .- TARGET_AMT - JOB - HOMEKIDS - YOJ - CAR_AGE,
                     family  = binomial(link = "probit")
                     )


stargazer(logit_model_1a, logit_model_2a, probit_model_2a,  
          type='text',
          column.labels=c("Crash Dummy","Crash Dummy","Crash Dummy") 
)
## 
## =================================================================
##                                       Dependent variable:        
##                               -----------------------------------
##                                           TARGET_FLAG            
##                                      logistic           probit   
##                               Crash Dummy Crash Dummy Crash Dummy
##                                   (1)         (2)         (3)    
## -----------------------------------------------------------------
## KIDSDRIV                       0.320***    0.327***    0.190***  
##                                 (0.071)     (0.063)     (0.037)  
##                                                                  
## AGE                             -0.004      -0.007*     -0.004   
##                                 (0.005)     (0.004)     (0.002)  
##                                                                  
## HOMEKIDS                         0.023                           
##                                 (0.043)                          
##                                                                  
## YOJ                             -0.009                           
##                                 (0.010)                          
##                                                                  
## INCOME                        -0.00000**  -0.00000*** -0.00000***
##                                (0.00000)   (0.00000)   (0.00000) 
##                                                                  
## PARENT1Yes                     0.414***    0.431***    0.252***  
##                                 (0.127)     (0.113)     (0.066)  
##                                                                  
## HOME_VAL                      -0.00000*** -0.00000*** -0.00000***
##                                (0.00000)   (0.00000)   (0.00000) 
##                                                                  
## MSTATUSYes                     -0.422***   -0.429***   -0.250*** 
##                                 (0.101)     (0.094)     (0.054)  
##                                                                  
## SEXMale                         0.213*       0.203      0.125*   
##                                 (0.129)     (0.127)     (0.072)  
##                                                                  
## EDUCATIONBachelors             -0.381***   -0.573***   -0.329*** 
##                                 (0.133)     (0.112)     (0.065)  
##                                                                  
## EDUCATIONHigh School            -0.003      -0.083      -0.041   
##                                 (0.107)     (0.103)     (0.060)  
##                                                                  
## EDUCATIONMasters               -0.449**    -0.620***   -0.348*** 
##                                 (0.214)     (0.134)     (0.077)  
##                                                                  
## EDUCATIONPhD                     0.084     -0.398**    -0.234**  
##                                 (0.264)     (0.186)     (0.106)  
##                                                                  
## JOBClerical                      0.195                           
##                                 (0.120)                          
##                                                                  
## JOBDoctor                      -0.691**                          
##                                 (0.329)                          
##                                                                  
## JOBHome Maker                   -0.127                           
##                                 (0.178)                          
##                                                                  
## JOBLawyer                        0.044                           
##                                 (0.215)                          
##                                                                  
## JOBManager                     -0.881***                         
##                                 (0.160)                          
##                                                                  
## JOBProfessional                 -0.094                           
##                                 (0.137)                          
##                                                                  
## JOBStudent                      -0.168                           
##                                 (0.153)                          
##                                                                  
## TRAVTIME                       0.016***    0.016***    0.009***  
##                                 (0.002)     (0.002)     (0.001)  
##                                                                  
## CAR_USEPrivate                 -0.832***   -0.878***   -0.501*** 
##                                 (0.106)     (0.086)     (0.050)  
##                                                                  
## BLUEBOOK                      -0.00002*** -0.00002*** -0.00001***
##                                (0.00001)   (0.00001)   (0.00000) 
##                                                                  
## TIF                            -0.052***   -0.051***   -0.030*** 
##                                 (0.009)     (0.008)     (0.005)  
##                                                                  
## CAR_TYPEPanel Truck            0.691***    0.579***    0.328***  
##                                 (0.195)     (0.186)     (0.107)  
##                                                                  
## CAR_TYPEPickup                 0.551***    0.493***    0.278***  
##                                 (0.115)     (0.113)     (0.064)  
##                                                                  
## CAR_TYPESports Car             1.111***    1.072***    0.620***  
##                                 (0.147)     (0.145)     (0.083)  
##                                                                  
## CAR_TYPESUV                    0.830***    0.814***    0.467***  
##                                 (0.126)     (0.125)     (0.071)  
##                                                                  
## CAR_TYPEVan                    0.565***    0.539***    0.302***  
##                                 (0.150)     (0.148)     (0.085)  
##                                                                  
## RED_CARyes                     -0.227**    -0.234**    -0.130**  
##                                 (0.103)     (0.102)     (0.059)  
##                                                                  
## OLDCLAIM                      -0.00001*** -0.00001*** -0.00001***
##                                (0.00000)   (0.00000)   (0.00000) 
##                                                                  
## CLM_FREQ                       0.200***    0.197***    0.119***  
##                                 (0.033)     (0.033)     (0.019)  
##                                                                  
## REVOKEDYes                     0.850***    0.860***    0.495***  
##                                 (0.108)     (0.106)     (0.062)  
##                                                                  
## MVR_PTS                        0.116***    0.122***    0.071***  
##                                 (0.016)     (0.016)     (0.009)  
##                                                                  
## CAR_AGE                         -0.003                           
##                                 (0.009)                          
##                                                                  
## URBANICITYHighly Urban/ Urban  2.306***    2.248***    1.221***  
##                                 (0.124)     (0.124)     (0.065)  
##                                                                  
## Constant                       -2.431***   -2.223***   -1.263*** 
##                                 (0.321)     (0.287)     (0.163)  
##                                                                  
## -----------------------------------------------------------------
## Observations                     6,044       6,044       6,044   
## Log Likelihood                -2,680.261  -2,711.779  -2,717.147 
## Akaike Inf. Crit.              5,434.522   5,477.557   5,488.294 
## =================================================================
## Note:                                 *p<0.1; **p<0.05; ***p<0.01

3.5 Coefficient Interpretation

I will focus on Logit Model 2a (column 2) as it is my preferred model.

3.5.1 Target Flag Regression

3.5.1.1 Quantitative Variable

Find your preferred interpretations.

  • If the age of the driver increases by one year, then the log odds of the car crash reduces by 0.007, cetrius paribus.
  • Alternatively, if the age of the driver increases by one year, then the odds of the car crash reduces by \(exp(-0.007)\) or approximately 0.9930244.
  • Equivalently, if the age of the driver increases by one year, then the odds of the car crash reduces by a factor of \(exp(-0.007)\) or approximately 0.9930244.
  • Alternatively, if the age of the driver increases by one year, then the odds of the car crash reduces by \((1 - 0.9930244) * 100\%\) or 0.69756 percent (not percentage points).

This is statistically significant at the 10% significance level but the economic magnitude seems small.

3.5.1.2 Quanlitative/Dummy Variable

  • If you have a bachelors degree, compared to those who have less than high school education (baseline category), the log odds of car crash reduces by 0.573, cetrius paribus. This is statistically significant at the 1% significance level and the the economic magnitude seems large i.e. \((1 - exp(-.573) ) * 100\%\) or 43 (percent, not percentage points) decrease.

  • If you are living in an urban area, compared to if one is living in rural areas (baseline caretory), the log odds of car crash increases by 1.561, cetrius paribus. This is not only significant at the 1% level but also very large in economic magnitude i.e. \((exp(1.561) -1 ) * 100\%\) or 376% (percent, not percentage points) increase.

4 Evaluation

4.1 Prediction

# use the best model to fit the original data / predict in the new==original data 
logit_probabilities   <- predict(object = logit_model_2a, 
                         newdata = df_train_clean, 
                         type = "response"
                         )

# between 0 and 1 as expected
psych::describe(logit_probabilities)
##    vars    n mean   sd median trimmed  mad min  max range skew kurtosis se
## X1    1 6044 0.26 0.22    0.2    0.24 0.21   0 0.96  0.95 0.92    -0.05  0
# convert the logit probabilities to a binary variable 
logit_probability_2_binary_variable <- ifelse(test = logit_probabilities > 0.5,
                                              yes  = 1, 
                                              no   = 0
                                              )

psych::describe(logit_probability_2_binary_variable)
##    vars    n mean   sd median trimmed mad min max range skew kurtosis se
## X1    1 6044 0.17 0.38      0    0.09   0   0   1     1 1.76     1.11  0
df_train_clean$PREDICTED_FLAG <- logit_probability_2_binary_variable

table("Actual"      = df_train_clean$TARGET_FLAG,
      "Predictions" = df_train_clean$PREDICTED_FLAG
      )
##       Predictions
## Actual    0    1
##      0 4101  342
##      1  920  681

4.2 Confusion Matrix

You can construct measures by hand, or use the caret package. Be sure to specify what is positive (event of interest) in your data (crash here)

  • I get about 80% accuracy, which is not great, but at least better than randomly guessing (50% chance of right guess). These are just the sum of the diagnol terms in the confusion matrix divided by total observations. By hand, \(\dfrac{4101+681}{4101+681+920+342} \approx 79.12\%\).

  • Classification Error Rate is simply the complement of accuracy (1-accuracy). It is the sum of the off-diagnol terms in the confusion matrix divided by total observations. By hand, \(\dfrac{920+342}{920+342+4101+681} \approx 20.88\%\).

  • Sensitivity/True Positive Rate/ probability of detection: Detecting true positives (who will get into a crash) incorrectly most of the time. By hand, \(\dfrac{681}{681+920} \approx 42\%\). If you are a car insurance company, you can run into serious losses by selling more policies to these risky people.

  • Specificity/True Negative Rate: Detecting true negatives (who will NOT get into a crash) correctly most of the time. By hand, \(\dfrac{4101}{4101+342} \approx 92\%\). If you are a car insurance company, you can create profits by selling policies to these less risky people.

df_train_clean$fac_TARGET_FLAG    <- as_factor(df_train_clean$TARGET_FLAG)
df_train_clean$fac_PREDICTED_FLAG <- as_factor(df_train_clean$PREDICTED_FLAG)

cm <- caret::confusionMatrix(data      = df_train_clean$fac_PREDICTED_FLAG, # predicted
                             reference = df_train_clean$fac_TARGET_FLAG,    # truth
                             positive  = "1"           #    an optional character string for the factor level that corresponds to a "positive" result. If there are only two factor levels, the first level will be used as the "positive" result
                             )
print(cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4101  920
##          1  342  681
##                                           
##                Accuracy : 0.7912          
##                  95% CI : (0.7807, 0.8014)
##     No Information Rate : 0.7351          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3939          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4254          
##             Specificity : 0.9230          
##          Pos Pred Value : 0.6657          
##          Neg Pred Value : 0.8168          
##              Prevalence : 0.2649          
##          Detection Rate : 0.1127          
##    Detection Prevalence : 0.1693          
##       Balanced Accuracy : 0.6742          
##                                           
##        'Positive' Class : 1               
## 
# define Classification Error Rate function
cm_cer <- function(dataset){
cm <- table("Predictions" = df_train_clean$fac_PREDICTED_FLAG, 
            "Actual"      = df_train_clean$fac_TARGET_FLAG  )
TP <- cm[2,2]
TN <- cm[1,1]
FP <- cm[2,1]
FN <- cm[1,2]
return((FP + FN)/(TP + FP + TN + FN))
}

# apply Classification Error Rate (CER) function on confusion matrix
cm_cer(cm)
## [1] 0.2088021
# CER is 1-Accuracy
1 - 0.7912  
## [1] 0.2088

5 Session Info

https://www.rostrum.blog/2018/10/13/sessioninfo/

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] skimr_2.2.1       cowplot_1.2.0     patchwork_1.3.2   visdat_0.6.0     
##  [5] caret_7.0-1       lattice_0.22-7    MASS_7.3-65       car_3.1-3        
##  [9] carData_3.0-5     data.table_1.17.8 corrplot_0.95     lubridate_1.9.4  
## [13] forcats_1.0.0     stringr_1.5.2     dplyr_1.1.4       purrr_1.1.0      
## [17] readr_2.1.5       tidyr_1.3.1       tibble_3.3.0      ggplot2_3.5.2    
## [21] tidyverse_2.0.0   stargazer_5.2.3   psych_2.5.6      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1     timeDate_4041.110    farver_2.1.2        
##  [4] fastmap_1.2.0        pROC_1.19.0.1        digest_0.6.37       
##  [7] rpart_4.1.24         timechange_0.3.0     lifecycle_1.0.4     
## [10] survival_3.8-3       magrittr_2.0.3       compiler_4.4.2      
## [13] rlang_1.1.6          sass_0.4.10          tools_4.4.2         
## [16] yaml_2.3.10          knitr_1.50           labeling_0.4.3      
## [19] mnormt_2.1.1         repr_1.1.7           plyr_1.8.9          
## [22] RColorBrewer_1.1-3   abind_1.4-8          withr_3.0.2         
## [25] nnet_7.3-20          grid_4.4.2           stats4_4.4.2        
## [28] e1071_1.7-16         future_1.67.0        globals_0.18.0      
## [31] scales_1.4.0         iterators_1.0.14     cli_3.6.5           
## [34] crayon_1.5.3         rmarkdown_2.29       generics_0.1.4      
## [37] rstudioapi_0.17.1    future.apply_1.20.0  reshape2_1.4.4      
## [40] tzdb_0.5.0           proxy_0.4-27         cachem_1.1.0        
## [43] splines_4.4.2        parallel_4.4.2       base64enc_0.1-3     
## [46] vctrs_0.6.5          hardhat_1.4.2        Matrix_1.7-4        
## [49] jsonlite_2.0.0       hms_1.1.3            Formula_1.2-5       
## [52] listenv_0.9.1        foreach_1.5.2        gower_1.0.2         
## [55] jquerylib_0.1.4      recipes_1.3.1        glue_1.8.0          
## [58] parallelly_1.45.1    codetools_0.2-20     stringi_1.8.7       
## [61] gtable_0.3.6         pillar_1.11.0        htmltools_0.5.8.1   
## [64] ipred_0.9-15         lava_1.8.1           R6_2.6.1            
## [67] conflicted_1.2.0     evaluate_1.0.5       memoise_2.0.1       
## [70] bslib_0.9.0          class_7.3-23         Rcpp_1.1.0          
## [73] nlme_3.1-168         prodlim_2025.04.28   xfun_0.53           
## [76] pkgconfig_2.0.3      ModelMetrics_1.2.2.2
# devtools::session_info()

Controlling output - https://sebastiansauer.github.io/figure_sizing_knitr/