This file is still written for a coder/analyst, and you will have to polish it (change text and hide chunks) to make it an effective client ready report. Play around with code chunk options like suppressing warnings or output.

1 Setup

1.1 Working Directory and Data

Empty variables and functions in the environment tab/window, set working directory and load the training/testing data.

# Clear the workspace
rm(list = ls()) # Clear environment
gc()            # Clear unused memory
##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 531483 28.4    1183890 63.3         NA   669265 35.8
## Vcells 978152  7.5    8388608 64.0      16384  1840436 14.1
cat("\f")       # Clear the console
graphics.off()  # Clear all graphs


# Set working directory and path to data
setwd("/Users/arvindsharma/Dropbox/WCAS/Econometrics/")
  

df_train    <- read.csv("insurance-training-data.csv")
df_eval     <- read.csv("insurance-evaluation-data.csv")

1.2 Load packages

Now, I will load the packages.

# Prepare needed libraries

packages <- c("psych",       # quick summary stats for data exploration,
              "mice",        # for imputation of missing values and vis of missing data,
              "stargazer",   # summary stats,
              "vtable",      # summary stats,
              "summarytools",# summary stats,
              "naniar",      # for visualisation of missing data,
              "visdat",      # for visualisation of missing data,
              "VIM",         # for visualisation of missing data,
              "DataExplorer",# for visualisation of missing data,
              "tidyverse",   # data manipulation like selecting variables,
              "fastDummies", # Create dummy variables using fastDummies,
              "corrplot",    # correlation plots,
              "ggplot2",     # graphing,
              "data.table",  # reshape for graphing, 
              "car"          # vif for multicollinearity
              )

for (i in 1:length(packages)) {
  if (!packages[i] %in% rownames(installed.packages())) {
    install.packages(packages[i]
                     , repos = "http://cran.rstudio.com/"
                     , dependencies = TRUE
                     )
  }
  library(packages[i], character.only = TRUE)
}

rm(packages)

2 Exploratory Data Analysis (EDA)

Be sure to talk about some insights from your summary statistics tables and graphic visualizations.

A good practice is to tell the reader what you are planning to do in the section right at the beginning/top.

  • I will first check for missing values and impute them, as there are only 2 variables with less about 6% missing values. While I can impute with median/mean or even just drop all rows with any missing observation, I will impute the mean with the mice package.

  • After I have imputed the mean, I will create my summary statistics table and visualize my data through charts.

Throughout the analysis, I will be describing what I am doing with test and also be commenting my code. Also, make sure to align your code i.e. = or " below each other - use space if your have to.

2.1 Missing Data

YOJ and CAR_AGE have about \(6\%\) missing values. AGE has some missing values too.

You can use any of the packages below. I personally prefer Amelia and naniar as they works on big data too, but like vis

?naniar
## No documentation for 'naniar' in specified packages and libraries:
## you could try '??naniar'
naniar::gg_miss_var(df_train)

naniar::gg_miss_upset(df_train)

naniar::vis_miss(df_train) 

# visdat::vis_miss(df_train)       ## same output, but works on small datasets only

?Amelia::missmap
Amelia::missmap(obj = df_train)

# VIM::aggr(df_train)
VIM::matrixplot(df_train)
## 
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.
DataExplorer::plot_missing(df_train)

2.2 Impute Missing Values

The mice package (Multivariate Imputations by Chained Equations) implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation. Many diagnostic plots are implemented to inspect the quality of the imputations.

First, we will create an imputation_model using the mice() function, and then complete the missing values using the complete() function to obtain the imputed_dataset (with the missing values completed). The imputed data frame will contain the original data frame df_train with missing values replaced with imputed values.

?mice
## No documentation for 'mice' in specified packages and libraries:
## you could try '??mice'
imputed_model <-
mice::mice(data   = df_train, 
     m      = 1,         # default value is 5
     method = "mean",    # univariate imputation method - can play with
     seed   = 7          # integer argument for offsetting the random number generator
     )
## 
##  iter imp variable
##   1   1  AGE  YOJ  CAR_AGE
##   2   1  AGE  YOJ  CAR_AGE
##   3   1  AGE  YOJ  CAR_AGE
##   4   1  AGE  YOJ  CAR_AGE
##   5   1  AGE  YOJ  CAR_AGE
## Warning: Number of logged events: 14
?complete
## Help on topic 'complete' was found in the following packages:
## 
##   Package               Library
##   tidyr                 /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
##   mice                  /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
## 
## 
## Using the first match ...
train <- mice::complete(data = imputed_model)

Keep in mind that the mice package uses a multivariate imputation method that takes into account relationships between variables to impute missing values. It’s essential to consider the assumptions and limitations of the imputation method and perform appropriate validation and analysis to ensure the imputed values are reasonable and suitable for your analysis.

As you can see now, there are no missing values in the training data now.

?vis_miss
## Help on topic 'vis_miss' was found in the following packages:
## 
##   Package               Library
##   visdat                /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
##   naniar                /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
## 
## 
## Using the first match ...
naniar::vis_miss(train)

3 Summary Statistics

3.1 sumtable

We will use the sumtable function from vtable function to explore the raw data. The nice feature of the command is that if we have categorical data, we will quickly be able to see the different categories in them.

3.1.1 Re-coding data values

There are some spelling mistakes that I would like to fix immediately - as they will have implications in default graph labels later on.

# Change spellings
?recode
## Help on topic 'recode' was found in the following packages:
## 
##   Package               Library
##   car                   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
##   dplyr                 /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
## 
## 
## Using the first match ...
train$SEX        <- dplyr::recode(train$SEX,
                                  "M"   = "Male",
                                  "z_F"  = "Female"
                                  )

train$CAR_TYPE   <- dplyr::recode(train$CAR_TYPE,
                                  "z_SUV"   = "SUV"
                                 )

train$EDUCATION  <- dplyr::recode(train$EDUCATION,
                                  "z_High School"   = "High School"
                                 ) 

train$MSTATUS    <- dplyr::recode(train$MSTATUS,
                                  "z_No" = "No"
                                  )

train$RED_CAR    <- dplyr::recode(train$RED_CAR,
                                  "yes"  = "Yes",
                                  "no"   = "No"
                                  )

train$URBANICITY <- dplyr::recode(train$URBANICITY,
                                  "Highly Urban/ Urban"   = "Highly Urban / Urban",
                                  "z_Highly Rural/ Rural" = "Highly Rural / Rural"
                                  )

?dplyr::case_when
train$JOB <- dplyr::case_when(
  train$JOB == "z_Blue Collar"          ~ "Blue Collar",
  is.na(train$JOB) | train$JOB == ""    ~ "Missing Values",
  TRUE ~ as.character(train$JOB)     # Keep other values as they are
)

table(train$JOB)
## 
##    Blue Collar       Clerical         Doctor     Home Maker         Lawyer 
##           1825           1271            246            641            835 
##        Manager Missing Values   Professional        Student 
##            988            526           1117            712

3.1.2 Renaming variables

There are some variables that I would like to rename immediately to something more informative.

?dplyr::rename

train <- train |>
  dplyr::rename(Parent_Single = PARENT1, 
                Urban_City    = URBANICITY
         )

3.1.3 Income, Home Value, Blue Book

In this code, gsub("[\\$,]", "", train$INCOME) is used to remove the dollar sign ($) and commas (,) from the INCOME variable using the gsub() function with a regular expression. The regular expression [\\$,] matches both the dollar sign and the comma, and they are replaced with an empty string "". Then, as.numeric() is used to convert the character values to numeric.

# Remove dollar sign and convert to numeric

train$INCOME <- as.numeric(gsub(pattern     = "[\\$,]",
                                replacement = "", 
                                x           = train$INCOME)
                           )

# Instead of copy-pasting the ocde again for another or more varaibles, you can simply try -

train <- train |>
  dplyr::mutate(
    INCOME      = as.numeric( gsub("[\\$,]",  "",  INCOME))  ,
    HOME_VAL    = as.numeric( gsub("[\\$,]",  "",  HOME_VAL)),
    BLUEBOOK    = as.numeric( gsub("[\\$,]",  "",  BLUEBOOK)),
    OLDCLAIM    = as.numeric( gsub("[\\$,]",  "",  OLDCLAIM))
  )

3.2 Stargazer

We will use stargazer again to create summary statistics table. However, note that stargazer requires integer or numeric data. If we have categorical data, we will have to convert them into either integer or numeric data type. There are 10 variables we will have to treat. Read up on fastDummies syntax.

  • CASE I: If there are binary categorical variables, we can rename them and convert them into dummies and summarize them without any issues. Parent_Single, MSTATUS, SEX, RED_CAR, REVOKED, Urban_City

  • CASE II: If we have categorical variables summing up to many different values, we can still summarize them in with a stargazer table. EDUCATION, JOB, CAR.

clean_train_stargazer_table <- dplyr::select(.data = train,
                                       -c("INDEX")
                                       ) # remove index variable

# Reorder the variables
clean_train_stargazer_table <- dplyr::select(clean_train_stargazer_table, 
                                             -"JOB",
                                             -"CAR_TYPE",
                                             everything()
                                             )

?fastDummies::dummy_cols()
clean_train_stargazer_table <- fastDummies::dummy_cols(.data = clean_train_stargazer_table,
                                     remove_selected_columns = TRUE
                                    )
  

clean_train_stargazer_table <- dplyr::select(.data = clean_train_stargazer_table ,
                                       -c("Parent_Single_No", 
                                          "MSTATUS_No",
                                          "SEX_Male",
                                          "RED_CAR_No",
                                          "REVOKED_No",
                                          "Urban_City_Highly Rural / Rural"
                                          )
                                       ) # remove index variable

  

# Assuming your dataframe is named 'your_data'
clean_train_stargazer_table <- clean_train_stargazer_table |>
  dplyr::select(-starts_with("JOB")) |>  
  dplyr::select(-starts_with("CAR")) 


# drop missing values

naniar::gg_miss_upset(clean_train_stargazer_table)

  clean_train_stargazer_table$JOB       <- train$JOB
  clean_train_stargazer_table$CAR_TYPE  <- train$CAR_TYPE

clean_train_stargazer_table <- na.omit(clean_train_stargazer_table)


stargazer::stargazer(clean_train_stargazer_table[,1:25], type = "text")
## 
## ================================================================================
## Statistic                         N      Mean      St. Dev.    Min       Max    
## --------------------------------------------------------------------------------
## TARGET_FLAG                     7,285    0.265       0.441      0         1     
## TARGET_AMT                      7,285  1,504.017   4,720.400  0.000  107,586.100
## KIDSDRIV                        7,285    0.172       0.515      0         4     
## AGE                             7,285   44.757       8.658    16.000   81.000   
## HOMEKIDS                        7,285    0.726       1.121      0         5     
## YOJ                             7,285   10.509       3.962    0.000    23.000   
## INCOME                          7,285 61,894.360  47,406.000    0      367,030  
## HOME_VAL                        7,285 155,195.200 129,411.500   0      885,282  
## TRAVTIME                        7,285   33.513      15.901      5        142    
## BLUEBOOK                        7,285 15,770.290   8,429.700  1,500    69,740   
## TIF                             7,285    5.361       4.145      1        25     
## OLDCLAIM                        7,285  4,031.242   8,788.665    0      57,037   
## CLM_FREQ                        7,285    0.796       1.162      0         5     
## MVR_PTS                         7,285    1.701       2.154      0        13     
## Parent_Single_Yes               7,285    0.134       0.341      0         1     
## MSTATUS_Yes                     7,285    0.596       0.491      0         1     
## SEX_Female                      7,285    0.536       0.499      0         1     
## EDUCATIONHigh School            7,285    0.146       0.353      0         1     
## EDUCATION_Bachelors             7,285    0.273       0.446      0         1     
## EDUCATION_High School           7,285    0.286       0.452      0         1     
## EDUCATION_Masters               7,285    0.206       0.404      0         1     
## EDUCATION_PhD                   7,285    0.089       0.284      0         1     
## RED_CAR_Yes                     7,285    0.291       0.454      0         1     
## REVOKED_Yes                     7,285    0.122       0.328      0         1     
## Urban_City_Highly Urban / Urban 7,285    0.798       0.402      0         1     
## --------------------------------------------------------------------------------
summarytools::descr(clean_train_stargazer_table)
## Non-numerical variable(s) ignored: JOB, CAR_TYPE
## Descriptive Statistics  
## clean_train_stargazer_table  
## N: 7285  
## 
##                         AGE   BLUEBOOK   CLM_FREQ   EDUCATION_<High School   EDUCATION_Bachelors
## ----------------- --------- ---------- ---------- ------------------------ ---------------------
##              Mean     44.76   15770.29       0.80                     0.15                  0.27
##           Std.Dev      8.66    8429.70       1.16                     0.35                  0.45
##               Min     16.00    1500.00       0.00                     0.00                  0.00
##                Q1     39.00    9400.00       0.00                     0.00                  0.00
##            Median     45.00   14500.00       0.00                     0.00                  0.00
##                Q3     51.00   20890.00       2.00                     0.00                  1.00
##               Max     81.00   69740.00       5.00                     1.00                  1.00
##               MAD      8.90    8421.17       0.00                     0.00                  0.00
##               IQR     12.00   11490.00       2.00                     0.00                  1.00
##                CV      0.19       0.53       1.46                     2.42                  1.63
##          Skewness     -0.03       0.80       1.22                     2.00                  1.02
##       SE.Skewness      0.03       0.03       0.03                     0.03                  0.03
##          Kurtosis     -0.06       0.84       0.31                     2.01                 -0.97
##           N.Valid   7285.00    7285.00    7285.00                  7285.00               7285.00
##         Pct.Valid    100.00     100.00     100.00                   100.00                100.00
## 
## Table: Table continues below
## 
##  
## 
##                     EDUCATION_High School   EDUCATION_Masters   EDUCATION_PhD    HOME_VAL   HOMEKIDS
## ----------------- ----------------------- ------------------- --------------- ----------- ----------
##              Mean                    0.29                0.21            0.09   155195.25       0.73
##           Std.Dev                    0.45                0.40            0.28   129411.54       1.12
##               Min                    0.00                0.00            0.00        0.00       0.00
##                Q1                    0.00                0.00            0.00        0.00       0.00
##            Median                    0.00                0.00            0.00   161774.00       0.00
##                Q3                    1.00                0.00            0.00   239235.00       1.00
##               Max                    1.00                1.00            1.00   885282.00       5.00
##               MAD                    0.00                0.00            0.00   148380.09       0.00
##               IQR                    1.00                0.00            0.00   239235.00       1.00
##                CV                    1.58                1.97            3.21        0.83       1.54
##          Skewness                    0.95                1.46            2.89        0.49       1.33
##       SE.Skewness                    0.03                0.03            0.03        0.03       0.03
##          Kurtosis                   -1.10                0.12            6.37       -0.02       0.61
##           N.Valid                 7285.00             7285.00         7285.00     7285.00    7285.00
##         Pct.Valid                  100.00              100.00          100.00      100.00     100.00
## 
## Table: Table continues below
## 
##  
## 
##                        INCOME   KIDSDRIV   MSTATUS_Yes   MVR_PTS   OLDCLAIM   Parent_Single_Yes
## ----------------- ----------- ---------- ------------- --------- ---------- -------------------
##              Mean    61894.36       0.17          0.60      1.70    4031.24                0.13
##           Std.Dev    47406.00       0.52          0.49      2.15    8788.67                0.34
##               Min        0.00       0.00          0.00      0.00       0.00                0.00
##                Q1    28123.00       0.00          0.00      0.00       0.00                0.00
##            Median    54109.00       0.00          1.00      1.00       0.00                0.00
##                Q3    86211.00       0.00          1.00      3.00    4604.00                0.00
##               Max   367030.00       4.00          1.00     13.00   57037.00                1.00
##               MAD    41865.66       0.00          0.00      1.48       0.00                0.00
##               IQR    58088.00       0.00          1.00      3.00    4604.00                0.00
##                CV        0.77       2.99          0.82      1.27       2.18                2.54
##          Skewness        1.17       3.35         -0.39      1.34       3.12                2.15
##       SE.Skewness        0.03       0.03          0.03      0.03       0.03                0.03
##          Kurtosis        2.09      11.79         -1.85      1.35       9.84                2.62
##           N.Valid     7285.00    7285.00       7285.00   7285.00    7285.00             7285.00
##         Pct.Valid      100.00     100.00        100.00    100.00     100.00              100.00
## 
## Table: Table continues below
## 
##  
## 
##                     RED_CAR_Yes   REVOKED_Yes   SEX_Female   TARGET_AMT   TARGET_FLAG       TIF
## ----------------- ------------- ------------- ------------ ------------ ------------- ---------
##              Mean          0.29          0.12         0.54      1504.02          0.26      5.36
##           Std.Dev          0.45          0.33         0.50      4720.40          0.44      4.14
##               Min          0.00          0.00         0.00         0.00          0.00      1.00
##                Q1          0.00          0.00         0.00         0.00          0.00      1.00
##            Median          0.00          0.00         1.00         0.00          0.00      4.00
##                Q3          1.00          0.00         1.00      1069.00          1.00      7.00
##               Max          1.00          1.00         1.00    107586.14          1.00     25.00
##               MAD          0.00          0.00         0.00         0.00          0.00      4.45
##               IQR          1.00          0.00         1.00      1069.00          1.00      6.00
##                CV          1.56          2.68         0.93         3.14          1.67      0.77
##          Skewness          0.92          2.30        -0.14         8.93          1.07      0.89
##       SE.Skewness          0.03          0.03         0.03         0.03          0.03      0.03
##          Kurtosis         -1.15          3.30        -1.98       118.01         -0.87      0.42
##           N.Valid       7285.00       7285.00      7285.00      7285.00       7285.00   7285.00
##         Pct.Valid        100.00        100.00       100.00       100.00        100.00    100.00
## 
## Table: Table continues below
## 
##  
## 
##                     TRAVTIME   Urban_City_Highly Urban / Urban       YOJ
## ----------------- ---------- --------------------------------- ---------
##              Mean      33.51                              0.80     10.51
##           Std.Dev      15.90                              0.40      3.96
##               Min       5.00                              0.00      0.00
##                Q1      23.00                              1.00      9.00
##            Median      33.00                              1.00     11.00
##                Q3      44.00                              1.00     13.00
##               Max     142.00                              1.00     23.00
##               MAD      16.31                              0.00      2.97
##               IQR      21.00                              0.00      4.00
##                CV       0.47                              0.50      0.38
##          Skewness       0.44                             -1.48     -1.23
##       SE.Skewness       0.03                              0.03      0.03
##          Kurtosis       0.63                              0.20      1.44
##           N.Valid    7285.00                           7285.00   7285.00
##         Pct.Valid     100.00                            100.00    100.00

Of course, I am expecting you to clean up the table more - like label the variables, draw some inferences. You can hide some commands like summarytools::descr(clean_train_stargazer_table) but use the skewness/kurtosis/IQR calculations to describe certain variables. In fact, I would visualize the raw data too - might be easier to see a few things there!

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
df7 <- reshape2::melt(data = clean_train_stargazer_table)
## Using JOB, CAR_TYPE as id variables
ggplot(data = df7, mapping = aes(x = ))

ggplot(df7, aes(x = value)) + 
  geom_histogram() + 
  facet_wrap(~variable, scales = "free_x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.3 Graphing Categorical Variables

Car Types and Jobs are broken down into too many dummies - so I do not want to put then in the summary statistics tables. Instead, I would prefer to have them in dedicated charts below.

3.3.1 Car Types

You can try to create conditional charts (if the car crash happened).

# Create a table of frequencies
freq_table_CAR_TYPE <- table(clean_train_stargazer_table$CAR_TYPE)

freq_table_CAR_TYPE
## 
##     Minivan Panel Truck      Pickup  Sports Car         SUV         Van 
##        1936         603        1240         813        2038         655
# Convert the table to a data frame
df_CAR_TYPE <- data.frame(Value     = names(freq_table_CAR_TYPE),
                          Frequency = freq_table_CAR_TYPE
                          )

df_CAR_TYPE
##         Value Frequency.Var1 Frequency.Freq
## 1     Minivan        Minivan           1936
## 2 Panel Truck    Panel Truck            603
## 3      Pickup         Pickup           1240
## 4  Sports Car     Sports Car            813
## 5         SUV            SUV           2038
## 6         Van            Van            655
# Create a vector of 6 distinct colors
num_colors <- length(df_CAR_TYPE$Value)
my_colors <- rainbow(num_colors)


# Assign the colors to the 'ColorVar' variable
ColorVar <- factor(x      = 1:num_colors, 
                   levels = 1:num_colors,
                   labels = my_colors
                   )



# ACTUAL PLOT - I want to sort the columns by height and color them differently
ggplot(data = df_CAR_TYPE, aes(x    = reorder(x  =  Value, 
                                         X  =  -Frequency.Freq), 
                          y    = Frequency.Freq, 
                          fill = ColorVar
                          )
       ) +
  geom_bar(stat         = "identity",
           show.legend  = FALSE
           ) +
  labs(title = "Bar Chart of Occupations", 
           x = "",
           y = "Frequency"
       ) +
  theme(axis.text.x  = element_text(angle = 45,   # fit the labels 
                                    hjust = 1
                                    )
        ) 

3.3.2 Jobs

You can try to create conditional charts (if the car crash happened).

# Create a table of frequencies
freq_table_JOB <- table(clean_train_stargazer_table$JOB)

freq_table_JOB
## 
##    Blue Collar       Clerical         Doctor     Home Maker         Lawyer 
##           1630           1135            222            565            763 
##        Manager Missing Values   Professional        Student 
##            887            466           1000            617
# Convert the table to a data frame
df_JOB <- data.frame(Value     = names(freq_table_JOB),
                     Frequency = freq_table_JOB
                     )

df_JOB
##            Value Frequency.Var1 Frequency.Freq
## 1    Blue Collar    Blue Collar           1630
## 2       Clerical       Clerical           1135
## 3         Doctor         Doctor            222
## 4     Home Maker     Home Maker            565
## 5         Lawyer         Lawyer            763
## 6        Manager        Manager            887
## 7 Missing Values Missing Values            466
## 8   Professional   Professional           1000
## 9        Student        Student            617
# Create a vector of 9 distinct colors
num_colors <- 9
my_colors <- rainbow(num_colors)


# Assign the colors to the 'ColorVar' variable
ColorVar <- factor(x      = 1:num_colors, 
                   levels = 1:num_colors,
                   labels = my_colors
                   )



# ACTUAL PLOT - I want to sort the columns by height and color them differently
ggplot(data = df_JOB, aes(x    = reorder(x  =  Value, 
                                         X  =  -Frequency.Freq), 
                          y    = Frequency.Freq, 
                          fill = ColorVar
                          )
       ) +
  geom_bar(stat         = "identity",
           show.legend  = FALSE
           ) +
  labs(title = "Bar Chart of Occupations", 
           x = "",
           y = "Frequency"
       ) +
  theme(axis.text.x  = element_text(angle = 45,   # fit the labels 
                                    hjust = 1
                                    )
        ) 

4 Visualization

It would be a good idea to plot the distribution of all the variables like you did in HW1 - melt the data into long format and use ggplot2 to see the raw data. In this assignment, it would be infomrative to split the charts above by crashed or not crashed too. Run basic regression models and quickly see what trends you are getting - then you can create the charts to support the story from the regressions by decomposing the raw data and verifying why you find the results.

5 MODELS

  1. See basic multivariate regression. https://rpubs.com/sharmaar/HW1_solution

  2. Then run some logstic and/or probit regression in R. https://rpubs.com/sharmaar/Logistic_regression_implementation

    • Make sure to explain the coefficients - else you will not get full marks.
  3. Try some lasso/ridge for bonus points. https://rpubs.com/sharmaar/Lasso_Ridge

  • Present regressions in a table with stargazer package - make sure the table readable and nicely edited.

6 MODEL SELECTION

  1. Talk about the basic metrics like adjusted R squared, RMSE, and tell us which is your preferred model..

  2. Predict the outcome variable.

  3. Create a confusion matrix through caret package. Set the event of interest as a car crash.

Note that you should get the same answers for metrics like sensitivity, specificity, accuracy if you construct them by hand or by using the package.

https://rpubs.com/sharmaar/ConfusionMatrix