Useful code for Homework 2

This file is still written for a coder/analyst, and you will have to polish it (change text and hide chunks) to make it an effective client ready report. Play around with code chunk options like suppressing warnings or output.

Setup

Working Directory and Data

Empty variables and functions in the environment tab/window, set working directory and load the training/testing data.

# Clear the workspace
rm(list = ls()) # Clear environment
gc()            # Clear unused memory

          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  597615 32.0    1357351 72.5         NA   700240 37.4
Vcells 1106754  8.5    8388608 64.0      49152  1963592 15.0

cat("\f")       # Clear the console

graphics.off()  # Clear all graphs


# Set working directory and path to data
setwd("/Users/arvindsharma/Dropbox/WCAS/Econometrics/")

Load packages

Now, I will load the packages.

# Prepare needed libraries

packages <- c("reader",       # importing data, 
              "psych",        # quick summary stats for data exploration,
              "mice",         # for imputation of missing values and vis of missing data,
              "stargazer",    # summary stats,
              "vtable",       # summary stats,
              "summarytools", # summary stats,
              "naniar",       # for visualisation of missing data,
              "visdat",       # for visualisation of missing data,
              "VIM",          # for visualisation of missing data,
              "DataExplorer", # for visualisation of missing data,
              "tidyverse",    # data manipulation like selecting variables,
              "fastDummies",  # Create dummy variables using fastDummies,
              "corrplot",     # correlation plots,
              "ggplot2",      # graphing,
              "data.table",   # reshape for graphing, 
              "car"           # vif for multicollinearity
              )

for (i in 1:length(packages)) {
  if (!packages[i] %in% rownames(installed.packages())) {
    install.packages(packages[i]
                     , repos = "http://cran.rstudio.com/"
                     , dependencies = TRUE
                     )
  }
  library(packages[i], character.only = TRUE)
}

rm(packages)

df_train    <- read.csv("insurance-training-data2.csv")
df_test     <- read.csv("insurance-testing-data2.csv")

Exploratory Data Analysis (EDA)

Be sure to talk about some insights from your summary statistics tables and graphic visualizations.

A good practice is to tell the reader what you are planning to do in the section right at the beginning/top.

I will first check for missing values and impute them, as there are only 2 variables with less about 6% missing values. While I can impute with median/mean or even just drop all rows with any missing observation, I will impute the mean with the mice package.
After I have imputed the mean, I will create my summary statistics table and visualize my data through charts.

Throughout the analysis, I will be describing what I am doing with test and also be commenting my code. Also, make sure to align your code i.e. = or " below each other - use space if your have to.

Missing Data

YOJ and CAR_AGE have about $6\%$ missing values. AGE has some missing values too.

You can use any of the packages below. I personally prefer Amelia and naniar as they works on big data too, but like visdat

?naniar

No documentation for 'naniar' in specified packages and libraries:
you could try '??naniar'

naniar::gg_miss_var(df_train)

naniar::gg_miss_upset(df_train)

naniar::vis_miss(df_train)

# visdat::vis_miss(df_train)       ## same output, but works on small datasets only

?Amelia::missmap
Amelia::missmap(obj = df_train)

# VIM::aggr(df_train)
VIM::matrixplot(df_train)


Click in a column to sort by the corresponding variable.
To regain use of the VIM GUI and the R console, click outside the plot region.

DataExplorer::plot_missing(df_train)

Impute Missing Values

The mice package (Multivariate Imputations by Chained Equations) implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation. Many diagnostic plots are implemented to inspect the quality of the imputations.

First, we will create an imputation_model using the mice() function, and then complete the missing values using the complete() function to obtain the imputed_dataset (with the missing values completed). The imputed data frame will contain the original data frame df_train with missing values replaced with imputed values.

?mice

No documentation for 'mice' in specified packages and libraries:
you could try '??mice'

imputed_model <-
mice::mice(data   = df_train, 
     m      = 1,         # default value is 5
     method = "mean",    # univariate imputation method - can play with
     seed   = 7          # integer argument for offsetting the random number generator
     )


 iter imp variable
  1   1  AGE  YOJ  CAR_AGE
  2   1  AGE  YOJ  CAR_AGE
  3   1  AGE  YOJ  CAR_AGE
  4   1  AGE  YOJ  CAR_AGE
  5   1  AGE  YOJ  CAR_AGE

Warning: Number of logged events: 14

?complete

Help on topic 'complete' was found in the following packages:

  Package               Library
  mice                  /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
  tidyr                 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library


Using the first match ...

train <- mice::complete(data = imputed_model)

Keep in mind that the mice package uses a multivariate imputation method that takes into account relationships between variables to impute missing values. It’s essential to consider the assumptions and limitations of the imputation method and perform appropriate validation and analysis to ensure the imputed values are reasonable and suitable for your analysis.

As you can see now, there are no missing values in the training data now.

?vis_miss

Help on topic 'vis_miss' was found in the following packages:

  Package               Library
  naniar                /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
  visdat                /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library


Using the first match ...

naniar::vis_miss(train)

Summary Statistics

`sumtable`

We will use the sumtable function from vtable function to explore the raw data. The nice feature of the command is that if we have categorical data, we will quickly be able to see the different categories in them.

Re-coding data values

There are some spelling mistakes that I would like to fix immediately - as they will have implications in default graph labels later on.

# Change spellings
?recode

Help on topic 'recode' was found in the following packages:

  Package               Library
  dplyr                 /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
  car                   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library


Using the first match ...

train$SEX        <- dplyr::recode(train$SEX,
                                  "M"   = "Male",
                                  "z_F"  = "Female"
                                  )

train$CAR_TYPE   <- dplyr::recode(train$CAR_TYPE,
                                  "z_SUV"   = "SUV"
                                 )

train$EDUCATION  <- dplyr::recode(train$EDUCATION,
                                  "z_High School"   = "High School"
                                 ) 

train$MSTATUS    <- dplyr::recode(train$MSTATUS,
                                  "z_No" = "No"
                                  )

train$RED_CAR    <- dplyr::recode(train$RED_CAR,
                                  "yes"  = "Yes",
                                  "no"   = "No"
                                  )

train$URBANICITY <- dplyr::recode(train$URBANICITY,
                                  "Highly Urban/ Urban"   = "Highly Urban / Urban",
                                  "z_Highly Rural/ Rural" = "Highly Rural / Rural"
                                  )

?dplyr::case_when
train$JOB <- dplyr::case_when(
  train$JOB == "z_Blue Collar"          ~ "Blue Collar",
  is.na(train$JOB) | train$JOB == ""    ~ "Missing Values",
  TRUE ~ as.character(train$JOB)     # Keep other values as they are
)

table(train$JOB)


   Blue Collar       Clerical         Doctor     Home Maker         Lawyer 
          1452           1054            190            512            688 
       Manager Missing Values   Professional        Student 
           758            419            887            568

Renaming variables

There are some variables that I would like to rename immediately to something more informative.

?dplyr::rename

train <- train |>
  dplyr::rename(Parent_Single = PARENT1, 
                Urban_City    = URBANICITY
         )

Income, Home Value, Blue Book

In this code, gsub("[\\$,]", "", train$INCOME) is used to remove the dollar sign ($) and commas (,) from the INCOME variable using the gsub() function with a regular expression. The regular expression [\\$,] matches both the dollar sign and the comma, and they are replaced with an empty string "". Then, as.numeric() is used to convert the character values to numeric.

# Remove dollar sign and convert to numeric

train$INCOME <- as.numeric(gsub(pattern     = "[\\$,]",
                                replacement = "", 
                                x           = train$INCOME)
                           )

# Instead of copy-pasting the ocde again for another or more varaibles, you can simply try -

train <- train |>
  dplyr::mutate(
    INCOME      = as.numeric( gsub("[\\$,]",  "",  INCOME))  ,
    HOME_VAL    = as.numeric( gsub("[\\$,]",  "",  HOME_VAL)),
    BLUEBOOK    = as.numeric( gsub("[\\$,]",  "",  BLUEBOOK)),
    OLDCLAIM    = as.numeric( gsub("[\\$,]",  "",  OLDCLAIM))
  )

Stargazer

We will use stargazer again to create summary statistics table. However, note that stargazer requires integer or numeric data. If we have categorical data, we will have to convert them into either integer or numeric data type. There are 10 variables we will have to treat. Read up on fastDummies syntax.

CASE I: If there are binary categorical variables, we can rename them and convert them into dummies and summarize them without any issues. Parent_Single, MSTATUS, SEX, RED_CAR, REVOKED, Urban_City
CASE II: If we have categorical variables summing up to many different values, we can still summarize them in with a stargazer table. EDUCATION, JOB, CAR.

clean_train_stargazer_table <- dplyr::select(.data = train,
                                       -c("INDEX")
                                       ) # remove index variable

# Reorder the variables
clean_train_stargazer_table <- dplyr::select(clean_train_stargazer_table, 
                                             -"JOB",
                                             -"CAR_TYPE",
                                             everything()
                                             )

?fastDummies::dummy_cols()
clean_train_stargazer_table <- fastDummies::dummy_cols(.data = clean_train_stargazer_table,
                                     remove_selected_columns = TRUE
                                    )
  

clean_train_stargazer_table <- dplyr::select(.data = clean_train_stargazer_table ,
                                       -c("Parent_Single_No", 
                                          "MSTATUS_No",
                                          "SEX_Male",
                                          "RED_CAR_No",
                                          "REVOKED_No",
                                          "Urban_City_Highly Rural / Rural"
                                          )
                                       ) # remove index variable

  

# Assuming your dataframe is named 'your_data'
clean_train_stargazer_table <- clean_train_stargazer_table |>
  dplyr::select(-starts_with("JOB")) |>  
  dplyr::select(-starts_with("CAR")) 


# drop missing values

naniar::gg_miss_upset(clean_train_stargazer_table)

  clean_train_stargazer_table$JOB       <- train$JOB
  clean_train_stargazer_table$CAR_TYPE  <- train$CAR_TYPE

clean_train_stargazer_table <- na.omit(clean_train_stargazer_table)


stargazer::stargazer(clean_train_stargazer_table[,1:25], type = "text")


================================================================================
Statistic                         N      Mean      St. Dev.    Min       Max    
--------------------------------------------------------------------------------
TARGET_FLAG                     5,833    0.265       0.441      0         1     
TARGET_AMT                      5,833  1,462.450   4,540.275  0.000  107,586.100
KIDSDRIV                        5,833    0.172       0.512      0         4     
AGE                             5,833   44.796       8.675    16.000   81.000   
HOMEKIDS                        5,833    0.723       1.109      0         5     
YOJ                             5,833   10.521       3.966    0.000    23.000   
INCOME                          5,833 61,741.410  47,584.860    0      367,030  
HOME_VAL                        5,833 154,565.100 129,163.300   0      885,282  
TRAVTIME                        5,833   33.445      15.933      5        142    
BLUEBOOK                        5,833 15,694.750   8,398.463  1,500    69,740   
TIF                             5,833    5.371       4.147      1        25     
OLDCLAIM                        5,833  4,136.410   8,951.677    0      57,037   
CLM_FREQ                        5,833    0.801       1.165      0         5     
MVR_PTS                         5,833    1.712       2.162      0        13     
Parent_Single_Yes               5,833    0.133       0.340      0         1     
MSTATUS_Yes                     5,833    0.596       0.491      0         1     
SEX_Female                      5,833    0.534       0.499      0         1     
EDUCATIONHigh School            5,833    0.147       0.354      0         1     
EDUCATION_Bachelors             5,833    0.272       0.445      0         1     
EDUCATION_High School           5,833    0.285       0.451      0         1     
EDUCATION_Masters               5,833    0.207       0.405      0         1     
EDUCATION_PhD                   5,833    0.089       0.285      0         1     
RED_CAR_Yes                     5,833    0.293       0.455      0         1     
REVOKED_Yes                     5,833    0.127       0.333      0         1     
Urban_City_Highly Urban / Urban 5,833    0.797       0.403      0         1     
--------------------------------------------------------------------------------

Of course, I am expecting you to clean up the table more - like label the variables, draw some inferences.
You can hide some commands like summarytools::descr(clean_train_stargazer_table) but use the skewness/kurtosis/IQR calculations to describe certain variables. Might have to download X11 from https://www.xquartz.org/ if using Mac.
- Worry about skewness and/or kurtosis only if it is a major issue. For skewness, if this value is between:
  1. -0.5 and 0.5, the distribution of the value is almost symmetrical
  2. -1 and -0.5, the data is negatively skewed, and if it is between 0.5 to 1, the data is positively skewed. The skewness is moderate.
  3. If the skewness is lower than -1 (negatively skewed) or greater than 1 (positively skewed), the data is highly skewed. I might consider to log the varaible in this case and check for skewness on this new distribution.

summarytools::descr(clean_train_stargazer_table)

Non-numerical variable(s) ignored: JOB, CAR_TYPE

Descriptive Statistics  
clean_train_stargazer_table  
N: 5833  

                        AGE   BLUEBOOK   CLM_FREQ   EDUCATION_<High School   EDUCATION_Bachelors
----------------- --------- ---------- ---------- ------------------------ ---------------------
             Mean     44.80   15694.75       0.80                     0.15                  0.27
          Std.Dev      8.68    8398.46       1.17                     0.35                  0.44
              Min     16.00    1500.00       0.00                     0.00                  0.00
               Q1     39.00    9360.00       0.00                     0.00                  0.00
           Median     45.00   14410.00       0.00                     0.00                  0.00
               Q3     51.00   20770.00       2.00                     0.00                  1.00
              Max     81.00   69740.00       5.00                     1.00                  1.00
              MAD      8.90    8347.04       0.00                     0.00                  0.00
              IQR     12.00   11410.00       2.00                     0.00                  1.00
               CV      0.19       0.54       1.45                     2.41                  1.64
         Skewness     -0.04       0.83       1.22                     1.99                  1.03
      SE.Skewness      0.03       0.03       0.03                     0.03                  0.03
         Kurtosis     -0.04       0.99       0.31                     1.96                 -0.95
          N.Valid   5833.00    5833.00    5833.00                  5833.00               5833.00
        Pct.Valid    100.00     100.00     100.00                   100.00                100.00

Table: Table continues below

 

                    EDUCATION_High School   EDUCATION_Masters   EDUCATION_PhD    HOME_VAL   HOMEKIDS
----------------- ----------------------- ------------------- --------------- ----------- ----------
             Mean                    0.28                0.21            0.09   154565.13       0.72
          Std.Dev                    0.45                0.41            0.28   129163.32       1.11
              Min                    0.00                0.00            0.00        0.00       0.00
               Q1                    0.00                0.00            0.00        0.00       0.00
           Median                    0.00                0.00            0.00   161160.00       0.00
               Q3                    1.00                0.00            0.00   238724.00       1.00
              Max                    1.00                1.00            1.00   885282.00       5.00
              MAD                    0.00                0.00            0.00   148580.24       0.00
              IQR                    1.00                0.00            0.00   238724.00       1.00
               CV                    1.58                1.96            3.20        0.84       1.53
         Skewness                    0.95                1.45            2.88        0.50       1.30
      SE.Skewness                    0.03                0.03            0.03        0.03       0.03
         Kurtosis                   -1.09                0.09            6.31        0.03       0.50
          N.Valid                 5833.00             5833.00         5833.00     5833.00    5833.00
        Pct.Valid                  100.00              100.00          100.00      100.00     100.00

Table: Table continues below

 

                       INCOME   KIDSDRIV   MSTATUS_Yes   MVR_PTS   OLDCLAIM   Parent_Single_Yes
----------------- ----------- ---------- ------------- --------- ---------- -------------------
             Mean    61741.41       0.17          0.60      1.71    4136.41                0.13
          Std.Dev    47584.86       0.51          0.49      2.16    8951.68                0.34
              Min        0.00       0.00          0.00      0.00       0.00                0.00
               Q1    27803.00       0.00          0.00      0.00       0.00                0.00
           Median    53358.00       0.00          1.00      1.00       0.00                0.00
               Q3    85837.00       0.00          1.00      3.00    4676.00                0.00
              Max   367030.00       4.00          1.00     13.00   57037.00                1.00
              MAD    41996.13       0.00          0.00      1.48       0.00                0.00
              IQR    58034.00       0.00          1.00      3.00    4676.00                0.00
               CV        0.77       2.98          0.82      1.26       2.16                2.55
         Skewness        1.21       3.30         -0.39      1.33       3.05                2.16
      SE.Skewness        0.03       0.03          0.03      0.03       0.03                0.03
         Kurtosis        2.26      11.25         -1.85      1.29       9.33                2.65
          N.Valid     5833.00    5833.00       5833.00   5833.00    5833.00             5833.00
        Pct.Valid      100.00     100.00        100.00    100.00     100.00              100.00

Table: Table continues below

 

                    RED_CAR_Yes   REVOKED_Yes   SEX_Female   TARGET_AMT   TARGET_FLAG       TIF
----------------- ------------- ------------- ------------ ------------ ------------- ---------
             Mean          0.29          0.13         0.53      1462.45          0.27      5.37
          Std.Dev          0.46          0.33         0.50      4540.28          0.44      4.15
              Min          0.00          0.00         0.00         0.00          0.00      1.00
               Q1          0.00          0.00         0.00         0.00          0.00      1.00
           Median          0.00          0.00         1.00         0.00          0.00      4.00
               Q3          1.00          0.00         1.00      1101.93          1.00      7.00
              Max          1.00          1.00         1.00    107586.14          1.00     25.00
              MAD          0.00          0.00         0.00         0.00          0.00      4.45
              IQR          1.00          0.00         1.00      1101.93          1.00      6.00
               CV          1.55          2.62         0.93         3.10          1.66      0.77
         Skewness          0.91          2.24        -0.14         9.41          1.06      0.89
      SE.Skewness          0.03          0.03         0.03         0.03          0.03      0.03
         Kurtosis         -1.17          3.00        -1.98       136.68         -0.87      0.44
          N.Valid       5833.00       5833.00      5833.00      5833.00       5833.00   5833.00
        Pct.Valid        100.00        100.00       100.00       100.00        100.00    100.00

Table: Table continues below

 

                    TRAVTIME   Urban_City_Highly Urban / Urban       YOJ
----------------- ---------- --------------------------------- ---------
             Mean      33.45                              0.80     10.52
          Std.Dev      15.93                              0.40      3.97
              Min       5.00                              0.00      0.00
               Q1      23.00                              1.00      9.00
           Median      33.00                              1.00     11.00
               Q3      44.00                              1.00     13.00
              Max     142.00                              1.00     23.00
              MAD      16.31                              0.00      2.97
              IQR      21.00                              0.00      4.00
               CV       0.48                              0.51      0.38
         Skewness       0.46                             -1.47     -1.23
      SE.Skewness       0.03                              0.03      0.03
         Kurtosis       0.70                              0.17      1.43
          N.Valid    5833.00                           5833.00   5833.00
        Pct.Valid     100.00                            100.00    100.00

In fact, I would visualize the raw data too - might be easier to see a few things there! Make sure your labels are not cut.

library(ggplot2)

df7 <- reshape2::melt(data = clean_train_stargazer_table)

Using JOB, CAR_TYPE as id variables

ggplot(df7, aes(x = value)) + 
  geom_histogram() + 
  facet_wrap(~variable, scales = "free_x")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Graphing Categorical Variables

Car Types and Jobs are broken down into too many dummies - so I do not want to put then in the summary statistics tables. Instead, I would prefer to have them in dedicated charts below.

Car Types

You can try to create conditional charts (if the car crash happened).

# Create a table of frequencies
freq_table_CAR_TYPE <- table(clean_train_stargazer_table$CAR_TYPE)

freq_table_CAR_TYPE


    Minivan Panel Truck      Pickup  Sports Car         SUV         Van 
       1582         468         993         634        1634         522

# Convert the table to a data frame
df_CAR_TYPE <- data.frame(Value     = names(freq_table_CAR_TYPE),
                          Frequency = freq_table_CAR_TYPE
                          )

df_CAR_TYPE

        Value Frequency.Var1 Frequency.Freq
1     Minivan        Minivan           1582
2 Panel Truck    Panel Truck            468
3      Pickup         Pickup            993
4  Sports Car     Sports Car            634
5         SUV            SUV           1634
6         Van            Van            522

# Create a vector of 6 distinct colors
num_colors <- length(df_CAR_TYPE$Value)
my_colors <- rainbow(num_colors)


# Assign the colors to the 'ColorVar' variable
ColorVar <- factor(x      = 1:num_colors, 
                   levels = 1:num_colors,
                   labels = my_colors
                   )



# ACTUAL PLOT - I want to sort the columns by height and color them differently
ggplot(data = df_CAR_TYPE, aes(x    = reorder(x  =  Value, 
                                         X  =  -Frequency.Freq), 
                          y    = Frequency.Freq, 
                          fill = ColorVar
                          )
       ) +
  geom_bar(stat         = "identity",
           show.legend  = FALSE
           ) +
  labs(title = "Bar Chart of Occupations", 
           x = "",
           y = "Frequency"
       ) +
  theme(axis.text.x  = element_text(angle = 45,   # fit the labels 
                                    hjust = 1
                                    )
        )

Jobs

You can try to create conditional charts (if the car crash happened).

# Create a table of frequencies
freq_table_JOB <- table(clean_train_stargazer_table$JOB)

freq_table_JOB


   Blue Collar       Clerical         Doctor     Home Maker         Lawyer 
          1302            933            172            452            635 
       Manager Missing Values   Professional        Student 
           674            378            798            489

# Convert the table to a data frame
df_JOB <- data.frame(Value     = names(freq_table_JOB),
                     Frequency = freq_table_JOB
                     )

df_JOB

           Value Frequency.Var1 Frequency.Freq
1    Blue Collar    Blue Collar           1302
2       Clerical       Clerical            933
3         Doctor         Doctor            172
4     Home Maker     Home Maker            452
5         Lawyer         Lawyer            635
6        Manager        Manager            674
7 Missing Values Missing Values            378
8   Professional   Professional            798
9        Student        Student            489

# Create a vector of 9 distinct colors
num_colors <- 9
my_colors <- rainbow(num_colors)


# Assign the colors to the 'ColorVar' variable
ColorVar <- factor(x      = 1:num_colors, 
                   levels = 1:num_colors,
                   labels = my_colors
                   )



# ACTUAL PLOT - I want to sort the columns by height and color them differently
ggplot(data = df_JOB, aes(x    = reorder(x  =  Value, 
                                         X  =  -Frequency.Freq), 
                          y    = Frequency.Freq, 
                          fill = ColorVar
                          )
       ) +
  geom_bar(stat         = "identity",
           show.legend  = FALSE
           ) +
  labs(title = "Bar Chart of Occupations", 
           x = "",
           y = "Frequency"
       ) +
  theme(axis.text.x  = element_text(angle = 45,   # fit the labels 
                                    hjust = 1
                                    )
        )

Visualization

It would be a good idea to plot the distribution of all the variables like you did in HW1 - melt the data into long format and use ggplot2 to see the raw data. In this assignment, it would be informative to split the charts above by crashed or not crashed too. Run basic regression models and quickly see what trends you are getting - then you can create the charts to support the story from the regressions by decomposing the raw data and verifying why you find the results.

MODELS

See basic multivariate regression from Homework 1 solution.
Then run some logstic and/or probit regression in R. https://rpubs.com/sharmaar/Logistic_regression_implementation
- Make sure to explain the coefficients - else you will not get full marks.
Try some lasso/ridge for bonus points. https://rpubs.com/sharmaar/Lasso_Ridge

Present regressions in a table with stargazer package - make sure the table readable and nicely edited.

MODEL SELECTION

Talk about the basic metrics like adjusted R squared, RMSE, and tell us which is your preferred model..
Predict the outcome variable.
Create a confusion matrix through caret package. Set the event of interest as a car crash.

Note that you should get the same answers for metrics like sensitivity, specificity, accuracy if you construct them by hand or by using the package.

https://rpubs.com/sharmaar/ConfusionMatrix