Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company.

Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT.

This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Objective

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.

You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

train_df <- fread("C:/Users/Nick Climaco/Documents/DataScience/DATA_621_F23/Homework-4/insurance_training_data.csv")
test_df <- fread("C:/Users/Nick Climaco/Documents/DataScience/DATA_621_F23/Homework-4/insurance-evaluation-data.csv")

Data Cleaning

This data set has 8161 rows with two response variables and 24 explanatory variables. We notice that there some missing entries in some of the columns where we will address to handle the missing data in a later section.

# removing 'z_' and '$'
train_df <- as.data.frame(lapply(train_df, function(x) gsub("^z_", "", x)))


train_df$INCOME <- gsub("\\$", "", train_df$INCOME)
train_df$HOME_VAL <- gsub("\\$", "", train_df$HOME_VAL)
train_df$BLUEBOOK <- gsub("\\$", "", train_df$BLUEBOOK)
train_df$OLDCLAIM <- gsub("\\$", "", train_df$OLDCLAIM)

columns_to_convert = c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM")
convert_with_commas_to_numeric <- function(x) as.numeric(gsub(",", "", x))
train_df[columns_to_convert] <- lapply(train_df[columns_to_convert], convert_with_commas_to_numeric)

train_df <- type.convert(train_df, as.is = TRUE)

train_df[train_df == ""] <- NA

show_summary <- function(df) {
    cat(rep("+", 50), "\n")
    cat(paste("DIMENSIONS : (", nrow(df), ", ", ncol(df), ")\n", sep = ""), "\n")
    cat(rep("+", 50), "\n")
    cat("COLUMNS:\n", "\n")
    col_names <- names(df)
    cat(paste(col_names, ", "))
    cat(rep("+", 50), "\n")
    cat("DATA INFO:\n", "\n")
    cat(sapply(df, class), "\n")
    cat(rep("+", 50), "\n")
    cat("MISSING VALUES:\n", "\n")
    missing_values <- colSums(is.na(df))
    cat(paste(col_names, ": ", missing_values, "\n"))
}

show_summary(train_df)
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## DIMENSIONS : (8161, 26)
##  
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## COLUMNS:
##  
## INDEX ,  TARGET_FLAG ,  TARGET_AMT ,  KIDSDRIV ,  AGE ,  HOMEKIDS ,  YOJ ,  INCOME ,  PARENT1 ,  HOME_VAL ,  MSTATUS ,  SEX ,  EDUCATION ,  JOB ,  TRAVTIME ,  CAR_USE ,  BLUEBOOK ,  TIF ,  CAR_TYPE ,  RED_CAR ,  OLDCLAIM ,  CLM_FREQ ,  REVOKED ,  MVR_PTS ,  CAR_AGE ,  URBANICITY , + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## DATA INFO:
##  
## integer integer numeric integer integer integer integer integer character integer character character character character integer character integer integer character character integer integer character integer integer character 
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## MISSING VALUES:
##  
## INDEX :  0 
##  TARGET_FLAG :  0 
##  TARGET_AMT :  0 
##  KIDSDRIV :  0 
##  AGE :  6 
##  HOMEKIDS :  0 
##  YOJ :  454 
##  INCOME :  445 
##  PARENT1 :  0 
##  HOME_VAL :  464 
##  MSTATUS :  0 
##  SEX :  0 
##  EDUCATION :  0 
##  JOB :  526 
##  TRAVTIME :  0 
##  CAR_USE :  0 
##  BLUEBOOK :  0 
##  TIF :  0 
##  CAR_TYPE :  0 
##  RED_CAR :  0 
##  OLDCLAIM :  0 
##  CLM_FREQ :  0 
##  REVOKED :  0 
##  MVR_PTS :  0 
##  CAR_AGE :  510 
##  URBANICITY :  0
train_df <- na.omit(train_df)

Data Exploration

The following are some of the summary statistics of the numerical variables.

summary(train_df)
##      INDEX        TARGET_FLAG      TARGET_AMT       KIDSDRIV     
##  Min.   :    1   Min.   :0.000   Min.   :    0   Min.   :0.0000  
##  1st Qu.: 2470   1st Qu.:0.000   1st Qu.:    0   1st Qu.:0.0000  
##  Median : 5060   Median :0.000   Median :    0   Median :0.0000  
##  Mean   : 5090   Mean   :0.265   Mean   : 1480   Mean   :0.1732  
##  3rd Qu.: 7667   3rd Qu.:1.000   3rd Qu.: 1037   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.000   Max.   :85524   Max.   :4.0000  
##       AGE           HOMEKIDS           YOJ            INCOME      
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.00   Min.   :     0  
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.00   1st Qu.: 26748  
##  Median :45.00   Median :0.0000   Median :11.00   Median : 51624  
##  Mean   :44.63   Mean   :0.7434   Mean   :10.49   Mean   : 58177  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.00   3rd Qu.: 81287  
##  Max.   :81.00   Max.   :5.0000   Max.   :23.00   Max.   :367030  
##    PARENT1             HOME_VAL        MSTATUS              SEX           
##  Length:6045        Min.   :     0   Length:6045        Length:6045       
##  Class :character   1st Qu.:     0   Class :character   Class :character  
##  Mode  :character   Median :159152   Mode  :character   Mode  :character  
##                     Mean   :150102                                        
##                     3rd Qu.:233053                                        
##                     Max.   :885282                                        
##   EDUCATION             JOB               TRAVTIME        CAR_USE         
##  Length:6045        Length:6045        Min.   :  5.00   Length:6045       
##  Class :character   Class :character   1st Qu.: 23.00   Class :character  
##  Mode  :character   Mode  :character   Median : 33.00   Mode  :character  
##                                        Mean   : 33.69                     
##                                        3rd Qu.: 44.00                     
##                                        Max.   :142.00                     
##     BLUEBOOK          TIF          CAR_TYPE           RED_CAR         
##  Min.   : 1500   Min.   : 1.00   Length:6045        Length:6045       
##  1st Qu.: 9170   1st Qu.: 1.00   Class :character   Class :character  
##  Median :14080   Median : 4.00   Mode  :character   Mode  :character  
##  Mean   :15236   Mean   : 5.36                                        
##  3rd Qu.:20120   3rd Qu.: 7.00                                        
##  Max.   :65970   Max.   :25.00                                        
##     OLDCLAIM        CLM_FREQ        REVOKED             MVR_PTS    
##  Min.   :    0   Min.   :0.0000   Length:6045        Min.   : 0.0  
##  1st Qu.:    0   1st Qu.:0.0000   Class :character   1st Qu.: 0.0  
##  Median :    0   Median :0.0000   Mode  :character   Median : 1.0  
##  Mean   : 4005   Mean   :0.7841                      Mean   : 1.7  
##  3rd Qu.: 4546   3rd Qu.:2.0000                      3rd Qu.: 3.0  
##  Max.   :57037   Max.   :5.0000                      Max.   :13.0  
##     CAR_AGE        URBANICITY       
##  Min.   :-3.000   Length:6045       
##  1st Qu.: 1.000   Class :character  
##  Median : 8.000   Mode  :character  
##  Mean   : 7.921                     
##  3rd Qu.:12.000                     
##  Max.   :28.000

For the missing data, we decided to simply removing them since we believed that imputing them based on the non-missing data might have unwanted bias. For instance, trying to calculate a customer target variables with missing income based on other customer’s data return a wide margin of error in calculating that customer actual income.

show_summary(train_df)
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## DIMENSIONS : (6045, 26)
##  
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## COLUMNS:
##  
## INDEX ,  TARGET_FLAG ,  TARGET_AMT ,  KIDSDRIV ,  AGE ,  HOMEKIDS ,  YOJ ,  INCOME ,  PARENT1 ,  HOME_VAL ,  MSTATUS ,  SEX ,  EDUCATION ,  JOB ,  TRAVTIME ,  CAR_USE ,  BLUEBOOK ,  TIF ,  CAR_TYPE ,  RED_CAR ,  OLDCLAIM ,  CLM_FREQ ,  REVOKED ,  MVR_PTS ,  CAR_AGE ,  URBANICITY , + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## DATA INFO:
##  
## integer integer numeric integer integer integer integer integer character integer character character character character integer character integer integer character character integer integer character integer integer character 
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## MISSING VALUES:
##  
## INDEX :  0 
##  TARGET_FLAG :  0 
##  TARGET_AMT :  0 
##  KIDSDRIV :  0 
##  AGE :  0 
##  HOMEKIDS :  0 
##  YOJ :  0 
##  INCOME :  0 
##  PARENT1 :  0 
##  HOME_VAL :  0 
##  MSTATUS :  0 
##  SEX :  0 
##  EDUCATION :  0 
##  JOB :  0 
##  TRAVTIME :  0 
##  CAR_USE :  0 
##  BLUEBOOK :  0 
##  TIF :  0 
##  CAR_TYPE :  0 
##  RED_CAR :  0 
##  OLDCLAIM :  0 
##  CLM_FREQ :  0 
##  REVOKED :  0 
##  MVR_PTS :  0 
##  CAR_AGE :  0 
##  URBANICITY :  0
str(train_df)
## 'data.frame':    6045 obs. of  26 variables:
##  $ INDEX      : int  1 2 4 7 12 13 14 15 16 19 ...
##  $ TARGET_FLAG: int  0 0 0 1 1 0 1 0 0 1 ...
##  $ TARGET_AMT : num  0 0 0 2946 2501 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AGE        : int  60 43 35 34 34 50 53 43 55 45 ...
##  $ HOMEKIDS   : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ YOJ        : int  11 11 10 12 10 7 14 5 11 0 ...
##  $ INCOME     : int  67349 91449 16039 125301 62978 106952 77100 52642 59162 0 ...
##  $ PARENT1    : chr  "No" "No" "No" "Yes" ...
##  $ HOME_VAL   : int  0 257252 124191 0 0 0 0 209970 180232 106859 ...
##  $ MSTATUS    : chr  "No" "No" "Yes" "No" ...
##  $ SEX        : chr  "M" "M" "F" "F" ...
##  $ EDUCATION  : chr  "PhD" "High School" "High School" "Bachelors" ...
##  $ JOB        : chr  "Professional" "Blue Collar" "Clerical" "Blue Collar" ...
##  $ TRAVTIME   : int  14 22 5 46 34 48 15 36 25 48 ...
##  $ CAR_USE    : chr  "Private" "Commercial" "Private" "Commercial" ...
##  $ BLUEBOOK   : int  14230 14940 4010 17430 11200 18510 18300 22420 17600 6000 ...
##  $ TIF        : int  11 1 4 1 1 7 1 7 7 1 ...
##  $ CAR_TYPE   : chr  "Minivan" "Minivan" "SUV" "Sports Car" ...
##  $ RED_CAR    : chr  "yes" "yes" "no" "no" ...
##  $ OLDCLAIM   : int  4461 0 38690 0 0 0 0 0 5028 0 ...
##  $ CLM_FREQ   : int  2 0 2 0 0 0 0 0 2 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "No" ...
##  $ MVR_PTS    : int  3 0 3 0 0 1 0 0 3 3 ...
##  $ CAR_AGE    : int  18 1 10 7 1 17 11 1 9 5 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
##  - attr(*, "na.action")= 'omit' Named int [1:2116] 4 5 7 8 14 21 29 32 37 45 ...
##   ..- attr(*, "names")= chr [1:2116] "4" "5" "7" "8" ...

Observe, that TARGET_FLAG has inbalance in the number of customer who crashed their cars and the number of customer who didnt crash which could skew the models towards the customers who didnt crash. Also, for the rest of the skewed data, we employed boxcox, log and sqrt transformations in order for the distributions to resemble normal.

par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))

for (col_name in names(train_df)[-1]) {
    if (is.numeric(train_df[[col_name]])) {
        hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
    }
}

par(mfrow = c(1, 1))

par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))

for (col_name in names(train_df)[-1]) {
    if (is.numeric(train_df[[col_name]])) {
        boxplot(train_df[[col_name]], main = paste(col_name), horizontal = TRUE,
            ylab = "Value")
    }
}

par(mfrow = c(1, 1))

Here, is the overwhelmingly skewed class imbalance in the data where a majority of the data are customer who have not crashed their cars.

train_df$TARGET_FLAG <- as.factor(train_df$TARGET_FLAG)
count <- table(train_df$TARGET_FLAG)
barplot(count, main = "Bar Plot of TARGET_FLAG", xlab = "Categories", ylab = "Count",
    names.arg = c("No Crash", "Crashed"))

Data Preparation

To the address the class imbalance, we used the upsampling method to have a more even number of customer who has reported a crash.

# up sample crashed count
df_0 <- train_df[train_df$TARGET_FLAG == 0, ]
df_1 <- train_df[train_df$TARGET_FLAG == 1, ]

# Calculate the difference in the number of rows between the two classes
diff_rows <- nrow(df_0) - nrow(df_1)

# Randomly sample rows from the minority class to match the majority class
df_1_upsampled <- df_1[sample(nrow(df_1), diff_rows + 1600, replace = TRUE), ]

# Combine the upsampled minority class with the majority class
upsampled_data <- rbind(df_0, df_1_upsampled)

# Shuffle the rows to randomize the order
set.seed(123)  # Set seed for reproducibility
upsampled_data <- upsampled_data[sample(nrow(upsampled_data)), ]
count <- table(upsampled_data$TARGET_FLAG)
barplot(count, main = "Bar Plot of TARGET_FLAG", xlab = "Categories", ylab = "Count",
    names.arg = c("No Crash", "Crashed"))

upsampled_data$TARGET_FLAG <- as.numeric(upsampled_data$TARGET_FLAG)
upsampled_data$TARGET_AMT <- ifelse(upsampled_data$TARGET_AMT > 0, log(upsampled_data$TARGET_AMT),
    0)
upsampled_data$INCOME <- sqrt(upsampled_data$INCOME)
upsampled_data$HOME_VAL <- ifelse(upsampled_data$HOME_VAL > 0, log(upsampled_data$HOME_VAL),
    0)
upsampled_data$BLUEBOOK <- sqrt(upsampled_data$BLUEBOOK)
upsampled_data$OLDCLAIM <- ifelse(upsampled_data$OLDCLAIM > 0, log(upsampled_data$OLDCLAIM),
    0)
par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))

for (col_name in names(upsampled_data)[-1]) {
    if (is.numeric(upsampled_data[[col_name]])) {
        hist(upsampled_data[[col_name]], main = paste(col_name), xlab = "Value")
    }
}

par(mfrow = c(1, 1))

TARGET_AMT, INCOME, BLUEBOOK and YOJ were transform to more closely a normal distribution in hopes to improve the performance of our models.

Model Building

Linear Models

Our initial linear model with create a baseline for us where we will be able to build upon.

LM 1

lm_df <- upsampled_data |>
    dplyr::select(-INDEX, -TARGET_FLAG)
lm_df2 <- train_df |>
    dplyr::select(-INDEX, -TARGET_FLAG)

base_model <- lm(TARGET_AMT ~ ., data = lm_df2)
summary(base_model)
## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = lm_df2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5219  -1672   -727    384  82915 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    7.176e+02  5.163e+02   1.390  0.16463    
## KIDSDRIV                       2.066e+02  1.255e+02   1.646  0.09985 .  
## AGE                           -4.372e+00  7.920e+00  -0.552  0.58096    
## HOMEKIDS                       1.562e+01  7.242e+01   0.216  0.82920    
## YOJ                           -2.232e+00  1.652e+01  -0.135  0.89253    
## INCOME                        -2.927e-03  2.286e-03  -1.280  0.20055    
## PARENT1Yes                     4.689e+02  2.244e+02   2.090  0.03666 *  
## HOME_VAL                      -1.049e-03  7.147e-04  -1.468  0.14217    
## MSTATUSYes                    -5.410e+02  1.670e+02  -3.239  0.00121 ** 
## SEXM                           4.715e+02  2.039e+02   2.312  0.02079 *  
## EDUCATIONBachelors            -3.630e+02  2.266e+02  -1.602  0.10923    
## EDUCATIONHigh School          -2.213e+02  1.870e+02  -1.184  0.23657    
## EDUCATIONMasters              -3.036e+02  3.396e+02  -0.894  0.37139    
## EDUCATIONPhD                   4.376e+02  4.262e+02   1.027  0.30452    
## JOBClerical                   -1.569e+02  2.099e+02  -0.747  0.45493    
## JOBDoctor                     -1.348e+03  4.980e+02  -2.707  0.00682 ** 
## JOBHome Maker                 -2.536e+02  3.007e+02  -0.843  0.39909    
## JOBLawyer                     -2.258e+02  3.410e+02  -0.662  0.50782    
## JOBManager                    -1.115e+03  2.562e+02  -4.353 1.36e-05 ***
## JOBProfessional               -2.822e+01  2.336e+02  -0.121  0.90388    
## JOBStudent                    -4.338e+02  2.648e+02  -1.638  0.10138    
## TRAVTIME                       1.086e+01  3.616e+00   3.005  0.00267 ** 
## CAR_USEPrivate                -7.558e+02  1.831e+02  -4.127 3.72e-05 ***
## BLUEBOOK                       1.260e-02  9.608e-03   1.311  0.18992    
## TIF                           -4.458e+01  1.368e+01  -3.259  0.00112 ** 
## CAR_TYPEPanel Truck            4.756e+02  3.287e+02   1.447  0.14807    
## CAR_TYPEPickup                 4.057e+02  1.879e+02   2.159  0.03091 *  
## CAR_TYPESports Car             1.264e+03  2.367e+02   5.342 9.52e-08 ***
## CAR_TYPESUV                    9.103e+02  1.948e+02   4.673 3.03e-06 ***
## CAR_TYPEVan                    4.892e+02  2.418e+02   2.023  0.04314 *  
## RED_CARyes                    -1.691e+02  1.711e+02  -0.989  0.32288    
## OLDCLAIM                      -4.870e-03  8.355e-03  -0.583  0.55997    
## CLM_FREQ                       7.950e+01  6.203e+01   1.282  0.19999    
## REVOKEDYes                     4.970e+02  1.954e+02   2.544  0.01098 *  
## MVR_PTS                        1.728e+02  2.909e+01   5.942 2.98e-09 ***
## CAR_AGE                       -2.532e+01  1.446e+01  -1.751  0.08004 .  
## URBANICITYHighly Urban/ Urban  1.641e+03  1.537e+02  10.677  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4389 on 6008 degrees of freedom
## Multiple R-squared:  0.07651,    Adjusted R-squared:  0.07098 
## F-statistic: 13.83 on 36 and 6008 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(base_model)

This model, we expected to perform poorly with an adjR^2 0.07 where the model was only able to explain the 7% of the variability in the data. The data used for this model were not transformed and nor had a feature selection system that picked out the best features.

LM 2 by Stepwise

models <- regsubsets(TARGET_AMT ~ ., data = lm_df, nvmax = 5, method = "seqrep")
summary(models)
## Subset selection object
## Call: regsubsets.formula(TARGET_AMT ~ ., data = lm_df, nvmax = 5, method = "seqrep")
## 36 Variables  (and intercept)
##                               Forced in Forced out
## KIDSDRIV                          FALSE      FALSE
## AGE                               FALSE      FALSE
## HOMEKIDS                          FALSE      FALSE
## YOJ                               FALSE      FALSE
## INCOME                            FALSE      FALSE
## PARENT1Yes                        FALSE      FALSE
## HOME_VAL                          FALSE      FALSE
## MSTATUSYes                        FALSE      FALSE
## SEXM                              FALSE      FALSE
## EDUCATIONBachelors                FALSE      FALSE
## EDUCATIONHigh School              FALSE      FALSE
## EDUCATIONMasters                  FALSE      FALSE
## EDUCATIONPhD                      FALSE      FALSE
## JOBClerical                       FALSE      FALSE
## JOBDoctor                         FALSE      FALSE
## JOBHome Maker                     FALSE      FALSE
## JOBLawyer                         FALSE      FALSE
## JOBManager                        FALSE      FALSE
## JOBProfessional                   FALSE      FALSE
## JOBStudent                        FALSE      FALSE
## TRAVTIME                          FALSE      FALSE
## CAR_USEPrivate                    FALSE      FALSE
## BLUEBOOK                          FALSE      FALSE
## TIF                               FALSE      FALSE
## CAR_TYPEPanel Truck               FALSE      FALSE
## CAR_TYPEPickup                    FALSE      FALSE
## CAR_TYPESports Car                FALSE      FALSE
## CAR_TYPESUV                       FALSE      FALSE
## CAR_TYPEVan                       FALSE      FALSE
## RED_CARyes                        FALSE      FALSE
## OLDCLAIM                          FALSE      FALSE
## CLM_FREQ                          FALSE      FALSE
## REVOKEDYes                        FALSE      FALSE
## MVR_PTS                           FALSE      FALSE
## CAR_AGE                           FALSE      FALSE
## URBANICITYHighly Urban/ Urban     FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: 'sequential replacement'
##          KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1Yes HOME_VAL MSTATUSYes SEXM
## 1  ( 1 ) " "      " " " "      " " " "    " "        " "      " "        " " 
## 2  ( 1 ) " "      " " " "      " " " "    " "        " "      " "        " " 
## 3  ( 1 ) " "      " " " "      " " "*"    " "        " "      " "        " " 
## 4  ( 1 ) " "      " " " "      " " "*"    " "        " "      " "        " " 
## 5  ( 1 ) " "      " " " "      " " "*"    "*"        " "      " "        " " 
##          EDUCATIONBachelors EDUCATIONHigh School EDUCATIONMasters EDUCATIONPhD
## 1  ( 1 ) " "                " "                  " "              " "         
## 2  ( 1 ) " "                " "                  " "              " "         
## 3  ( 1 ) " "                " "                  " "              " "         
## 4  ( 1 ) " "                " "                  " "              " "         
## 5  ( 1 ) " "                " "                  " "              " "         
##          JOBClerical JOBDoctor JOBHome Maker JOBLawyer JOBManager
## 1  ( 1 ) " "         " "       " "           " "       " "       
## 2  ( 1 ) " "         " "       " "           " "       " "       
## 3  ( 1 ) " "         " "       " "           " "       " "       
## 4  ( 1 ) " "         " "       " "           " "       " "       
## 5  ( 1 ) " "         " "       " "           " "       " "       
##          JOBProfessional JOBStudent TRAVTIME CAR_USEPrivate BLUEBOOK TIF
## 1  ( 1 ) " "             " "        " "      " "            " "      " "
## 2  ( 1 ) " "             " "        " "      " "            " "      " "
## 3  ( 1 ) " "             " "        " "      " "            " "      " "
## 4  ( 1 ) " "             " "        " "      "*"            " "      " "
## 5  ( 1 ) " "             " "        " "      "*"            " "      " "
##          CAR_TYPEPanel Truck CAR_TYPEPickup CAR_TYPESports Car CAR_TYPESUV
## 1  ( 1 ) " "                 " "            " "                " "        
## 2  ( 1 ) " "                 " "            " "                " "        
## 3  ( 1 ) " "                 " "            " "                " "        
## 4  ( 1 ) " "                 " "            " "                " "        
## 5  ( 1 ) " "                 " "            " "                " "        
##          CAR_TYPEVan RED_CARyes OLDCLAIM CLM_FREQ REVOKEDYes MVR_PTS CAR_AGE
## 1  ( 1 ) " "         " "        "*"      " "      " "        " "     " "    
## 2  ( 1 ) " "         " "        "*"      " "      " "        " "     " "    
## 3  ( 1 ) " "         " "        "*"      " "      " "        " "     " "    
## 4  ( 1 ) " "         " "        "*"      " "      " "        " "     " "    
## 5  ( 1 ) " "         " "        "*"      " "      " "        " "     " "    
##          URBANICITYHighly Urban/ Urban
## 1  ( 1 ) " "                          
## 2  ( 1 ) "*"                          
## 3  ( 1 ) "*"                          
## 4  ( 1 ) "*"                          
## 5  ( 1 ) "*"
model_2 <- lm(TARGET_AMT ~ PARENT1 + JOB + CAR_USE + OLDCLAIM + URBANICITY + INCOME,
    data = lm_df)
summary(model_2)
## 
## Call:
## lm(formula = TARGET_AMT ~ PARENT1 + JOB + CAR_USE + OLDCLAIM + 
##     URBANICITY + INCOME, data = lm_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9274 -3.0506  0.0433  3.0414 10.6843 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    3.1234810  0.1938597  16.112  < 2e-16 ***
## PARENT1Yes                     1.6289755  0.1049323  15.524  < 2e-16 ***
## JOBClerical                    0.1411034  0.1375306   1.026 0.304930    
## JOBDoctor                     -1.0625292  0.2695931  -3.941 8.17e-05 ***
## JOBHome Maker                 -0.6962028  0.1990792  -3.497 0.000473 ***
## JOBLawyer                     -0.5482664  0.1705529  -3.215 0.001311 ** 
## JOBManager                    -1.9858514  0.1584395 -12.534  < 2e-16 ***
## JOBProfessional               -0.5482570  0.1423500  -3.851 0.000118 ***
## JOBStudent                    -0.6100499  0.1884222  -3.238 0.001210 ** 
## CAR_USEPrivate                -1.2077736  0.1015496 -11.893  < 2e-16 ***
## OLDCLAIM                       0.1621598  0.0092964  17.443  < 2e-16 ***
## URBANICITYHighly Urban/ Urban  3.1100614  0.1119786  27.774  < 2e-16 ***
## INCOME                        -0.0060144  0.0006353  -9.467  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.68 on 8871 degrees of freedom
## Multiple R-squared:  0.2257, Adjusted R-squared:  0.2247 
## F-statistic: 215.5 on 12 and 8871 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model_2)

Using best subsets function in feature selection of the transformed variables we were able to greatly improve the fit of this second model relative the first model wheret the adjusted Rsquared when from 0.07 to 0.225 which suggests that the transformation and removal of excess features were significant.

LM 3 by 10-fold validation

train_control <- trainControl(method = "cv", number = 10)

step_model2 <- train(TARGET_AMT ~ ., data = lm_df, method = "leapBackward", tuneGrid = data.frame(nvmax = 1:5),
    trControl = train_control)
step_model2$results
##   nvmax     RMSE   Rsquared      MAE     RMSESD RsquaredSD      MAESD
## 1     1 4.008379 0.08162822 3.810545 0.04840724 0.02007521 0.03988688
## 2     2 3.916301 0.12241911 3.632547 0.04775842 0.01918493 0.04411160
## 3     3 3.845643 0.15381406 3.501919 0.04624626 0.01986846 0.04099640
## 4     4 3.778778 0.18286651 3.388962 0.05334672 0.02297831 0.05192853
## 5     5 3.739504 0.19980490 3.322746 0.05841986 0.02451064 0.05812010
step_model2$bestTune
##   nvmax
## 5     5
summary(step_model2$finalModel)
## Subset selection object
## 36 Variables  (and intercept)
##                               Forced in Forced out
## KIDSDRIV                          FALSE      FALSE
## AGE                               FALSE      FALSE
## HOMEKIDS                          FALSE      FALSE
## YOJ                               FALSE      FALSE
## INCOME                            FALSE      FALSE
## PARENT1Yes                        FALSE      FALSE
## HOME_VAL                          FALSE      FALSE
## MSTATUSYes                        FALSE      FALSE
## SEXM                              FALSE      FALSE
## EDUCATIONBachelors                FALSE      FALSE
## EDUCATIONHigh School              FALSE      FALSE
## EDUCATIONMasters                  FALSE      FALSE
## EDUCATIONPhD                      FALSE      FALSE
## JOBClerical                       FALSE      FALSE
## JOBDoctor                         FALSE      FALSE
## JOBHome Maker                     FALSE      FALSE
## JOBLawyer                         FALSE      FALSE
## JOBManager                        FALSE      FALSE
## JOBProfessional                   FALSE      FALSE
## JOBStudent                        FALSE      FALSE
## TRAVTIME                          FALSE      FALSE
## CAR_USEPrivate                    FALSE      FALSE
## BLUEBOOK                          FALSE      FALSE
## TIF                               FALSE      FALSE
## CAR_TYPEPanel Truck               FALSE      FALSE
## CAR_TYPEPickup                    FALSE      FALSE
## CAR_TYPESports Car                FALSE      FALSE
## CAR_TYPESUV                       FALSE      FALSE
## CAR_TYPEVan                       FALSE      FALSE
## RED_CARyes                        FALSE      FALSE
## OLDCLAIM                          FALSE      FALSE
## CLM_FREQ                          FALSE      FALSE
## REVOKEDYes                        FALSE      FALSE
## MVR_PTS                           FALSE      FALSE
## CAR_AGE                           FALSE      FALSE
## URBANICITYHighly Urban/ Urban     FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: backward
##          KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1Yes HOME_VAL MSTATUSYes SEXM
## 1  ( 1 ) " "      " " " "      " " " "    " "        " "      " "        " " 
## 2  ( 1 ) " "      " " " "      " " " "    " "        " "      " "        " " 
## 3  ( 1 ) " "      " " " "      " " "*"    " "        " "      " "        " " 
## 4  ( 1 ) " "      " " " "      " " "*"    " "        " "      " "        " " 
## 5  ( 1 ) " "      " " " "      " " "*"    " "        " "      "*"        " " 
##          EDUCATIONBachelors EDUCATIONHigh School EDUCATIONMasters EDUCATIONPhD
## 1  ( 1 ) " "                " "                  " "              " "         
## 2  ( 1 ) " "                " "                  " "              " "         
## 3  ( 1 ) " "                " "                  " "              " "         
## 4  ( 1 ) " "                " "                  " "              " "         
## 5  ( 1 ) " "                " "                  " "              " "         
##          JOBClerical JOBDoctor JOBHome Maker JOBLawyer JOBManager
## 1  ( 1 ) " "         " "       " "           " "       " "       
## 2  ( 1 ) " "         " "       " "           " "       " "       
## 3  ( 1 ) " "         " "       " "           " "       " "       
## 4  ( 1 ) " "         " "       " "           " "       " "       
## 5  ( 1 ) " "         " "       " "           " "       " "       
##          JOBProfessional JOBStudent TRAVTIME CAR_USEPrivate BLUEBOOK TIF
## 1  ( 1 ) " "             " "        " "      " "            " "      " "
## 2  ( 1 ) " "             " "        " "      " "            " "      " "
## 3  ( 1 ) " "             " "        " "      " "            " "      " "
## 4  ( 1 ) " "             " "        " "      "*"            " "      " "
## 5  ( 1 ) " "             " "        " "      "*"            " "      " "
##          CAR_TYPEPanel Truck CAR_TYPEPickup CAR_TYPESports Car CAR_TYPESUV
## 1  ( 1 ) " "                 " "            " "                " "        
## 2  ( 1 ) " "                 " "            " "                " "        
## 3  ( 1 ) " "                 " "            " "                " "        
## 4  ( 1 ) " "                 " "            " "                " "        
## 5  ( 1 ) " "                 " "            " "                " "        
##          CAR_TYPEVan RED_CARyes OLDCLAIM CLM_FREQ REVOKEDYes MVR_PTS CAR_AGE
## 1  ( 1 ) " "         " "        "*"      " "      " "        " "     " "    
## 2  ( 1 ) " "         " "        "*"      " "      " "        " "     " "    
## 3  ( 1 ) " "         " "        "*"      " "      " "        " "     " "    
## 4  ( 1 ) " "         " "        "*"      " "      " "        " "     " "    
## 5  ( 1 ) " "         " "        "*"      " "      " "        " "     " "    
##          URBANICITYHighly Urban/ Urban
## 1  ( 1 ) " "                          
## 2  ( 1 ) "*"                          
## 3  ( 1 ) "*"                          
## 4  ( 1 ) "*"                          
## 5  ( 1 ) "*"
model_3 <- lm(TARGET_AMT ~ INCOME + JOB + CAR_USE + OLDCLAIM + URBANICITY, data = upsampled_data)
summary(model_3)
## 
## Call:
## lm(formula = TARGET_AMT ~ INCOME + JOB + CAR_USE + OLDCLAIM + 
##     URBANICITY, data = upsampled_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7814 -3.1217  0.2167  3.0889 10.5353 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    3.3562472  0.1958758  17.135  < 2e-16 ***
## INCOME                        -0.0062332  0.0006437  -9.684  < 2e-16 ***
## JOBClerical                    0.1852756  0.1393486   1.330 0.183690    
## JOBDoctor                     -1.1883865  0.2730916  -4.352 1.37e-05 ***
## JOBHome Maker                 -0.6735017  0.2017485  -3.338 0.000846 ***
## JOBLawyer                     -0.5993864  0.1728121  -3.468 0.000526 ***
## JOBManager                    -2.0284367  0.1605441 -12.635  < 2e-16 ***
## JOBProfessional               -0.5474932  0.1442625  -3.795 0.000149 ***
## JOBStudent                    -0.5102379  0.1908425  -2.674 0.007518 ** 
## CAR_USEPrivate                -1.1797461  0.1028977 -11.465  < 2e-16 ***
## OLDCLAIM                       0.1697077  0.0094085  18.038  < 2e-16 ***
## URBANICITYHighly Urban/ Urban  3.1530863  0.1134483  27.793  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.729 on 8872 degrees of freedom
## Multiple R-squared:  0.2047, Adjusted R-squared:  0.2037 
## F-statistic: 207.6 on 11 and 8872 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model_3)

For model 3, we used a 10-Fold Validation method to selecting the best features to use. This method was push our knowledge is feature selection away. And it returned the best 5 features to use in the model. Where model 3 yields an adjusted Rsquared of 0.20 which is worse than model 2. In addition, we observed multiple t-values over the suggested threshold which could account for the poor fit of the data.

Logistic Models

Next, we build our logistic models on w

Logit Model 1

logit_df <- upsampled_data |>
    select(-INDEX, -TARGET_AMT) |>
    mutate(TARGET_FLAG = if_else(TARGET_FLAG == 1, 0, 1))
logit_1 <- glm(TARGET_FLAG ~ ., data = logit_df, family = "binomial")
summary(logit_1)
## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = "binomial", data = logit_df)
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -0.9638368  0.2585018  -3.729 0.000193 ***
## KIDSDRIV                       0.3055413  0.0562807   5.429 5.67e-08 ***
## AGE                           -0.0015876  0.0034265  -0.463 0.643135    
## HOMEKIDS                       0.0172184  0.0325903   0.528 0.597270    
## YOJ                            0.0040759  0.0077878   0.523 0.600719    
## INCOME                        -0.0028675  0.0004753  -6.033 1.61e-09 ***
## PARENT1Yes                     0.4903936  0.0971820   5.046 4.51e-07 ***
## HOME_VAL                      -0.0293439  0.0063151  -4.647 3.37e-06 ***
## MSTATUSYes                    -0.4079803  0.0780663  -5.226 1.73e-07 ***
## SEXM                           0.2411218  0.0935003   2.579 0.009913 ** 
## EDUCATIONBachelors            -0.4138299  0.1004116  -4.121 3.77e-05 ***
## EDUCATIONHigh School           0.0329515  0.0818227   0.403 0.687156    
## EDUCATIONMasters              -0.4006630  0.1581591  -2.533 0.011300 *  
## EDUCATIONPhD                   0.0846562  0.1928038   0.439 0.660604    
## JOBClerical                    0.1476141  0.0932740   1.583 0.113516    
## JOBDoctor                     -0.7243040  0.2318448  -3.124 0.001784 ** 
## JOBHome Maker                 -0.3600040  0.1456888  -2.471 0.013472 *  
## JOBLawyer                      0.0560669  0.1582066   0.354 0.723046    
## JOBManager                    -0.8382743  0.1179871  -7.105 1.21e-12 ***
## JOBProfessional               -0.0801400  0.1032523  -0.776 0.437656    
## JOBStudent                    -0.4393807  0.1360568  -3.229 0.001241 ** 
## TRAVTIME                       0.0179825  0.0016598  10.834  < 2e-16 ***
## CAR_USEPrivate                -0.8887629  0.0811197 -10.956  < 2e-16 ***
## BLUEBOOK                      -0.0054147  0.0010418  -5.197 2.02e-07 ***
## TIF                           -0.0540281  0.0063019  -8.573  < 2e-16 ***
## CAR_TYPEPanel Truck            0.6099367  0.1434501   4.252 2.12e-05 ***
## CAR_TYPEPickup                 0.6813316  0.0857897   7.942 1.99e-15 ***
## CAR_TYPESports Car             1.1852233  0.1084332  10.930  < 2e-16 ***
## CAR_TYPESUV                    0.9821315  0.0915877  10.723  < 2e-16 ***
## CAR_TYPEVan                    0.6453665  0.1098281   5.876 4.20e-09 ***
## RED_CARyes                    -0.2126795  0.0774405  -2.746 0.006026 ** 
## OLDCLAIM                       0.0256060  0.0114139   2.243 0.024871 *  
## CLM_FREQ                       0.1193069  0.0399503   2.986 0.002823 ** 
## REVOKEDYes                     0.6234602  0.0753886   8.270  < 2e-16 ***
## MVR_PTS                        0.1070035  0.0128461   8.330  < 2e-16 ***
## CAR_AGE                        0.0023938  0.0065873   0.363 0.716303    
## URBANICITYHighly Urban/ Urban  2.1672489  0.0832258  26.041  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12316  on 8883  degrees of freedom
## Residual deviance:  9262  on 8847  degrees of freedom
## AIC: 9336
## 
## Number of Fisher Scoring iterations: 4

Our first logit model with all the features including the transformed variable yields an AIC of 9279. This model indicated multiple features that were not significant predictors. We can take those redundant features using stepwise regression in the next model. Reducing the number of features will help this model to perform better.

Logit Model 2 Stepwise

logit_2 <- step(logit_1, direction = "both", scope = formula(logit_1))
## Start:  AIC=9335.98
## TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE + 
##     BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + 
##     REVOKED + MVR_PTS + CAR_AGE + URBANICITY
## 
##              Df Deviance     AIC
## - CAR_AGE     1   9262.1  9334.1
## - AGE         1   9262.2  9334.2
## - YOJ         1   9262.3  9334.3
## - HOMEKIDS    1   9262.3  9334.3
## <none>            9262.0  9336.0
## - OLDCLAIM    1   9267.0  9339.0
## - SEX         1   9268.6  9340.6
## - RED_CAR     1   9269.5  9341.5
## - CLM_FREQ    1   9270.9  9342.9
## - HOME_VAL    1   9283.6  9355.6
## - PARENT1     1   9287.7  9359.7
## - BLUEBOOK    1   9289.1  9361.1
## - MSTATUS     1   9289.3  9361.3
## - KIDSDRIV    1   9292.1  9364.1
## - INCOME      1   9298.7  9370.7
## - EDUCATION   4   9306.4  9372.4
## - REVOKED     1   9332.1  9404.1
## - MVR_PTS     1   9333.3  9405.3
## - TIF         1   9336.6  9408.6
## - JOB         7   9380.0  9440.0
## - TRAVTIME    1   9381.5  9453.5
## - CAR_USE     1   9384.3  9456.3
## - CAR_TYPE    5   9439.4  9503.4
## - URBANICITY  1  10083.7 10155.7
## 
## Step:  AIC=9334.11
## TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE + 
##     BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + 
##     REVOKED + MVR_PTS + URBANICITY
## 
##              Df Deviance     AIC
## - AGE         1   9262.3  9332.3
## - YOJ         1   9262.4  9332.4
## - HOMEKIDS    1   9262.4  9332.4
## <none>            9262.1  9334.1
## + CAR_AGE     1   9262.0  9336.0
## - OLDCLAIM    1   9267.2  9337.2
## - SEX         1   9268.8  9338.8
## - RED_CAR     1   9269.7  9339.7
## - CLM_FREQ    1   9271.1  9341.1
## - HOME_VAL    1   9283.8  9353.8
## - PARENT1     1   9287.9  9357.9
## - BLUEBOOK    1   9289.2  9359.2
## - MSTATUS     1   9289.4  9359.4
## - KIDSDRIV    1   9292.2  9362.2
## - INCOME      1   9298.7  9368.7
## - EDUCATION   4   9310.0  9374.0
## - REVOKED     1   9332.3  9402.3
## - MVR_PTS     1   9333.4  9403.4
## - TIF         1   9336.6  9406.6
## - JOB         7   9380.2  9438.2
## - TRAVTIME    1   9381.7  9451.7
## - CAR_USE     1   9384.7  9454.7
## - CAR_TYPE    5   9439.5  9501.5
## - URBANICITY  1  10083.7 10153.7
## 
## Step:  AIC=9332.32
## TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE + 
##     BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + 
##     REVOKED + MVR_PTS + URBANICITY
## 
##              Df Deviance     AIC
## - YOJ         1   9262.5  9330.5
## - HOMEKIDS    1   9262.9  9330.9
## <none>            9262.3  9332.3
## + AGE         1   9262.1  9334.1
## + CAR_AGE     1   9262.2  9334.2
## - OLDCLAIM    1   9267.4  9335.4
## - SEX         1   9268.8  9336.8
## - RED_CAR     1   9269.9  9337.9
## - CLM_FREQ    1   9271.3  9339.3
## - HOME_VAL    1   9284.2  9352.2
## - PARENT1     1   9288.8  9356.8
## - MSTATUS     1   9289.7  9357.7
## - BLUEBOOK    1   9290.8  9358.8
## - KIDSDRIV    1   9292.5  9360.5
## - INCOME      1   9298.8  9366.8
## - EDUCATION   4   9310.2  9372.2
## - REVOKED     1   9332.4  9400.4
## - MVR_PTS     1   9333.8  9401.8
## - TIF         1   9336.8  9404.8
## - JOB         7   9381.1  9437.1
## - TRAVTIME    1   9381.8  9449.8
## - CAR_USE     1   9384.8  9452.8
## - CAR_TYPE    5   9440.2  9500.2
## - URBANICITY  1  10085.0 10153.0
## 
## Step:  AIC=9330.51
## TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 + HOME_VAL + 
##     MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + REVOKED + 
##     MVR_PTS + URBANICITY
## 
##              Df Deviance     AIC
## - HOMEKIDS    1   9263.2  9329.2
## <none>            9262.5  9330.5
## + YOJ         1   9262.3  9332.3
## + AGE         1   9262.4  9332.4
## + CAR_AGE     1   9262.4  9332.4
## - OLDCLAIM    1   9267.6  9333.6
## - SEX         1   9269.0  9335.0
## - RED_CAR     1   9270.1  9336.1
## - CLM_FREQ    1   9271.4  9337.4
## - HOME_VAL    1   9284.3  9350.3
## - PARENT1     1   9288.9  9354.9
## - MSTATUS     1   9289.7  9355.7
## - BLUEBOOK    1   9290.9  9356.9
## - KIDSDRIV    1   9292.6  9358.6
## - INCOME      1   9301.0  9367.0
## - EDUCATION   4   9310.8  9370.8
## - REVOKED     1   9332.7  9398.7
## - MVR_PTS     1   9334.0  9400.0
## - TIF         1   9336.9  9402.9
## - JOB         7   9384.1  9438.1
## - TRAVTIME    1   9382.1  9448.1
## - CAR_USE     1   9384.8  9450.8
## - CAR_TYPE    5   9440.4  9498.4
## - URBANICITY  1  10085.3 10151.3
## 
## Step:  AIC=9329.21
## TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL + MSTATUS + 
##     SEX + EDUCATION + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + 
##     CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + 
##     URBANICITY
## 
##              Df Deviance     AIC
## <none>            9263.2  9329.2
## + HOMEKIDS    1   9262.5  9330.5
## + AGE         1   9262.8  9330.8
## + YOJ         1   9262.9  9330.9
## + CAR_AGE     1   9263.1  9331.1
## - OLDCLAIM    1   9268.4  9332.4
## - SEX         1   9269.5  9333.5
## - RED_CAR     1   9270.7  9334.7
## - CLM_FREQ    1   9272.0  9336.0
## - HOME_VAL    1   9285.1  9349.1
## - MSTATUS     1   9290.1  9354.1
## - BLUEBOOK    1   9292.2  9356.2
## - INCOME      1   9301.4  9365.4
## - KIDSDRIV    1   9302.8  9366.8
## - PARENT1     1   9303.5  9367.5
## - EDUCATION   4   9312.1  9370.1
## - REVOKED     1   9334.0  9398.0
## - MVR_PTS     1   9334.9  9398.9
## - TIF         1   9337.6  9401.6
## - JOB         7   9385.5  9437.5
## - TRAVTIME    1   9382.4  9446.4
## - CAR_USE     1   9385.7  9449.7
## - CAR_TYPE    5   9440.8  9496.8
## - URBANICITY  1  10086.1 10150.1
summary(logit_2)
## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL + 
##     MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + REVOKED + 
##     MVR_PTS + URBANICITY, family = "binomial", data = logit_df)
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -0.9802374  0.2250160  -4.356 1.32e-05 ***
## KIDSDRIV                       0.3175643  0.0511867   6.204 5.50e-10 ***
## INCOME                        -0.0027776  0.0004509  -6.159 7.30e-10 ***
## PARENT1Yes                     0.5334438  0.0845816   6.307 2.85e-10 ***
## HOME_VAL                      -0.0295154  0.0063064  -4.680 2.87e-06 ***
## MSTATUSYes                    -0.3878864  0.0749170  -5.178 2.25e-07 ***
## SEXM                           0.2315180  0.0925715   2.501  0.01239 *  
## EDUCATIONBachelors            -0.4099657  0.0937765  -4.372 1.23e-05 ***
## EDUCATIONHigh School           0.0316676  0.0815008   0.389  0.69760    
## EDUCATIONMasters              -0.3883805  0.1405542  -2.763  0.00572 ** 
## EDUCATIONPhD                   0.0899442  0.1793282   0.502  0.61598    
## JOBClerical                    0.1562900  0.0929121   1.682  0.09254 .  
## JOBDoctor                     -0.7313044  0.2315349  -3.159  0.00159 ** 
## JOBHome Maker                 -0.3702797  0.1438399  -2.574  0.01005 *  
## JOBLawyer                      0.0494213  0.1579254   0.313  0.75433    
## JOBManager                    -0.8449111  0.1177685  -7.174 7.27e-13 ***
## JOBProfessional               -0.0861123  0.1030640  -0.836  0.40342    
## JOBStudent                    -0.4403109  0.1346563  -3.270  0.00108 ** 
## TRAVTIME                       0.0179437  0.0016589  10.817  < 2e-16 ***
## CAR_USEPrivate                -0.8879993  0.0809929 -10.964  < 2e-16 ***
## BLUEBOOK                      -0.0055264  0.0010282  -5.375 7.67e-08 ***
## TIF                           -0.0539370  0.0062996  -8.562  < 2e-16 ***
## CAR_TYPEPanel Truck            0.6139653  0.1429750   4.294 1.75e-05 ***
## CAR_TYPEPickup                 0.6811846  0.0857378   7.945 1.94e-15 ***
## CAR_TYPESports Car             1.1756992  0.1074073  10.946  < 2e-16 ***
## CAR_TYPESUV                    0.9757538  0.0906959  10.759  < 2e-16 ***
## CAR_TYPEVan                    0.6467599  0.1096442   5.899 3.66e-09 ***
## RED_CARyes                    -0.2119145  0.0774130  -2.737  0.00619 ** 
## OLDCLAIM                       0.0258938  0.0114097   2.269  0.02324 *  
## CLM_FREQ                       0.1183324  0.0399421   2.963  0.00305 ** 
## REVOKEDYes                     0.6255949  0.0753110   8.307  < 2e-16 ***
## MVR_PTS                        0.1072780  0.0128434   8.353  < 2e-16 ***
## URBANICITYHighly Urban/ Urban  2.1683332  0.0832228  26.055  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12315.8  on 8883  degrees of freedom
## Residual deviance:  9263.2  on 8851  degrees of freedom
## AIC: 9329.2
## 
## Number of Fisher Scoring iterations: 4

It seems even with using the step() function, we were able to improve that fit of the model minimally. This stepwise() method used the combination of forward selection and backwards elimination to reach a formula that would best fit the data. We hoped that this method would greatly improve our performance.

Manual Significant Selection Model 3

logit_3 <- glm(data = logit_df, TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME +
    BLUEBOOK + CAR_TYPE + MVR_PTS + URBANICITY + TIF + AGE)
summary(logit_3)
## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME + 
##     BLUEBOOK + CAR_TYPE + MVR_PTS + URBANICITY + TIF + AGE, data = logit_df)
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    3.637e-01  3.335e-02  10.906  < 2e-16 ***
## KIDSDRIV                       7.173e-02  8.822e-03   8.131 4.84e-16 ***
## INCOME                        -5.410e-04  5.083e-05 -10.644  < 2e-16 ***
## HOME_VAL                      -1.034e-02  8.456e-04 -12.231  < 2e-16 ***
## TRAVTIME                       3.178e-03  2.985e-04  10.647  < 2e-16 ***
## BLUEBOOK                      -1.174e-03  1.742e-04  -6.739 1.69e-11 ***
## CAR_TYPEPanel Truck            2.402e-01  2.329e-02  10.314  < 2e-16 ***
## CAR_TYPEPickup                 1.915e-01  1.473e-02  12.997  < 2e-16 ***
## CAR_TYPESports Car             1.937e-01  1.656e-02  11.697  < 2e-16 ***
## CAR_TYPESUV                    1.683e-01  1.301e-02  12.936  < 2e-16 ***
## CAR_TYPEVan                    1.833e-01  1.946e-02   9.421  < 2e-16 ***
## MVR_PTS                        3.300e-02  2.050e-03  16.096  < 2e-16 ***
## URBANICITYHighly Urban/ Urban  3.866e-01  1.296e-02  29.836  < 2e-16 ***
## TIF                           -9.643e-03  1.150e-03  -8.388  < 2e-16 ***
## AGE                           -2.919e-03  5.297e-04  -5.510 3.70e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1948906)
## 
##     Null deviance: 2221.0  on 8883  degrees of freedom
## Residual deviance: 1728.5  on 8869  degrees of freedom
## AIC: 10701
## 
## Number of Fisher Scoring iterations: 2

We wanted to see and test whether that manually selecting the features from the first model would improve or worsen the model’s performance. In this case the AIC is higher thus making the performance of this model worse. The next model, we will be building will using teh rfeControl function to select the features.

Logit Model 4: rfeControl

control <- rfeControl(functions = rfFuncs, # random forest
                      method = "repeatedcv", # repeated cv
                      repeats = 2, # number of repeats
                      number = 5) # number of folds
x <- logit_df %>%
    select(-TARGET_FLAG) %>%
    as.data.frame()

# Target variable
y <- logit_df$TARGET_FLAG

# Training: 80%; Test: 20%
set.seed(2021)
inTrain <- createDataPartition(y, p = 0.8, list = FALSE)[, 1]

x_train <- x[inTrain, ]
x_test <- x[-inTrain, ]

y_train <- y[inTrain]
y_test <- y[-inTrain]
result_rfe1 <- rfe(x = x_train, y = y_train, sizes = c(1:4), rfeControl = control)

# Print the results
result_rfe1
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (5 fold, repeated 2 times) 
## 
## Resampling performance over subset size:
## 
##  Variables   RMSE Rsquared    MAE   RMSESD RsquaredSD    MAESD Selected
##          1 0.4814  0.07398 0.4633 0.002359   0.009475 0.002309         
##          2 0.4777  0.08839 0.4583 0.002835   0.011716 0.002608         
##          3 0.4592  0.16488 0.4353 0.003360   0.014769 0.004615         
##          4 0.4384  0.25552 0.4114 0.003688   0.017496 0.003971         
##         23 0.3034  0.66634 0.2330 0.003646   0.009488 0.004031        *
## 
## The top 5 variables (out of 23):
##    URBANICITY, TRAVTIME, BLUEBOOK, HOME_VAL, AGE
result_rfe1
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (5 fold, repeated 2 times) 
## 
## Resampling performance over subset size:
## 
##  Variables   RMSE Rsquared    MAE   RMSESD RsquaredSD    MAESD Selected
##          1 0.4814  0.07398 0.4633 0.002359   0.009475 0.002309         
##          2 0.4777  0.08839 0.4583 0.002835   0.011716 0.002608         
##          3 0.4592  0.16488 0.4353 0.003360   0.014769 0.004615         
##          4 0.4384  0.25552 0.4114 0.003688   0.017496 0.003971         
##         23 0.3034  0.66634 0.2330 0.003646   0.009488 0.004031        *
## 
## The top 5 variables (out of 23):
##    URBANICITY, TRAVTIME, BLUEBOOK, HOME_VAL, AGE
logit_4 <- glm(TARGET_FLAG ~ URBANICITY + TRAVTIME + BLUEBOOK + HOME_VAL + INCOME,
    data = logit_df, family = "binomial")
summary(logit_4)
## 
## Call:
## glm(formula = TARGET_FLAG ~ URBANICITY + TRAVTIME + BLUEBOOK + 
##     HOME_VAL + INCOME, family = "binomial", data = logit_df)
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -0.4248050  0.1177954  -3.606 0.000311 ***
## URBANICITYHighly Urban/ Urban  2.1098042  0.0760672  27.736  < 2e-16 ***
## TRAVTIME                       0.0152592  0.0014881  10.254  < 2e-16 ***
## BLUEBOOK                      -0.0063802  0.0007542  -8.460  < 2e-16 ***
## HOME_VAL                      -0.0561934  0.0041394 -13.575  < 2e-16 ***
## INCOME                        -0.0032572  0.0002538 -12.831  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12316  on 8883  degrees of freedom
## Residual deviance: 10828  on 8878  degrees of freedom
## AIC: 10840
## 
## Number of Fisher Scoring iterations: 4

The rfeControl returned that having 23 variables would have the best performing model however our hardware limited us from the running that function. Moreover, even using the best 5 features from the rfeControl function yielded a relatively high AIC.

Model Selection

After building and testing multiple models and trying out new feature selection techniques. For this assignment, we will be proposing the use of Linear Model 2 with a adj Rsquare of 0.22 53 and Logistic Model 2 with AIC 9396.9 our lowest amongst the models. There were many models that didnt make the cut for this report.