In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company.
Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT.
This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.
You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
This data set has 8161 rows with two response variables and 24 explanatory variables. We notice that there some missing entries in some of the columns where we will address to handle the missing data in a later section.
# removing 'z_' and '$'
train_df <- as.data.frame(lapply(train_df, function(x) gsub("^z_", "", x)))
train_df$INCOME <- gsub("\\$", "", train_df$INCOME)
train_df$HOME_VAL <- gsub("\\$", "", train_df$HOME_VAL)
train_df$BLUEBOOK <- gsub("\\$", "", train_df$BLUEBOOK)
train_df$OLDCLAIM <- gsub("\\$", "", train_df$OLDCLAIM)
columns_to_convert = c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM")
convert_with_commas_to_numeric <- function(x) as.numeric(gsub(",", "", x))
train_df[columns_to_convert] <- lapply(train_df[columns_to_convert], convert_with_commas_to_numeric)
train_df <- type.convert(train_df, as.is = TRUE)
train_df[train_df == ""] <- NA
show_summary <- function(df) {
cat(rep("+", 50), "\n")
cat(paste("DIMENSIONS : (", nrow(df), ", ", ncol(df), ")\n", sep = ""), "\n")
cat(rep("+", 50), "\n")
cat("COLUMNS:\n", "\n")
col_names <- names(df)
cat(paste(col_names, ", "))
cat(rep("+", 50), "\n")
cat("DATA INFO:\n", "\n")
cat(sapply(df, class), "\n")
cat(rep("+", 50), "\n")
cat("MISSING VALUES:\n", "\n")
missing_values <- colSums(is.na(df))
cat(paste(col_names, ": ", missing_values, "\n"))
}
show_summary(train_df)## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## DIMENSIONS : (8161, 26)
##
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## COLUMNS:
##
## INDEX , TARGET_FLAG , TARGET_AMT , KIDSDRIV , AGE , HOMEKIDS , YOJ , INCOME , PARENT1 , HOME_VAL , MSTATUS , SEX , EDUCATION , JOB , TRAVTIME , CAR_USE , BLUEBOOK , TIF , CAR_TYPE , RED_CAR , OLDCLAIM , CLM_FREQ , REVOKED , MVR_PTS , CAR_AGE , URBANICITY , + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## DATA INFO:
##
## integer integer numeric integer integer integer integer integer character integer character character character character integer character integer integer character character integer integer character integer integer character
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## MISSING VALUES:
##
## INDEX : 0
## TARGET_FLAG : 0
## TARGET_AMT : 0
## KIDSDRIV : 0
## AGE : 6
## HOMEKIDS : 0
## YOJ : 454
## INCOME : 445
## PARENT1 : 0
## HOME_VAL : 464
## MSTATUS : 0
## SEX : 0
## EDUCATION : 0
## JOB : 526
## TRAVTIME : 0
## CAR_USE : 0
## BLUEBOOK : 0
## TIF : 0
## CAR_TYPE : 0
## RED_CAR : 0
## OLDCLAIM : 0
## CLM_FREQ : 0
## REVOKED : 0
## MVR_PTS : 0
## CAR_AGE : 510
## URBANICITY : 0
The following are some of the summary statistics of the numerical variables.
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.000 Min. : 0 Min. :0.0000
## 1st Qu.: 2470 1st Qu.:0.000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5060 Median :0.000 Median : 0 Median :0.0000
## Mean : 5090 Mean :0.265 Mean : 1480 Mean :0.1732
## 3rd Qu.: 7667 3rd Qu.:1.000 3rd Qu.: 1037 3rd Qu.:0.0000
## Max. :10302 Max. :1.000 Max. :85524 Max. :4.0000
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.00 Min. : 0
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.: 26748
## Median :45.00 Median :0.0000 Median :11.00 Median : 51624
## Mean :44.63 Mean :0.7434 Mean :10.49 Mean : 58177
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.00 3rd Qu.: 81287
## Max. :81.00 Max. :5.0000 Max. :23.00 Max. :367030
## PARENT1 HOME_VAL MSTATUS SEX
## Length:6045 Min. : 0 Length:6045 Length:6045
## Class :character 1st Qu.: 0 Class :character Class :character
## Mode :character Median :159152 Mode :character Mode :character
## Mean :150102
## 3rd Qu.:233053
## Max. :885282
## EDUCATION JOB TRAVTIME CAR_USE
## Length:6045 Length:6045 Min. : 5.00 Length:6045
## Class :character Class :character 1st Qu.: 23.00 Class :character
## Mode :character Mode :character Median : 33.00 Mode :character
## Mean : 33.69
## 3rd Qu.: 44.00
## Max. :142.00
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Min. : 1500 Min. : 1.00 Length:6045 Length:6045
## 1st Qu.: 9170 1st Qu.: 1.00 Class :character Class :character
## Median :14080 Median : 4.00 Mode :character Mode :character
## Mean :15236 Mean : 5.36
## 3rd Qu.:20120 3rd Qu.: 7.00
## Max. :65970 Max. :25.00
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Min. : 0 Min. :0.0000 Length:6045 Min. : 0.0
## 1st Qu.: 0 1st Qu.:0.0000 Class :character 1st Qu.: 0.0
## Median : 0 Median :0.0000 Mode :character Median : 1.0
## Mean : 4005 Mean :0.7841 Mean : 1.7
## 3rd Qu.: 4546 3rd Qu.:2.0000 3rd Qu.: 3.0
## Max. :57037 Max. :5.0000 Max. :13.0
## CAR_AGE URBANICITY
## Min. :-3.000 Length:6045
## 1st Qu.: 1.000 Class :character
## Median : 8.000 Mode :character
## Mean : 7.921
## 3rd Qu.:12.000
## Max. :28.000
For the missing data, we decided to simply removing them since we believed that imputing them based on the non-missing data might have unwanted bias. For instance, trying to calculate a customer target variables with missing income based on other customer’s data return a wide margin of error in calculating that customer actual income.
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## DIMENSIONS : (6045, 26)
##
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## COLUMNS:
##
## INDEX , TARGET_FLAG , TARGET_AMT , KIDSDRIV , AGE , HOMEKIDS , YOJ , INCOME , PARENT1 , HOME_VAL , MSTATUS , SEX , EDUCATION , JOB , TRAVTIME , CAR_USE , BLUEBOOK , TIF , CAR_TYPE , RED_CAR , OLDCLAIM , CLM_FREQ , REVOKED , MVR_PTS , CAR_AGE , URBANICITY , + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## DATA INFO:
##
## integer integer numeric integer integer integer integer integer character integer character character character character integer character integer integer character character integer integer character integer integer character
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## MISSING VALUES:
##
## INDEX : 0
## TARGET_FLAG : 0
## TARGET_AMT : 0
## KIDSDRIV : 0
## AGE : 0
## HOMEKIDS : 0
## YOJ : 0
## INCOME : 0
## PARENT1 : 0
## HOME_VAL : 0
## MSTATUS : 0
## SEX : 0
## EDUCATION : 0
## JOB : 0
## TRAVTIME : 0
## CAR_USE : 0
## BLUEBOOK : 0
## TIF : 0
## CAR_TYPE : 0
## RED_CAR : 0
## OLDCLAIM : 0
## CLM_FREQ : 0
## REVOKED : 0
## MVR_PTS : 0
## CAR_AGE : 0
## URBANICITY : 0
## 'data.frame': 6045 obs. of 26 variables:
## $ INDEX : int 1 2 4 7 12 13 14 15 16 19 ...
## $ TARGET_FLAG: int 0 0 0 1 1 0 1 0 0 1 ...
## $ TARGET_AMT : num 0 0 0 2946 2501 ...
## $ KIDSDRIV : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AGE : int 60 43 35 34 34 50 53 43 55 45 ...
## $ HOMEKIDS : int 0 0 1 1 0 0 0 0 0 0 ...
## $ YOJ : int 11 11 10 12 10 7 14 5 11 0 ...
## $ INCOME : int 67349 91449 16039 125301 62978 106952 77100 52642 59162 0 ...
## $ PARENT1 : chr "No" "No" "No" "Yes" ...
## $ HOME_VAL : int 0 257252 124191 0 0 0 0 209970 180232 106859 ...
## $ MSTATUS : chr "No" "No" "Yes" "No" ...
## $ SEX : chr "M" "M" "F" "F" ...
## $ EDUCATION : chr "PhD" "High School" "High School" "Bachelors" ...
## $ JOB : chr "Professional" "Blue Collar" "Clerical" "Blue Collar" ...
## $ TRAVTIME : int 14 22 5 46 34 48 15 36 25 48 ...
## $ CAR_USE : chr "Private" "Commercial" "Private" "Commercial" ...
## $ BLUEBOOK : int 14230 14940 4010 17430 11200 18510 18300 22420 17600 6000 ...
## $ TIF : int 11 1 4 1 1 7 1 7 7 1 ...
## $ CAR_TYPE : chr "Minivan" "Minivan" "SUV" "Sports Car" ...
## $ RED_CAR : chr "yes" "yes" "no" "no" ...
## $ OLDCLAIM : int 4461 0 38690 0 0 0 0 0 5028 0 ...
## $ CLM_FREQ : int 2 0 2 0 0 0 0 0 2 0 ...
## $ REVOKED : chr "No" "No" "No" "No" ...
## $ MVR_PTS : int 3 0 3 0 0 1 0 0 3 3 ...
## $ CAR_AGE : int 18 1 10 7 1 17 11 1 9 5 ...
## $ URBANICITY : chr "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
## - attr(*, "na.action")= 'omit' Named int [1:2116] 4 5 7 8 14 21 29 32 37 45 ...
## ..- attr(*, "names")= chr [1:2116] "4" "5" "7" "8" ...
Observe, that TARGET_FLAG has inbalance in the number of customer who crashed their cars and the number of customer who didnt crash which could skew the models towards the customers who didnt crash. Also, for the rest of the skewed data, we employed boxcox, log and sqrt transformations in order for the distributions to resemble normal.
par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))
for (col_name in names(train_df)[-1]) {
if (is.numeric(train_df[[col_name]])) {
hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
}
}
par(mfrow = c(1, 1))par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))
for (col_name in names(train_df)[-1]) {
if (is.numeric(train_df[[col_name]])) {
boxplot(train_df[[col_name]], main = paste(col_name), horizontal = TRUE,
ylab = "Value")
}
}
par(mfrow = c(1, 1))Here, is the overwhelmingly skewed class imbalance in the data where a majority of the data are customer who have not crashed their cars.
train_df$TARGET_FLAG <- as.factor(train_df$TARGET_FLAG)
count <- table(train_df$TARGET_FLAG)
barplot(count, main = "Bar Plot of TARGET_FLAG", xlab = "Categories", ylab = "Count",
names.arg = c("No Crash", "Crashed"))To the address the class imbalance, we used the upsampling method to have a more even number of customer who has reported a crash.
# up sample crashed count
df_0 <- train_df[train_df$TARGET_FLAG == 0, ]
df_1 <- train_df[train_df$TARGET_FLAG == 1, ]
# Calculate the difference in the number of rows between the two classes
diff_rows <- nrow(df_0) - nrow(df_1)
# Randomly sample rows from the minority class to match the majority class
df_1_upsampled <- df_1[sample(nrow(df_1), diff_rows + 1600, replace = TRUE), ]
# Combine the upsampled minority class with the majority class
upsampled_data <- rbind(df_0, df_1_upsampled)
# Shuffle the rows to randomize the order
set.seed(123) # Set seed for reproducibility
upsampled_data <- upsampled_data[sample(nrow(upsampled_data)), ]count <- table(upsampled_data$TARGET_FLAG)
barplot(count, main = "Bar Plot of TARGET_FLAG", xlab = "Categories", ylab = "Count",
names.arg = c("No Crash", "Crashed"))upsampled_data$TARGET_AMT <- ifelse(upsampled_data$TARGET_AMT > 0, log(upsampled_data$TARGET_AMT),
0)
upsampled_data$INCOME <- sqrt(upsampled_data$INCOME)
upsampled_data$HOME_VAL <- ifelse(upsampled_data$HOME_VAL > 0, log(upsampled_data$HOME_VAL),
0)
upsampled_data$BLUEBOOK <- sqrt(upsampled_data$BLUEBOOK)
upsampled_data$OLDCLAIM <- ifelse(upsampled_data$OLDCLAIM > 0, log(upsampled_data$OLDCLAIM),
0)par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))
for (col_name in names(upsampled_data)[-1]) {
if (is.numeric(upsampled_data[[col_name]])) {
hist(upsampled_data[[col_name]], main = paste(col_name), xlab = "Value")
}
}
par(mfrow = c(1, 1))TARGET_AMT, INCOME, BLUEBOOK and YOJ were transform to more closely a normal distribution in hopes to improve the performance of our models.
Our initial linear model with create a baseline for us where we will be able to build upon.
lm_df <- upsampled_data |>
dplyr::select(-INDEX, -TARGET_FLAG)
lm_df2 <- train_df |>
dplyr::select(-INDEX, -TARGET_FLAG)
base_model <- lm(TARGET_AMT ~ ., data = lm_df2)
summary(base_model)##
## Call:
## lm(formula = TARGET_AMT ~ ., data = lm_df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5219 -1672 -727 384 82915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.176e+02 5.163e+02 1.390 0.16463
## KIDSDRIV 2.066e+02 1.255e+02 1.646 0.09985 .
## AGE -4.372e+00 7.920e+00 -0.552 0.58096
## HOMEKIDS 1.562e+01 7.242e+01 0.216 0.82920
## YOJ -2.232e+00 1.652e+01 -0.135 0.89253
## INCOME -2.927e-03 2.286e-03 -1.280 0.20055
## PARENT1Yes 4.689e+02 2.244e+02 2.090 0.03666 *
## HOME_VAL -1.049e-03 7.147e-04 -1.468 0.14217
## MSTATUSYes -5.410e+02 1.670e+02 -3.239 0.00121 **
## SEXM 4.715e+02 2.039e+02 2.312 0.02079 *
## EDUCATIONBachelors -3.630e+02 2.266e+02 -1.602 0.10923
## EDUCATIONHigh School -2.213e+02 1.870e+02 -1.184 0.23657
## EDUCATIONMasters -3.036e+02 3.396e+02 -0.894 0.37139
## EDUCATIONPhD 4.376e+02 4.262e+02 1.027 0.30452
## JOBClerical -1.569e+02 2.099e+02 -0.747 0.45493
## JOBDoctor -1.348e+03 4.980e+02 -2.707 0.00682 **
## JOBHome Maker -2.536e+02 3.007e+02 -0.843 0.39909
## JOBLawyer -2.258e+02 3.410e+02 -0.662 0.50782
## JOBManager -1.115e+03 2.562e+02 -4.353 1.36e-05 ***
## JOBProfessional -2.822e+01 2.336e+02 -0.121 0.90388
## JOBStudent -4.338e+02 2.648e+02 -1.638 0.10138
## TRAVTIME 1.086e+01 3.616e+00 3.005 0.00267 **
## CAR_USEPrivate -7.558e+02 1.831e+02 -4.127 3.72e-05 ***
## BLUEBOOK 1.260e-02 9.608e-03 1.311 0.18992
## TIF -4.458e+01 1.368e+01 -3.259 0.00112 **
## CAR_TYPEPanel Truck 4.756e+02 3.287e+02 1.447 0.14807
## CAR_TYPEPickup 4.057e+02 1.879e+02 2.159 0.03091 *
## CAR_TYPESports Car 1.264e+03 2.367e+02 5.342 9.52e-08 ***
## CAR_TYPESUV 9.103e+02 1.948e+02 4.673 3.03e-06 ***
## CAR_TYPEVan 4.892e+02 2.418e+02 2.023 0.04314 *
## RED_CARyes -1.691e+02 1.711e+02 -0.989 0.32288
## OLDCLAIM -4.870e-03 8.355e-03 -0.583 0.55997
## CLM_FREQ 7.950e+01 6.203e+01 1.282 0.19999
## REVOKEDYes 4.970e+02 1.954e+02 2.544 0.01098 *
## MVR_PTS 1.728e+02 2.909e+01 5.942 2.98e-09 ***
## CAR_AGE -2.532e+01 1.446e+01 -1.751 0.08004 .
## URBANICITYHighly Urban/ Urban 1.641e+03 1.537e+02 10.677 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4389 on 6008 degrees of freedom
## Multiple R-squared: 0.07651, Adjusted R-squared: 0.07098
## F-statistic: 13.83 on 36 and 6008 DF, p-value: < 2.2e-16
This model, we expected to perform poorly with an adjR^2 0.07 where the model was only able to explain the 7% of the variability in the data. The data used for this model were not transformed and nor had a feature selection system that picked out the best features.
## Subset selection object
## Call: regsubsets.formula(TARGET_AMT ~ ., data = lm_df, nvmax = 5, method = "seqrep")
## 36 Variables (and intercept)
## Forced in Forced out
## KIDSDRIV FALSE FALSE
## AGE FALSE FALSE
## HOMEKIDS FALSE FALSE
## YOJ FALSE FALSE
## INCOME FALSE FALSE
## PARENT1Yes FALSE FALSE
## HOME_VAL FALSE FALSE
## MSTATUSYes FALSE FALSE
## SEXM FALSE FALSE
## EDUCATIONBachelors FALSE FALSE
## EDUCATIONHigh School FALSE FALSE
## EDUCATIONMasters FALSE FALSE
## EDUCATIONPhD FALSE FALSE
## JOBClerical FALSE FALSE
## JOBDoctor FALSE FALSE
## JOBHome Maker FALSE FALSE
## JOBLawyer FALSE FALSE
## JOBManager FALSE FALSE
## JOBProfessional FALSE FALSE
## JOBStudent FALSE FALSE
## TRAVTIME FALSE FALSE
## CAR_USEPrivate FALSE FALSE
## BLUEBOOK FALSE FALSE
## TIF FALSE FALSE
## CAR_TYPEPanel Truck FALSE FALSE
## CAR_TYPEPickup FALSE FALSE
## CAR_TYPESports Car FALSE FALSE
## CAR_TYPESUV FALSE FALSE
## CAR_TYPEVan FALSE FALSE
## RED_CARyes FALSE FALSE
## OLDCLAIM FALSE FALSE
## CLM_FREQ FALSE FALSE
## REVOKEDYes FALSE FALSE
## MVR_PTS FALSE FALSE
## CAR_AGE FALSE FALSE
## URBANICITYHighly Urban/ Urban FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: 'sequential replacement'
## KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1Yes HOME_VAL MSTATUSYes SEXM
## 1 ( 1 ) " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " "*" " " " " " " " "
## 4 ( 1 ) " " " " " " " " "*" " " " " " " " "
## 5 ( 1 ) " " " " " " " " "*" "*" " " " " " "
## EDUCATIONBachelors EDUCATIONHigh School EDUCATIONMasters EDUCATIONPhD
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## JOBClerical JOBDoctor JOBHome Maker JOBLawyer JOBManager
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## JOBProfessional JOBStudent TRAVTIME CAR_USEPrivate BLUEBOOK TIF
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " " "
## 5 ( 1 ) " " " " " " "*" " " " "
## CAR_TYPEPanel Truck CAR_TYPEPickup CAR_TYPESports Car CAR_TYPESUV
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## CAR_TYPEVan RED_CARyes OLDCLAIM CLM_FREQ REVOKEDYes MVR_PTS CAR_AGE
## 1 ( 1 ) " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " "*" " " " " " " " "
## 3 ( 1 ) " " " " "*" " " " " " " " "
## 4 ( 1 ) " " " " "*" " " " " " " " "
## 5 ( 1 ) " " " " "*" " " " " " " " "
## URBANICITYHighly Urban/ Urban
## 1 ( 1 ) " "
## 2 ( 1 ) "*"
## 3 ( 1 ) "*"
## 4 ( 1 ) "*"
## 5 ( 1 ) "*"
model_2 <- lm(TARGET_AMT ~ PARENT1 + JOB + CAR_USE + OLDCLAIM + URBANICITY + INCOME,
data = lm_df)
summary(model_2)##
## Call:
## lm(formula = TARGET_AMT ~ PARENT1 + JOB + CAR_USE + OLDCLAIM +
## URBANICITY + INCOME, data = lm_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9274 -3.0506 0.0433 3.0414 10.6843
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1234810 0.1938597 16.112 < 2e-16 ***
## PARENT1Yes 1.6289755 0.1049323 15.524 < 2e-16 ***
## JOBClerical 0.1411034 0.1375306 1.026 0.304930
## JOBDoctor -1.0625292 0.2695931 -3.941 8.17e-05 ***
## JOBHome Maker -0.6962028 0.1990792 -3.497 0.000473 ***
## JOBLawyer -0.5482664 0.1705529 -3.215 0.001311 **
## JOBManager -1.9858514 0.1584395 -12.534 < 2e-16 ***
## JOBProfessional -0.5482570 0.1423500 -3.851 0.000118 ***
## JOBStudent -0.6100499 0.1884222 -3.238 0.001210 **
## CAR_USEPrivate -1.2077736 0.1015496 -11.893 < 2e-16 ***
## OLDCLAIM 0.1621598 0.0092964 17.443 < 2e-16 ***
## URBANICITYHighly Urban/ Urban 3.1100614 0.1119786 27.774 < 2e-16 ***
## INCOME -0.0060144 0.0006353 -9.467 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.68 on 8871 degrees of freedom
## Multiple R-squared: 0.2257, Adjusted R-squared: 0.2247
## F-statistic: 215.5 on 12 and 8871 DF, p-value: < 2.2e-16
Using best subsets function in feature selection of the transformed variables we were able to greatly improve the fit of this second model relative the first model wheret the adjusted Rsquared when from 0.07 to 0.225 which suggests that the transformation and removal of excess features were significant.
train_control <- trainControl(method = "cv", number = 10)
step_model2 <- train(TARGET_AMT ~ ., data = lm_df, method = "leapBackward", tuneGrid = data.frame(nvmax = 1:5),
trControl = train_control)
step_model2$results## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 1 4.008379 0.08162822 3.810545 0.04840724 0.02007521 0.03988688
## 2 2 3.916301 0.12241911 3.632547 0.04775842 0.01918493 0.04411160
## 3 3 3.845643 0.15381406 3.501919 0.04624626 0.01986846 0.04099640
## 4 4 3.778778 0.18286651 3.388962 0.05334672 0.02297831 0.05192853
## 5 5 3.739504 0.19980490 3.322746 0.05841986 0.02451064 0.05812010
## nvmax
## 5 5
## Subset selection object
## 36 Variables (and intercept)
## Forced in Forced out
## KIDSDRIV FALSE FALSE
## AGE FALSE FALSE
## HOMEKIDS FALSE FALSE
## YOJ FALSE FALSE
## INCOME FALSE FALSE
## PARENT1Yes FALSE FALSE
## HOME_VAL FALSE FALSE
## MSTATUSYes FALSE FALSE
## SEXM FALSE FALSE
## EDUCATIONBachelors FALSE FALSE
## EDUCATIONHigh School FALSE FALSE
## EDUCATIONMasters FALSE FALSE
## EDUCATIONPhD FALSE FALSE
## JOBClerical FALSE FALSE
## JOBDoctor FALSE FALSE
## JOBHome Maker FALSE FALSE
## JOBLawyer FALSE FALSE
## JOBManager FALSE FALSE
## JOBProfessional FALSE FALSE
## JOBStudent FALSE FALSE
## TRAVTIME FALSE FALSE
## CAR_USEPrivate FALSE FALSE
## BLUEBOOK FALSE FALSE
## TIF FALSE FALSE
## CAR_TYPEPanel Truck FALSE FALSE
## CAR_TYPEPickup FALSE FALSE
## CAR_TYPESports Car FALSE FALSE
## CAR_TYPESUV FALSE FALSE
## CAR_TYPEVan FALSE FALSE
## RED_CARyes FALSE FALSE
## OLDCLAIM FALSE FALSE
## CLM_FREQ FALSE FALSE
## REVOKEDYes FALSE FALSE
## MVR_PTS FALSE FALSE
## CAR_AGE FALSE FALSE
## URBANICITYHighly Urban/ Urban FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: backward
## KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1Yes HOME_VAL MSTATUSYes SEXM
## 1 ( 1 ) " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " "*" " " " " " " " "
## 4 ( 1 ) " " " " " " " " "*" " " " " " " " "
## 5 ( 1 ) " " " " " " " " "*" " " " " "*" " "
## EDUCATIONBachelors EDUCATIONHigh School EDUCATIONMasters EDUCATIONPhD
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## JOBClerical JOBDoctor JOBHome Maker JOBLawyer JOBManager
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## JOBProfessional JOBStudent TRAVTIME CAR_USEPrivate BLUEBOOK TIF
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " "*" " " " "
## 5 ( 1 ) " " " " " " "*" " " " "
## CAR_TYPEPanel Truck CAR_TYPEPickup CAR_TYPESports Car CAR_TYPESUV
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## CAR_TYPEVan RED_CARyes OLDCLAIM CLM_FREQ REVOKEDYes MVR_PTS CAR_AGE
## 1 ( 1 ) " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " "*" " " " " " " " "
## 3 ( 1 ) " " " " "*" " " " " " " " "
## 4 ( 1 ) " " " " "*" " " " " " " " "
## 5 ( 1 ) " " " " "*" " " " " " " " "
## URBANICITYHighly Urban/ Urban
## 1 ( 1 ) " "
## 2 ( 1 ) "*"
## 3 ( 1 ) "*"
## 4 ( 1 ) "*"
## 5 ( 1 ) "*"
model_3 <- lm(TARGET_AMT ~ INCOME + JOB + CAR_USE + OLDCLAIM + URBANICITY, data = upsampled_data)
summary(model_3)##
## Call:
## lm(formula = TARGET_AMT ~ INCOME + JOB + CAR_USE + OLDCLAIM +
## URBANICITY, data = upsampled_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7814 -3.1217 0.2167 3.0889 10.5353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.3562472 0.1958758 17.135 < 2e-16 ***
## INCOME -0.0062332 0.0006437 -9.684 < 2e-16 ***
## JOBClerical 0.1852756 0.1393486 1.330 0.183690
## JOBDoctor -1.1883865 0.2730916 -4.352 1.37e-05 ***
## JOBHome Maker -0.6735017 0.2017485 -3.338 0.000846 ***
## JOBLawyer -0.5993864 0.1728121 -3.468 0.000526 ***
## JOBManager -2.0284367 0.1605441 -12.635 < 2e-16 ***
## JOBProfessional -0.5474932 0.1442625 -3.795 0.000149 ***
## JOBStudent -0.5102379 0.1908425 -2.674 0.007518 **
## CAR_USEPrivate -1.1797461 0.1028977 -11.465 < 2e-16 ***
## OLDCLAIM 0.1697077 0.0094085 18.038 < 2e-16 ***
## URBANICITYHighly Urban/ Urban 3.1530863 0.1134483 27.793 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.729 on 8872 degrees of freedom
## Multiple R-squared: 0.2047, Adjusted R-squared: 0.2037
## F-statistic: 207.6 on 11 and 8872 DF, p-value: < 2.2e-16
For model 3, we used a 10-Fold Validation method to selecting the best features to use. This method was push our knowledge is feature selection away. And it returned the best 5 features to use in the model. Where model 3 yields an adjusted Rsquared of 0.20 which is worse than model 2. In addition, we observed multiple t-values over the suggested threshold which could account for the poor fit of the data.
Next, we build our logistic models on w
logit_df <- upsampled_data |>
select(-INDEX, -TARGET_AMT) |>
mutate(TARGET_FLAG = if_else(TARGET_FLAG == 1, 0, 1))
logit_1 <- glm(TARGET_FLAG ~ ., data = logit_df, family = "binomial")
summary(logit_1)##
## Call:
## glm(formula = TARGET_FLAG ~ ., family = "binomial", data = logit_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9638368 0.2585018 -3.729 0.000193 ***
## KIDSDRIV 0.3055413 0.0562807 5.429 5.67e-08 ***
## AGE -0.0015876 0.0034265 -0.463 0.643135
## HOMEKIDS 0.0172184 0.0325903 0.528 0.597270
## YOJ 0.0040759 0.0077878 0.523 0.600719
## INCOME -0.0028675 0.0004753 -6.033 1.61e-09 ***
## PARENT1Yes 0.4903936 0.0971820 5.046 4.51e-07 ***
## HOME_VAL -0.0293439 0.0063151 -4.647 3.37e-06 ***
## MSTATUSYes -0.4079803 0.0780663 -5.226 1.73e-07 ***
## SEXM 0.2411218 0.0935003 2.579 0.009913 **
## EDUCATIONBachelors -0.4138299 0.1004116 -4.121 3.77e-05 ***
## EDUCATIONHigh School 0.0329515 0.0818227 0.403 0.687156
## EDUCATIONMasters -0.4006630 0.1581591 -2.533 0.011300 *
## EDUCATIONPhD 0.0846562 0.1928038 0.439 0.660604
## JOBClerical 0.1476141 0.0932740 1.583 0.113516
## JOBDoctor -0.7243040 0.2318448 -3.124 0.001784 **
## JOBHome Maker -0.3600040 0.1456888 -2.471 0.013472 *
## JOBLawyer 0.0560669 0.1582066 0.354 0.723046
## JOBManager -0.8382743 0.1179871 -7.105 1.21e-12 ***
## JOBProfessional -0.0801400 0.1032523 -0.776 0.437656
## JOBStudent -0.4393807 0.1360568 -3.229 0.001241 **
## TRAVTIME 0.0179825 0.0016598 10.834 < 2e-16 ***
## CAR_USEPrivate -0.8887629 0.0811197 -10.956 < 2e-16 ***
## BLUEBOOK -0.0054147 0.0010418 -5.197 2.02e-07 ***
## TIF -0.0540281 0.0063019 -8.573 < 2e-16 ***
## CAR_TYPEPanel Truck 0.6099367 0.1434501 4.252 2.12e-05 ***
## CAR_TYPEPickup 0.6813316 0.0857897 7.942 1.99e-15 ***
## CAR_TYPESports Car 1.1852233 0.1084332 10.930 < 2e-16 ***
## CAR_TYPESUV 0.9821315 0.0915877 10.723 < 2e-16 ***
## CAR_TYPEVan 0.6453665 0.1098281 5.876 4.20e-09 ***
## RED_CARyes -0.2126795 0.0774405 -2.746 0.006026 **
## OLDCLAIM 0.0256060 0.0114139 2.243 0.024871 *
## CLM_FREQ 0.1193069 0.0399503 2.986 0.002823 **
## REVOKEDYes 0.6234602 0.0753886 8.270 < 2e-16 ***
## MVR_PTS 0.1070035 0.0128461 8.330 < 2e-16 ***
## CAR_AGE 0.0023938 0.0065873 0.363 0.716303
## URBANICITYHighly Urban/ Urban 2.1672489 0.0832258 26.041 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12316 on 8883 degrees of freedom
## Residual deviance: 9262 on 8847 degrees of freedom
## AIC: 9336
##
## Number of Fisher Scoring iterations: 4
Our first logit model with all the features including the transformed variable yields an AIC of 9279. This model indicated multiple features that were not significant predictors. We can take those redundant features using stepwise regression in the next model. Reducing the number of features will help this model to perform better.
## Start: AIC=9335.98
## TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE +
## BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ +
## REVOKED + MVR_PTS + CAR_AGE + URBANICITY
##
## Df Deviance AIC
## - CAR_AGE 1 9262.1 9334.1
## - AGE 1 9262.2 9334.2
## - YOJ 1 9262.3 9334.3
## - HOMEKIDS 1 9262.3 9334.3
## <none> 9262.0 9336.0
## - OLDCLAIM 1 9267.0 9339.0
## - SEX 1 9268.6 9340.6
## - RED_CAR 1 9269.5 9341.5
## - CLM_FREQ 1 9270.9 9342.9
## - HOME_VAL 1 9283.6 9355.6
## - PARENT1 1 9287.7 9359.7
## - BLUEBOOK 1 9289.1 9361.1
## - MSTATUS 1 9289.3 9361.3
## - KIDSDRIV 1 9292.1 9364.1
## - INCOME 1 9298.7 9370.7
## - EDUCATION 4 9306.4 9372.4
## - REVOKED 1 9332.1 9404.1
## - MVR_PTS 1 9333.3 9405.3
## - TIF 1 9336.6 9408.6
## - JOB 7 9380.0 9440.0
## - TRAVTIME 1 9381.5 9453.5
## - CAR_USE 1 9384.3 9456.3
## - CAR_TYPE 5 9439.4 9503.4
## - URBANICITY 1 10083.7 10155.7
##
## Step: AIC=9334.11
## TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE +
## BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ +
## REVOKED + MVR_PTS + URBANICITY
##
## Df Deviance AIC
## - AGE 1 9262.3 9332.3
## - YOJ 1 9262.4 9332.4
## - HOMEKIDS 1 9262.4 9332.4
## <none> 9262.1 9334.1
## + CAR_AGE 1 9262.0 9336.0
## - OLDCLAIM 1 9267.2 9337.2
## - SEX 1 9268.8 9338.8
## - RED_CAR 1 9269.7 9339.7
## - CLM_FREQ 1 9271.1 9341.1
## - HOME_VAL 1 9283.8 9353.8
## - PARENT1 1 9287.9 9357.9
## - BLUEBOOK 1 9289.2 9359.2
## - MSTATUS 1 9289.4 9359.4
## - KIDSDRIV 1 9292.2 9362.2
## - INCOME 1 9298.7 9368.7
## - EDUCATION 4 9310.0 9374.0
## - REVOKED 1 9332.3 9402.3
## - MVR_PTS 1 9333.4 9403.4
## - TIF 1 9336.6 9406.6
## - JOB 7 9380.2 9438.2
## - TRAVTIME 1 9381.7 9451.7
## - CAR_USE 1 9384.7 9454.7
## - CAR_TYPE 5 9439.5 9501.5
## - URBANICITY 1 10083.7 10153.7
##
## Step: AIC=9332.32
## TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE +
## BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ +
## REVOKED + MVR_PTS + URBANICITY
##
## Df Deviance AIC
## - YOJ 1 9262.5 9330.5
## - HOMEKIDS 1 9262.9 9330.9
## <none> 9262.3 9332.3
## + AGE 1 9262.1 9334.1
## + CAR_AGE 1 9262.2 9334.2
## - OLDCLAIM 1 9267.4 9335.4
## - SEX 1 9268.8 9336.8
## - RED_CAR 1 9269.9 9337.9
## - CLM_FREQ 1 9271.3 9339.3
## - HOME_VAL 1 9284.2 9352.2
## - PARENT1 1 9288.8 9356.8
## - MSTATUS 1 9289.7 9357.7
## - BLUEBOOK 1 9290.8 9358.8
## - KIDSDRIV 1 9292.5 9360.5
## - INCOME 1 9298.8 9366.8
## - EDUCATION 4 9310.2 9372.2
## - REVOKED 1 9332.4 9400.4
## - MVR_PTS 1 9333.8 9401.8
## - TIF 1 9336.8 9404.8
## - JOB 7 9381.1 9437.1
## - TRAVTIME 1 9381.8 9449.8
## - CAR_USE 1 9384.8 9452.8
## - CAR_TYPE 5 9440.2 9500.2
## - URBANICITY 1 10085.0 10153.0
##
## Step: AIC=9330.51
## TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 + HOME_VAL +
## MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE + BLUEBOOK +
## TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + REVOKED +
## MVR_PTS + URBANICITY
##
## Df Deviance AIC
## - HOMEKIDS 1 9263.2 9329.2
## <none> 9262.5 9330.5
## + YOJ 1 9262.3 9332.3
## + AGE 1 9262.4 9332.4
## + CAR_AGE 1 9262.4 9332.4
## - OLDCLAIM 1 9267.6 9333.6
## - SEX 1 9269.0 9335.0
## - RED_CAR 1 9270.1 9336.1
## - CLM_FREQ 1 9271.4 9337.4
## - HOME_VAL 1 9284.3 9350.3
## - PARENT1 1 9288.9 9354.9
## - MSTATUS 1 9289.7 9355.7
## - BLUEBOOK 1 9290.9 9356.9
## - KIDSDRIV 1 9292.6 9358.6
## - INCOME 1 9301.0 9367.0
## - EDUCATION 4 9310.8 9370.8
## - REVOKED 1 9332.7 9398.7
## - MVR_PTS 1 9334.0 9400.0
## - TIF 1 9336.9 9402.9
## - JOB 7 9384.1 9438.1
## - TRAVTIME 1 9382.1 9448.1
## - CAR_USE 1 9384.8 9450.8
## - CAR_TYPE 5 9440.4 9498.4
## - URBANICITY 1 10085.3 10151.3
##
## Step: AIC=9329.21
## TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL + MSTATUS +
## SEX + EDUCATION + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF +
## CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS +
## URBANICITY
##
## Df Deviance AIC
## <none> 9263.2 9329.2
## + HOMEKIDS 1 9262.5 9330.5
## + AGE 1 9262.8 9330.8
## + YOJ 1 9262.9 9330.9
## + CAR_AGE 1 9263.1 9331.1
## - OLDCLAIM 1 9268.4 9332.4
## - SEX 1 9269.5 9333.5
## - RED_CAR 1 9270.7 9334.7
## - CLM_FREQ 1 9272.0 9336.0
## - HOME_VAL 1 9285.1 9349.1
## - MSTATUS 1 9290.1 9354.1
## - BLUEBOOK 1 9292.2 9356.2
## - INCOME 1 9301.4 9365.4
## - KIDSDRIV 1 9302.8 9366.8
## - PARENT1 1 9303.5 9367.5
## - EDUCATION 4 9312.1 9370.1
## - REVOKED 1 9334.0 9398.0
## - MVR_PTS 1 9334.9 9398.9
## - TIF 1 9337.6 9401.6
## - JOB 7 9385.5 9437.5
## - TRAVTIME 1 9382.4 9446.4
## - CAR_USE 1 9385.7 9449.7
## - CAR_TYPE 5 9440.8 9496.8
## - URBANICITY 1 10086.1 10150.1
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL +
## MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE + BLUEBOOK +
## TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + REVOKED +
## MVR_PTS + URBANICITY, family = "binomial", data = logit_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9802374 0.2250160 -4.356 1.32e-05 ***
## KIDSDRIV 0.3175643 0.0511867 6.204 5.50e-10 ***
## INCOME -0.0027776 0.0004509 -6.159 7.30e-10 ***
## PARENT1Yes 0.5334438 0.0845816 6.307 2.85e-10 ***
## HOME_VAL -0.0295154 0.0063064 -4.680 2.87e-06 ***
## MSTATUSYes -0.3878864 0.0749170 -5.178 2.25e-07 ***
## SEXM 0.2315180 0.0925715 2.501 0.01239 *
## EDUCATIONBachelors -0.4099657 0.0937765 -4.372 1.23e-05 ***
## EDUCATIONHigh School 0.0316676 0.0815008 0.389 0.69760
## EDUCATIONMasters -0.3883805 0.1405542 -2.763 0.00572 **
## EDUCATIONPhD 0.0899442 0.1793282 0.502 0.61598
## JOBClerical 0.1562900 0.0929121 1.682 0.09254 .
## JOBDoctor -0.7313044 0.2315349 -3.159 0.00159 **
## JOBHome Maker -0.3702797 0.1438399 -2.574 0.01005 *
## JOBLawyer 0.0494213 0.1579254 0.313 0.75433
## JOBManager -0.8449111 0.1177685 -7.174 7.27e-13 ***
## JOBProfessional -0.0861123 0.1030640 -0.836 0.40342
## JOBStudent -0.4403109 0.1346563 -3.270 0.00108 **
## TRAVTIME 0.0179437 0.0016589 10.817 < 2e-16 ***
## CAR_USEPrivate -0.8879993 0.0809929 -10.964 < 2e-16 ***
## BLUEBOOK -0.0055264 0.0010282 -5.375 7.67e-08 ***
## TIF -0.0539370 0.0062996 -8.562 < 2e-16 ***
## CAR_TYPEPanel Truck 0.6139653 0.1429750 4.294 1.75e-05 ***
## CAR_TYPEPickup 0.6811846 0.0857378 7.945 1.94e-15 ***
## CAR_TYPESports Car 1.1756992 0.1074073 10.946 < 2e-16 ***
## CAR_TYPESUV 0.9757538 0.0906959 10.759 < 2e-16 ***
## CAR_TYPEVan 0.6467599 0.1096442 5.899 3.66e-09 ***
## RED_CARyes -0.2119145 0.0774130 -2.737 0.00619 **
## OLDCLAIM 0.0258938 0.0114097 2.269 0.02324 *
## CLM_FREQ 0.1183324 0.0399421 2.963 0.00305 **
## REVOKEDYes 0.6255949 0.0753110 8.307 < 2e-16 ***
## MVR_PTS 0.1072780 0.0128434 8.353 < 2e-16 ***
## URBANICITYHighly Urban/ Urban 2.1683332 0.0832228 26.055 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12315.8 on 8883 degrees of freedom
## Residual deviance: 9263.2 on 8851 degrees of freedom
## AIC: 9329.2
##
## Number of Fisher Scoring iterations: 4
It seems even with using the step() function, we were able to improve that fit of the model minimally. This stepwise() method used the combination of forward selection and backwards elimination to reach a formula that would best fit the data. We hoped that this method would greatly improve our performance.
logit_3 <- glm(data = logit_df, TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME +
BLUEBOOK + CAR_TYPE + MVR_PTS + URBANICITY + TIF + AGE)
summary(logit_3)##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + HOME_VAL + TRAVTIME +
## BLUEBOOK + CAR_TYPE + MVR_PTS + URBANICITY + TIF + AGE, data = logit_df)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.637e-01 3.335e-02 10.906 < 2e-16 ***
## KIDSDRIV 7.173e-02 8.822e-03 8.131 4.84e-16 ***
## INCOME -5.410e-04 5.083e-05 -10.644 < 2e-16 ***
## HOME_VAL -1.034e-02 8.456e-04 -12.231 < 2e-16 ***
## TRAVTIME 3.178e-03 2.985e-04 10.647 < 2e-16 ***
## BLUEBOOK -1.174e-03 1.742e-04 -6.739 1.69e-11 ***
## CAR_TYPEPanel Truck 2.402e-01 2.329e-02 10.314 < 2e-16 ***
## CAR_TYPEPickup 1.915e-01 1.473e-02 12.997 < 2e-16 ***
## CAR_TYPESports Car 1.937e-01 1.656e-02 11.697 < 2e-16 ***
## CAR_TYPESUV 1.683e-01 1.301e-02 12.936 < 2e-16 ***
## CAR_TYPEVan 1.833e-01 1.946e-02 9.421 < 2e-16 ***
## MVR_PTS 3.300e-02 2.050e-03 16.096 < 2e-16 ***
## URBANICITYHighly Urban/ Urban 3.866e-01 1.296e-02 29.836 < 2e-16 ***
## TIF -9.643e-03 1.150e-03 -8.388 < 2e-16 ***
## AGE -2.919e-03 5.297e-04 -5.510 3.70e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1948906)
##
## Null deviance: 2221.0 on 8883 degrees of freedom
## Residual deviance: 1728.5 on 8869 degrees of freedom
## AIC: 10701
##
## Number of Fisher Scoring iterations: 2
We wanted to see and test whether that manually selecting the features from the first model would improve or worsen the model’s performance. In this case the AIC is higher thus making the performance of this model worse. The next model, we will be building will using teh rfeControl function to select the features.
control <- rfeControl(functions = rfFuncs, # random forest
method = "repeatedcv", # repeated cv
repeats = 2, # number of repeats
number = 5) # number of foldsx <- logit_df %>%
select(-TARGET_FLAG) %>%
as.data.frame()
# Target variable
y <- logit_df$TARGET_FLAG
# Training: 80%; Test: 20%
set.seed(2021)
inTrain <- createDataPartition(y, p = 0.8, list = FALSE)[, 1]
x_train <- x[inTrain, ]
x_test <- x[-inTrain, ]
y_train <- y[inTrain]
y_test <- y[-inTrain]result_rfe1 <- rfe(x = x_train, y = y_train, sizes = c(1:4), rfeControl = control)
# Print the results
result_rfe1##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (5 fold, repeated 2 times)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 1 0.4814 0.07398 0.4633 0.002359 0.009475 0.002309
## 2 0.4777 0.08839 0.4583 0.002835 0.011716 0.002608
## 3 0.4592 0.16488 0.4353 0.003360 0.014769 0.004615
## 4 0.4384 0.25552 0.4114 0.003688 0.017496 0.003971
## 23 0.3034 0.66634 0.2330 0.003646 0.009488 0.004031 *
##
## The top 5 variables (out of 23):
## URBANICITY, TRAVTIME, BLUEBOOK, HOME_VAL, AGE
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (5 fold, repeated 2 times)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 1 0.4814 0.07398 0.4633 0.002359 0.009475 0.002309
## 2 0.4777 0.08839 0.4583 0.002835 0.011716 0.002608
## 3 0.4592 0.16488 0.4353 0.003360 0.014769 0.004615
## 4 0.4384 0.25552 0.4114 0.003688 0.017496 0.003971
## 23 0.3034 0.66634 0.2330 0.003646 0.009488 0.004031 *
##
## The top 5 variables (out of 23):
## URBANICITY, TRAVTIME, BLUEBOOK, HOME_VAL, AGE
logit_4 <- glm(TARGET_FLAG ~ URBANICITY + TRAVTIME + BLUEBOOK + HOME_VAL + INCOME,
data = logit_df, family = "binomial")
summary(logit_4)##
## Call:
## glm(formula = TARGET_FLAG ~ URBANICITY + TRAVTIME + BLUEBOOK +
## HOME_VAL + INCOME, family = "binomial", data = logit_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4248050 0.1177954 -3.606 0.000311 ***
## URBANICITYHighly Urban/ Urban 2.1098042 0.0760672 27.736 < 2e-16 ***
## TRAVTIME 0.0152592 0.0014881 10.254 < 2e-16 ***
## BLUEBOOK -0.0063802 0.0007542 -8.460 < 2e-16 ***
## HOME_VAL -0.0561934 0.0041394 -13.575 < 2e-16 ***
## INCOME -0.0032572 0.0002538 -12.831 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12316 on 8883 degrees of freedom
## Residual deviance: 10828 on 8878 degrees of freedom
## AIC: 10840
##
## Number of Fisher Scoring iterations: 4
The rfeControl returned that having 23 variables would have the best performing model however our hardware limited us from the running that function. Moreover, even using the best 5 features from the rfeControl function yielded a relatively high AIC.
After building and testing multiple models and trying out new feature selection techniques. For this assignment, we will be proposing the use of Linear Model 2 with a adj Rsquare of 0.22 53 and Logistic Model 2 with AIC 9396.9 our lowest amongst the models. There were many models that didnt make the cut for this report.