training <- read.csv('/Users/jordanglendrange/Documents/Data 621/insurance_training_data.csv')
head(training)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1 1 0 0 0 60 0 11 $67,349 No
## 2 2 0 0 0 43 0 11 $91,449 No
## 3 4 0 0 0 35 1 10 $16,039 No
## 4 5 0 0 0 51 0 14 No
## 5 6 0 0 0 50 0 NA $114,986 No
## 6 7 1 2946 0 34 1 12 $125,301 Yes
## HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK
## 1 $0 z_No M PhD Professional 14 Private $14,230
## 2 $257,252 z_No M z_High School z_Blue Collar 22 Commercial $14,940
## 3 $124,191 Yes z_F z_High School Clerical 5 Private $4,010
## 4 $306,251 Yes M <High School z_Blue Collar 32 Private $15,440
## 5 $243,925 Yes z_F PhD Doctor 36 Private $18,000
## 6 $0 z_No z_F Bachelors z_Blue Collar 46 Commercial $17,430
## TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1 11 Minivan yes $4,461 2 No 3 18
## 2 1 Minivan yes $0 0 No 0 1
## 3 4 z_SUV no $38,690 2 No 3 10
## 4 7 Minivan yes $0 0 No 0 6
## 5 1 z_SUV no $19,217 2 Yes 3 17
## 6 1 Sports Car no $0 0 No 0 7
## URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
test <- read.csv('/Users/jordanglendrange/Documents/Data 621/insurance-evaluation-data.csv')
head(test)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1 3 NA NA 0 48 0 11 $52,881 No
## 2 9 NA NA 1 40 1 11 $50,815 Yes
## 3 10 NA NA 0 44 2 12 $43,486 Yes
## 4 18 NA NA 0 35 2 NA $21,204 Yes
## 5 21 NA NA 0 59 0 12 $87,460 No
## 6 30 NA NA 0 46 0 14 No
## HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK
## 1 $0 z_No M Bachelors Manager 26 Private $21,970
## 2 $0 z_No M z_High School Manager 21 Private $18,930
## 3 $0 z_No z_F z_High School z_Blue Collar 30 Commercial $5,900
## 4 $0 z_No M z_High School Clerical 74 Private $9,230
## 5 $0 z_No M z_High School Manager 45 Private $15,420
## 6 $207,519 Yes M Bachelors Professional 7 Commercial $25,660
## TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1 1 Van yes $0 0 No 2 10
## 2 6 Minivan no $3,295 1 No 2 1
## 3 10 z_SUV no $0 0 No 0 10
## 4 6 Pickup no $0 0 Yes 0 4
## 5 1 Minivan yes $44,857 2 No 4 1
## 6 1 Panel Truck no $2,119 1 No 2 12
## URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 z_Highly Rural/ Rural
## 4 z_Highly Rural/ Rural
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
Firstly, lets check the dimensions of the table.
dim(training)
## [1] 8161 26
Next, let’s check the summary of the data set. Interestingly we see the median of TARGET_AMT is 0. That means the majority of values are 0. We may need to filter out some of the data. We will need to confirm later down the line.
summary(training)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Length:8161
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 Class :character
## Median :45.00 Median :0.0000 Median :11.0 Mode :character
## Mean :44.79 Mean :0.7212 Mean :10.5
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
## Max. :81.00 Max. :5.0000 Max. :23.0
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX
## Length:8161 Length:8161 Length:8161 Length:8161
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## EDUCATION JOB TRAVTIME CAR_USE
## Length:8161 Length:8161 Min. : 5.00 Length:8161
## Class :character Class :character 1st Qu.: 22.00 Class :character
## Mode :character Mode :character Median : 33.00 Mode :character
## Mean : 33.49
## 3rd Qu.: 44.00
## Max. :142.00
##
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Length:8161 Min. : 1.000 Length:8161 Length:8161
## Class :character 1st Qu.: 1.000 Class :character Class :character
## Mode :character Median : 4.000 Mode :character Mode :character
## Mean : 5.351
## 3rd Qu.: 7.000
## Max. :25.000
##
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Length:8161 Min. :0.0000 Length:8161 Min. : 0.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Mode :character Median :0.0000 Mode :character Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## CAR_AGE URBANICITY
## Min. :-3.000 Length:8161
## 1st Qu.: 1.000 Class :character
## Median : 8.000 Mode :character
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
So 80% of the rows have a target amount of 0. Interesting.
hist(training$TARGET_AMT)
Based on the correlation plot TARGET_FLAG is the most correlated with CLM_FREQ and MVR_PTS.
library("Hmisc")
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
library("corrplot")
## corrplot 0.92 loaded
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
training_cor <- cor(select_if(training, is.numeric))
corrplot(training_cor)
head(training)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1 1 0 0 0 60 0 11 $67,349 No
## 2 2 0 0 0 43 0 11 $91,449 No
## 3 4 0 0 0 35 1 10 $16,039 No
## 4 5 0 0 0 51 0 14 No
## 5 6 0 0 0 50 0 NA $114,986 No
## 6 7 1 2946 0 34 1 12 $125,301 Yes
## HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK
## 1 $0 z_No M PhD Professional 14 Private $14,230
## 2 $257,252 z_No M z_High School z_Blue Collar 22 Commercial $14,940
## 3 $124,191 Yes z_F z_High School Clerical 5 Private $4,010
## 4 $306,251 Yes M <High School z_Blue Collar 32 Private $15,440
## 5 $243,925 Yes z_F PhD Doctor 36 Private $18,000
## 6 $0 z_No z_F Bachelors z_Blue Collar 46 Commercial $17,430
## TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1 11 Minivan yes $4,461 2 No 3 18
## 2 1 Minivan yes $0 0 No 0 1
## 3 4 z_SUV no $38,690 2 No 3 10
## 4 7 Minivan yes $0 0 No 0 6
## 5 1 z_SUV no $19,217 2 Yes 3 17
## 6 1 Sports Car no $0 0 No 0 7
## URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
Now we are going to start creating dummy variables. We use ifelse and drop the original column.
unique(training$PARENT1)
## [1] "No" "Yes"
training$PARENT_Yes <- ifelse(training$PARENT1 == "Yes", 1, 0)
training <- select(training,-c('INDEX', 'PARENT1'))
MSTATUS
training$MSTATUS_Yes <- ifelse(training$MSTATUS == "Yes", 1, 0)
training <- select(training,-c('MSTATUS'))
training$SEX_m <- ifelse(training$SEX == "M", 1, 0)
training <- select(training,-c('SEX'))
head(training)
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL
## 1 0 0 0 60 0 11 $67,349 $0
## 2 0 0 0 43 0 11 $91,449 $257,252
## 3 0 0 0 35 1 10 $16,039 $124,191
## 4 0 0 0 51 0 14 $306,251
## 5 0 0 0 50 0 NA $114,986 $243,925
## 6 1 2946 0 34 1 12 $125,301 $0
## EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE
## 1 PhD Professional 14 Private $14,230 11 Minivan
## 2 z_High School z_Blue Collar 22 Commercial $14,940 1 Minivan
## 3 z_High School Clerical 5 Private $4,010 4 z_SUV
## 4 <High School z_Blue Collar 32 Private $15,440 7 Minivan
## 5 PhD Doctor 36 Private $18,000 1 z_SUV
## 6 Bachelors z_Blue Collar 46 Commercial $17,430 1 Sports Car
## RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE URBANICITY
## 1 yes $4,461 2 No 3 18 Highly Urban/ Urban
## 2 yes $0 0 No 0 1 Highly Urban/ Urban
## 3 no $38,690 2 No 3 10 Highly Urban/ Urban
## 4 yes $0 0 No 0 6 Highly Urban/ Urban
## 5 no $19,217 2 Yes 3 17 Highly Urban/ Urban
## 6 no $0 0 No 0 7 Highly Urban/ Urban
## PARENT_Yes MSTATUS_Yes SEX_m
## 1 0 0 1
## 2 0 0 1
## 3 0 1 0
## 4 0 1 1
## 5 0 1 0
## 6 1 0 0
training$EDUCTATION_phd <- ifelse(training$EDUCATION == "PhD", 1, 0)
training$EDUCTATION_hs <- ifelse(training$EDUCATION == "z_High School", 1, 0)
training$EDUCTATION_b <- ifelse(training$EDUCATION == "Bachelors", 1, 0)
training$EDUCTATION_m <- ifelse(training$EDUCATION == "Masters", 1, 0)
training <- select(training,-c('EDUCATION'))
training$JOB_pro <- ifelse(training$JOB == "Professional", 1, 0)
training$JOB_bc <- ifelse(training$JOB == "z_Blue Collar", 1, 0)
training$JOB_cler <- ifelse(training$JOB == "Clerical", 1, 0)
training$JOB_doc <- ifelse(training$JOB == "Doctor", 1, 0)
training$JOB_law <- ifelse(training$JOB == "Lawyer", 1, 0)
training$JOB_mg <- ifelse(training$JOB == "Manager", 1, 0)
training$JOB_hm <- ifelse(training$JOB == "Home Maker", 1, 0)
training$JOB_st <- ifelse(training$JOB == "Student", 1, 0)
training <- select(training,-c('JOB'))
head(training)
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL TRAVTIME
## 1 0 0 0 60 0 11 $67,349 $0 14
## 2 0 0 0 43 0 11 $91,449 $257,252 22
## 3 0 0 0 35 1 10 $16,039 $124,191 5
## 4 0 0 0 51 0 14 $306,251 32
## 5 0 0 0 50 0 NA $114,986 $243,925 36
## 6 1 2946 0 34 1 12 $125,301 $0 46
## CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1 Private $14,230 11 Minivan yes $4,461 2 No 3
## 2 Commercial $14,940 1 Minivan yes $0 0 No 0
## 3 Private $4,010 4 z_SUV no $38,690 2 No 3
## 4 Private $15,440 7 Minivan yes $0 0 No 0
## 5 Private $18,000 1 z_SUV no $19,217 2 Yes 3
## 6 Commercial $17,430 1 Sports Car no $0 0 No 0
## CAR_AGE URBANICITY PARENT_Yes MSTATUS_Yes SEX_m EDUCTATION_phd
## 1 18 Highly Urban/ Urban 0 0 1 1
## 2 1 Highly Urban/ Urban 0 0 1 0
## 3 10 Highly Urban/ Urban 0 1 0 0
## 4 6 Highly Urban/ Urban 0 1 1 0
## 5 17 Highly Urban/ Urban 0 1 0 1
## 6 7 Highly Urban/ Urban 1 0 0 0
## EDUCTATION_hs EDUCTATION_b EDUCTATION_m JOB_pro JOB_bc JOB_cler JOB_doc
## 1 0 0 0 1 0 0 0
## 2 1 0 0 0 1 0 0
## 3 1 0 0 0 0 1 0
## 4 0 0 0 0 1 0 0
## 5 0 0 0 0 0 0 1
## 6 0 1 0 0 1 0 0
## JOB_law JOB_mg JOB_hm JOB_st
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
unique(training$CAR_USE)
## [1] "Private" "Commercial"
training$CAR_priv <- ifelse(training$CAR_USE == "Private", 1, 0)
training <- select(training,-c('CAR_USE'))
training$CAR_TYPE_mini <- ifelse(training$CAR_TYPE == "Minivan", 1, 0)
training$CAR_TYPE_suv <- ifelse(training$CAR_TYPE == "z_SUV", 1, 0)
training$CAR_TYPE_sc <- ifelse(training$CAR_TYPE == "Sports Car", 1, 0)
training$CAR_TYPE_van <- ifelse(training$CAR_TYPE == "Van", 1, 0)
training$CAR_TYPE_pt <- ifelse(training$CAR_TYPE == "Panel Truck", 1, 0)
training <- select(training,-c('CAR_TYPE'))
training$RED_CAR_y <- ifelse(training$RED_CAR == "Yes", 1, 0)
training <- select(training,-c('RED_CAR'))
training$REVOKED <- ifelse(training$REVOKED == "Yes", 1, 0)
training$URBANICITY <- ifelse(training$URBANICITY == "Highly Urban/ Urban", 1, 0)
library(readr)
training$INCOME <- parse_number(training$INCOME)
training$HOME_VAL <- parse_number(training$HOME_VAL)
training$BLUEBOOK <- parse_number(training$BLUEBOOK)
training$OLDCLAIM <- parse_number(training$OLDCLAIM)
First model is built off of just one variable - age. As we would expect the R^2 is extremely low, and there isn’t much we can do with this.
training_crash <-training %>% filter(TARGET_FLAG == 1)
model_age <- lm(TARGET_AMT~AGE,data=training_crash)
summary(model_age)
##
## Call:
## lm(formula = TARGET_AMT ~ AGE, data = training_crash)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5834 -3090 -1552 74 101706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4727.30 775.49 6.096 1.29e-09 ***
## AGE 22.60 17.49 1.292 0.196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7750 on 2146 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.0007774, Adjusted R-squared: 0.0003118
## F-statistic: 1.67 on 1 and 2146 DF, p-value: 0.1965
Here I included all the variables. R^2 is much higher as expected.
model_all <- lm (TARGET_AMT~KIDSDRIV+AGE+HOMEKIDS+YOJ+INCOME+HOME_VAL+TRAVTIME+BLUEBOOK+TIF+OLDCLAIM+CLM_FREQ+REVOKED+MVR_PTS+CAR_AGE+URBANICITY+PARENT_Yes+MSTATUS_Yes+SEX_m+EDUCTATION_phd+EDUCTATION_hs+EDUCTATION_b+EDUCTATION_m+JOB_pro+JOB_bc+JOB_cler+JOB_doc+JOB_law+JOB_mg+JOB_hm+JOB_st+CAR_priv+CAR_TYPE_mini+CAR_TYPE_suv+CAR_TYPE_sc+CAR_TYPE_van+CAR_TYPE_pt+RED_CAR_y, data=training_crash)
summary(model_all)
##
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME +
## HOME_VAL + TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ +
## REVOKED + MVR_PTS + CAR_AGE + URBANICITY + PARENT_Yes + MSTATUS_Yes +
## SEX_m + EDUCTATION_phd + EDUCTATION_hs + EDUCTATION_b + EDUCTATION_m +
## JOB_pro + JOB_bc + JOB_cler + JOB_doc + JOB_law + JOB_mg +
## JOB_hm + JOB_st + CAR_priv + CAR_TYPE_mini + CAR_TYPE_suv +
## CAR_TYPE_sc + CAR_TYPE_van + CAR_TYPE_pt + RED_CAR_y, data = training_crash)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9689 -3151 -1457 551 76305
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.651e+03 2.140e+03 1.706 0.0882 .
## KIDSDRIV -1.762e+02 3.555e+02 -0.496 0.6203
## AGE 9.854e-01 2.349e+01 0.042 0.9665
## HOMEKIDS 2.758e+02 2.294e+02 1.202 0.2295
## YOJ 2.020e+01 5.459e+01 0.370 0.7113
## INCOME -1.507e-02 7.820e-03 -1.927 0.0542 .
## HOME_VAL 2.225e-03 2.267e-03 0.981 0.3267
## TRAVTIME 3.955e+00 1.234e+01 0.321 0.7486
## BLUEBOOK 1.492e-01 3.373e-02 4.425 1.03e-05 ***
## TIF -5.918e+00 4.694e+01 -0.126 0.8997
## OLDCLAIM 4.994e-02 2.527e-02 1.976 0.0483 *
## CLM_FREQ -2.052e+02 1.748e+02 -1.174 0.2407
## REVOKED -1.256e+03 5.849e+02 -2.147 0.0319 *
## MVR_PTS 8.821e+01 7.558e+01 1.167 0.2434
## CAR_AGE -9.675e+01 4.869e+01 -1.987 0.0471 *
## URBANICITY 5.887e+01 8.181e+02 0.072 0.9426
## PARENT_Yes -9.808e+01 6.467e+02 -0.152 0.8795
## MSTATUS_Yes -1.385e+03 5.661e+02 -2.446 0.0145 *
## SEX_m 1.649e+03 6.518e+02 2.530 0.0115 *
## EDUCTATION_phd 3.115e+03 1.478e+03 2.108 0.0352 *
## EDUCTATION_hs -6.890e+02 5.649e+02 -1.220 0.2228
## EDUCTATION_b 1.574e+02 7.143e+02 0.220 0.8257
## EDUCTATION_m 8.165e+02 1.250e+03 0.653 0.5138
## JOB_pro 1.271e+03 1.279e+03 0.994 0.3204
## JOB_bc 5.032e+02 1.302e+03 0.387 0.6992
## JOB_cler -6.284e+02 1.362e+03 -0.462 0.6445
## JOB_doc -3.305e+03 1.867e+03 -1.770 0.0769 .
## JOB_law -1.396e+02 1.146e+03 -0.122 0.9030
## JOB_mg -1.300e+03 1.221e+03 -1.065 0.2871
## JOB_hm -5.308e+02 1.445e+03 -0.367 0.7134
## JOB_st -5.548e+02 1.472e+03 -0.377 0.7062
## CAR_priv -2.774e+02 5.848e+02 -0.474 0.6353
## CAR_TYPE_mini -2.989e+02 6.625e+02 -0.451 0.6520
## CAR_TYPE_suv 1.378e+03 7.506e+02 1.836 0.0665 .
## CAR_TYPE_sc 1.664e+03 8.436e+02 1.972 0.0488 *
## CAR_TYPE_van -5.303e+02 8.228e+02 -0.644 0.5194
## CAR_TYPE_pt -5.811e+02 1.006e+03 -0.578 0.5637
## RED_CAR_y NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7585 on 1666 degrees of freedom
## (450 observations deleted due to missingness)
## Multiple R-squared: 0.0425, Adjusted R-squared: 0.0218
## F-statistic: 2.054 on 36 and 1666 DF, p-value: 0.0002527
Now we are going to create a binary logistic model using glm, with the same variables are before.
model_age_g <- glm(TARGET_FLAG~AGE,data=training)
summary(model_age)
##
## Call:
## lm(formula = TARGET_AMT ~ AGE, data = training_crash)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5834 -3090 -1552 74 101706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4727.30 775.49 6.096 1.29e-09 ***
## AGE 22.60 17.49 1.292 0.196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7750 on 2146 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.0007774, Adjusted R-squared: 0.0003118
## F-statistic: 1.67 on 1 and 2146 DF, p-value: 0.1965
model_all_g <- glm (TARGET_FLAG~KIDSDRIV+AGE+HOMEKIDS+YOJ+INCOME+HOME_VAL+TRAVTIME+BLUEBOOK+TIF+OLDCLAIM+CLM_FREQ+REVOKED+MVR_PTS+CAR_AGE+URBANICITY+PARENT_Yes+MSTATUS_Yes+SEX_m+EDUCTATION_phd+EDUCTATION_hs+EDUCTATION_b+EDUCTATION_m+JOB_pro+JOB_bc+JOB_cler+JOB_doc+JOB_law+JOB_mg+JOB_hm+JOB_st+CAR_priv+CAR_TYPE_mini+CAR_TYPE_suv+CAR_TYPE_sc+CAR_TYPE_van+CAR_TYPE_pt+RED_CAR_y, data=training_crash)
summary(model_all_g)
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ +
## INCOME + HOME_VAL + TRAVTIME + BLUEBOOK + TIF + OLDCLAIM +
## CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY + PARENT_Yes +
## MSTATUS_Yes + SEX_m + EDUCTATION_phd + EDUCTATION_hs + EDUCTATION_b +
## EDUCTATION_m + JOB_pro + JOB_bc + JOB_cler + JOB_doc + JOB_law +
## JOB_mg + JOB_hm + JOB_st + CAR_priv + CAR_TYPE_mini + CAR_TYPE_suv +
## CAR_TYPE_sc + CAR_TYPE_van + CAR_TYPE_pt + RED_CAR_y, data = training_crash)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## 1.565e-14 2.443e-14 2.687e-14 2.931e-14 4.896e-14
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+00 7.804e-15 1.281e+14 < 2e-16 ***
## KIDSDRIV 2.485e-16 1.296e-15 1.920e-01 0.84804
## AGE 6.243e-17 8.567e-17 7.290e-01 0.46627
## HOMEKIDS 8.012e-16 8.367e-16 9.580e-01 0.33840
## YOJ -7.480e-17 1.991e-16 -3.760e-01 0.70717
## INCOME -7.754e-20 2.852e-20 -2.719e+00 0.00661 **
## HOME_VAL 1.551e-20 8.269e-21 1.876e+00 0.06081 .
## TRAVTIME -3.393e-17 4.499e-17 -7.540e-01 0.45095
## BLUEBOOK -5.310e-20 1.230e-19 -4.320e-01 0.66605
## TIF 1.706e-16 1.712e-16 9.970e-01 0.31911
## OLDCLAIM -1.094e-20 9.218e-20 -1.190e-01 0.90551
## CLM_FREQ 4.597e-16 6.376e-16 7.210e-01 0.47104
## REVOKED 1.085e-15 2.133e-15 5.090e-01 0.61096
## MVR_PTS 2.605e-16 2.756e-16 9.450e-01 0.34469
## CAR_AGE 1.118e-16 1.775e-16 6.300e-01 0.52884
## URBANICITY -8.136e-16 2.983e-15 -2.730e-01 0.78512
## PARENT_Yes -4.330e-15 2.359e-15 -1.836e+00 0.06656 .
## MSTATUS_Yes -2.734e-15 2.064e-15 -1.324e+00 0.18561
## SEX_m -9.966e-16 2.377e-15 -4.190e-01 0.67508
## EDUCTATION_phd 1.368e-15 5.389e-15 2.540e-01 0.79962
## EDUCTATION_hs 1.261e-15 2.060e-15 6.120e-01 0.54058
## EDUCTATION_b -2.014e-15 2.605e-15 -7.730e-01 0.43956
## EDUCTATION_m -9.558e-16 4.559e-15 -2.100e-01 0.83397
## JOB_pro -5.933e-16 4.665e-15 -1.270e-01 0.89880
## JOB_bc -3.598e-15 4.747e-15 -7.580e-01 0.44860
## JOB_cler -3.747e-15 4.965e-15 -7.550e-01 0.45052
## JOB_doc -3.634e-16 6.809e-15 -5.300e-02 0.95744
## JOB_law -8.323e-16 4.178e-15 -1.990e-01 0.84211
## JOB_mg 5.302e-16 4.452e-15 1.190e-01 0.90523
## JOB_hm -4.896e-15 5.269e-15 -9.290e-01 0.35293
## JOB_st -4.230e-15 5.367e-15 -7.880e-01 0.43072
## CAR_priv 2.373e-15 2.133e-15 1.113e+00 0.26592
## CAR_TYPE_mini -1.293e-16 2.416e-15 -5.400e-02 0.95733
## CAR_TYPE_suv -1.601e-15 2.737e-15 -5.850e-01 0.55875
## CAR_TYPE_sc -6.518e-15 3.076e-15 -2.119e+00 0.03425 *
## CAR_TYPE_van 1.575e-15 3.001e-15 5.250e-01 0.59977
## CAR_TYPE_pt 2.905e-15 3.669e-15 7.920e-01 0.42866
## RED_CAR_y NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 7.650813e-28)
##
## Null deviance: 0.0000e+00 on 1702 degrees of freedom
## Residual deviance: 1.2746e-24 on 1666 degrees of freedom
## (450 observations deleted due to missingness)
## AIC: -101460
##
## Number of Fisher Scoring iterations: 1
For the purposes of the assignment we are going to pick the models where we used all of the variables. They had higher R^2 than the models just using one variable. Now we need to validate the models by checking the residuals.
Checking the residuals shows that the model in it’s current state should not be used. We need to go back and adjust the variables we use.
plot(model_all$residuals)
qqnorm(model_all$residuals)
qqline(model_all$residuals)
hist(model_all$residuals)