Data Exploration

training <- read.csv('/Users/jordanglendrange/Documents/Data 621/insurance_training_data.csv')
head(training)
##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ   INCOME PARENT1
## 1     1           0          0        0  60        0  11  $67,349      No
## 2     2           0          0        0  43        0  11  $91,449      No
## 3     4           0          0        0  35        1  10  $16,039      No
## 4     5           0          0        0  51        0  14               No
## 5     6           0          0        0  50        0  NA $114,986      No
## 6     7           1       2946        0  34        1  12 $125,301     Yes
##   HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK
## 1       $0    z_No   M           PhD  Professional       14    Private  $14,230
## 2 $257,252    z_No   M z_High School z_Blue Collar       22 Commercial  $14,940
## 3 $124,191     Yes z_F z_High School      Clerical        5    Private   $4,010
## 4 $306,251     Yes   M  <High School z_Blue Collar       32    Private  $15,440
## 5 $243,925     Yes z_F           PhD        Doctor       36    Private  $18,000
## 6       $0    z_No z_F     Bachelors z_Blue Collar       46 Commercial  $17,430
##   TIF   CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1  11    Minivan     yes   $4,461        2      No       3      18
## 2   1    Minivan     yes       $0        0      No       0       1
## 3   4      z_SUV      no  $38,690        2      No       3      10
## 4   7    Minivan     yes       $0        0      No       0       6
## 5   1      z_SUV      no  $19,217        2     Yes       3      17
## 6   1 Sports Car      no       $0        0      No       0       7
##            URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
test <- read.csv('/Users/jordanglendrange/Documents/Data 621/insurance-evaluation-data.csv')
head(test)
##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ  INCOME PARENT1
## 1     3          NA         NA        0  48        0  11 $52,881      No
## 2     9          NA         NA        1  40        1  11 $50,815     Yes
## 3    10          NA         NA        0  44        2  12 $43,486     Yes
## 4    18          NA         NA        0  35        2  NA $21,204     Yes
## 5    21          NA         NA        0  59        0  12 $87,460      No
## 6    30          NA         NA        0  46        0  14              No
##   HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK
## 1       $0    z_No   M     Bachelors       Manager       26    Private  $21,970
## 2       $0    z_No   M z_High School       Manager       21    Private  $18,930
## 3       $0    z_No z_F z_High School z_Blue Collar       30 Commercial   $5,900
## 4       $0    z_No   M z_High School      Clerical       74    Private   $9,230
## 5       $0    z_No   M z_High School       Manager       45    Private  $15,420
## 6 $207,519     Yes   M     Bachelors  Professional        7 Commercial  $25,660
##   TIF    CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1   1         Van     yes       $0        0      No       2      10
## 2   6     Minivan      no   $3,295        1      No       2       1
## 3  10       z_SUV      no       $0        0      No       0      10
## 4   6      Pickup      no       $0        0     Yes       0       4
## 5   1     Minivan     yes  $44,857        2      No       4       1
## 6   1 Panel Truck      no   $2,119        1      No       2      12
##              URBANICITY
## 1   Highly Urban/ Urban
## 2   Highly Urban/ Urban
## 3 z_Highly Rural/ Rural
## 4 z_Highly Rural/ Rural
## 5   Highly Urban/ Urban
## 6   Highly Urban/ Urban

Firstly, lets check the dimensions of the table.

dim(training)
## [1] 8161   26

Next, let’s check the summary of the data set. Interestingly we see the median of TARGET_AMT is 0. That means the majority of values are 0. We may need to filter out some of the data. We will need to confirm later down the line.

summary(training)
##      INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##       AGE           HOMEKIDS           YOJ          INCOME         
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Length:8161       
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   Class :character  
##  Median :45.00   Median :0.0000   Median :11.0   Mode  :character  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5                     
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0                     
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0                     
##  NA's   :6                        NA's   :454                      
##    PARENT1            HOME_VAL           MSTATUS              SEX           
##  Length:8161        Length:8161        Length:8161        Length:8161       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   EDUCATION             JOB               TRAVTIME        CAR_USE         
##  Length:8161        Length:8161        Min.   :  5.00   Length:8161       
##  Class :character   Class :character   1st Qu.: 22.00   Class :character  
##  Mode  :character   Mode  :character   Median : 33.00   Mode  :character  
##                                        Mean   : 33.49                     
##                                        3rd Qu.: 44.00                     
##                                        Max.   :142.00                     
##                                                                           
##    BLUEBOOK              TIF           CAR_TYPE           RED_CAR         
##  Length:8161        Min.   : 1.000   Length:8161        Length:8161       
##  Class :character   1st Qu.: 1.000   Class :character   Class :character  
##  Mode  :character   Median : 4.000   Mode  :character   Mode  :character  
##                     Mean   : 5.351                                        
##                     3rd Qu.: 7.000                                        
##                     Max.   :25.000                                        
##                                                                           
##    OLDCLAIM            CLM_FREQ        REVOKED             MVR_PTS      
##  Length:8161        Min.   :0.0000   Length:8161        Min.   : 0.000  
##  Class :character   1st Qu.:0.0000   Class :character   1st Qu.: 0.000  
##  Mode  :character   Median :0.0000   Mode  :character   Median : 1.000  
##                     Mean   :0.7986                      Mean   : 1.696  
##                     3rd Qu.:2.0000                      3rd Qu.: 3.000  
##                     Max.   :5.0000                      Max.   :13.000  
##                                                                         
##     CAR_AGE        URBANICITY       
##  Min.   :-3.000   Length:8161       
##  1st Qu.: 1.000   Class :character  
##  Median : 8.000   Mode  :character  
##  Mean   : 8.328                     
##  3rd Qu.:12.000                     
##  Max.   :28.000                     
##  NA's   :510

So 80% of the rows have a target amount of 0. Interesting.

hist(training$TARGET_AMT)

Based on the correlation plot TARGET_FLAG is the most correlated with CLM_FREQ and MVR_PTS.

library("Hmisc")
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library("corrplot")
## corrplot 0.92 loaded
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
training_cor <- cor(select_if(training, is.numeric))
corrplot(training_cor)

Data Preparation

head(training)
##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ   INCOME PARENT1
## 1     1           0          0        0  60        0  11  $67,349      No
## 2     2           0          0        0  43        0  11  $91,449      No
## 3     4           0          0        0  35        1  10  $16,039      No
## 4     5           0          0        0  51        0  14               No
## 5     6           0          0        0  50        0  NA $114,986      No
## 6     7           1       2946        0  34        1  12 $125,301     Yes
##   HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK
## 1       $0    z_No   M           PhD  Professional       14    Private  $14,230
## 2 $257,252    z_No   M z_High School z_Blue Collar       22 Commercial  $14,940
## 3 $124,191     Yes z_F z_High School      Clerical        5    Private   $4,010
## 4 $306,251     Yes   M  <High School z_Blue Collar       32    Private  $15,440
## 5 $243,925     Yes z_F           PhD        Doctor       36    Private  $18,000
## 6       $0    z_No z_F     Bachelors z_Blue Collar       46 Commercial  $17,430
##   TIF   CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1  11    Minivan     yes   $4,461        2      No       3      18
## 2   1    Minivan     yes       $0        0      No       0       1
## 3   4      z_SUV      no  $38,690        2      No       3      10
## 4   7    Minivan     yes       $0        0      No       0       6
## 5   1      z_SUV      no  $19,217        2     Yes       3      17
## 6   1 Sports Car      no       $0        0      No       0       7
##            URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban

Now we are going to start creating dummy variables. We use ifelse and drop the original column.

unique(training$PARENT1)
## [1] "No"  "Yes"
training$PARENT_Yes <- ifelse(training$PARENT1 == "Yes", 1, 0)
training <- select(training,-c('INDEX', 'PARENT1'))

MSTATUS

training$MSTATUS_Yes <- ifelse(training$MSTATUS == "Yes", 1, 0)
training <- select(training,-c('MSTATUS'))
training$SEX_m <- ifelse(training$SEX == "M", 1, 0)
training <- select(training,-c('SEX'))

head(training)
##   TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ   INCOME HOME_VAL
## 1           0          0        0  60        0  11  $67,349       $0
## 2           0          0        0  43        0  11  $91,449 $257,252
## 3           0          0        0  35        1  10  $16,039 $124,191
## 4           0          0        0  51        0  14          $306,251
## 5           0          0        0  50        0  NA $114,986 $243,925
## 6           1       2946        0  34        1  12 $125,301       $0
##       EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK TIF   CAR_TYPE
## 1           PhD  Professional       14    Private  $14,230  11    Minivan
## 2 z_High School z_Blue Collar       22 Commercial  $14,940   1    Minivan
## 3 z_High School      Clerical        5    Private   $4,010   4      z_SUV
## 4  <High School z_Blue Collar       32    Private  $15,440   7    Minivan
## 5           PhD        Doctor       36    Private  $18,000   1      z_SUV
## 6     Bachelors z_Blue Collar       46 Commercial  $17,430   1 Sports Car
##   RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE          URBANICITY
## 1     yes   $4,461        2      No       3      18 Highly Urban/ Urban
## 2     yes       $0        0      No       0       1 Highly Urban/ Urban
## 3      no  $38,690        2      No       3      10 Highly Urban/ Urban
## 4     yes       $0        0      No       0       6 Highly Urban/ Urban
## 5      no  $19,217        2     Yes       3      17 Highly Urban/ Urban
## 6      no       $0        0      No       0       7 Highly Urban/ Urban
##   PARENT_Yes MSTATUS_Yes SEX_m
## 1          0           0     1
## 2          0           0     1
## 3          0           1     0
## 4          0           1     1
## 5          0           1     0
## 6          1           0     0
training$EDUCTATION_phd <- ifelse(training$EDUCATION == "PhD", 1, 0)
training$EDUCTATION_hs <- ifelse(training$EDUCATION == "z_High School", 1, 0)
training$EDUCTATION_b <- ifelse(training$EDUCATION == "Bachelors", 1, 0)
training$EDUCTATION_m <- ifelse(training$EDUCATION == "Masters", 1, 0)
training <- select(training,-c('EDUCATION'))
training$JOB_pro <- ifelse(training$JOB == "Professional", 1, 0)
training$JOB_bc <- ifelse(training$JOB == "z_Blue Collar", 1, 0)
training$JOB_cler <- ifelse(training$JOB == "Clerical", 1, 0)
training$JOB_doc <- ifelse(training$JOB == "Doctor", 1, 0)
training$JOB_law <- ifelse(training$JOB == "Lawyer", 1, 0)
training$JOB_mg <- ifelse(training$JOB == "Manager", 1, 0)
training$JOB_hm <- ifelse(training$JOB == "Home Maker", 1, 0)
training$JOB_st <- ifelse(training$JOB == "Student", 1, 0)

training <- select(training,-c('JOB'))

head(training)
##   TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ   INCOME HOME_VAL TRAVTIME
## 1           0          0        0  60        0  11  $67,349       $0       14
## 2           0          0        0  43        0  11  $91,449 $257,252       22
## 3           0          0        0  35        1  10  $16,039 $124,191        5
## 4           0          0        0  51        0  14          $306,251       32
## 5           0          0        0  50        0  NA $114,986 $243,925       36
## 6           1       2946        0  34        1  12 $125,301       $0       46
##      CAR_USE BLUEBOOK TIF   CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1    Private  $14,230  11    Minivan     yes   $4,461        2      No       3
## 2 Commercial  $14,940   1    Minivan     yes       $0        0      No       0
## 3    Private   $4,010   4      z_SUV      no  $38,690        2      No       3
## 4    Private  $15,440   7    Minivan     yes       $0        0      No       0
## 5    Private  $18,000   1      z_SUV      no  $19,217        2     Yes       3
## 6 Commercial  $17,430   1 Sports Car      no       $0        0      No       0
##   CAR_AGE          URBANICITY PARENT_Yes MSTATUS_Yes SEX_m EDUCTATION_phd
## 1      18 Highly Urban/ Urban          0           0     1              1
## 2       1 Highly Urban/ Urban          0           0     1              0
## 3      10 Highly Urban/ Urban          0           1     0              0
## 4       6 Highly Urban/ Urban          0           1     1              0
## 5      17 Highly Urban/ Urban          0           1     0              1
## 6       7 Highly Urban/ Urban          1           0     0              0
##   EDUCTATION_hs EDUCTATION_b EDUCTATION_m JOB_pro JOB_bc JOB_cler JOB_doc
## 1             0            0            0       1      0        0       0
## 2             1            0            0       0      1        0       0
## 3             1            0            0       0      0        1       0
## 4             0            0            0       0      1        0       0
## 5             0            0            0       0      0        0       1
## 6             0            1            0       0      1        0       0
##   JOB_law JOB_mg JOB_hm JOB_st
## 1       0      0      0      0
## 2       0      0      0      0
## 3       0      0      0      0
## 4       0      0      0      0
## 5       0      0      0      0
## 6       0      0      0      0
unique(training$CAR_USE)
## [1] "Private"    "Commercial"
training$CAR_priv <- ifelse(training$CAR_USE == "Private", 1, 0)
training <- select(training,-c('CAR_USE'))
training$CAR_TYPE_mini <- ifelse(training$CAR_TYPE == "Minivan", 1, 0)
training$CAR_TYPE_suv <- ifelse(training$CAR_TYPE == "z_SUV", 1, 0)
training$CAR_TYPE_sc <- ifelse(training$CAR_TYPE == "Sports Car", 1, 0)
training$CAR_TYPE_van <- ifelse(training$CAR_TYPE == "Van", 1, 0)
training$CAR_TYPE_pt <- ifelse(training$CAR_TYPE == "Panel Truck", 1, 0)

training <- select(training,-c('CAR_TYPE'))
training$RED_CAR_y <- ifelse(training$RED_CAR == "Yes", 1, 0)
training <- select(training,-c('RED_CAR'))
training$REVOKED <- ifelse(training$REVOKED == "Yes", 1, 0)
training$URBANICITY <- ifelse(training$URBANICITY == "Highly Urban/ Urban", 1, 0)

Models

library(readr)
training$INCOME <- parse_number(training$INCOME)
training$HOME_VAL <- parse_number(training$HOME_VAL)
training$BLUEBOOK <- parse_number(training$BLUEBOOK)
training$OLDCLAIM <- parse_number(training$OLDCLAIM)

First model is built off of just one variable - age. As we would expect the R^2 is extremely low, and there isn’t much we can do with this.

training_crash <-training %>% filter(TARGET_FLAG == 1)

model_age <- lm(TARGET_AMT~AGE,data=training_crash)

summary(model_age)
## 
## Call:
## lm(formula = TARGET_AMT ~ AGE, data = training_crash)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5834  -3090  -1552     74 101706 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4727.30     775.49   6.096 1.29e-09 ***
## AGE            22.60      17.49   1.292    0.196    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7750 on 2146 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.0007774,  Adjusted R-squared:  0.0003118 
## F-statistic:  1.67 on 1 and 2146 DF,  p-value: 0.1965

Here I included all the variables. R^2 is much higher as expected.

model_all <- lm (TARGET_AMT~KIDSDRIV+AGE+HOMEKIDS+YOJ+INCOME+HOME_VAL+TRAVTIME+BLUEBOOK+TIF+OLDCLAIM+CLM_FREQ+REVOKED+MVR_PTS+CAR_AGE+URBANICITY+PARENT_Yes+MSTATUS_Yes+SEX_m+EDUCTATION_phd+EDUCTATION_hs+EDUCTATION_b+EDUCTATION_m+JOB_pro+JOB_bc+JOB_cler+JOB_doc+JOB_law+JOB_mg+JOB_hm+JOB_st+CAR_priv+CAR_TYPE_mini+CAR_TYPE_suv+CAR_TYPE_sc+CAR_TYPE_van+CAR_TYPE_pt+RED_CAR_y, data=training_crash)

summary(model_all)
## 
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + 
##     HOME_VAL + TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + 
##     REVOKED + MVR_PTS + CAR_AGE + URBANICITY + PARENT_Yes + MSTATUS_Yes + 
##     SEX_m + EDUCTATION_phd + EDUCTATION_hs + EDUCTATION_b + EDUCTATION_m + 
##     JOB_pro + JOB_bc + JOB_cler + JOB_doc + JOB_law + JOB_mg + 
##     JOB_hm + JOB_st + CAR_priv + CAR_TYPE_mini + CAR_TYPE_suv + 
##     CAR_TYPE_sc + CAR_TYPE_van + CAR_TYPE_pt + RED_CAR_y, data = training_crash)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9689  -3151  -1457    551  76305 
## 
## Coefficients: (1 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.651e+03  2.140e+03   1.706   0.0882 .  
## KIDSDRIV       -1.762e+02  3.555e+02  -0.496   0.6203    
## AGE             9.854e-01  2.349e+01   0.042   0.9665    
## HOMEKIDS        2.758e+02  2.294e+02   1.202   0.2295    
## YOJ             2.020e+01  5.459e+01   0.370   0.7113    
## INCOME         -1.507e-02  7.820e-03  -1.927   0.0542 .  
## HOME_VAL        2.225e-03  2.267e-03   0.981   0.3267    
## TRAVTIME        3.955e+00  1.234e+01   0.321   0.7486    
## BLUEBOOK        1.492e-01  3.373e-02   4.425 1.03e-05 ***
## TIF            -5.918e+00  4.694e+01  -0.126   0.8997    
## OLDCLAIM        4.994e-02  2.527e-02   1.976   0.0483 *  
## CLM_FREQ       -2.052e+02  1.748e+02  -1.174   0.2407    
## REVOKED        -1.256e+03  5.849e+02  -2.147   0.0319 *  
## MVR_PTS         8.821e+01  7.558e+01   1.167   0.2434    
## CAR_AGE        -9.675e+01  4.869e+01  -1.987   0.0471 *  
## URBANICITY      5.887e+01  8.181e+02   0.072   0.9426    
## PARENT_Yes     -9.808e+01  6.467e+02  -0.152   0.8795    
## MSTATUS_Yes    -1.385e+03  5.661e+02  -2.446   0.0145 *  
## SEX_m           1.649e+03  6.518e+02   2.530   0.0115 *  
## EDUCTATION_phd  3.115e+03  1.478e+03   2.108   0.0352 *  
## EDUCTATION_hs  -6.890e+02  5.649e+02  -1.220   0.2228    
## EDUCTATION_b    1.574e+02  7.143e+02   0.220   0.8257    
## EDUCTATION_m    8.165e+02  1.250e+03   0.653   0.5138    
## JOB_pro         1.271e+03  1.279e+03   0.994   0.3204    
## JOB_bc          5.032e+02  1.302e+03   0.387   0.6992    
## JOB_cler       -6.284e+02  1.362e+03  -0.462   0.6445    
## JOB_doc        -3.305e+03  1.867e+03  -1.770   0.0769 .  
## JOB_law        -1.396e+02  1.146e+03  -0.122   0.9030    
## JOB_mg         -1.300e+03  1.221e+03  -1.065   0.2871    
## JOB_hm         -5.308e+02  1.445e+03  -0.367   0.7134    
## JOB_st         -5.548e+02  1.472e+03  -0.377   0.7062    
## CAR_priv       -2.774e+02  5.848e+02  -0.474   0.6353    
## CAR_TYPE_mini  -2.989e+02  6.625e+02  -0.451   0.6520    
## CAR_TYPE_suv    1.378e+03  7.506e+02   1.836   0.0665 .  
## CAR_TYPE_sc     1.664e+03  8.436e+02   1.972   0.0488 *  
## CAR_TYPE_van   -5.303e+02  8.228e+02  -0.644   0.5194    
## CAR_TYPE_pt    -5.811e+02  1.006e+03  -0.578   0.5637    
## RED_CAR_y              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7585 on 1666 degrees of freedom
##   (450 observations deleted due to missingness)
## Multiple R-squared:  0.0425, Adjusted R-squared:  0.0218 
## F-statistic: 2.054 on 36 and 1666 DF,  p-value: 0.0002527

Now we are going to create a binary logistic model using glm, with the same variables are before.

model_age_g <- glm(TARGET_FLAG~AGE,data=training)

summary(model_age)
## 
## Call:
## lm(formula = TARGET_AMT ~ AGE, data = training_crash)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5834  -3090  -1552     74 101706 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4727.30     775.49   6.096 1.29e-09 ***
## AGE            22.60      17.49   1.292    0.196    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7750 on 2146 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.0007774,  Adjusted R-squared:  0.0003118 
## F-statistic:  1.67 on 1 and 2146 DF,  p-value: 0.1965
model_all_g <- glm (TARGET_FLAG~KIDSDRIV+AGE+HOMEKIDS+YOJ+INCOME+HOME_VAL+TRAVTIME+BLUEBOOK+TIF+OLDCLAIM+CLM_FREQ+REVOKED+MVR_PTS+CAR_AGE+URBANICITY+PARENT_Yes+MSTATUS_Yes+SEX_m+EDUCTATION_phd+EDUCTATION_hs+EDUCTATION_b+EDUCTATION_m+JOB_pro+JOB_bc+JOB_cler+JOB_doc+JOB_law+JOB_mg+JOB_hm+JOB_st+CAR_priv+CAR_TYPE_mini+CAR_TYPE_suv+CAR_TYPE_sc+CAR_TYPE_van+CAR_TYPE_pt+RED_CAR_y, data=training_crash)

summary(model_all_g)
## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + 
##     INCOME + HOME_VAL + TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + 
##     CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY + PARENT_Yes + 
##     MSTATUS_Yes + SEX_m + EDUCTATION_phd + EDUCTATION_hs + EDUCTATION_b + 
##     EDUCTATION_m + JOB_pro + JOB_bc + JOB_cler + JOB_doc + JOB_law + 
##     JOB_mg + JOB_hm + JOB_st + CAR_priv + CAR_TYPE_mini + CAR_TYPE_suv + 
##     CAR_TYPE_sc + CAR_TYPE_van + CAR_TYPE_pt + RED_CAR_y, data = training_crash)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## 1.565e-14  2.443e-14  2.687e-14  2.931e-14  4.896e-14  
## 
## Coefficients: (1 not defined because of singularities)
##                  Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)     1.000e+00  7.804e-15  1.281e+14  < 2e-16 ***
## KIDSDRIV        2.485e-16  1.296e-15  1.920e-01  0.84804    
## AGE             6.243e-17  8.567e-17  7.290e-01  0.46627    
## HOMEKIDS        8.012e-16  8.367e-16  9.580e-01  0.33840    
## YOJ            -7.480e-17  1.991e-16 -3.760e-01  0.70717    
## INCOME         -7.754e-20  2.852e-20 -2.719e+00  0.00661 ** 
## HOME_VAL        1.551e-20  8.269e-21  1.876e+00  0.06081 .  
## TRAVTIME       -3.393e-17  4.499e-17 -7.540e-01  0.45095    
## BLUEBOOK       -5.310e-20  1.230e-19 -4.320e-01  0.66605    
## TIF             1.706e-16  1.712e-16  9.970e-01  0.31911    
## OLDCLAIM       -1.094e-20  9.218e-20 -1.190e-01  0.90551    
## CLM_FREQ        4.597e-16  6.376e-16  7.210e-01  0.47104    
## REVOKED         1.085e-15  2.133e-15  5.090e-01  0.61096    
## MVR_PTS         2.605e-16  2.756e-16  9.450e-01  0.34469    
## CAR_AGE         1.118e-16  1.775e-16  6.300e-01  0.52884    
## URBANICITY     -8.136e-16  2.983e-15 -2.730e-01  0.78512    
## PARENT_Yes     -4.330e-15  2.359e-15 -1.836e+00  0.06656 .  
## MSTATUS_Yes    -2.734e-15  2.064e-15 -1.324e+00  0.18561    
## SEX_m          -9.966e-16  2.377e-15 -4.190e-01  0.67508    
## EDUCTATION_phd  1.368e-15  5.389e-15  2.540e-01  0.79962    
## EDUCTATION_hs   1.261e-15  2.060e-15  6.120e-01  0.54058    
## EDUCTATION_b   -2.014e-15  2.605e-15 -7.730e-01  0.43956    
## EDUCTATION_m   -9.558e-16  4.559e-15 -2.100e-01  0.83397    
## JOB_pro        -5.933e-16  4.665e-15 -1.270e-01  0.89880    
## JOB_bc         -3.598e-15  4.747e-15 -7.580e-01  0.44860    
## JOB_cler       -3.747e-15  4.965e-15 -7.550e-01  0.45052    
## JOB_doc        -3.634e-16  6.809e-15 -5.300e-02  0.95744    
## JOB_law        -8.323e-16  4.178e-15 -1.990e-01  0.84211    
## JOB_mg          5.302e-16  4.452e-15  1.190e-01  0.90523    
## JOB_hm         -4.896e-15  5.269e-15 -9.290e-01  0.35293    
## JOB_st         -4.230e-15  5.367e-15 -7.880e-01  0.43072    
## CAR_priv        2.373e-15  2.133e-15  1.113e+00  0.26592    
## CAR_TYPE_mini  -1.293e-16  2.416e-15 -5.400e-02  0.95733    
## CAR_TYPE_suv   -1.601e-15  2.737e-15 -5.850e-01  0.55875    
## CAR_TYPE_sc    -6.518e-15  3.076e-15 -2.119e+00  0.03425 *  
## CAR_TYPE_van    1.575e-15  3.001e-15  5.250e-01  0.59977    
## CAR_TYPE_pt     2.905e-15  3.669e-15  7.920e-01  0.42866    
## RED_CAR_y              NA         NA         NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 7.650813e-28)
## 
##     Null deviance: 0.0000e+00  on 1702  degrees of freedom
## Residual deviance: 1.2746e-24  on 1666  degrees of freedom
##   (450 observations deleted due to missingness)
## AIC: -101460
## 
## Number of Fisher Scoring iterations: 1

Select Model

For the purposes of the assignment we are going to pick the models where we used all of the variables. They had higher R^2 than the models just using one variable. Now we need to validate the models by checking the residuals.

Checking the residuals shows that the model in it’s current state should not be used. We need to go back and adjust the variables we use.

plot(model_all$residuals)

qqnorm(model_all$residuals)
qqline(model_all$residuals)

hist(model_all$residuals)