DATA 621 Week 4

Our goal is to build a multiple regression model and a logistic regression model that will predict the probability of someone crashing their car and the cost if the car crashes. We are given a dataset with two response variables TARGET_FLAG and TARGET_AMT. TARGET_FLAG is a binary field taking the values 1=crash or 0=no crash while TARGET_AMT is the amount of time spent on repairs given there was a crash. The type of data contained in each response variable is connected to the type of model. TARGET_FLAG is the response variable for logistic regression and TARGET_AMT is the response variable for multiple regression.

The other variables are defined as follows: AGE Age of Driver Very young people tend to be risky. Maybe very old people also. BLUEBOOK Value of Vehicle Unknown effect on probability of collision, but probably effect the payout if there is a crash CAR_AGE Vehicle Age Unknown effect on probability of collision, but probably effect the payout if there is a crash CAR_TYPE Type of Car Unknown effect on probability of collision, but probably effect the payout if there is a crash CAR_USE Vehicle Use Commercial vehicles are driven more, so might increase probability of collision CLM_FREQ # Claims (Past 5 Years) The more claims you filed in the past, the more you are likely to file in the future EDUCATION Max Education Level Unknown effect, but in theory more educated people tend to drive more safely HOMEKIDS # Children at Home Unknown effect HOME_VAL Home Value In theory, home owners tend to drive more responsibly INCOME Income In theory, rich people tend to get into fewer crashes JOB Job Category In theory, white collar jobs tend to be safer KIDSDRIV # Driving Children When teenagers drive your car, you are more likely to get into crashes MSTATUS Marital Status In theory, married people drive more safely MVR_PTS Motor Vehicle Record Points If you get lots of traffic tickets, you tend to get into more crashes OLDCLAIM Total Claims (Past 5 Years) If your total payout over the past five years was high, this suggests future payouts will be high PARENT1 Single Parent Unknown effect RED_CAR A Red Car Urban legend says that red cars (especially red sports cars) are more risky. Is that true? REVOKED License Revoked (Past 7 Years) If your license was revoked in the past 7 years, you probably are a more risky driver. SEX Gender Urban legend says that women have less crashes then men. Is that true? TIF Time in Force People who have been customers for a long time are usually more safe. TRAVTIME Distance to Work Long drives to work usually suggest greater risk URBANICITY Home/Work Area Unknown YOJ Years on Job People who stay at a job for a long time are usually more safe

Our plan is the pick the best multiple regression and logistic regression model that best satisfies our goal. This study will be partitioned into four sections starting with EDA, Data preperation, Model Building, and Model Selection.

We need to partition the data into two subsets. The first subset only takes TARGET_FLAG response and the other variables. The second subset takes TARGET_AMT where TARGET_FLAG=1 in addition to the target variables.

Read in the data

##    INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ   INCOME PARENT1
## 1      1           0          0        0  60        0  11  $67,349      No
## 2      2           0          0        0  43        0  11  $91,449      No
## 3      4           0          0        0  35        1  10  $16,039      No
## 4      5           0          0        0  51        0  14               No
## 5      6           0          0        0  50        0  NA $114,986      No
## 6      7           1       2946        0  34        1  12 $125,301     Yes
## 7      8           0          0        0  54        0  NA  $18,755      No
## 8     11           1       4021        1  37        2  NA $107,961      No
## 9     12           1       2501        0  34        0  10  $62,978      No
## 10    13           0          0        0  50        0   7 $106,952      No
##    HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE
## 1        $0    z_No   M           PhD  Professional       14    Private
## 2  $257,252    z_No   M z_High School z_Blue Collar       22 Commercial
## 3  $124,191     Yes z_F z_High School      Clerical        5    Private
## 4  $306,251     Yes   M  <High School z_Blue Collar       32    Private
## 5  $243,925     Yes z_F           PhD        Doctor       36    Private
## 6        $0    z_No z_F     Bachelors z_Blue Collar       46 Commercial
## 7               Yes z_F  <High School z_Blue Collar       33    Private
## 8  $333,680     Yes   M     Bachelors z_Blue Collar       44 Commercial
## 9        $0    z_No z_F     Bachelors      Clerical       34    Private
## 10       $0    z_No   M     Bachelors  Professional       48 Commercial
##    BLUEBOOK TIF   CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1   $14,230  11    Minivan     yes   $4,461        2      No       3
## 2   $14,940   1    Minivan     yes       $0        0      No       0
## 3    $4,010   4      z_SUV      no  $38,690        2      No       3
## 4   $15,440   7    Minivan     yes       $0        0      No       0
## 5   $18,000   1      z_SUV      no  $19,217        2     Yes       3
## 6   $17,430   1 Sports Car      no       $0        0      No       0
## 7    $8,780   1      z_SUV      no       $0        0      No       0
## 8   $16,970   1        Van     yes   $2,374        1     Yes      10
## 9   $11,200   1      z_SUV      no       $0        0      No       0
## 10  $18,510   7        Van      no       $0        0      No       1
##    CAR_AGE            URBANICITY
## 1       18   Highly Urban/ Urban
## 2        1   Highly Urban/ Urban
## 3       10   Highly Urban/ Urban
## 4        6   Highly Urban/ Urban
## 5       17   Highly Urban/ Urban
## 6        7   Highly Urban/ Urban
## 7        1   Highly Urban/ Urban
## 8        7   Highly Urban/ Urban
## 9        1   Highly Urban/ Urban
## 10      17 z_Highly Rural/ Rural

How many rows and variables does our data have overall?

## [1] 8161

## [1] 26

##  [1] "INDEX"       "TARGET_FLAG" "TARGET_AMT"  "KIDSDRIV"    "AGE"        
##  [6] "HOMEKIDS"    "YOJ"         "INCOME"      "PARENT1"     "HOME_VAL"   
## [11] "MSTATUS"     "SEX"         "EDUCATION"   "JOB"         "TRAVTIME"   
## [16] "CAR_USE"     "BLUEBOOK"    "TIF"         "CAR_TYPE"    "RED_CAR"    
## [21] "OLDCLAIM"    "CLM_FREQ"    "REVOKED"     "MVR_PTS"     "CAR_AGE"    
## [26] "URBANICITY"

There are 8,161 rows and 26 variables. The INDEX variable is simply a row number and will not be included in any portion of the analysis from here on out. Each record represents a car owner.

Lets examine our response variables. TARGET_FLAG

table(crash_training$TARGET_FLAG);

## 
##    0    1 
## 6008 2153

barplot(table(crash_training$TARGET_FLAG), ylim=c(0, 7000), xlab="Result", ylab="N", col="black")

The data contains information for roughly 73% non-crashes and 27% crashes.

TARGET_AMT

x <- crash_training$TARGET_AMT
h<-hist(x, breaks=10, col="red", xlab="Target AMT", 
    main="Histogram with Normal Curve", ylim=range(0:1200)) 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="blue", lwd=2);

x <- log(crash_training$TARGET_AMT)
h<-hist(x, breaks=10, col="red", xlab="Target AMT", 
    main="Histogram with Normal Curve After Log Transform", ylim=range(0:1200))

#xfit<-seq(min(x),max(x),length=40) 
#yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
#yfit <- yfit*diff(h$mids[1:2])*length(x) 
#lines(x, yfit, col="blue", lwd=2)

Almost immediatly, we notice that there is no association with normality however if we add a log transform, the spread of the response variable gets closer to normal.

General Data Summaries

##      INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##       AGE           HOMEKIDS           YOJ            INCOME    
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   $0      : 615  
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0           : 445  
##  Median :45.00   Median :0.0000   Median :11.0   $26,840 :   4  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5   $48,509 :   4  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0   $61,790 :   4  
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0   $107,375:   3  
##  NA's   :6                        NA's   :454    (Other) :7086  
##  PARENT1        HOME_VAL    MSTATUS      SEX               EDUCATION   
##  No :7084   $0      :2294   Yes :4894   M  :3786   <High School :1203  
##  Yes:1077           : 464   z_No:3267   z_F:4375   Bachelors    :2242  
##             $111,129:   3                          Masters      :1658  
##             $115,249:   3                          PhD          : 728  
##             $123,109:   3                          z_High School:2330  
##             $153,061:   3                                              
##             (Other) :5391                                              
##             JOB          TRAVTIME            CAR_USE        BLUEBOOK   
##  z_Blue Collar:1825   Min.   :  5.00   Commercial:3029   $1,500 : 157  
##  Clerical     :1271   1st Qu.: 22.00   Private   :5132   $6,000 :  34  
##  Professional :1117   Median : 33.00                     $5,800 :  33  
##  Manager      : 988   Mean   : 33.49                     $6,200 :  33  
##  Lawyer       : 835   3rd Qu.: 44.00                     $6,400 :  31  
##  Student      : 712   Max.   :142.00                     $5,900 :  30  
##  (Other)      :1413                                      (Other):7843  
##       TIF                CAR_TYPE    RED_CAR       OLDCLAIM   
##  Min.   : 1.000   Minivan    :2145   no :5783   $0     :5009  
##  1st Qu.: 1.000   Panel Truck: 676   yes:2378   $1,310 :   4  
##  Median : 4.000   Pickup     :1389              $1,391 :   4  
##  Mean   : 5.351   Sports Car : 907              $4,263 :   4  
##  3rd Qu.: 7.000   Van        : 750              $1,105 :   3  
##  Max.   :25.000   z_SUV      :2294              $1,332 :   3  
##                                                 (Other):3134  
##     CLM_FREQ      REVOKED       MVR_PTS          CAR_AGE      
##  Min.   :0.0000   No :7161   Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   Yes:1000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000              Median : 1.000   Median : 8.000  
##  Mean   :0.7986              Mean   : 1.696   Mean   : 8.328  
##  3rd Qu.:2.0000              3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000              Max.   :13.000   Max.   :28.000  
##                                               NA's   :510     
##                  URBANICITY  
##  Highly Urban/ Urban  :6492  
##  z_Highly Rural/ Rural:1669  
##                              
##                              
##                              
##                              
##

Lets select the variables that can be reasonably visualized and plot their spreads. We are unable to do a facet plot of all the remaining variables at once since attributes are not identical across measure variables. We produce plots detailing the distribution of the discrete value variables.

From our visuals, car age is the variable that is most alarming to observe. It seems to be that there are negative values but that is not possible for car age. This variable might have outliers caused by errors within the data collection. Travel time also has a very significant outlier but it is related to travele time. It does not seem reasonable that many users would have a travel time of just 5 minutes. YOJ also stands out since this variable is defined as the number of years people stay on the job. There is a visible peak at 0 years but does that group in people who stayed in their jobs for x amunt of months?

Some of these variables might be good candidates for dummy coding especially when doing the linear regression model on them. Lets look at the density plot of variables relating to income. We need to convert from factor to numeric in order to visualize.

##     BLUEBOOK       HOME_VAL        INCOME        OLDCLAIM     
##  Min.   :   1   Min.   :   1   Min.   :   1   Min.   :   1.0  
##  1st Qu.: 478   1st Qu.:   2   1st Qu.: 926   1st Qu.:   1.0  
##  Median :1124   Median :1245   Median :2817   Median :   1.0  
##  Mean   :1284   Mean   :1685   Mean   :2876   Mean   : 552.3  
##  3rd Qu.:2234   3rd Qu.:3164   3rd Qu.:4701   3rd Qu.:1015.0  
##  Max.   :2789   Max.   :5107   Max.   :6613   Max.   :2857.0

Out of these four predictors, home value and income have their largest value of zero. Having zero as income could be due to unemployment and having zero as home value could be due to not being a homeowner.

In order to perform correlation analysis using standard pearson, we should only consider variables with numeric entries. We will make such a subset and then consider some alternatives to see if we can quantify the correlation of categorical variables.

How does each variable correlate with TARGET_AMT

## TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS         YOJ 
##  0.53424606  1.00000000  0.05539418          NA  0.06198804          NA 
##    HOME_VAL    TRAVTIME    BLUEBOOK         TIF    OLDCLAIM    CLM_FREQ 
## -0.07682458  0.02798702  0.02359552 -0.04648083  0.09714780  0.11641916 
##     MVR_PTS     CAR_AGE 
##  0.13786551          NA

The two responses have a strong positive correlation. This does not seem out of the ordinary since a car crash needs to occur in order to record the cost of the crash. The response variable has a strong correlation with KIDSDRIV which is quite interesting. Could it be that kid drivers are more accident prone?

How does each variable correlate with TARGET_FLAG?

## TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS         YOJ 
##  1.00000000  0.53424606  0.10366830          NA  0.11562101          NA 
##    HOME_VAL    TRAVTIME    BLUEBOOK         TIF    OLDCLAIM    CLM_FREQ 
## -0.14857148  0.04836831  0.05044526 -0.08237005  0.19028750  0.21619606 
##     MVR_PTS     CAR_AGE 
##  0.21919705          NA

The highest correlation with target flag occurs with mvr_pts closely followed by claim_frequency. A value of 1 in target flag is an indicator for a crash. If a car has crashed then that must be put into an insurance claim and contribute to your mvr points.

How do variables correlate with each other?

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

There are several variables that have minimal correlation with one another. Some variables are not correlated at all such as TIF.

Check missing data

## TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS         YOJ 
##           0           0           0           6           0         454 
##      INCOME     PARENT1    HOME_VAL     MSTATUS         SEX   EDUCATION 
##           0           0           0           0           0           0 
##         JOB    TRAVTIME     CAR_USE    BLUEBOOK         TIF    CAR_TYPE 
##           0           0           0           0           0           0 
##     RED_CAR    OLDCLAIM    CLM_FREQ     REVOKED     MVR_PTS     CAR_AGE 
##           0           0           0           0           0         510 
##  URBANICITY 
##           0

Age is missing 0.07% data YOJ is missing 5.56% data CAR_AGE is missing 6.24% of data

5% is generally accepted as the threshhold to remove data but I hesitate in removing YOJ due to its theoretical effect on car saftey. I do not believe that YOJ is suitable for imputing methods because it is possible that someone does not have employment history, hence there is nothing to record. How many zero entries appear in the data?

colSums(crash_training2 == 0)

## TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS         YOJ 
##        6008        6008        7180          NA        5289          NA 
##      INCOME     PARENT1    HOME_VAL     MSTATUS         SEX   EDUCATION 
##           0           0           0           0           0           0 
##         JOB    TRAVTIME     CAR_USE    BLUEBOOK         TIF    CAR_TYPE 
##           0           0           0           0           0           0 
##     RED_CAR    OLDCLAIM    CLM_FREQ     REVOKED     MVR_PTS     CAR_AGE 
##           0           0        5009           0        3712          NA 
##  URBANICITY 
##           0

The amount of zero entires in these variables are not suprising given that it is very reasonable for users not have children or any insurance claims. In most of these variables, a zero value implies a safe driver especially considering both the response variables.

We skip outlier analysis due to several thought points. While we observed some variables with extreme values in the histograms, the extreme values do not seem out of the ordinary. A prime example of an extreme value is travel time being 5 minutes. I believe 5 minutes by car is a reasonable commute time for a good majoirty of people who live close to where they work. Car age is the only variable that does not seem right. Lets run an outlier analysis on car age.

## Outliers identified: 0 nPropotion (%) of outliers: 0 nMean of the outliers: NaN nMean without removing outliers: 8.33 nMean if we remove outliers: 8.33 nDo you want to remove outliers and to replace with NA? [yes/no]: 
## Nothing changed n

There does not seem to be a major change in car age with or without outliers.

II Data Preperation)

There are several variables that are prime candidates for dummy coding such as sex or m status however this may prove to be a long process. We can convert the variables to factors and the lm function will dummy code these categorical variables automatically.

Recall the missing values

colSums(is.na(crash_training2)) #;colSums(is.na(crash_evaluation))

## TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS         YOJ 
##           0           0           0           6           0         454 
##      INCOME     PARENT1    HOME_VAL     MSTATUS         SEX   EDUCATION 
##           0           0           0           0           0           0 
##         JOB    TRAVTIME     CAR_USE    BLUEBOOK         TIF    CAR_TYPE 
##           0           0           0           0           0           0 
##     RED_CAR    OLDCLAIM    CLM_FREQ     REVOKED     MVR_PTS     CAR_AGE 
##           0           0           0           0           0         510 
##  URBANICITY 
##           0

After giving it more thought, we need to take into consideration that the data represents policy holders. Age of the policy holder should be imputed using the median age of a policy holder. YOJ is the number of years on a job. It is possible that a policy holder just started a job, or they are not employed. If they are not employed, then there are certain insurance companies that still provide low to no cost healthcare such as Fidelis. The blank entries could be replaced with zero. Car age had the most number of missing values and a strange negative number. The negative number could be due to an erorr with data collection. As for the missing values, we can impute with the median since a car age should not be blank for a policy holder’s record. We apply the same transformations to the evaluation data set.

Impute missing values

Check if there has been any major change in summary after imputing

##   TARGET_FLAG       TARGET_AMT        KIDSDRIV           AGE       
##  Min.   :0.0000   Min.   :     0   Min.   :0.0000   Min.   : 0.00  
##  1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000   1st Qu.:39.00  
##  Median :0.0000   Median :     0   Median :0.0000   Median :45.00  
##  Mean   :0.2638   Mean   :  1504   Mean   :0.1711   Mean   :44.76  
##  3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000   3rd Qu.:51.00  
##  Max.   :1.0000   Max.   :107586   Max.   :4.0000   Max.   :81.00  
##                                                                    
##     HOMEKIDS           YOJ              INCOME     PARENT1   
##  Min.   :0.0000   Min.   : 0.000   $0      : 615   No :7084  
##  1st Qu.:0.0000   1st Qu.: 8.000           : 445   Yes:1077  
##  Median :0.0000   Median :11.000   $26,840 :   4             
##  Mean   :0.7212   Mean   : 9.915   $48,509 :   4             
##  3rd Qu.:1.0000   3rd Qu.:13.000   $61,790 :   4             
##  Max.   :5.0000   Max.   :23.000   $107,375:   3             
##                                    (Other) :7086             
##      HOME_VAL    MSTATUS      SEX               EDUCATION   
##  $0      :2294   Yes :4894   M  :3786   <High School :1203  
##          : 464   z_No:3267   z_F:4375   Bachelors    :2242  
##  $111,129:   3                          Masters      :1658  
##  $115,249:   3                          PhD          : 728  
##  $123,109:   3                          z_High School:2330  
##  $153,061:   3                                              
##  (Other) :5391                                              
##             JOB          TRAVTIME            CAR_USE        BLUEBOOK   
##  z_Blue Collar:1825   Min.   :  5.00   Commercial:3029   $1,500 : 157  
##  Clerical     :1271   1st Qu.: 22.00   Private   :5132   $6,000 :  34  
##  Professional :1117   Median : 33.00                     $5,800 :  33  
##  Manager      : 988   Mean   : 33.49                     $6,200 :  33  
##  Lawyer       : 835   3rd Qu.: 44.00                     $6,400 :  31  
##  Student      : 712   Max.   :142.00                     $5,900 :  30  
##  (Other)      :1413                                      (Other):7843  
##       TIF                CAR_TYPE    RED_CAR       OLDCLAIM   
##  Min.   : 1.000   Minivan    :2145   no :5783   $0     :5009  
##  1st Qu.: 1.000   Panel Truck: 676   yes:2378   $1,310 :   4  
##  Median : 4.000   Pickup     :1389              $1,391 :   4  
##  Mean   : 5.351   Sports Car : 907              $4,263 :   4  
##  3rd Qu.: 7.000   Van        : 750              $1,105 :   3  
##  Max.   :25.000   z_SUV      :2294              $1,332 :   3  
##                                                 (Other):3134  
##     CLM_FREQ      REVOKED       MVR_PTS          CAR_AGE      
##  Min.   :0.0000   No :7161   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.0000   Yes:1000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000              Median : 1.000   Median : 8.000  
##  Mean   :0.7986              Mean   : 1.696   Mean   : 7.809  
##  3rd Qu.:2.0000              3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000              Max.   :13.000   Max.   :28.000  
##                                                               
##                  URBANICITY  
##  Highly Urban/ Urban  :6492  
##  z_Highly Rural/ Rural:1669  
##                              
##                              
##                              
##                              
##

##   TARGET_FLAG       TARGET_AMT        KIDSDRIV           AGE       
##  Min.   :0.0000   Min.   :     0   Min.   :0.0000   Min.   :16.00  
##  1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000   1st Qu.:39.00  
##  Median :0.0000   Median :     0   Median :0.0000   Median :45.00  
##  Mean   :0.2638   Mean   :  1504   Mean   :0.1711   Mean   :44.79  
##  3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000   3rd Qu.:51.00  
##  Max.   :1.0000   Max.   :107586   Max.   :4.0000   Max.   :81.00  
##                                                     NA's   :6      
##     HOMEKIDS           YOJ            INCOME     PARENT1   
##  Min.   :0.0000   Min.   : 0.0   $0      : 615   No :7084  
##  1st Qu.:0.0000   1st Qu.: 9.0           : 445   Yes:1077  
##  Median :0.0000   Median :11.0   $26,840 :   4             
##  Mean   :0.7212   Mean   :10.5   $48,509 :   4             
##  3rd Qu.:1.0000   3rd Qu.:13.0   $61,790 :   4             
##  Max.   :5.0000   Max.   :23.0   $107,375:   3             
##                   NA's   :454    (Other) :7086             
##      HOME_VAL    MSTATUS      SEX               EDUCATION   
##  $0      :2294   Yes :4894   M  :3786   <High School :1203  
##          : 464   z_No:3267   z_F:4375   Bachelors    :2242  
##  $111,129:   3                          Masters      :1658  
##  $115,249:   3                          PhD          : 728  
##  $123,109:   3                          z_High School:2330  
##  $153,061:   3                                              
##  (Other) :5391                                              
##             JOB          TRAVTIME            CAR_USE        BLUEBOOK   
##  z_Blue Collar:1825   Min.   :  5.00   Commercial:3029   $1,500 : 157  
##  Clerical     :1271   1st Qu.: 22.00   Private   :5132   $6,000 :  34  
##  Professional :1117   Median : 33.00                     $5,800 :  33  
##  Manager      : 988   Mean   : 33.49                     $6,200 :  33  
##  Lawyer       : 835   3rd Qu.: 44.00                     $6,400 :  31  
##  Student      : 712   Max.   :142.00                     $5,900 :  30  
##  (Other)      :1413                                      (Other):7843  
##       TIF                CAR_TYPE    RED_CAR       OLDCLAIM   
##  Min.   : 1.000   Minivan    :2145   no :5783   $0     :5009  
##  1st Qu.: 1.000   Panel Truck: 676   yes:2378   $1,310 :   4  
##  Median : 4.000   Pickup     :1389              $1,391 :   4  
##  Mean   : 5.351   Sports Car : 907              $4,263 :   4  
##  3rd Qu.: 7.000   Van        : 750              $1,105 :   3  
##  Max.   :25.000   z_SUV      :2294              $1,332 :   3  
##                                                 (Other):3134  
##     CLM_FREQ      REVOKED       MVR_PTS          CAR_AGE      
##  Min.   :0.0000   No :7161   Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   Yes:1000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000              Median : 1.000   Median : 8.000  
##  Mean   :0.7986              Mean   : 1.696   Mean   : 8.328  
##  3rd Qu.:2.0000              3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000              Max.   :13.000   Max.   :28.000  
##                                               NA's   :510     
##                  URBANICITY  
##  Highly Urban/ Urban  :6492  
##  z_Highly Rural/ Rural:1669  
##                              
##                              
##                              
##                              
##

Imputing missing data does not seem to change the stats of the imputed variables. We also took the absolute value of car age to fix the negative entry.

We need to do some additional modifications to our data otherwise it will complicate modeling. We need to be sure that income and simialr variables are read in as numeric. We also need to recode categorical variables into dummies. Transforming our data is going to play a key role in optimizing our model building.

Count the number of levels per variable after we treat the income realted variables as numeric

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## $TARGET_FLAG
## NULL
## 
## $TARGET_AMT
## NULL
## 
## $KIDSDRIV
## NULL
## 
## $AGE
## NULL
## 
## $HOMEKIDS
## NULL
## 
## $YOJ
## NULL
## 
## $INCOME
## NULL
## 
## $PARENT1
## [1] "No"  "Yes"
## 
## $HOME_VAL
## NULL
## 
## $MSTATUS
## [1] "Yes"  "z_No"
## 
## $SEX
## [1] "M"   "z_F"
## 
## $EDUCATION
## [1] "<High School"  "Bachelors"     "Masters"       "PhD"          
## [5] "z_High School"
## 
## $JOB
## [1] ""              "Clerical"      "Doctor"        "Home Maker"   
## [5] "Lawyer"        "Manager"       "Professional"  "Student"      
## [9] "z_Blue Collar"
## 
## $TRAVTIME
## NULL
## 
## $CAR_USE
## [1] "Commercial" "Private"   
## 
## $BLUEBOOK
## NULL
## 
## $TIF
## NULL
## 
## $CAR_TYPE
## [1] "Minivan"     "Panel Truck" "Pickup"      "Sports Car"  "Van"        
## [6] "z_SUV"      
## 
## $RED_CAR
## [1] "no"  "yes"
## 
## $OLDCLAIM
## NULL
## 
## $CLM_FREQ
## NULL
## 
## $REVOKED
## [1] "No"  "Yes"
## 
## $MVR_PTS
## NULL
## 
## $CAR_AGE
## NULL
## 
## $URBANICITY
## [1] "Highly Urban/ Urban"   "z_Highly Rural/ Rural"

The variables JOB and EDUCATION have more than 1 factor. These variables are good candidates for recoding as dummy variables in addition to flattening. This should optimize the computation time of the glm function. We can then remove those variables with many factors.

Model Building

We seek to build two different models (Linear regression and binary logisitc regression). In the EDA portion we determined that it is best to apply a log(x+1) transformation on the response variable. The log makes the spread appear closer to normal and the +1 makes sure the values in the field are within the domain of the log function.

Full Logisitic Regression without target_amt

## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

## ✔ tibble  1.4.2     ✔ purrr   0.2.4
## ✔ readr   1.1.1     ✔ stringr 1.3.0
## ✔ tibble  1.4.2     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::src()       masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()

## Warning: package 'broom' was built under R version 3.4.4

## Warning: package 'car' was built under R version 3.4.4

## Loading required package: carData

## Warning: package 'carData' was built under R version 3.4.4

## 
## Attaching package: 'car'

## The following object is masked from 'package:purrr':
## 
##     some

## The following object is masked from 'package:dplyr':
## 
##     recode

## 
## Call:
## glm(formula = TARGET_FLAG ~ AGE + BLUEBOOK + CAR_AGE + CAR_TYPE + 
##     CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS + HOME_VAL + INCOME + 
##     JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 + 
##     RED_CAR + REVOKED + SEX + TIF + TRAVTIME + URBANICITY + YOJ, 
##     family = "binomial", data = crash_training3a)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5911  -0.7179  -0.4046   0.6397   3.1539  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -1.281e+00  3.157e-01  -4.058 4.95e-05 ***
## AGE                             -5.488e-03  3.903e-03  -1.406 0.159646    
## BLUEBOOK                         1.252e-05  3.368e-05   0.372 0.710099    
## CAR_AGE                         -4.202e-03  6.670e-03  -0.630 0.528731    
## CAR_TYPEPanel Truck              1.923e-01  1.432e-01   1.342 0.179473    
## CAR_TYPEPickup                   6.011e-01  1.021e-01   5.888 3.91e-09 ***
## CAR_TYPESports Car               1.228e+00  1.224e-01  10.034  < 2e-16 ***
## CAR_TYPEVan                      4.425e-01  1.213e-01   3.649 0.000263 ***
## CAR_TYPEz_SUV                    9.648e-01  1.030e-01   9.367  < 2e-16 ***
## CAR_USEPrivate                  -7.538e-01  9.154e-02  -8.234  < 2e-16 ***
## CLM_FREQ                         1.169e-01  3.206e-02   3.645 0.000268 ***
## EDUCATIONBachelors              -4.576e-01  1.127e-01  -4.059 4.92e-05 ***
## EDUCATIONMasters                -3.846e-01  1.728e-01  -2.226 0.026046 *  
## EDUCATIONPhD                    -4.087e-01  2.023e-01  -2.020 0.043355 *  
## EDUCATIONz_High School          -1.389e-02  9.476e-02  -0.147 0.883494    
## HOMEKIDS                         4.702e-02  3.676e-02   1.279 0.200857    
## HOME_VAL                        -8.131e-05  2.023e-05  -4.018 5.86e-05 ***
## INCOME                          -1.083e-05  1.598e-05  -0.677 0.498091    
## JOBClerical                      5.463e-01  1.933e-01   2.826 0.004720 ** 
## JOBDoctor                       -3.910e-01  2.633e-01  -1.485 0.137519    
## JOBHome Maker                    6.125e-01  1.987e-01   3.082 0.002057 ** 
## JOBLawyer                        1.825e-01  1.682e-01   1.085 0.277869    
## JOBManager                      -5.113e-01  1.698e-01  -3.011 0.002606 ** 
## JOBProfessional                  2.297e-01  1.769e-01   1.299 0.194069    
## JOBStudent                       5.219e-01  2.062e-01   2.531 0.011368 *  
## JOBz_Blue Collar                 3.841e-01  1.841e-01   2.086 0.036977 *  
## KIDSDRIV                         3.893e-01  6.097e-02   6.386 1.71e-10 ***
## MSTATUSz_No                      5.245e-01  7.722e-02   6.793 1.10e-11 ***
## MVR_PTS                          1.063e-01  1.365e-02   7.786 6.89e-15 ***
## OLDCLAIM                         8.863e-05  4.235e-05   2.093 0.036358 *  
## PARENT1Yes                       3.715e-01  1.090e-01   3.409 0.000653 ***
## RED_CARyes                       1.105e-02  8.618e-02   0.128 0.897954    
## REVOKEDYes                       7.432e-01  8.030e-02   9.255  < 2e-16 ***
## SEXz_F                          -2.808e-01  1.030e-01  -2.725 0.006425 ** 
## TIF                             -5.484e-02  7.315e-03  -7.496 6.58e-14 ***
## TRAVTIME                         1.463e-02  1.878e-03   7.789 6.74e-15 ***
## URBANICITYz_Highly Rural/ Rural -2.355e+00  1.124e-01 -20.940  < 2e-16 ***
## YOJ                             -1.355e-02  7.088e-03  -1.912 0.055891 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7348.1  on 8123  degrees of freedom
## AIC: 7424.1
## 
## Number of Fisher Scoring iterations: 5

##                 GVIF Df GVIF^(1/(2*Df))
## AGE         1.426799  1        1.194487
## BLUEBOOK    1.124090  1        1.060231
## CAR_AGE     1.785593  1        1.336261
## CAR_TYPE    4.053666  5        1.150230
## CAR_USE     2.458818  1        1.568062
## CLM_FREQ    1.849502  1        1.359964
## EDUCATION   9.531535  4        1.325548
## HOMEKIDS    2.146924  1        1.465238
## HOME_VAL    1.321127  1        1.149403
## INCOME      1.284895  1        1.133532
## JOB        19.406626  8        1.203641
## KIDSDRIV    1.347663  1        1.160889
## MSTATUS     1.768427  1        1.329822
## MVR_PTS     1.182979  1        1.087649
## OLDCLAIM    1.839574  1        1.356309
## PARENT1     1.930994  1        1.389602
## RED_CAR     1.840066  1        1.356490
## REVOKED     1.016893  1        1.008411
## SEX         3.154551  1        1.776106
## TIF         1.010589  1        1.005281
## TRAVTIME    1.038718  1        1.019175
## URBANICITY  1.138952  1        1.067217
## YOJ         1.327868  1        1.152332

##   TARGET_FLAG AGE BLUEBOOK CAR_AGE   CAR_TYPE CAR_USE CLM_FREQ EDUCATION
## 1           1  51      350      17 Sports Car Private        0       PhD
## 2           1  67     1671       1      z_SUV Private        4       PhD
## 3           1  59     1431      14     Pickup Private        1       PhD
##   HOMEKIDS HOME_VAL INCOME          JOB KIDSDRIV MSTATUS MVR_PTS OLDCLAIM
## 1        0     3508    124 Professional        0     Yes       0        1
## 2        0     3994   6509       Doctor        0     Yes       5      131
## 3        0     3570   6045       Doctor        0     Yes       2     2585
##   PARENT1 RED_CAR REVOKED SEX TIF TRAVTIME            URBANICITY YOJ
## 1      No      no     Yes z_F   4       48 z_Highly Rural/ Rural  12
## 2      No      no      No   M   6       17   Highly Urban/ Urban  13
## 3      No      no      No z_F   4       53   Highly Urban/ Urban  17
##     .fitted   .se.fit   .resid        .hat    .sigma     .cooksd
## 1 -3.191324 0.2431200 2.542290 0.002242393 0.9507453 0.001441625
## 2 -1.861628 0.2908489 2.003043 0.009848317 0.9509024 0.001700865
## 3 -2.350243 0.2620566 2.209667 0.005457469 0.9508469 0.001522856
##   .std.resid index
## 1   2.545145  2338
## 2   2.012980  2768
## 3   2.215721  7741

##   TARGET_FLAG AGE BLUEBOOK CAR_AGE CAR_TYPE CAR_USE CLM_FREQ EDUCATION
## 1           1  60      856      10  Minivan Private        0 Bachelors
##   HOMEKIDS HOME_VAL INCOME          JOB KIDSDRIV MSTATUS MVR_PTS OLDCLAIM
## 1        0     2843   4605 Professional        0     Yes       0        1
##   PARENT1 RED_CAR REVOKED SEX TIF TRAVTIME            URBANICITY YOJ
## 1      No     yes      No   M   1       35 z_Highly Rural/ Rural  13
##     .fitted   .se.fit   .resid         .hat    .sigma      .cooksd
## 1 -4.966646 0.1769071 3.153915 0.0002150431 0.9505205 0.0008126726
##   .std.resid index
## 1   3.154254  5101

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

##           llh       llhNull            G2      McFadden          r2ML 
## -3674.0444376 -4708.9811460  2069.8734168     0.2197793     0.2240210 
##          r2CU 
##     0.3272130

There are several variables that can be identified as not being significant. These variables are age, bluebook, car age, homekids, income, red car, and yoj. Our next model will eliminate these predictors but retain the levels. These variables were shown to have minimal correlation in our heatmap. Job can also be eliminated due to the fact that it has a high VIF number. The full model yields a McFadden of 0.2.

## 
## Call:
## glm(formula = TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + 
##     HOMEKIDS + HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + 
##     PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY, family = "binomial", 
##     data = crash_training3a)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5886  -0.7290  -0.4235   0.6570   3.0983  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -1.197e+00  1.502e-01  -7.969 1.60e-15 ***
## CAR_TYPEPanel Truck              1.193e-01  1.313e-01   0.908 0.363860    
## CAR_TYPEPickup                   5.796e-01  9.668e-02   5.995 2.03e-09 ***
## CAR_TYPESports Car               1.239e+00  1.196e-01  10.362  < 2e-16 ***
## CAR_TYPEVan                      3.988e-01  1.183e-01   3.372 0.000747 ***
## CAR_TYPEz_SUV                    9.756e-01  1.014e-01   9.622  < 2e-16 ***
## CAR_USEPrivate                  -8.009e-01  7.245e-02 -11.055  < 2e-16 ***
## CLM_FREQ                         1.175e-01  3.182e-02   3.692 0.000223 ***
## EDUCATIONBachelors              -7.543e-01  9.464e-02  -7.970 1.58e-15 ***
## EDUCATIONMasters                -8.615e-01  1.026e-01  -8.399  < 2e-16 ***
## EDUCATIONPhD                    -1.055e+00  1.331e-01  -7.926 2.27e-15 ***
## EDUCATIONz_High School          -1.280e-01  9.085e-02  -1.409 0.158823    
## HOMEKIDS                         7.975e-02  3.339e-02   2.389 0.016916 *  
## HOME_VAL                        -9.488e-05  1.978e-05  -4.797 1.61e-06 ***
## KIDSDRIV                         3.484e-01  5.934e-02   5.872 4.31e-09 ***
## MSTATUSz_No                      5.204e-01  7.625e-02   6.826 8.74e-12 ***
## MVR_PTS                          1.121e-01  1.354e-02   8.280  < 2e-16 ***
## OLDCLAIM                         8.477e-05  4.205e-05   2.016 0.043804 *  
## PARENT1Yes                       3.721e-01  1.074e-01   3.466 0.000528 ***
## REVOKEDYes                       7.487e-01  7.949e-02   9.419  < 2e-16 ***
## SEXz_F                          -2.382e-01  8.752e-02  -2.722 0.006485 ** 
## TIF                             -5.478e-02  7.253e-03  -7.553 4.26e-14 ***
## TRAVTIME                         1.498e-02  1.862e-03   8.042 8.80e-16 ***
## URBANICITYz_Highly Rural/ Rural -2.239e+00  1.114e-01 -20.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7451.7  on 8137  degrees of freedom
## AIC: 7499.7
## 
## Number of Fisher Scoring iterations: 5

##                GVIF Df GVIF^(1/(2*Df))
## CAR_TYPE   3.188769  5        1.122955
## CAR_USE    1.562896  1        1.250158
## CLM_FREQ   1.849893  1        1.360108
## EDUCATION  1.345521  4        1.037794
## HOMEKIDS   1.801076  1        1.342042
## HOME_VAL   1.261352  1        1.123099
## KIDSDRIV   1.290858  1        1.136159
## MSTATUS    1.753165  1        1.324071
## MVR_PTS    1.178627  1        1.085646
## OLDCLAIM   1.839824  1        1.356401
## PARENT1    1.912960  1        1.383098
## REVOKED    1.014748  1        1.007347
## SEX        2.311561  1        1.520382
## TIF        1.007045  1        1.003516
## TRAVTIME   1.034820  1        1.017261
## URBANICITY 1.118487  1        1.057586

##   TARGET_FLAG CAR_TYPE CAR_USE CLM_FREQ     EDUCATION HOMEKIDS HOME_VAL
## 1           1      Van Private        3 z_High School        3      264
## 2           0    z_SUV Private        1     Bachelors        4     1333
## 3           0    z_SUV Private        3  <High School        3        2
##   KIDSDRIV MSTATUS MVR_PTS OLDCLAIM PARENT1 REVOKED SEX TIF TRAVTIME
## 1        0     Yes       0     2331      No      No   M   1       40
## 2        4     Yes       9      717      No     Yes z_F   1       23
## 3        3    z_No       4     2703     Yes      No z_F   1       95
##              URBANICITY   .fitted   .se.fit    .resid        .hat
## 1 z_Highly Rural/ Rural -2.657602 0.2005136  2.334681 0.002461993
## 2   Highly Urban/ Urban  1.797394 0.2473381 -1.975217 0.007460934
## 3   Highly Urban/ Urban  3.314761 0.2276243 -2.588613 0.001753313
##      .sigma     .cooksd .std.resid index
## 1 0.9566739 0.001470274   2.337561   807
## 2 0.9567724 0.001904079  -1.982627  3773
## 3 0.9565937 0.002017227  -2.590885  5652

##   TARGET_FLAG CAR_TYPE CAR_USE CLM_FREQ EDUCATION HOMEKIDS HOME_VAL
## 1           1  Minivan Private        0 Bachelors        0     2843
##   KIDSDRIV MSTATUS MVR_PTS OLDCLAIM PARENT1 REVOKED SEX TIF TRAVTIME
## 1        0     Yes       0        1      No      No   M   1       35
##              URBANICITY   .fitted   .se.fit   .resid         .hat
## 1 z_Highly Rural/ Rural -4.791511 0.1518637 3.098315 0.0001883017
##      .sigma      .cooksd .std.resid index
## 1 0.9564081 0.0009456568   3.098606  5101

##           llh       llhNull            G2      McFadden          r2ML 
## -3725.8668999 -4708.9811460  1966.2284922     0.2087743     0.2141032 
##          r2CU 
##     0.3127267

The second model has the same McFadden as the first but it is less complicated. The AIC is hovering around the same value.

Build a model using backstepping

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

step<-stepAIC(mod3, trace=FALSE)
step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS + 
##     HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 + 
##     REVOKED + SEX + TIF + TRAVTIME + URBANICITY
## 
## Final Model:
## TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS + 
##     HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 + 
##     REVOKED + SEX + TIF + TRAVTIME + URBANICITY
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                       8137   7451.734 7499.734

Formulate the model built by backstepping

mod4<-glm(TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS + 
    HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 + 
    REVOKED + SEX + TIF + TRAVTIME + URBANICITY, family="binomial",  data=crash_training3a)
summary(mod4);

## 
## Call:
## glm(formula = TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + 
##     HOMEKIDS + HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + 
##     PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY, family = "binomial", 
##     data = crash_training3a)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5886  -0.7290  -0.4235   0.6570   3.0983  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -1.197e+00  1.502e-01  -7.969 1.60e-15 ***
## CAR_TYPEPanel Truck              1.193e-01  1.313e-01   0.908 0.363860    
## CAR_TYPEPickup                   5.796e-01  9.668e-02   5.995 2.03e-09 ***
## CAR_TYPESports Car               1.239e+00  1.196e-01  10.362  < 2e-16 ***
## CAR_TYPEVan                      3.988e-01  1.183e-01   3.372 0.000747 ***
## CAR_TYPEz_SUV                    9.756e-01  1.014e-01   9.622  < 2e-16 ***
## CAR_USEPrivate                  -8.009e-01  7.245e-02 -11.055  < 2e-16 ***
## CLM_FREQ                         1.175e-01  3.182e-02   3.692 0.000223 ***
## EDUCATIONBachelors              -7.543e-01  9.464e-02  -7.970 1.58e-15 ***
## EDUCATIONMasters                -8.615e-01  1.026e-01  -8.399  < 2e-16 ***
## EDUCATIONPhD                    -1.055e+00  1.331e-01  -7.926 2.27e-15 ***
## EDUCATIONz_High School          -1.280e-01  9.085e-02  -1.409 0.158823    
## HOMEKIDS                         7.975e-02  3.339e-02   2.389 0.016916 *  
## HOME_VAL                        -9.488e-05  1.978e-05  -4.797 1.61e-06 ***
## KIDSDRIV                         3.484e-01  5.934e-02   5.872 4.31e-09 ***
## MSTATUSz_No                      5.204e-01  7.625e-02   6.826 8.74e-12 ***
## MVR_PTS                          1.121e-01  1.354e-02   8.280  < 2e-16 ***
## OLDCLAIM                         8.477e-05  4.205e-05   2.016 0.043804 *  
## PARENT1Yes                       3.721e-01  1.074e-01   3.466 0.000528 ***
## REVOKEDYes                       7.487e-01  7.949e-02   9.419  < 2e-16 ***
## SEXz_F                          -2.382e-01  8.752e-02  -2.722 0.006485 ** 
## TIF                             -5.478e-02  7.253e-03  -7.553 4.26e-14 ***
## TRAVTIME                         1.498e-02  1.862e-03   8.042 8.80e-16 ***
## URBANICITYz_Highly Rural/ Rural -2.239e+00  1.114e-01 -20.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7451.7  on 8137  degrees of freedom
## AIC: 7499.7
## 
## Number of Fisher Scoring iterations: 5

plot(mod4, which = 4, id.n = 3);

mod4.data <- augment(mod4) %>% 
  mutate(index = 1:n()) ;

car::vif(mod4);

##                GVIF Df GVIF^(1/(2*Df))
## CAR_TYPE   3.188769  5        1.122955
## CAR_USE    1.562896  1        1.250158
## CLM_FREQ   1.849893  1        1.360108
## EDUCATION  1.345521  4        1.037794
## HOMEKIDS   1.801076  1        1.342042
## HOME_VAL   1.261352  1        1.123099
## KIDSDRIV   1.290858  1        1.136159
## MSTATUS    1.753165  1        1.324071
## MVR_PTS    1.178627  1        1.085646
## OLDCLAIM   1.839824  1        1.356401
## PARENT1    1.912960  1        1.383098
## REVOKED    1.014748  1        1.007347
## SEX        2.311561  1        1.520382
## TIF        1.007045  1        1.003516
## TRAVTIME   1.034820  1        1.017261
## URBANICITY 1.118487  1        1.057586

mod4.data %>% top_n(3, .cooksd);

##   TARGET_FLAG CAR_TYPE CAR_USE CLM_FREQ     EDUCATION HOMEKIDS HOME_VAL
## 1           1      Van Private        3 z_High School        3      264
## 2           0    z_SUV Private        1     Bachelors        4     1333
## 3           0    z_SUV Private        3  <High School        3        2
##   KIDSDRIV MSTATUS MVR_PTS OLDCLAIM PARENT1 REVOKED SEX TIF TRAVTIME
## 1        0     Yes       0     2331      No      No   M   1       40
## 2        4     Yes       9      717      No     Yes z_F   1       23
## 3        3    z_No       4     2703     Yes      No z_F   1       95
##              URBANICITY   .fitted   .se.fit    .resid        .hat
## 1 z_Highly Rural/ Rural -2.657602 0.2005136  2.334681 0.002461993
## 2   Highly Urban/ Urban  1.797394 0.2473381 -1.975217 0.007460934
## 3   Highly Urban/ Urban  3.314761 0.2276243 -2.588613 0.001753313
##      .sigma     .cooksd .std.resid index
## 1 0.9566739 0.001470274   2.337561   807
## 2 0.9567724 0.001904079  -1.982627  3773
## 3 0.9565937 0.002017227  -2.590885  5652

ggplot(mod4.data, aes(index, .std.resid)) + 
  geom_point(aes(color = TARGET_FLAG), alpha = .5) +
  theme_bw();

mod4.data %>% 
  filter(abs(.std.resid) > 3);

##   TARGET_FLAG CAR_TYPE CAR_USE CLM_FREQ EDUCATION HOMEKIDS HOME_VAL
## 1           1  Minivan Private        0 Bachelors        0     2843
##   KIDSDRIV MSTATUS MVR_PTS OLDCLAIM PARENT1 REVOKED SEX TIF TRAVTIME
## 1        0     Yes       0        1      No      No   M   1       35
##              URBANICITY   .fitted   .se.fit   .resid         .hat
## 1 z_Highly Rural/ Rural -4.791511 0.1518637 3.098315 0.0001883017
##      .sigma      .cooksd .std.resid index
## 1 0.9564081 0.0009456568   3.098606  5101

pR2(mod4)

##           llh       llhNull            G2      McFadden          r2ML 
## -3725.8668999 -4708.9811460  1966.2284922     0.2087743     0.2141032 
##          r2CU 
##     0.3127267

This model allows us to dive deeper into the levels. Having a car type of a truck or having a high school education does not seem to be significant due to high p values. It is commanly known to be a bad statistical practice to remove levels that are not significant because it will dichotomize the predictor.

We need to find a linear regression model to predict the amount of money a car crash costs. Full Model with log transform response variable

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:MASS':
## 
##     cement

## The following object is masked from 'package:datasets':
## 
##     rivers

## 
## Call:
## lm(formula = log(TARGET_AMT + 1) ~ AGE + BLUEBOOK + CAR_AGE + 
##     CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS + HOME_VAL + 
##     INCOME + JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + 
##     PARENT1 + RED_CAR + REVOKED + SEX + TIF + TRAVTIME + URBANICITY, 
##     data = crash_training3a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0977 -2.3385 -0.9203  2.1755 10.8203 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      1.912e+00  3.902e-01   4.900 9.77e-07 ***
## AGE                             -6.950e-03  4.868e-03  -1.428 0.153404    
## BLUEBOOK                         1.044e-05  4.260e-05   0.245 0.806508    
## CAR_AGE                         -5.493e-03  8.077e-03  -0.680 0.496473    
## CAR_TYPEPanel Truck              1.230e-01  1.814e-01   0.678 0.497673    
## CAR_TYPEPickup                   6.420e-01  1.236e-01   5.194 2.11e-07 ***
## CAR_TYPESports Car               1.387e+00  1.468e-01   9.449  < 2e-16 ***
## CAR_TYPEVan                      4.509e-01  1.482e-01   3.043 0.002352 ** 
## CAR_TYPEz_SUV                    1.066e+00  1.182e-01   9.024  < 2e-16 ***
## CAR_USEPrivate                  -9.995e-01  1.174e-01  -8.514  < 2e-16 ***
## CLM_FREQ                         1.500e-01  4.498e-02   3.334 0.000860 ***
## EDUCATIONBachelors              -5.570e-01  1.433e-01  -3.886 0.000103 ***
## EDUCATIONMasters                -3.751e-01  2.072e-01  -1.811 0.070239 .  
## EDUCATIONPhD                    -4.606e-01  2.397e-01  -1.921 0.054718 .  
## EDUCATIONz_High School           2.078e-02  1.227e-01   0.169 0.865534    
## HOMEKIDS                         3.674e-02  4.573e-02   0.803 0.421776    
## HOME_VAL                        -9.463e-05  2.505e-05  -3.778 0.000159 ***
## INCOME                          -2.007e-05  1.900e-05  -1.056 0.290902    
## JOBClerical                      9.008e-01  2.411e-01   3.736 0.000188 ***
## JOBDoctor                       -2.893e-01  2.917e-01  -0.992 0.321330    
## JOBHome Maker                    1.068e+00  2.448e-01   4.362 1.30e-05 ***
## JOBLawyer                        3.035e-01  2.106e-01   1.441 0.149588    
## JOBManager                      -4.683e-01  2.056e-01  -2.277 0.022800 *  
## JOBProfessional                  4.847e-01  2.199e-01   2.204 0.027559 *  
## JOBStudent                       9.691e-01  2.566e-01   3.777 0.000160 ***
## JOBz_Blue Collar                 7.243e-01  2.292e-01   3.160 0.001581 ** 
## KIDSDRIV                         5.067e-01  8.077e-02   6.274 3.70e-10 ***
## MSTATUSz_No                      6.418e-01  9.536e-02   6.730 1.81e-11 ***
## MVR_PTS                          1.762e-01  1.866e-02   9.444  < 2e-16 ***
## OLDCLAIM                         1.089e-04  5.971e-05   1.824 0.068156 .  
## PARENT1Yes                       6.455e-01  1.441e-01   4.480 7.56e-06 ***
## RED_CARyes                      -3.549e-03  1.064e-01  -0.033 0.973389    
## REVOKEDYes                       1.073e+00  1.111e-01   9.656  < 2e-16 ***
## SEXz_F                          -3.211e-01  1.223e-01  -2.625 0.008688 ** 
## TIF                             -6.530e-02  8.696e-03  -7.510 6.56e-14 ***
## TRAVTIME                         1.674e-02  2.300e-03   7.277 3.73e-13 ***
## URBANICITYz_Highly Rural/ Rural -2.437e+00  9.983e-02 -24.414  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.245 on 8124 degrees of freedom
## Multiple R-squared:  0.2222, Adjusted R-squared:  0.2188 
## F-statistic: 64.48 on 36 and 8124 DF,  p-value: < 2.2e-16

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                      Data                       
##  -----------------------------------------------
##  Response : log(TARGET_AMT + 1) 
##  Variables: fitted values of log(TARGET_AMT + 1) 
## 
##           Test Summary           
##  --------------------------------
##  DF            =    1 
##  Chi2          =    825.5908 
##  Prob > Chi2   =    1.473243e-181

##                 GVIF Df GVIF^(1/(2*Df))
## AGE         1.392944  1        1.180230
## BLUEBOOK    1.122873  1        1.059657
## CAR_AGE     1.745095  1        1.321021
## CAR_TYPE    3.709046  5        1.140056
## CAR_USE     2.492971  1        1.578914
## CLM_FREQ    2.104164  1        1.450574
## EDUCATION   9.294596  4        1.321383
## HOMEKIDS    2.019551  1        1.421109
## HOME_VAL    1.400510  1        1.183431
## INCOME      1.223359  1        1.106055
## JOB        17.528235  8        1.196007
## KIDSDRIV    1.322743  1        1.150105
## MSTATUS     1.692053  1        1.300789
## MVR_PTS     1.243635  1        1.115184
## OLDCLAIM    2.053553  1        1.433022
## PARENT1     1.842964  1        1.357558
## RED_CAR     1.811537  1        1.345933
## REVOKED     1.028725  1        1.014261
## SEX         2.885243  1        1.698600
## TIF         1.007572  1        1.003779
## TRAVTIME    1.037493  1        1.018574
## URBANICITY  1.256678  1        1.121017

The full linear model suggests that we eliminate job due to its high VIF number. We can aslo remove age, bluebook, car age, education, homekids, and income due to high p values. Education has multiple levels, most with high p values.

## 
## Call:
## lm(formula = log(TARGET_AMT + 1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + 
##     HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 + 
##     REVOKED + SEX + TIF + TRAVTIME + URBANICITY + YOJ, data = crash_training3a)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.947 -2.340 -1.029  2.241 10.813 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.361e+00  1.811e-01  13.038  < 2e-16 ***
## CAR_TYPEPanel Truck             -2.468e-01  1.655e-01  -1.491 0.135961    
## CAR_TYPEPickup                   6.424e-01  1.193e-01   5.385 7.43e-08 ***
## CAR_TYPESports Car               1.479e+00  1.472e-01  10.050  < 2e-16 ***
## CAR_TYPEVan                      2.733e-01  1.462e-01   1.869 0.061699 .  
## CAR_TYPEz_SUV                    1.162e+00  1.190e-01   9.769  < 2e-16 ***
## CAR_USEPrivate                  -1.198e+00  9.036e-02 -13.262  < 2e-16 ***
## CLM_FREQ                         1.593e-01  4.560e-02   3.495 0.000477 ***
## HOME_VAL                        -2.025e-04  2.363e-05  -8.569  < 2e-16 ***
## KIDSDRIV                         5.684e-01  7.391e-02   7.690 1.64e-14 ***
## MSTATUSz_No                      3.312e-01  9.132e-02   3.627 0.000289 ***
## MVR_PTS                          1.876e-01  1.888e-02   9.941  < 2e-16 ***
## OLDCLAIM                         1.256e-04  6.055e-05   2.075 0.038040 *  
## PARENT1Yes                       9.147e-01  1.272e-01   7.191 7.00e-13 ***
## REVOKEDYes                       1.122e+00  1.126e-01   9.968  < 2e-16 ***
## SEXz_F                          -3.179e-01  1.047e-01  -3.036 0.002405 ** 
## TIF                             -6.501e-02  8.809e-03  -7.381 1.73e-13 ***
## TRAVTIME                         1.789e-02  2.329e-03   7.681 1.76e-14 ***
## URBANICITYz_Highly Rural/ Rural -2.010e+00  9.674e-02 -20.774  < 2e-16 ***
## YOJ                             -3.637e-02  7.961e-03  -4.568 4.99e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.293 on 8141 degrees of freedom
## Multiple R-squared:  0.1975, Adjusted R-squared:  0.1956 
## F-statistic: 105.5 on 19 and 8141 DF,  p-value: < 2.2e-16

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                      Data                       
##  -----------------------------------------------
##  Response : log(TARGET_AMT + 1) 
##  Variables: fitted values of log(TARGET_AMT + 1) 
## 
##           Test Summary           
##  --------------------------------
##  DF            =    1 
##  Chi2          =    734.0186 
##  Prob > Chi2   =    1.197798e-161

##                GVIF Df GVIF^(1/(2*Df))
## CAR_TYPE   2.760221  5        1.106864
## CAR_USE    1.434298  1        1.197622
## CLM_FREQ   2.100171  1        1.449197
## HOME_VAL   1.210598  1        1.100272
## KIDSDRIV   1.075910  1        1.037261
## MSTATUS    1.506874  1        1.227548
## MVR_PTS    1.236127  1        1.111813
## OLDCLAIM   2.051267  1        1.432224
## PARENT1    1.394992  1        1.181098
## REVOKED    1.026012  1        1.012922
## SEX        2.051994  1        1.432478
## TIF        1.004093  1        1.002045
## TRAVTIME   1.033058  1        1.016394
## URBANICITY 1.146057  1        1.070540
## YOJ        1.030601  1        1.015185

The reduced model has a much lower R squared value, however according to the Breusch Pagan test, the low p-value means there is non-constant variance present.

How about we build a model with backstepping?

#library(MASS)
lstep<-stepAIC(lmod2, trace=FALSE)
lstep$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## log(TARGET_AMT + 1) ~ AGE + BLUEBOOK + CAR_AGE + CAR_TYPE + CAR_USE + 
##     CLM_FREQ + EDUCATION + HOMEKIDS + HOME_VAL + INCOME + JOB + 
##     KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 + RED_CAR + 
##     REVOKED + SEX + TIF + TRAVTIME + URBANICITY
## 
## Final Model:
## log(TARGET_AMT + 1) ~ AGE + CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + 
##     HOME_VAL + JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + 
##     PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY
## 
## 
##         Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                               8124   85545.12 19249.72
## 2  - RED_CAR  1  0.0117177      8125   85545.13 19247.72
## 3 - BLUEBOOK  1  0.6338732      8126   85545.77 19245.78
## 4  - CAR_AGE  1  4.9548030      8127   85550.72 19244.25
## 5 - HOMEKIDS  1  6.8761756      8128   85557.60 19242.91
## 6   - INCOME  1 11.7191331      8129   85569.32 19242.02

Formulate the model obtained through backstepping

lmod4<-lm(log(TARGET_AMT + 1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + 
    HOMEKIDS + HOME_VAL + JOB + KIDSDRIV + MSTATUS + MVR_PTS + 
    OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY + 
    YOJ,data=crash_training3a)
#summary(lmod2);

summary(lmod4);

## 
## Call:
## lm(formula = log(TARGET_AMT + 1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + 
##     EDUCATION + HOMEKIDS + HOME_VAL + JOB + KIDSDRIV + MSTATUS + 
##     MVR_PTS + OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + 
##     URBANICITY + YOJ, data = crash_training3a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0992 -2.3399 -0.9167  2.1626 10.8591 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      1.764e+00  3.129e-01   5.635 1.80e-08 ***
## CAR_TYPEPanel Truck              1.368e-01  1.782e-01   0.768 0.442546    
## CAR_TYPEPickup                   6.554e-01  1.209e-01   5.422 6.07e-08 ***
## CAR_TYPESports Car               1.378e+00  1.455e-01   9.471  < 2e-16 ***
## CAR_TYPEVan                      4.587e-01  1.480e-01   3.099 0.001946 ** 
## CAR_TYPEz_SUV                    1.062e+00  1.176e-01   9.024  < 2e-16 ***
## CAR_USEPrivate                  -9.942e-01  1.173e-01  -8.474  < 2e-16 ***
## CLM_FREQ                         1.473e-01  4.495e-02   3.277 0.001054 ** 
## EDUCATIONBachelors              -6.025e-01  1.362e-01  -4.423 9.87e-06 ***
## EDUCATIONMasters                -4.434e-01  1.909e-01  -2.323 0.020219 *  
## EDUCATIONPhD                    -5.160e-01  2.266e-01  -2.277 0.022790 *  
## EDUCATIONz_High School           7.435e-03  1.220e-01   0.061 0.951389    
## HOMEKIDS                         7.263e-02  4.302e-02   1.688 0.091400 .  
## HOME_VAL                        -9.539e-05  2.501e-05  -3.814 0.000138 ***
## JOBClerical                      9.215e-01  2.406e-01   3.829 0.000129 ***
## JOBDoctor                       -3.047e-01  2.912e-01  -1.046 0.295553    
## JOBHome Maker                    9.855e-01  2.461e-01   4.004 6.29e-05 ***
## JOBLawyer                        2.854e-01  2.102e-01   1.358 0.174611    
## JOBManager                      -4.817e-01  2.054e-01  -2.345 0.019031 *  
## JOBProfessional                  4.701e-01  2.197e-01   2.139 0.032427 *  
## JOBStudent                       9.008e-01  2.576e-01   3.497 0.000473 ***
## JOBz_Blue Collar                 7.221e-01  2.290e-01   3.153 0.001620 ** 
## KIDSDRIV                         4.867e-01  7.958e-02   6.115 1.01e-09 ***
## MSTATUSz_No                      6.291e-01  9.549e-02   6.589 4.72e-11 ***
## MVR_PTS                          1.757e-01  1.863e-02   9.427  < 2e-16 ***
## OLDCLAIM                         1.095e-04  5.967e-05   1.834 0.066649 .  
## PARENT1Yes                       6.552e-01  1.434e-01   4.568 4.99e-06 ***
## REVOKEDYes                       1.075e+00  1.110e-01   9.677  < 2e-16 ***
## SEXz_F                          -3.129e-01  1.045e-01  -2.995 0.002751 ** 
## TIF                             -6.537e-02  8.691e-03  -7.522 5.98e-14 ***
## TRAVTIME                         1.677e-02  2.299e-03   7.297 3.23e-13 ***
## URBANICITYz_Highly Rural/ Rural -2.445e+00  9.973e-02 -24.518  < 2e-16 ***
## YOJ                             -2.258e-02  8.617e-03  -2.621 0.008788 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.244 on 8128 degrees of freedom
## Multiple R-squared:  0.2225, Adjusted R-squared:  0.2195 
## F-statistic:  72.7 on 32 and 8128 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lmod4)

hist(resid(lmod4), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod4);

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                      Data                       
##  -----------------------------------------------
##  Response : log(TARGET_AMT + 1) 
##  Variables: fitted values of log(TARGET_AMT + 1) 
## 
##           Test Summary           
##  --------------------------------
##  DF            =    1 
##  Chi2          =    827.5224 
##  Prob > Chi2   =    5.601701e-182

vif(lmod4)

##                 GVIF Df GVIF^(1/(2*Df))
## CAR_TYPE    3.440799  5        1.131530
## CAR_USE     2.491683  1        1.578507
## CLM_FREQ    2.103030  1        1.450183
## EDUCATION   7.243179  4        1.280829
## HOMEKIDS    1.788846  1        1.337477
## HOME_VAL    1.397842  1        1.182304
## JOB        18.103217  8        1.198422
## KIDSDRIV    1.285407  1        1.133758
## MSTATUS     1.697876  1        1.303026
## MVR_PTS     1.241590  1        1.114267
## OLDCLAIM    2.053236  1        1.432912
## PARENT1     1.828223  1        1.352118
## REVOKED     1.028404  1        1.014103
## SEX         2.105523  1        1.451042
## TIF         1.007337  1        1.003662
## TRAVTIME    1.037289  1        1.018474
## URBANICITY  1.255074  1        1.120301
## YOJ         1.244493  1        1.115569

Based on the VIF numbers, we can remove job.

lmod5<-lm(log(TARGET_AMT+1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + 
    HOMEKIDS + HOME_VAL  + KIDSDRIV + MSTATUS + MVR_PTS + 
    OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY + 
    YOJ,data=crash_training3a)
#summary(lmod2);

summary(lmod5);

## 
## Call:
## lm(formula = log(TARGET_AMT + 1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + 
##     EDUCATION + HOMEKIDS + HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + 
##     OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY + 
##     YOJ, data = crash_training3a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0420 -2.3491 -0.9675  2.2529 11.0907 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      2.663e+00  2.060e-01  12.925  < 2e-16 ***
## CAR_TYPEPanel Truck             -4.865e-03  1.673e-01  -0.029  0.97681    
## CAR_TYPEPickup                   6.124e-01  1.188e-01   5.154 2.61e-07 ***
## CAR_TYPESports Car               1.392e+00  1.459e-01   9.538  < 2e-16 ***
## CAR_TYPEVan                      3.938e-01  1.458e-01   2.701  0.00693 ** 
## CAR_TYPEz_SUV                    1.075e+00  1.181e-01   9.108  < 2e-16 ***
## CAR_USEPrivate                  -1.085e+00  9.365e-02 -11.583  < 2e-16 ***
## CLM_FREQ                         1.553e-01  4.517e-02   3.437  0.00059 ***
## EDUCATIONBachelors              -9.482e-01  1.212e-01  -7.824 5.75e-15 ***
## EDUCATIONMasters                -1.032e+00  1.309e-01  -7.881 3.67e-15 ***
## EDUCATIONPhD                    -1.309e+00  1.629e-01  -8.038 1.04e-15 ***
## EDUCATIONz_High School          -1.123e-01  1.191e-01  -0.943  0.34588    
## HOMEKIDS                         1.048e-01  4.282e-02   2.447  0.01443 *  
## HOME_VAL                        -1.143e-04  2.445e-05  -4.676 2.97e-06 ***
## KIDSDRIV                         4.652e-01  7.988e-02   5.823 5.99e-09 ***
## MSTATUSz_No                      5.982e-01  9.547e-02   6.266 3.90e-10 ***
## MVR_PTS                          1.826e-01  1.870e-02   9.766  < 2e-16 ***
## OLDCLAIM                         1.099e-04  5.998e-05   1.832  0.06703 .  
## PARENT1Yes                       6.271e-01  1.441e-01   4.353 1.36e-05 ***
## REVOKEDYes                       1.093e+00  1.116e-01   9.794  < 2e-16 ***
## SEXz_F                          -2.636e-01  1.040e-01  -2.534  0.01131 *  
## TIF                             -6.595e-02  8.728e-03  -7.557 4.57e-14 ***
## TRAVTIME                         1.780e-02  2.308e-03   7.711 1.40e-14 ***
## URBANICITYz_Highly Rural/ Rural -2.244e+00  9.786e-02 -22.926  < 2e-16 ***
## YOJ                             -3.340e-02  7.922e-03  -4.217 2.51e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.261 on 8136 degrees of freedom
## Multiple R-squared:  0.2133, Adjusted R-squared:  0.211 
## F-statistic: 91.92 on 24 and 8136 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lmod5)

hist(resid(lmod5), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod5);

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                      Data                       
##  -----------------------------------------------
##  Response : log(TARGET_AMT + 1) 
##  Variables: fitted values of log(TARGET_AMT + 1) 
## 
##           Test Summary           
##  --------------------------------
##  DF            =    1 
##  Chi2          =    788.6912 
##  Prob > Chi2   =    1.551769e-173

vif(lmod5)

##                GVIF Df GVIF^(1/(2*Df))
## CAR_TYPE   2.928573  5        1.113437
## CAR_USE    1.570921  1        1.253364
## CLM_FREQ   2.100759  1        1.449399
## EDUCATION  1.389562  4        1.041981
## HOMEKIDS   1.753133  1        1.324059
## HOME_VAL   1.321200  1        1.149434
## KIDSDRIV   1.281054  1        1.131836
## MSTATUS    1.679124  1        1.295810
## MVR_PTS    1.237080  1        1.112241
## OLDCLAIM   2.052316  1        1.432591
## PARENT1    1.824590  1        1.350774
## REVOKED    1.027230  1        1.013524
## SEX        2.066004  1        1.437360
## TIF        1.004898  1        1.002446
## TRAVTIME   1.034302  1        1.017006
## URBANICITY 1.195641  1        1.093454
## YOJ        1.040417  1        1.020008

This model has VIF numbers less than 4 for all predictors. The adjusted r square is still hovering 2. As for the coefficients, it appears that having a truck has a negative effect on getting into a crash where as sports cars, suvs, and vans have a positive effect. Education also has a negative effect on crashing. Many of these coefficients seem to make sense with our pre-existing knowledge on what elements might contribute to a car crash. It should also be noted that not having a log transform greatly reduces the adjusted r square.

Model Selection

We are going to validate mod4 and lmod5. These two models seemed to be the best performing out of all the linear and logisitc regression models built in this study.

In order to test the performance of our logitic model, we will partition the training data into a test and validation subset before deploying the model on the evaluation data.

## Warning: package 'caret' was built under R version 3.4.4

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

## The following object is masked from 'package:survival':
## 
##     cluster

## 'data.frame':    2448 obs. of  3 variables:
##  $ class                  : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ as.integer.pred_filter.: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ as.double.pred.        : num  0.0934 0.0609 0.4835 0.2695 0.2503 ...

## [1] "class"                   "as.integer.pred_filter."
## [3] "as.double.pred."

Get a confusion matrix

library(pROC)

## Warning: package 'pROC' was built under R version 3.4.4

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(caret)

names(classes.df)<-c("class", "scored.class", "scored.probability")

matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
matrix.df2

##             class
## scored.class    1    0
##            1  259  130
##            0  404 1655

#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix;

## Confusion Matrix and Statistics
## 
##             class
## scored.class    1    0
##            1  259  130
##            0  404 1655
##                                          
##                Accuracy : 0.7819         
##                  95% CI : (0.765, 0.7981)
##     No Information Rate : 0.7292         
##     P-Value [Acc > NIR] : 1.161e-09      
##                                          
##                   Kappa : 0.3653         
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.3906         
##             Specificity : 0.9272         
##          Pos Pred Value : 0.6658         
##          Neg Pred Value : 0.8038         
##              Prevalence : 0.2708         
##          Detection Rate : 0.1058         
##    Detection Prevalence : 0.1589         
##       Balanced Accuracy : 0.6589         
##                                          
##        'Positive' Class : 1              
##

f1_reduced<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1) 
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (2*TP)/(2*TP + Fn + FP)
}
f1_reduced(classes.df);

## [1] 0.4923954

plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");

auc(roc(classes.df$class, classes.df$scored.probability))

## Area under the curve: 0.8066

## 
## Call:
## glm(formula = TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + 
##     HOMEKIDS + HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + 
##     PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY, family = "binomial", 
##     data = crash_training3a)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5886  -0.7290  -0.4235   0.6570   3.0983  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -1.197e+00  1.502e-01  -7.969 1.60e-15 ***
## CAR_TYPEPanel Truck              1.193e-01  1.313e-01   0.908 0.363860    
## CAR_TYPEPickup                   5.796e-01  9.668e-02   5.995 2.03e-09 ***
## CAR_TYPESports Car               1.239e+00  1.196e-01  10.362  < 2e-16 ***
## CAR_TYPEVan                      3.988e-01  1.183e-01   3.372 0.000747 ***
## CAR_TYPEz_SUV                    9.756e-01  1.014e-01   9.622  < 2e-16 ***
## CAR_USEPrivate                  -8.009e-01  7.245e-02 -11.055  < 2e-16 ***
## CLM_FREQ                         1.175e-01  3.182e-02   3.692 0.000223 ***
## EDUCATIONBachelors              -7.543e-01  9.464e-02  -7.970 1.58e-15 ***
## EDUCATIONMasters                -8.615e-01  1.026e-01  -8.399  < 2e-16 ***
## EDUCATIONPhD                    -1.055e+00  1.331e-01  -7.926 2.27e-15 ***
## EDUCATIONz_High School          -1.280e-01  9.085e-02  -1.409 0.158823    
## HOMEKIDS                         7.975e-02  3.339e-02   2.389 0.016916 *  
## HOME_VAL                        -9.488e-05  1.978e-05  -4.797 1.61e-06 ***
## KIDSDRIV                         3.484e-01  5.934e-02   5.872 4.31e-09 ***
## MSTATUSz_No                      5.204e-01  7.625e-02   6.826 8.74e-12 ***
## MVR_PTS                          1.121e-01  1.354e-02   8.280  < 2e-16 ***
## OLDCLAIM                         8.477e-05  4.205e-05   2.016 0.043804 *  
## PARENT1Yes                       3.721e-01  1.074e-01   3.466 0.000528 ***
## REVOKEDYes                       7.487e-01  7.949e-02   9.419  < 2e-16 ***
## SEXz_F                          -2.382e-01  8.752e-02  -2.722 0.006485 ** 
## TIF                             -5.478e-02  7.253e-03  -7.553 4.26e-14 ***
## TRAVTIME                         1.498e-02  1.862e-03   8.042 8.80e-16 ***
## URBANICITYz_Highly Rural/ Rural -2.239e+00  1.114e-01 -20.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7451.7  on 8137  degrees of freedom
## AIC: 7499.7
## 
## Number of Fisher Scoring iterations: 5

We can convert the coefficients of our model into odds ratios and generate a confidence interval at the 95% confidence level.

exp(cbind("Odds ratio" = coef(mod4), confint.default(mod4, level = 0.95)))

##                                 Odds ratio      2.5 %    97.5 %
## (Intercept)                      0.3021883 0.22514200 0.4056008
## CAR_TYPEPanel Truck              1.1266622 0.87096071 1.4574339
## CAR_TYPEPickup                   1.7852863 1.47712552 2.1577362
## CAR_TYPESports Car               3.4538033 2.73196292 4.3663686
## CAR_TYPEVan                      1.4901070 1.18174106 1.8789386
## CAR_TYPEz_SUV                    2.6528289 2.17472761 3.2360379
## CAR_USEPrivate                   0.4489373 0.38951076 0.5174304
## CLM_FREQ                         1.1246531 1.05665567 1.1970263
## EDUCATIONBachelors               0.4703459 0.39071583 0.5662051
## EDUCATIONMasters                 0.4225210 0.34557013 0.5166072
## EDUCATIONPhD                     0.3482503 0.26829161 0.4520390
## EDUCATIONz_High School           0.8798394 0.73632411 1.0513270
## HOMEKIDS                         1.0830169 1.01441227 1.1562614
## HOME_VAL                         0.9999051 0.99986636 0.9999439
## KIDSDRIV                         1.4168694 1.26129310 1.5916356
## MSTATUSz_No                      1.6827732 1.44918932 1.9540067
## MVR_PTS                          1.1186252 1.08933226 1.1487058
## OLDCLAIM                         1.0000848 1.00000235 1.0001672
## PARENT1Yes                       1.4507509 1.17548032 1.7904836
## REVOKEDYes                       2.1142314 1.80921271 2.4706737
## SEXz_F                           0.7880152 0.66380530 0.9354670
## TIF                              0.9466951 0.93333296 0.9602485
## TRAVTIME                         1.0150898 1.01139153 1.0188016
## URBANICITYz_Highly Rural/ Rural  0.1065228 0.08562206 0.1325254

From our odds ratios, we can conclude that given a unit increase in a car being a sports car, a user is 3 times more likely to be involved in a crash. Having an education makes someone less likely to be involved in a crash. Not being married makes someone 1.6 times more likely to be in a crash.

Our logistic regression model yields an accuracy of 0.7872. That is the best performing model we can pick. We need to validate our linear regression model before we deploy on the evaluation data.

Get predicted values vs actual values and compute root mean square error.

## [1] 4751.411

Find relative square error.

## [1] 1.109594

The chosen linear model best minimizes RSME.

We are now in a position to deploy our models on the evaluation data. Deploy our linear model

##          1          2          3          4          5          6 
##  2.0164873  3.4219367  2.0270339         NA  3.4933309  1.7637496 
##          7          8          9         10         11         12 
##  3.4434531  3.3533618  0.8643448  2.4036975         NA  4.3959821 
##         13         14         15         16         17         18 
##  5.9841331  1.0004187  0.4147456  4.5898764  5.2030353  1.7597256 
##         19         20         21         22         23         24 
##  4.5924821  2.6904951  2.0228655  3.0288925  1.5836624  2.7983778 
##         25         26         27         28         29         30 
##  2.3394212  3.5325827  3.1600252  4.5128966  1.8515062  1.8302445 
##         31         32         33         34         35         36 
##  2.0753815  3.1451669  1.0976435  2.1666514  1.4640885         NA 
##         37         38         39         40         41         42 
##  1.9992239  2.1946236         NA  4.3450274  2.7032690  4.3297055 
##         43         44         45         46         47         48 
## -0.4526448  4.1708667 -2.1090651  2.6186578  1.4956471  3.2150948 
##         49         50         51         52         53         54 
## -0.7632353  4.3670731  1.4922485  3.2093304  5.5354579  1.0466093 
##         55         56         57         58         59         60 
##  3.7604082  2.5710424  2.9156530  3.2170713  1.0385135  4.0356455 
##         61         62         63         64         65         66 
## -1.2557860  1.2077085  2.6415229  2.0675694  0.4293705         NA 
##         67         68         69         70         71         72 
##  5.5118835  4.5515622  1.8418161  1.4485161 -0.1099099  1.9211665 
##         73         74         75         76         77         78 
##  4.8471779  2.2589293  5.1283445  3.0938316  3.4779705  2.1981171 
##         79         80         81         82         83         84 
##  1.6231959  0.6705303  3.8846945  2.8691952  2.9366777  0.8310662 
##         85         86         87         88         89         90 
##  3.5597005  4.2613771  2.7136331  2.9232136 -0.1523340  5.6311124 
##         91         92         93         94         95         96 
##  1.0259128  1.2735551  0.8046697  2.3227922  0.7327980  2.6377325 
##         97         98         99        100 
## -1.5968943  3.2152100  3.0867555  2.6955790

Deploy the logistic regression model

##          1          2          3          4          5          6 
## 0.21859864 0.38829741 0.13316284 0.23055757 0.43513746 0.18411279 
##          7          8          9         10         11         12 
## 0.45438997 0.41101491 0.05049800 0.25272133 0.04414184 0.53631040 
##         13         14         15         16         17         18 
## 0.85647953 0.09356372 0.03851994 0.64799931 0.74097390 0.16656608 
##         19         20 
## 0.66786199 0.32327326

Consolidate and visualize predicted probabilities and predicted amount

## Error in eval(varsRHS[[1]], data, env): object 'test_results2b' not found

Our logisitc regression model had reasonable performance but our linear model had poor predictive power. Rather than using a linear model, I would consider using a different algorithm such as robust regression or smoothing splines. Addative regression might perform better in this scenerio.

Appendix)

crash_training <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-4/master/insurance_training_data.csv"), header =TRUE)

crash_evaluation <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-4/master/insurance-evaluation-data.csv"), header =TRUE)

head(crash_training, 10)
nrow(crash_training);ncol(crash_training);names(crash_training)
table(crash_training$TARGET_FLAG);

barplot(table(crash_training$TARGET_FLAG), ylim=c(0, 7000), xlab="Result", ylab="N", col="black")
x <- crash_training$TARGET_AMT
h<-hist(x, breaks=10, col="red", xlab="Target AMT", 
    main="Histogram with Normal Curve", ylim=range(0:1200)) 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="blue", lwd=2);

x <- log(crash_training$TARGET_AMT)
h<-hist(x, breaks=10, col="red", xlab="Target AMT", 
    main="Histogram with Normal Curve After Log Transform", ylim=range(0:1200)) 
#xfit<-seq(min(x),max(x),length=40) 
#yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
#yfit <- yfit*diff(h$mids[1:2])*length(x) 
#lines(x, yfit, col="blue", lwd=2)
summary(crash_training)
barplot(table(crash_training$AGE), ylim=c(0, 500), xlab="AGE", ylab="N", col="black");
#barplot(table(crash_training$BLUEBOOK), ylim=c(0, 100), xlab="BLUEBOOK", ylab="N", col="black");
barplot(table(crash_training$CAR_AGE), ylim=c(0, 1000), xlab="CAR_AGE", ylab="N", col="black");
barplot(table(crash_training$CAR_TYPE), ylim=c(0, 5000), xlab="CAR_TYPE", ylab="N", col="black");
barplot(table(crash_training$CAR_USE), ylim=c(0, 5000), xlab="CAR_USE", ylab="N", col="black");
barplot(table(crash_training$CLM_FREQ), ylim=c(0, 5000), xlab="CLM_FREQ", ylab="N", col="black");
barplot(table(crash_training$EDUCATION), ylim=c(0, 5000), xlab="EDUCATION", ylab="N", col="black");
barplot(table(crash_training$HOMEKIDS), ylim=c(0, 7000), xlab="HOMEKIDS", ylab="N", col="black");
#barplot(table(crash_training$HOME_VAL), ylim=c(0, 500), xlab="HOME_VAL", ylab="N", col="black");
#barplot(table(crash_training$INCOME), ylim=c(0, 5000), xlab="INCOME", ylab="N", col="black");
barplot(table(crash_training$JOB), ylim=c(0, 5000), xlab="JOB", ylab="N", col="black");
barplot(table(crash_training$KIDSDRIV), ylim=c(0, 7000), xlab="KIDSDRIV", ylab="N", col="black");
barplot(table(crash_training$MSTATUS), ylim=c(0, 5000), xlab="MSTATUS", ylab="N", col="black");
barplot(table(crash_training$MVR_PTS), ylim=c(0, 4000), xlab="MVR_PTS", ylab="N", col="black");
#barplot(table(crash_training$OLDCLAIM), ylim=c(0, 5), xlab="OLDCLAIM", ylab="N", col="black");
barplot(table(crash_training$PARENT1), ylim=c(0, 5000), xlab="PARENT 1", ylab="N", col="black");
barplot(table(crash_training$RED_CAR), ylim=c(0, 5000), xlab="RED_CAR", ylab="N", col="black");
barplot(table(crash_training$REVOKED), ylim=c(0, 5000), xlab="REVOKED", ylab="N", col="black");
barplot(table(crash_training$SEX), ylim=c(0, 5000), xlab="SEX", ylab="N", col="black");
barplot(table(crash_training$TIF), ylim=c(0, 5000), xlab="TIF", ylab="N", col="black");
barplot(table(crash_training$TRAVTIME), ylim=c(0, 400), xlab="TRAV TIME", ylab="N", col="black");
barplot(table(crash_training$URBANICITY), ylim=c(0, 8000), xlab="URBAN CITY", ylab="N", col="black");
barplot(table(crash_training$YOJ), ylim=c(0, 2000), xlab="YOJ", ylab="N", col="black")


library(ggplot2)
library(tidyr)

crash_training_plot_subset <- subset(crash_training, select = c(BLUEBOOK, HOME_VAL,INCOME, OLDCLAIM))
crash_training_plot_subset$BLUEBOOK<-as.numeric(crash_training_plot_subset$BLUEBOOK)
crash_training_plot_subset$HOME_VAL<-as.numeric(crash_training_plot_subset$HOME_VAL)
crash_training_plot_subset$INCOME<-as.numeric(crash_training_plot_subset$INCOME)
crash_training_plot_subset$OLDCLAIM<-as.numeric(crash_training_plot_subset$OLDCLAIM)


ggplot(gather(crash_training_plot_subset), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x') ;summary(crash_training_plot_subset)
crash_training2<-crash_training
crash_training2<-subset(crash_training, select=-c(INDEX))
crash_evaluation2<-subset(crash_evaluation, select=-c(INDEX))

crash_train_cor<-subset(crash_training2, select=c(TARGET_FLAG, TARGET_AMT, KIDSDRIV, AGE, HOMEKIDS, YOJ, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, OLDCLAIM, CLM_FREQ, MVR_PTS, CAR_AGE))

crash_train_cor$TARGET_AMT<-as.numeric(crash_train_cor$TARGET_AMT)
crash_train_cor$TARGET_FLAG<-as.numeric(crash_train_cor$TARGET_FLAG)
crash_train_cor$KIDSDRIV<-as.numeric(crash_train_cor$KIDSDRIV)
crash_train_cor$AGE<-as.numeric(crash_train_cor$AGE)
crash_train_cor$HOMEKIDS<-as.numeric(crash_train_cor$HOMEKIDS)
crash_train_cor$YOJ<-as.numeric(crash_train_cor$YOJ)
crash_train_cor$HOME_VAL<-as.numeric(crash_train_cor$HOME_VAL)
crash_train_cor$TRAVTIME<-as.numeric(crash_train_cor$TRAVTIME)
crash_train_cor$BLUEBOOK<-as.numeric(crash_train_cor$BLUEBOOK)
crash_train_cor$TIF<-as.numeric(crash_train_cor$TIF)
crash_train_cor$OLDCLAIM<-as.numeric(crash_train_cor$OLDCLAIM)
crash_train_cor$CLM_FREQ<-as.numeric(crash_train_cor$CLM_FREQ)
crash_train_cor$MVR_PTS<-as.numeric(crash_train_cor$MVR_PTS)
crash_train_cor$CAR_AGE<-as.numeric(crash_train_cor$CAR_AGE)


apply(crash_train_cor,2,  function(col)cor(col, crash_training$TARGET_AMT))
apply(crash_train_cor,2,  function(col)cor(col, crash_training$TARGET_FLAG))
#correlation matrix and visualization 
correlation_matrix <- round(cor(crash_train_cor),2)

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(correlation_matrix){
    correlation_matrix[upper.tri(correlation_matrix)] <- NA
    return(correlation_matrix)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(correlation_matrix){
    correlation_matrix[lower.tri(correlation_matrix)]<- NA
    return(correlation_matrix)
  }
  
  upper_tri <- get_upper_tri(correlation_matrix)



library(reshape2)

# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)

# Heatmap
library(ggplot2)

ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 15, hjust = 1))+
 coord_fixed()


#add nice labels 
ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.text.x=element_text(size=rel(0.8), angle=90),
  axis.text.y=element_text(size=rel(0.8)),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))

#missing data 
colSums(is.na(crash_training2)) #;colSums(is.na(crash_evaluation))
colSums(crash_training2 == 0)
outlierKD<-function(crash_training2, var) {
     var_name <- eval(substitute(var),eval(crash_training2))
     na1 <- sum(is.na(var_name))
     m1 <- mean(var_name, na.rm = T)
     par(mfrow=c(2, 2), oma=c(0,0,3,0))
     boxplot(var_name, main="With outliers")
     hist(var_name, main="With outliers", xlab=NA, ylab=NA)
     outlier <- boxplot.stats(var_name)$out
     mo <- mean(outlier)
     var_name <- ifelse(var_name %in% outlier, NA, var_name)
     boxplot(var_name, main="Without outliers")
     hist(var_name, main="Without outliers", xlab=NA, ylab=NA)
     title("Outlier Check", outer=TRUE)
     na2 <- sum(is.na(var_name))
     cat("Outliers identified:", na2 - na1, "n")
     cat("Propotion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name))*100, 1), "n")
     cat("Mean of the outliers:", round(mo, 2), "n")
     m2 <- mean(var_name, na.rm = T)
     cat("Mean without removing outliers:", round(m1, 2), "n")
     cat("Mean if we remove outliers:", round(m2, 2), "n")
     response <- readline(prompt="Do you want to remove outliers and to replace with NA? [yes/no]: ")
     if(response == "y" | response == "yes"){
          crash_training2[as.character(substitute(var))] <- invisible(var_name)
          assign(as.character(as.list(match.call())$crash_training2), crash_training2, envir = .GlobalEnv)
          cat("Outliers successfully removed", "n")
          return(invisible(crash_training2))
     } else{
          cat("Nothing changed", "n")
          return(invisible(var_name))
     }
}


outlierKD(crash_training2, CAR_AGE)
colSums(is.na(crash_training2)) #;colSums(is.na(crash_evaluation))
library(Hmisc)

crash_training3<-crash_training2

impute(crash_training3$AGE, median)
impute(crash_training3$CAR_AGE, median)

crash_training3[is.na(crash_training3)] <- 0

colSums(is.na(crash_training3)) 

crash_evaluation3<-crash_evaluation2
impute(crash_evaluation3$AGE, median)
impute(crash_evaluation3$CAR_AGE, median)

crash_evaluation3[is.na(crash_evaluation3)] <- 0

crash_training3$CAR_AGE<-abs(crash_training3$CAR_AGE)


summary(crash_training3);summary(crash_training2)
crash_training3$TARGET_AMT<-as.numeric(crash_training3$TARGET_AMT)
crash_training3$TARGET_FLAG<-as.numeric(crash_training3$TARGET_FLAG)
crash_training3$KIDSDRIV<-as.numeric(crash_training3$KIDSDRIV)
crash_training3$AGE<-as.numeric(crash_training3$AGE)
crash_training3$HOMEKIDS<-as.numeric(crash_training3$HOMEKIDS)
crash_training3$YOJ<-as.numeric(crash_training3$YOJ)
crash_training3$HOME_VAL<-as.numeric(crash_training3$HOME_VAL)
crash_training3$TRAVTIME<-as.numeric(crash_training3$TRAVTIME)
crash_training3$BLUEBOOK<-as.numeric(crash_training3$BLUEBOOK)
crash_training3$TIF<-as.numeric(crash_training3$TIF)
crash_training3$OLDCLAIM<-as.numeric(crash_training3$OLDCLAIM)
crash_training3$CLM_FREQ<-as.numeric(crash_training3$CLM_FREQ)
crash_training3$MVR_PTS<-as.numeric(crash_training3$MVR_PTS)
crash_training3$CAR_AGE<-as.numeric(crash_training3$CAR_AGE)
crash_training3$INCOME<-as.numeric(crash_training3$INCOME)

crash_training3a<-crash_training3

crash_evaluation2$TARGET_AMT<-as.numeric(crash_evaluation2$TARGET_AMT)
crash_evaluation2$TARGET_FLAG<-as.numeric(crash_evaluation2$TARGET_FLAG)
crash_evaluation2$KIDSDRIV<-as.numeric(crash_evaluation2$KIDSDRIV)
crash_evaluation2$AGE<-as.numeric(crash_evaluation2$AGE)
crash_evaluation2$HOMEKIDS<-as.numeric(crash_evaluation2$HOMEKIDS)
crash_evaluation2$YOJ<-as.numeric(crash_evaluation2$YOJ)
crash_evaluation2$HOME_VAL<-as.numeric(crash_evaluation2$HOME_VAL)
crash_evaluation2$TRAVTIME<-as.numeric(crash_evaluation2$TRAVTIME)
crash_evaluation2$BLUEBOOK<-as.numeric(crash_evaluation2$BLUEBOOK)
crash_evaluation2$TIF<-as.numeric(crash_evaluation2$TIF)
crash_evaluation2$OLDCLAIM<-as.numeric(crash_evaluation2$OLDCLAIM)
crash_evaluation2$CLM_FREQ<-as.numeric(crash_evaluation2$CLM_FREQ)
crash_evaluation2$MVR_PTS<-as.numeric(crash_evaluation2$MVR_PTS)
crash_evaluation2$CAR_AGE<-as.numeric(crash_evaluation2$CAR_AGE)
crash_evaluation2$INCOME<-as.numeric(crash_evaluation2$INCOME)

crash_evaluation2a<-crash_evaluation2


library(dplyr)
crash_training3 %>% 
     sapply(levels)


crash_training3$CLERICAL <- ifelse(crash_training3$JOB == "Clerical", 1, 0)
crash_training3$DOCTOR <- ifelse(crash_training3$JOB == "Doctor", 1, 0)
crash_training3$HOME_MAKER <- ifelse(crash_training3$JOB == "Home Maker", 1, 0)
crash_training3$LAWYER <- ifelse(crash_training3$JOB == "Lawyer", 1, 0)
crash_training3$MANAGER <- ifelse(crash_training3$JOB == "manager", 1, 0)
crash_training3$PROF <- ifelse(crash_training3$JOB == "Professional", 1, 0)
crash_training3$STUDENT <- ifelse(crash_training3$JOB == "Student", 1, 0)
crash_training3$BLUE_COLLAR <- ifelse(crash_training3$JOB == "z_Blue Collar", 1, 0)

crash_training3$MINIVAN <- ifelse(crash_training3$CAR_TYPE == "minivan", 1, 0)
crash_training3$TRUCK <- ifelse(crash_training3$CAR_TYPE == "Panel Truck", 1, 0)
crash_training3$PICKUP <- ifelse(crash_training3$CAR_TYPE == "Pickup", 1, 0)
crash_training3$SPORTS <- ifelse(crash_training3$CAR_TYPE == "Sports Car", 1, 0)
crash_training3$VAN <- ifelse(crash_training3$CAR_TYPE == "Van", 1, 0)
crash_training3$SUV <- ifelse(crash_training3$CAR_TYPE == "z_SUV", 1, 0)

#do it for evaluation data as well 
crash_evaluation3$CLERICAL <- ifelse(crash_evaluation3$JOB == "Clerical", 1, 0)
crash_evaluation3$DOCTOR <- ifelse(crash_evaluation3$JOB == "Doctor", 1, 0)
crash_evaluation3$HOME_MAKER <- ifelse(crash_evaluation3$JOB == "Home Maker", 1, 0)
crash_evaluation3$LAWYER <- ifelse(crash_evaluation3$JOB == "Lawyer", 1, 0)
crash_evaluation3$MANAGER <- ifelse(crash_evaluation3$JOB == "Manager", 1, 0)
crash_evaluation3$PROF <- ifelse(crash_evaluation3$JOB == "Professional", 1, 0)
crash_evaluation3$STUDENT <- ifelse(crash_evaluation3$JOB == "Student", 1, 0)
crash_evaluation3$BLUE_COLLAR <- ifelse(crash_evaluation3$JOB == "z_Blue Collar", 1, 0)

crash_evaluation3$MINIVAN <- ifelse(crash_evaluation3$CAR_TYPE == "Minivan", 1, 0)
crash_evaluation3$TRUCK <- ifelse(crash_evaluation3$CAR_TYPE == "Panel Truck", 1, 0)
crash_evaluation3$PICKUP <- ifelse(crash_evaluation3$CAR_TYPE == "Pickup", 1, 0)
crash_evaluation3$SPORTS <- ifelse(crash_evaluation3$CAR_TYPE == "Sports Car", 1, 0)
crash_evaluation3$VAN <- ifelse(crash_evaluation3$CAR_TYPE == "Van", 1, 0)
crash_evaluation3$SUV <- ifelse(crash_evaluation3$CAR_TYPE == "z_SUV", 1, 0)

crash_training4<-subset(crash_training3, select=-c(CAR_TYPE, JOB))
crash_evaluation4<-subset(crash_evaluation3, select=-c(CAR_TYPE, JOB))
library(tidyverse)
library(dplyr)
library(broom)
library(car)


mod2<-glm(TARGET_FLAG~AGE+BLUEBOOK+CAR_AGE+CAR_TYPE+CAR_USE+CLM_FREQ+EDUCATION+HOMEKIDS+HOME_VAL+INCOME+JOB+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+RED_CAR+REVOKED+SEX+TIF+TRAVTIME+URBANICITY+YOJ, family="binomial",  data=crash_training3a)
summary(mod2);


plot(mod2, which = 4, id.n = 3);

mod2.data <- augment(mod2) %>% 
  mutate(index = 1:n()) ;

car::vif(mod2);

mod2.data %>% top_n(3, .cooksd);

ggplot(mod2.data, aes(index, .std.resid)) + 
  geom_point(aes(color = TARGET_FLAG), alpha = .5) +
  theme_bw();

mod2.data %>% 
  filter(abs(.std.resid) > 3);

library(pscl)
pR2(mod2)
mod3<-glm(TARGET_FLAG~CAR_TYPE+CAR_USE+CLM_FREQ+EDUCATION+HOMEKIDS+HOME_VAL+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+REVOKED+SEX+TIF+TRAVTIME+URBANICITY, family="binomial",  data=crash_training3a)
summary(mod3);


plot(mod3, which = 4, id.n = 3);

mod3.data <- augment(mod3) %>% 
  mutate(index = 1:n()) ;

car::vif(mod3);

mod3.data %>% top_n(3, .cooksd);

ggplot(mod3.data, aes(index, .std.resid)) + 
  geom_point(aes(color = TARGET_FLAG), alpha = .5) +
  theme_bw();

mod3.data %>% 
  filter(abs(.std.resid) > 3);

pR2(mod3)
library(MASS)
step<-stepAIC(mod3, trace=FALSE)
step$anova
mod4<-glm(TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS + 
    HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 + 
    REVOKED + SEX + TIF + TRAVTIME + URBANICITY, family="binomial",  data=crash_training3a)
summary(mod4);


plot(mod4, which = 4, id.n = 3);

mod4.data <- augment(mod4) %>% 
  mutate(index = 1:n()) ;

car::vif(mod4);

mod4.data %>% top_n(3, .cooksd);

ggplot(mod4.data, aes(index, .std.resid)) + 
  geom_point(aes(color = TARGET_FLAG), alpha = .5) +
  theme_bw();

mod4.data %>% 
  filter(abs(.std.resid) > 3);

pR2(mod4)
library(olsrr)
lmod2<-lm(log(TARGET_AMT+1)~AGE+BLUEBOOK+CAR_AGE+CAR_TYPE+CAR_USE+CLM_FREQ+EDUCATION+HOMEKIDS+HOME_VAL+INCOME+JOB+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+RED_CAR+REVOKED+SEX+TIF+TRAVTIME+URBANICITY,data=crash_training3a)
#summary(lmod2);

summary(lmod2);
par(mfrow=c(2,2))
plot(lmod2)
hist(resid(lmod2), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod2);
vif(lmod2)
lmod3<-lm(log(TARGET_AMT+1)~CAR_TYPE+CAR_USE+CLM_FREQ+HOME_VAL+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+REVOKED+SEX+TIF+TRAVTIME+URBANICITY+YOJ,data=crash_training3a)
#summary(lmod2);

summary(lmod3);
par(mfrow=c(2,2))
plot(lmod3)
hist(resid(lmod3), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod3);
vif(lmod3)
#library(MASS)
lstep<-stepAIC(lmod2, trace=FALSE)
lstep$anova
lmod4<-lm(log(TARGET_AMT + 1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + 
    HOMEKIDS + HOME_VAL + JOB + KIDSDRIV + MSTATUS + MVR_PTS + 
    OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY + 
    YOJ,data=crash_training3a)
#summary(lmod2);

summary(lmod4);
par(mfrow=c(2,2))
plot(lmod4)
hist(resid(lmod4), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod4);
vif(lmod4)
lmod5<-lm(log(TARGET_AMT+1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + 
    HOMEKIDS + HOME_VAL  + KIDSDRIV + MSTATUS + MVR_PTS + 
    OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY + 
    YOJ,data=crash_training3a)
#summary(lmod2);

summary(lmod5);
par(mfrow=c(2,2))
plot(lmod5)
hist(resid(lmod5), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod5);
vif(lmod5)
#The evaluation data does not contain the target response variable. To test predictive power, we need to partition our training data into a test and evaluation 
library(caret)
Train <- createDataPartition(crash_training3a$TARGET_FLAG, p=0.7, list=FALSE)
train <- crash_training3a[Train, ]
test <- crash_training3a[-Train, ]


#probabilities of prediction 
pred <- predict(mod4, newdata=test, type="response", na.action = na.pass)
scored.probability<-data.frame(as.double(pred))


#probability threshold 
pred_filter <- ifelse(pred > 0.5, 1, 0)
scored.class<-data.frame(as.integer(pred_filter))

#subset actuals
class<-data.frame(subset(test, select=c(TARGET_FLAG)))
class<-as.integer(class$TARGET_FLAG)

#library(plyr)
classes.df<-data.frame(class, scored.class, scored.probability)
str(classes.df)
names(classes.df)
library(pROC)
library(caret)

names(classes.df)<-c("class", "scored.class", "scored.probability")

matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
matrix.df2
#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix;

f1_reduced<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1) 
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (2*TP)/(2*TP + Fn + FP)
}
f1_reduced(classes.df);

plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");

auc(roc(classes.df$class, classes.df$scored.probability))
summary(mod4)
exp(cbind("Odds ratio" = coef(mod4), confint.default(mod4, level = 0.95)))
lpred <- data.frame(predict(lmod5, newdata=test, type = "response"))
actual <- subset(test, select=c(TARGET_AMT))
l_testing<-data.frame(lpred, actual)
l_testing$rmse <- (mean((l_testing$lpred - l_testing$actual)^2))^0.5
rmse<-(mean((lpred - actual)^2))^0.5
rmse
mu <- mean(actual$TARGET_AMT)
rse <- mean((lpred - actual)^2) / mean((mu - actual)^2)
rse
test_results <- predict(lmod5, newdata = crash_evaluation2)
head(test_results, 100);
test_results2<-predict(lmod5, newdata=crash_evaluation2, type = "response")
#table(test_results)
#predict(mod5, newdata=moneyball_evaluation3)
#plot(test_results);
#predict(lmod5,newdata=crash_evaluation2, interval='prediction')
test_resultsb <- predict(mod4, newdata = crash_evaluation2, type="response")
head(test_resultsb, 20)
predicted_amt<-data.frame(test_results2)
predicted_prb<-data.frame(test_resultsb)
production<-data.frame(predicted_amt,predicted_prb )


xyplot(test_results2 ~ test_results2b, data = production,
  xlab = "Predicted Probabilities",
  ylab = "Predicted Amount",
  main = "Predicted Cost vs. Predicted Probability of Getting Into a Car Crash")

crash_training <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-4/master/insurance_training_data.csv"), header =TRUE)

crash_evaluation <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-4/master/insurance-evaluation-data.csv"), header =TRUE)

head(crash_training, 10)



nrow(crash_training);ncol(crash_training);names(crash_training)


table(crash_training$TARGET_FLAG);

barplot(table(crash_training$TARGET_FLAG), ylim=c(0, 7000), xlab="Result", ylab="N", col="black")




x <- crash_training$TARGET_AMT
h<-hist(x, breaks=10, col="red", xlab="Target AMT", 
    main="Histogram with Normal Curve", ylim=range(0:1200)) 
xfit<-seq(min(x),max(x),length=40) 
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
yfit <- yfit*diff(h$mids[1:2])*length(x) 
lines(xfit, yfit, col="blue", lwd=2);

x <- log(crash_training$TARGET_AMT)
h<-hist(x, breaks=10, col="red", xlab="Target AMT", 
    main="Histogram with Normal Curve After Log Transform", ylim=range(0:1200)) 
#xfit<-seq(min(x),max(x),length=40) 
#yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
#yfit <- yfit*diff(h$mids[1:2])*length(x) 
#lines(x, yfit, col="blue", lwd=2)


summary(crash_training)

barplot(table(crash_training$AGE), ylim=c(0, 500), xlab="AGE", ylab="N", col="black");
#barplot(table(crash_training$BLUEBOOK), ylim=c(0, 100), xlab="BLUEBOOK", ylab="N", col="black");
barplot(table(crash_training$CAR_AGE), ylim=c(0, 1000), xlab="CAR_AGE", ylab="N", col="black");
barplot(table(crash_training$CAR_TYPE), ylim=c(0, 5000), xlab="CAR_TYPE", ylab="N", col="black");
barplot(table(crash_training$CAR_USE), ylim=c(0, 5000), xlab="CAR_USE", ylab="N", col="black");
barplot(table(crash_training$CLM_FREQ), ylim=c(0, 5000), xlab="CLM_FREQ", ylab="N", col="black");
barplot(table(crash_training$EDUCATION), ylim=c(0, 5000), xlab="EDUCATION", ylab="N", col="black");
barplot(table(crash_training$HOMEKIDS), ylim=c(0, 7000), xlab="HOMEKIDS", ylab="N", col="black");
#barplot(table(crash_training$HOME_VAL), ylim=c(0, 500), xlab="HOME_VAL", ylab="N", col="black");
#barplot(table(crash_training$INCOME), ylim=c(0, 5000), xlab="INCOME", ylab="N", col="black");
barplot(table(crash_training$JOB), ylim=c(0, 5000), xlab="JOB", ylab="N", col="black");
barplot(table(crash_training$KIDSDRIV), ylim=c(0, 7000), xlab="KIDSDRIV", ylab="N", col="black");
barplot(table(crash_training$MSTATUS), ylim=c(0, 5000), xlab="MSTATUS", ylab="N", col="black");
barplot(table(crash_training$MVR_PTS), ylim=c(0, 4000), xlab="MVR_PTS", ylab="N", col="black");
#barplot(table(crash_training$OLDCLAIM), ylim=c(0, 5), xlab="OLDCLAIM", ylab="N", col="black");
barplot(table(crash_training$PARENT1), ylim=c(0, 5000), xlab="PARENT 1", ylab="N", col="black");
barplot(table(crash_training$RED_CAR), ylim=c(0, 5000), xlab="RED_CAR", ylab="N", col="black");
barplot(table(crash_training$REVOKED), ylim=c(0, 5000), xlab="REVOKED", ylab="N", col="black");
barplot(table(crash_training$SEX), ylim=c(0, 5000), xlab="SEX", ylab="N", col="black");
barplot(table(crash_training$TIF), ylim=c(0, 5000), xlab="TIF", ylab="N", col="black");
barplot(table(crash_training$TRAVTIME), ylim=c(0, 400), xlab="TRAV TIME", ylab="N", col="black");
barplot(table(crash_training$URBANICITY), ylim=c(0, 8000), xlab="URBAN CITY", ylab="N", col="black");
barplot(table(crash_training$YOJ), ylim=c(0, 2000), xlab="YOJ", ylab="N", col="black")



library(ggplot2)
library(tidyr)

crash_training_plot_subset <- subset(crash_training, select = c(BLUEBOOK, HOME_VAL,INCOME, OLDCLAIM))
crash_training_plot_subset$BLUEBOOK<-as.numeric(crash_training_plot_subset$BLUEBOOK)
crash_training_plot_subset$HOME_VAL<-as.numeric(crash_training_plot_subset$HOME_VAL)
crash_training_plot_subset$INCOME<-as.numeric(crash_training_plot_subset$INCOME)
crash_training_plot_subset$OLDCLAIM<-as.numeric(crash_training_plot_subset$OLDCLAIM)


ggplot(gather(crash_training_plot_subset), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x') ;summary(crash_training_plot_subset)

    crash_training2<-crash_training
crash_training2<-subset(crash_training, select=-c(INDEX))
crash_evaluation2<-subset(crash_evaluation, select=-c(INDEX))

crash_train_cor<-subset(crash_training2, select=c(TARGET_FLAG, TARGET_AMT, KIDSDRIV, AGE, HOMEKIDS, YOJ, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, OLDCLAIM, CLM_FREQ, MVR_PTS, CAR_AGE))

crash_train_cor$TARGET_AMT<-as.numeric(crash_train_cor$TARGET_AMT)
crash_train_cor$TARGET_FLAG<-as.numeric(crash_train_cor$TARGET_FLAG)
crash_train_cor$KIDSDRIV<-as.numeric(crash_train_cor$KIDSDRIV)
crash_train_cor$AGE<-as.numeric(crash_train_cor$AGE)
crash_train_cor$HOMEKIDS<-as.numeric(crash_train_cor$HOMEKIDS)
crash_train_cor$YOJ<-as.numeric(crash_train_cor$YOJ)
crash_train_cor$HOME_VAL<-as.numeric(crash_train_cor$HOME_VAL)
crash_train_cor$TRAVTIME<-as.numeric(crash_train_cor$TRAVTIME)
crash_train_cor$BLUEBOOK<-as.numeric(crash_train_cor$BLUEBOOK)
crash_train_cor$TIF<-as.numeric(crash_train_cor$TIF)
crash_train_cor$OLDCLAIM<-as.numeric(crash_train_cor$OLDCLAIM)
crash_train_cor$CLM_FREQ<-as.numeric(crash_train_cor$CLM_FREQ)
crash_train_cor$MVR_PTS<-as.numeric(crash_train_cor$MVR_PTS)
crash_train_cor$CAR_AGE<-as.numeric(crash_train_cor$CAR_AGE)


apply(crash_train_cor,2,  function(col)cor(col, crash_training$TARGET_AMT))



apply(crash_train_cor,2,  function(col)cor(col, crash_training$TARGET_FLAG))



#correlation matrix and visualization 
correlation_matrix <- round(cor(crash_train_cor),2)

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(correlation_matrix){
    correlation_matrix[upper.tri(correlation_matrix)] <- NA
    return(correlation_matrix)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(correlation_matrix){
    correlation_matrix[lower.tri(correlation_matrix)]<- NA
    return(correlation_matrix)
  }
  
  upper_tri <- get_upper_tri(correlation_matrix)



library(reshape2)

# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)

# Heatmap
library(ggplot2)

ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 15, hjust = 1))+
 coord_fixed()


#add nice labels 
ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.text.x=element_text(size=rel(0.8), angle=90),
  axis.text.y=element_text(size=rel(0.8)),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))



  #missing data 
colSums(is.na(crash_training2)) #;colSums(is.na(crash_evaluation))
colSums(crash_training2 == 0)


outlierKD<-function(crash_training2, var) {
     var_name <- eval(substitute(var),eval(crash_training2))
     na1 <- sum(is.na(var_name))
     m1 <- mean(var_name, na.rm = T)
     par(mfrow=c(2, 2), oma=c(0,0,3,0))
     boxplot(var_name, main="With outliers")
     hist(var_name, main="With outliers", xlab=NA, ylab=NA)
     outlier <- boxplot.stats(var_name)$out
     mo <- mean(outlier)
     var_name <- ifelse(var_name %in% outlier, NA, var_name)
     boxplot(var_name, main="Without outliers")
     hist(var_name, main="Without outliers", xlab=NA, ylab=NA)
     title("Outlier Check", outer=TRUE)
     na2 <- sum(is.na(var_name))
     cat("Outliers identified:", na2 - na1, "n")
     cat("Propotion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name))*100, 1), "n")
     cat("Mean of the outliers:", round(mo, 2), "n")
     m2 <- mean(var_name, na.rm = T)
     cat("Mean without removing outliers:", round(m1, 2), "n")
     cat("Mean if we remove outliers:", round(m2, 2), "n")
     response <- readline(prompt="Do you want to remove outliers and to replace with NA? [yes/no]: ")
     if(response == "y" | response == "yes"){
          crash_training2[as.character(substitute(var))] <- invisible(var_name)
          assign(as.character(as.list(match.call())$crash_training2), crash_training2, envir = .GlobalEnv)
          cat("Outliers successfully removed", "n")
          return(invisible(crash_training2))
     } else{
          cat("Nothing changed", "n")
          return(invisible(var_name))
     }
}



outlierKD(crash_training2, CAR_AGE)


colSums(is.na(crash_training2)) #;colSums(is.na(crash_evaluation))



library(Hmisc)

crash_training3<-crash_training2

impute(crash_training3$AGE, median)
impute(crash_training3$CAR_AGE, median)

crash_training3[is.na(crash_training3)] <- 0

colSums(is.na(crash_training3)) 

crash_evaluation3<-crash_evaluation2
impute(crash_evaluation3$AGE, median)
impute(crash_evaluation3$CAR_AGE, median)

crash_evaluation3[is.na(crash_evaluation3)] <- 0

crash_training3$CAR_AGE<-abs(crash_training3$CAR_AGE)



summary(crash_training3);summary(crash_training2)



crash_training3$TARGET_AMT<-as.numeric(crash_training3$TARGET_AMT)
crash_training3$TARGET_FLAG<-as.numeric(crash_training3$TARGET_FLAG)
crash_training3$KIDSDRIV<-as.numeric(crash_training3$KIDSDRIV)
crash_training3$AGE<-as.numeric(crash_training3$AGE)
crash_training3$HOMEKIDS<-as.numeric(crash_training3$HOMEKIDS)
crash_training3$YOJ<-as.numeric(crash_training3$YOJ)
crash_training3$HOME_VAL<-as.numeric(crash_training3$HOME_VAL)
crash_training3$TRAVTIME<-as.numeric(crash_training3$TRAVTIME)
crash_training3$BLUEBOOK<-as.numeric(crash_training3$BLUEBOOK)
crash_training3$TIF<-as.numeric(crash_training3$TIF)
crash_training3$OLDCLAIM<-as.numeric(crash_training3$OLDCLAIM)
crash_training3$CLM_FREQ<-as.numeric(crash_training3$CLM_FREQ)
crash_training3$MVR_PTS<-as.numeric(crash_training3$MVR_PTS)
crash_training3$CAR_AGE<-as.numeric(crash_training3$CAR_AGE)
crash_training3$INCOME<-as.numeric(crash_training3$INCOME)

crash_training3a<-crash_training3

crash_evaluation2$TARGET_AMT<-as.numeric(crash_evaluation2$TARGET_AMT)
crash_evaluation2$TARGET_FLAG<-as.numeric(crash_evaluation2$TARGET_FLAG)
crash_evaluation2$KIDSDRIV<-as.numeric(crash_evaluation2$KIDSDRIV)
crash_evaluation2$AGE<-as.numeric(crash_evaluation2$AGE)
crash_evaluation2$HOMEKIDS<-as.numeric(crash_evaluation2$HOMEKIDS)
crash_evaluation2$YOJ<-as.numeric(crash_evaluation2$YOJ)
crash_evaluation2$HOME_VAL<-as.numeric(crash_evaluation2$HOME_VAL)
crash_evaluation2$TRAVTIME<-as.numeric(crash_evaluation2$TRAVTIME)
crash_evaluation2$BLUEBOOK<-as.numeric(crash_evaluation2$BLUEBOOK)
crash_evaluation2$TIF<-as.numeric(crash_evaluation2$TIF)
crash_evaluation2$OLDCLAIM<-as.numeric(crash_evaluation2$OLDCLAIM)
crash_evaluation2$CLM_FREQ<-as.numeric(crash_evaluation2$CLM_FREQ)
crash_evaluation2$MVR_PTS<-as.numeric(crash_evaluation2$MVR_PTS)
crash_evaluation2$CAR_AGE<-as.numeric(crash_evaluation2$CAR_AGE)
crash_evaluation2$INCOME<-as.numeric(crash_evaluation2$INCOME)

crash_evaluation2a<-crash_evaluation2


library(dplyr)
crash_training3 %>% 
     sapply(levels)



  #modeling
  mod2<-glm(TARGET_FLAG~AGE+BLUEBOOK+CAR_AGE+CAR_TYPE+CAR_USE+CLM_FREQ+EDUCATION+HOMEKIDS+HOME_VAL+INCOME+JOB+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+RED_CAR+REVOKED+SEX+TIF+TRAVTIME+URBANICITY+YOJ, family="binomial",  data=crash_training3a)
summary(mod2);


plot(mod2, which = 4, id.n = 3);

mod2.data <- augment(mod2) %>% 
  mutate(index = 1:n()) ;

car::vif(mod2);

mod2.data %>% top_n(3, .cooksd);

ggplot(mod2.data, aes(index, .std.resid)) + 
  geom_point(aes(color = TARGET_FLAG), alpha = .5) +
  theme_bw();

mod2.data %>% 
  filter(abs(.std.resid) > 3);

library(pscl)
pR2(mod2)



mod3<-glm(TARGET_FLAG~CAR_TYPE+CAR_USE+CLM_FREQ+EDUCATION+HOMEKIDS+HOME_VAL+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+REVOKED+SEX+TIF+TRAVTIME+URBANICITY, family="binomial",  data=crash_training3a)
summary(mod3);


plot(mod3, which = 4, id.n = 3);

mod3.data <- augment(mod3) %>% 
  mutate(index = 1:n()) ;

car::vif(mod3);

mod3.data %>% top_n(3, .cooksd);

ggplot(mod3.data, aes(index, .std.resid)) + 
  geom_point(aes(color = TARGET_FLAG), alpha = .5) +
  theme_bw();

mod3.data %>% 
  filter(abs(.std.resid) > 3);

pR2(mod3)


library(MASS)
step<-stepAIC(mod3, trace=FALSE)
step$anova


mod4<-glm(TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS + 
    HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 + 
    REVOKED + SEX + TIF + TRAVTIME + URBANICITY, family="binomial",  data=crash_training3a)
summary(mod4);


plot(mod4, which = 4, id.n = 3);

mod4.data <- augment(mod4) %>% 
  mutate(index = 1:n()) ;

car::vif(mod4);

mod4.data %>% top_n(3, .cooksd);

ggplot(mod4.data, aes(index, .std.resid)) + 
  geom_point(aes(color = TARGET_FLAG), alpha = .5) +
  theme_bw();

mod4.data %>% 
  filter(abs(.std.resid) > 3);

pR2(mod4)



library(olsrr)
lmod2<-lm(log(TARGET_AMT+1)~AGE+BLUEBOOK+CAR_AGE+CAR_TYPE+CAR_USE+CLM_FREQ+EDUCATION+HOMEKIDS+HOME_VAL+INCOME+JOB+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+RED_CAR+REVOKED+SEX+TIF+TRAVTIME+URBANICITY,data=crash_training3a)
#summary(lmod2);

summary(lmod2);
par(mfrow=c(2,2))
plot(lmod2)
hist(resid(lmod2), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod2);
vif(lmod2)



lmod3<-lm(log(TARGET_AMT+1)~CAR_TYPE+CAR_USE+CLM_FREQ+HOME_VAL+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+REVOKED+SEX+TIF+TRAVTIME+URBANICITY+YOJ,data=crash_training3a)
#summary(lmod2);

summary(lmod3);
par(mfrow=c(2,2))
plot(lmod3)
hist(resid(lmod3), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod3);
vif(lmod3)



#library(MASS)
lstep<-stepAIC(lmod2, trace=FALSE)
lstep$anova


lmod4<-lm(log(TARGET_AMT + 1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + 
    HOMEKIDS + HOME_VAL + JOB + KIDSDRIV + MSTATUS + MVR_PTS + 
    OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY + 
    YOJ,data=crash_training3a)
#summary(lmod2);

summary(lmod4);
par(mfrow=c(2,2))
plot(lmod4)
hist(resid(lmod4), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod4);
vif(lmod4)



lmod5<-lm(log(TARGET_AMT+1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + 
    HOMEKIDS + HOME_VAL  + KIDSDRIV + MSTATUS + MVR_PTS + 
    OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY + 
    YOJ,data=crash_training3a)
#summary(lmod2);

summary(lmod5);
par(mfrow=c(2,2))
plot(lmod5)
hist(resid(lmod5), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod5);
vif(lmod5)


#model selection 
#The evaluation data does not contain the target response variable. To test predictive power, we need to partition our training data into a test and evaluation 
library(caret)
Train <- createDataPartition(crash_training3a$TARGET_FLAG, p=0.7, list=FALSE)
train <- crash_training3a[Train, ]
test <- crash_training3a[-Train, ]


#probabilities of prediction 
pred <- predict(mod4, newdata=test, type="response", na.action = na.pass)
scored.probability<-data.frame(as.double(pred))


#probability threshold 
pred_filter <- ifelse(pred > 0.5, 1, 0)
scored.class<-data.frame(as.integer(pred_filter))

#subset actuals
class<-data.frame(subset(test, select=c(TARGET_FLAG)))
class<-as.integer(class$TARGET_FLAG)

#library(plyr)
classes.df<-data.frame(class, scored.class, scored.probability)
str(classes.df)
names(classes.df)


library(pROC)
library(caret)

names(classes.df)<-c("class", "scored.class", "scored.probability")

matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
matrix.df2


#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix;

f1_reduced<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1) 
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (2*TP)/(2*TP + Fn + FP)
}
f1_reduced(classes.df);

plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");

auc(roc(classes.df$class, classes.df$scored.probability))


exp(cbind("Odds ratio" = coef(mod4), confint.default(mod4, level = 0.95)))


lpred <- data.frame(predict(lmod5, newdata=test, type = "response"))
actual <- subset(test, select=c(TARGET_AMT))
l_testing<-data.frame(lpred, actual)
l_testing$rmse <- (mean((l_testing$lpred - l_testing$actual)^2))^0.5
rmse


mu <- mean(actual$TARGET_AMT)
rse <- mean((lpred - actual)^2) / mean((mu - actual)^2)
rse


#prediction
test_results <- predict(lmod5, newdata = crash_evaluation2)
head(test_results, 100);
test_results2<-predict(lmod5, newdata=crash_evaluation2, type = "response")
#table(test_results)
#predict(mod5, newdata=moneyball_evaluation3)
#plot(test_results);
predict(lmod5,newdata=crash_evaluation2, interval='prediction')

test_resultsb <- predict(mod4, newdata = crash_evaluation2, type="response")
head(test_resultsb, 20)



predicted_amt<-data.frame(test_results2)
predicted_prb<-data.frame(test_resultsb)
production<-data.frame(predicted_amt,predicted_prb )


xyplot(test_results2 ~ test_results2b, data = production,
  xlab = "Predicted Probabilities",
  ylab = "Predicted Amount",
  main = "Predicted Cost of Car Crash vs. Predicted Probability of Getting Into a Car Crash")

DATA 621 Week 4

Vinicio Haro

7/7/2018