Our goal is to build a multiple regression model and a logistic regression model that will predict the probability of someone crashing their car and the cost if the car crashes. We are given a dataset with two response variables TARGET_FLAG and TARGET_AMT. TARGET_FLAG is a binary field taking the values 1=crash or 0=no crash while TARGET_AMT is the amount of time spent on repairs given there was a crash. The type of data contained in each response variable is connected to the type of model. TARGET_FLAG is the response variable for logistic regression and TARGET_AMT is the response variable for multiple regression.
The other variables are defined as follows: AGE Age of Driver Very young people tend to be risky. Maybe very old people also. BLUEBOOK Value of Vehicle Unknown effect on probability of collision, but probably effect the payout if there is a crash CAR_AGE Vehicle Age Unknown effect on probability of collision, but probably effect the payout if there is a crash CAR_TYPE Type of Car Unknown effect on probability of collision, but probably effect the payout if there is a crash CAR_USE Vehicle Use Commercial vehicles are driven more, so might increase probability of collision CLM_FREQ # Claims (Past 5 Years) The more claims you filed in the past, the more you are likely to file in the future EDUCATION Max Education Level Unknown effect, but in theory more educated people tend to drive more safely HOMEKIDS # Children at Home Unknown effect HOME_VAL Home Value In theory, home owners tend to drive more responsibly INCOME Income In theory, rich people tend to get into fewer crashes JOB Job Category In theory, white collar jobs tend to be safer KIDSDRIV # Driving Children When teenagers drive your car, you are more likely to get into crashes MSTATUS Marital Status In theory, married people drive more safely MVR_PTS Motor Vehicle Record Points If you get lots of traffic tickets, you tend to get into more crashes OLDCLAIM Total Claims (Past 5 Years) If your total payout over the past five years was high, this suggests future payouts will be high PARENT1 Single Parent Unknown effect RED_CAR A Red Car Urban legend says that red cars (especially red sports cars) are more risky. Is that true? REVOKED License Revoked (Past 7 Years) If your license was revoked in the past 7 years, you probably are a more risky driver. SEX Gender Urban legend says that women have less crashes then men. Is that true? TIF Time in Force People who have been customers for a long time are usually more safe. TRAVTIME Distance to Work Long drives to work usually suggest greater risk URBANICITY Home/Work Area Unknown YOJ Years on Job People who stay at a job for a long time are usually more safe
Our plan is the pick the best multiple regression and logistic regression model that best satisfies our goal. This study will be partitioned into four sections starting with EDA, Data preperation, Model Building, and Model Selection.
We need to partition the data into two subsets. The first subset only takes TARGET_FLAG response and the other variables. The second subset takes TARGET_AMT where TARGET_FLAG=1 in addition to the target variables.
Read in the data
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1 1 0 0 0 60 0 11 $67,349 No
## 2 2 0 0 0 43 0 11 $91,449 No
## 3 4 0 0 0 35 1 10 $16,039 No
## 4 5 0 0 0 51 0 14 No
## 5 6 0 0 0 50 0 NA $114,986 No
## 6 7 1 2946 0 34 1 12 $125,301 Yes
## 7 8 0 0 0 54 0 NA $18,755 No
## 8 11 1 4021 1 37 2 NA $107,961 No
## 9 12 1 2501 0 34 0 10 $62,978 No
## 10 13 0 0 0 50 0 7 $106,952 No
## HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE
## 1 $0 z_No M PhD Professional 14 Private
## 2 $257,252 z_No M z_High School z_Blue Collar 22 Commercial
## 3 $124,191 Yes z_F z_High School Clerical 5 Private
## 4 $306,251 Yes M <High School z_Blue Collar 32 Private
## 5 $243,925 Yes z_F PhD Doctor 36 Private
## 6 $0 z_No z_F Bachelors z_Blue Collar 46 Commercial
## 7 Yes z_F <High School z_Blue Collar 33 Private
## 8 $333,680 Yes M Bachelors z_Blue Collar 44 Commercial
## 9 $0 z_No z_F Bachelors Clerical 34 Private
## 10 $0 z_No M Bachelors Professional 48 Commercial
## BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1 $14,230 11 Minivan yes $4,461 2 No 3
## 2 $14,940 1 Minivan yes $0 0 No 0
## 3 $4,010 4 z_SUV no $38,690 2 No 3
## 4 $15,440 7 Minivan yes $0 0 No 0
## 5 $18,000 1 z_SUV no $19,217 2 Yes 3
## 6 $17,430 1 Sports Car no $0 0 No 0
## 7 $8,780 1 z_SUV no $0 0 No 0
## 8 $16,970 1 Van yes $2,374 1 Yes 10
## 9 $11,200 1 z_SUV no $0 0 No 0
## 10 $18,510 7 Van no $0 0 No 1
## CAR_AGE URBANICITY
## 1 18 Highly Urban/ Urban
## 2 1 Highly Urban/ Urban
## 3 10 Highly Urban/ Urban
## 4 6 Highly Urban/ Urban
## 5 17 Highly Urban/ Urban
## 6 7 Highly Urban/ Urban
## 7 1 Highly Urban/ Urban
## 8 7 Highly Urban/ Urban
## 9 1 Highly Urban/ Urban
## 10 17 z_Highly Rural/ Rural
How many rows and variables does our data have overall?
## [1] 8161
## [1] 26
## [1] "INDEX" "TARGET_FLAG" "TARGET_AMT" "KIDSDRIV" "AGE"
## [6] "HOMEKIDS" "YOJ" "INCOME" "PARENT1" "HOME_VAL"
## [11] "MSTATUS" "SEX" "EDUCATION" "JOB" "TRAVTIME"
## [16] "CAR_USE" "BLUEBOOK" "TIF" "CAR_TYPE" "RED_CAR"
## [21] "OLDCLAIM" "CLM_FREQ" "REVOKED" "MVR_PTS" "CAR_AGE"
## [26] "URBANICITY"
There are 8,161 rows and 26 variables. The INDEX variable is simply a row number and will not be included in any portion of the analysis from here on out. Each record represents a car owner.
Lets examine our response variables. TARGET_FLAG
table(crash_training$TARGET_FLAG);
##
## 0 1
## 6008 2153
barplot(table(crash_training$TARGET_FLAG), ylim=c(0, 7000), xlab="Result", ylab="N", col="black")
The data contains information for roughly 73% non-crashes and 27% crashes.
TARGET_AMT
x <- crash_training$TARGET_AMT
h<-hist(x, breaks=10, col="red", xlab="Target AMT",
main="Histogram with Normal Curve", ylim=range(0:1200))
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2);
x <- log(crash_training$TARGET_AMT)
h<-hist(x, breaks=10, col="red", xlab="Target AMT",
main="Histogram with Normal Curve After Log Transform", ylim=range(0:1200))
#xfit<-seq(min(x),max(x),length=40)
#yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
#yfit <- yfit*diff(h$mids[1:2])*length(x)
#lines(x, yfit, col="blue", lwd=2)
Almost immediatly, we notice that there is no association with normality however if we add a log transform, the spread of the response variable gets closer to normal.
General Data Summaries
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 $0 : 615
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 : 445
## Median :45.00 Median :0.0000 Median :11.0 $26,840 : 4
## Mean :44.79 Mean :0.7212 Mean :10.5 $48,509 : 4
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 $61,790 : 4
## Max. :81.00 Max. :5.0000 Max. :23.0 $107,375: 3
## NA's :6 NA's :454 (Other) :7086
## PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## No :7084 $0 :2294 Yes :4894 M :3786 <High School :1203
## Yes:1077 : 464 z_No:3267 z_F:4375 Bachelors :2242
## $111,129: 3 Masters :1658
## $115,249: 3 PhD : 728
## $123,109: 3 z_High School:2330
## $153,061: 3
## (Other) :5391
## JOB TRAVTIME CAR_USE BLUEBOOK
## z_Blue Collar:1825 Min. : 5.00 Commercial:3029 $1,500 : 157
## Clerical :1271 1st Qu.: 22.00 Private :5132 $6,000 : 34
## Professional :1117 Median : 33.00 $5,800 : 33
## Manager : 988 Mean : 33.49 $6,200 : 33
## Lawyer : 835 3rd Qu.: 44.00 $6,400 : 31
## Student : 712 Max. :142.00 $5,900 : 30
## (Other) :1413 (Other):7843
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 no :5783 $0 :5009
## 1st Qu.: 1.000 Panel Truck: 676 yes:2378 $1,310 : 4
## Median : 4.000 Pickup :1389 $1,391 : 4
## Mean : 5.351 Sports Car : 907 $4,263 : 4
## 3rd Qu.: 7.000 Van : 750 $1,105 : 3
## Max. :25.000 z_SUV :2294 $1,332 : 3
## (Other):3134
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 No :7161 Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.328
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
## NA's :510
## URBANICITY
## Highly Urban/ Urban :6492
## z_Highly Rural/ Rural:1669
##
##
##
##
##
Lets select the variables that can be reasonably visualized and plot their spreads. We are unable to do a facet plot of all the remaining variables at once since attributes are not identical across measure variables. We produce plots detailing the distribution of the discrete value variables.
From our visuals, car age is the variable that is most alarming to observe. It seems to be that there are negative values but that is not possible for car age. This variable might have outliers caused by errors within the data collection. Travel time also has a very significant outlier but it is related to travele time. It does not seem reasonable that many users would have a travel time of just 5 minutes. YOJ also stands out since this variable is defined as the number of years people stay on the job. There is a visible peak at 0 years but does that group in people who stayed in their jobs for x amunt of months?
Some of these variables might be good candidates for dummy coding especially when doing the linear regression model on them. Lets look at the density plot of variables relating to income. We need to convert from factor to numeric in order to visualize.
## BLUEBOOK HOME_VAL INCOME OLDCLAIM
## Min. : 1 Min. : 1 Min. : 1 Min. : 1.0
## 1st Qu.: 478 1st Qu.: 2 1st Qu.: 926 1st Qu.: 1.0
## Median :1124 Median :1245 Median :2817 Median : 1.0
## Mean :1284 Mean :1685 Mean :2876 Mean : 552.3
## 3rd Qu.:2234 3rd Qu.:3164 3rd Qu.:4701 3rd Qu.:1015.0
## Max. :2789 Max. :5107 Max. :6613 Max. :2857.0
Out of these four predictors, home value and income have their largest value of zero. Having zero as income could be due to unemployment and having zero as home value could be due to not being a homeowner.
In order to perform correlation analysis using standard pearson, we should only consider variables with numeric entries. We will make such a subset and then consider some alternatives to see if we can quantify the correlation of categorical variables.
How does each variable correlate with TARGET_AMT
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ
## 0.53424606 1.00000000 0.05539418 NA 0.06198804 NA
## HOME_VAL TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ
## -0.07682458 0.02798702 0.02359552 -0.04648083 0.09714780 0.11641916
## MVR_PTS CAR_AGE
## 0.13786551 NA
The two responses have a strong positive correlation. This does not seem out of the ordinary since a car crash needs to occur in order to record the cost of the crash. The response variable has a strong correlation with KIDSDRIV which is quite interesting. Could it be that kid drivers are more accident prone?
How does each variable correlate with TARGET_FLAG?
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ
## 1.00000000 0.53424606 0.10366830 NA 0.11562101 NA
## HOME_VAL TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ
## -0.14857148 0.04836831 0.05044526 -0.08237005 0.19028750 0.21619606
## MVR_PTS CAR_AGE
## 0.21919705 NA
The highest correlation with target flag occurs with mvr_pts closely followed by claim_frequency. A value of 1 in target flag is an indicator for a crash. If a car has crashed then that must be put into an insurance claim and contribute to your mvr points.
How do variables correlate with each other?
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
There are several variables that have minimal correlation with one another. Some variables are not correlated at all such as TIF.
Check missing data
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ
## 0 0 0 6 0 454
## INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## 0 0 0 0 0 0
## JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE
## 0 0 0 0 0 0
## RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 0 0 0 0 0 510
## URBANICITY
## 0
Age is missing 0.07% data YOJ is missing 5.56% data CAR_AGE is missing 6.24% of data
5% is generally accepted as the threshhold to remove data but I hesitate in removing YOJ due to its theoretical effect on car saftey. I do not believe that YOJ is suitable for imputing methods because it is possible that someone does not have employment history, hence there is nothing to record. How many zero entries appear in the data?
colSums(crash_training2 == 0)
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ
## 6008 6008 7180 NA 5289 NA
## INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## 0 0 0 0 0 0
## JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE
## 0 0 0 0 0 0
## RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 0 0 5009 0 3712 NA
## URBANICITY
## 0
The amount of zero entires in these variables are not suprising given that it is very reasonable for users not have children or any insurance claims. In most of these variables, a zero value implies a safe driver especially considering both the response variables.
We skip outlier analysis due to several thought points. While we observed some variables with extreme values in the histograms, the extreme values do not seem out of the ordinary. A prime example of an extreme value is travel time being 5 minutes. I believe 5 minutes by car is a reasonable commute time for a good majoirty of people who live close to where they work. Car age is the only variable that does not seem right. Lets run an outlier analysis on car age.
## Outliers identified: 0 nPropotion (%) of outliers: 0 nMean of the outliers: NaN nMean without removing outliers: 8.33 nMean if we remove outliers: 8.33 nDo you want to remove outliers and to replace with NA? [yes/no]:
## Nothing changed n
There does not seem to be a major change in car age with or without outliers.
II Data Preperation)
There are several variables that are prime candidates for dummy coding such as sex or m status however this may prove to be a long process. We can convert the variables to factors and the lm function will dummy code these categorical variables automatically.
Recall the missing values
colSums(is.na(crash_training2)) #;colSums(is.na(crash_evaluation))
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ
## 0 0 0 6 0 454
## INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## 0 0 0 0 0 0
## JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE
## 0 0 0 0 0 0
## RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 0 0 0 0 0 510
## URBANICITY
## 0
After giving it more thought, we need to take into consideration that the data represents policy holders. Age of the policy holder should be imputed using the median age of a policy holder. YOJ is the number of years on a job. It is possible that a policy holder just started a job, or they are not employed. If they are not employed, then there are certain insurance companies that still provide low to no cost healthcare such as Fidelis. The blank entries could be replaced with zero. Car age had the most number of missing values and a strange negative number. The negative number could be due to an erorr with data collection. As for the missing values, we can impute with the median since a car age should not be blank for a policy holder’s record. We apply the same transformations to the evaluation data set.
Impute missing values
Check if there has been any major change in summary after imputing
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE
## Min. :0.0000 Min. : 0 Min. :0.0000 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00
## Median :0.0000 Median : 0 Median :0.0000 Median :45.00
## Mean :0.2638 Mean : 1504 Mean :0.1711 Mean :44.76
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00
## Max. :1.0000 Max. :107586 Max. :4.0000 Max. :81.00
##
## HOMEKIDS YOJ INCOME PARENT1
## Min. :0.0000 Min. : 0.000 $0 : 615 No :7084
## 1st Qu.:0.0000 1st Qu.: 8.000 : 445 Yes:1077
## Median :0.0000 Median :11.000 $26,840 : 4
## Mean :0.7212 Mean : 9.915 $48,509 : 4
## 3rd Qu.:1.0000 3rd Qu.:13.000 $61,790 : 4
## Max. :5.0000 Max. :23.000 $107,375: 3
## (Other) :7086
## HOME_VAL MSTATUS SEX EDUCATION
## $0 :2294 Yes :4894 M :3786 <High School :1203
## : 464 z_No:3267 z_F:4375 Bachelors :2242
## $111,129: 3 Masters :1658
## $115,249: 3 PhD : 728
## $123,109: 3 z_High School:2330
## $153,061: 3
## (Other) :5391
## JOB TRAVTIME CAR_USE BLUEBOOK
## z_Blue Collar:1825 Min. : 5.00 Commercial:3029 $1,500 : 157
## Clerical :1271 1st Qu.: 22.00 Private :5132 $6,000 : 34
## Professional :1117 Median : 33.00 $5,800 : 33
## Manager : 988 Mean : 33.49 $6,200 : 33
## Lawyer : 835 3rd Qu.: 44.00 $6,400 : 31
## Student : 712 Max. :142.00 $5,900 : 30
## (Other) :1413 (Other):7843
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 no :5783 $0 :5009
## 1st Qu.: 1.000 Panel Truck: 676 yes:2378 $1,310 : 4
## Median : 4.000 Pickup :1389 $1,391 : 4
## Mean : 5.351 Sports Car : 907 $4,263 : 4
## 3rd Qu.: 7.000 Van : 750 $1,105 : 3
## Max. :25.000 z_SUV :2294 $1,332 : 3
## (Other):3134
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 No :7161 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 7.809
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
##
## URBANICITY
## Highly Urban/ Urban :6492
## z_Highly Rural/ Rural:1669
##
##
##
##
##
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE
## Min. :0.0000 Min. : 0 Min. :0.0000 Min. :16.00
## 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00
## Median :0.0000 Median : 0 Median :0.0000 Median :45.00
## Mean :0.2638 Mean : 1504 Mean :0.1711 Mean :44.79
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00
## Max. :1.0000 Max. :107586 Max. :4.0000 Max. :81.00
## NA's :6
## HOMEKIDS YOJ INCOME PARENT1
## Min. :0.0000 Min. : 0.0 $0 : 615 No :7084
## 1st Qu.:0.0000 1st Qu.: 9.0 : 445 Yes:1077
## Median :0.0000 Median :11.0 $26,840 : 4
## Mean :0.7212 Mean :10.5 $48,509 : 4
## 3rd Qu.:1.0000 3rd Qu.:13.0 $61,790 : 4
## Max. :5.0000 Max. :23.0 $107,375: 3
## NA's :454 (Other) :7086
## HOME_VAL MSTATUS SEX EDUCATION
## $0 :2294 Yes :4894 M :3786 <High School :1203
## : 464 z_No:3267 z_F:4375 Bachelors :2242
## $111,129: 3 Masters :1658
## $115,249: 3 PhD : 728
## $123,109: 3 z_High School:2330
## $153,061: 3
## (Other) :5391
## JOB TRAVTIME CAR_USE BLUEBOOK
## z_Blue Collar:1825 Min. : 5.00 Commercial:3029 $1,500 : 157
## Clerical :1271 1st Qu.: 22.00 Private :5132 $6,000 : 34
## Professional :1117 Median : 33.00 $5,800 : 33
## Manager : 988 Mean : 33.49 $6,200 : 33
## Lawyer : 835 3rd Qu.: 44.00 $6,400 : 31
## Student : 712 Max. :142.00 $5,900 : 30
## (Other) :1413 (Other):7843
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 no :5783 $0 :5009
## 1st Qu.: 1.000 Panel Truck: 676 yes:2378 $1,310 : 4
## Median : 4.000 Pickup :1389 $1,391 : 4
## Mean : 5.351 Sports Car : 907 $4,263 : 4
## 3rd Qu.: 7.000 Van : 750 $1,105 : 3
## Max. :25.000 z_SUV :2294 $1,332 : 3
## (Other):3134
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 No :7161 Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.328
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
## NA's :510
## URBANICITY
## Highly Urban/ Urban :6492
## z_Highly Rural/ Rural:1669
##
##
##
##
##
Imputing missing data does not seem to change the stats of the imputed variables. We also took the absolute value of car age to fix the negative entry.
We need to do some additional modifications to our data otherwise it will complicate modeling. We need to be sure that income and simialr variables are read in as numeric. We also need to recode categorical variables into dummies. Transforming our data is going to play a key role in optimizing our model building.
Count the number of levels per variable after we treat the income realted variables as numeric
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
##
## src, summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## $TARGET_FLAG
## NULL
##
## $TARGET_AMT
## NULL
##
## $KIDSDRIV
## NULL
##
## $AGE
## NULL
##
## $HOMEKIDS
## NULL
##
## $YOJ
## NULL
##
## $INCOME
## NULL
##
## $PARENT1
## [1] "No" "Yes"
##
## $HOME_VAL
## NULL
##
## $MSTATUS
## [1] "Yes" "z_No"
##
## $SEX
## [1] "M" "z_F"
##
## $EDUCATION
## [1] "<High School" "Bachelors" "Masters" "PhD"
## [5] "z_High School"
##
## $JOB
## [1] "" "Clerical" "Doctor" "Home Maker"
## [5] "Lawyer" "Manager" "Professional" "Student"
## [9] "z_Blue Collar"
##
## $TRAVTIME
## NULL
##
## $CAR_USE
## [1] "Commercial" "Private"
##
## $BLUEBOOK
## NULL
##
## $TIF
## NULL
##
## $CAR_TYPE
## [1] "Minivan" "Panel Truck" "Pickup" "Sports Car" "Van"
## [6] "z_SUV"
##
## $RED_CAR
## [1] "no" "yes"
##
## $OLDCLAIM
## NULL
##
## $CLM_FREQ
## NULL
##
## $REVOKED
## [1] "No" "Yes"
##
## $MVR_PTS
## NULL
##
## $CAR_AGE
## NULL
##
## $URBANICITY
## [1] "Highly Urban/ Urban" "z_Highly Rural/ Rural"
The variables JOB and EDUCATION have more than 1 factor. These variables are good candidates for recoding as dummy variables in addition to flattening. This should optimize the computation time of the glm function. We can then remove those variables with many factors.
We seek to build two different models (Linear regression and binary logisitc regression). In the EDA portion we determined that it is best to apply a log(x+1) transformation on the response variable. The log makes the spread appear closer to normal and the +1 makes sure the values in the field are within the domain of the log function.
Full Logisitic Regression without target_amt
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 1.4.2 ✔ purrr 0.2.4
## ✔ readr 1.1.1 ✔ stringr 1.3.0
## ✔ tibble 1.4.2 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::src() masks Hmisc::src()
## ✖ dplyr::summarize() masks Hmisc::summarize()
## Warning: package 'broom' was built under R version 3.4.4
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
##
## Attaching package: 'car'
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:dplyr':
##
## recode
##
## Call:
## glm(formula = TARGET_FLAG ~ AGE + BLUEBOOK + CAR_AGE + CAR_TYPE +
## CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS + HOME_VAL + INCOME +
## JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 +
## RED_CAR + REVOKED + SEX + TIF + TRAVTIME + URBANICITY + YOJ,
## family = "binomial", data = crash_training3a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5911 -0.7179 -0.4046 0.6397 3.1539
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.281e+00 3.157e-01 -4.058 4.95e-05 ***
## AGE -5.488e-03 3.903e-03 -1.406 0.159646
## BLUEBOOK 1.252e-05 3.368e-05 0.372 0.710099
## CAR_AGE -4.202e-03 6.670e-03 -0.630 0.528731
## CAR_TYPEPanel Truck 1.923e-01 1.432e-01 1.342 0.179473
## CAR_TYPEPickup 6.011e-01 1.021e-01 5.888 3.91e-09 ***
## CAR_TYPESports Car 1.228e+00 1.224e-01 10.034 < 2e-16 ***
## CAR_TYPEVan 4.425e-01 1.213e-01 3.649 0.000263 ***
## CAR_TYPEz_SUV 9.648e-01 1.030e-01 9.367 < 2e-16 ***
## CAR_USEPrivate -7.538e-01 9.154e-02 -8.234 < 2e-16 ***
## CLM_FREQ 1.169e-01 3.206e-02 3.645 0.000268 ***
## EDUCATIONBachelors -4.576e-01 1.127e-01 -4.059 4.92e-05 ***
## EDUCATIONMasters -3.846e-01 1.728e-01 -2.226 0.026046 *
## EDUCATIONPhD -4.087e-01 2.023e-01 -2.020 0.043355 *
## EDUCATIONz_High School -1.389e-02 9.476e-02 -0.147 0.883494
## HOMEKIDS 4.702e-02 3.676e-02 1.279 0.200857
## HOME_VAL -8.131e-05 2.023e-05 -4.018 5.86e-05 ***
## INCOME -1.083e-05 1.598e-05 -0.677 0.498091
## JOBClerical 5.463e-01 1.933e-01 2.826 0.004720 **
## JOBDoctor -3.910e-01 2.633e-01 -1.485 0.137519
## JOBHome Maker 6.125e-01 1.987e-01 3.082 0.002057 **
## JOBLawyer 1.825e-01 1.682e-01 1.085 0.277869
## JOBManager -5.113e-01 1.698e-01 -3.011 0.002606 **
## JOBProfessional 2.297e-01 1.769e-01 1.299 0.194069
## JOBStudent 5.219e-01 2.062e-01 2.531 0.011368 *
## JOBz_Blue Collar 3.841e-01 1.841e-01 2.086 0.036977 *
## KIDSDRIV 3.893e-01 6.097e-02 6.386 1.71e-10 ***
## MSTATUSz_No 5.245e-01 7.722e-02 6.793 1.10e-11 ***
## MVR_PTS 1.063e-01 1.365e-02 7.786 6.89e-15 ***
## OLDCLAIM 8.863e-05 4.235e-05 2.093 0.036358 *
## PARENT1Yes 3.715e-01 1.090e-01 3.409 0.000653 ***
## RED_CARyes 1.105e-02 8.618e-02 0.128 0.897954
## REVOKEDYes 7.432e-01 8.030e-02 9.255 < 2e-16 ***
## SEXz_F -2.808e-01 1.030e-01 -2.725 0.006425 **
## TIF -5.484e-02 7.315e-03 -7.496 6.58e-14 ***
## TRAVTIME 1.463e-02 1.878e-03 7.789 6.74e-15 ***
## URBANICITYz_Highly Rural/ Rural -2.355e+00 1.124e-01 -20.940 < 2e-16 ***
## YOJ -1.355e-02 7.088e-03 -1.912 0.055891 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7348.1 on 8123 degrees of freedom
## AIC: 7424.1
##
## Number of Fisher Scoring iterations: 5
## GVIF Df GVIF^(1/(2*Df))
## AGE 1.426799 1 1.194487
## BLUEBOOK 1.124090 1 1.060231
## CAR_AGE 1.785593 1 1.336261
## CAR_TYPE 4.053666 5 1.150230
## CAR_USE 2.458818 1 1.568062
## CLM_FREQ 1.849502 1 1.359964
## EDUCATION 9.531535 4 1.325548
## HOMEKIDS 2.146924 1 1.465238
## HOME_VAL 1.321127 1 1.149403
## INCOME 1.284895 1 1.133532
## JOB 19.406626 8 1.203641
## KIDSDRIV 1.347663 1 1.160889
## MSTATUS 1.768427 1 1.329822
## MVR_PTS 1.182979 1 1.087649
## OLDCLAIM 1.839574 1 1.356309
## PARENT1 1.930994 1 1.389602
## RED_CAR 1.840066 1 1.356490
## REVOKED 1.016893 1 1.008411
## SEX 3.154551 1 1.776106
## TIF 1.010589 1 1.005281
## TRAVTIME 1.038718 1 1.019175
## URBANICITY 1.138952 1 1.067217
## YOJ 1.327868 1 1.152332
## TARGET_FLAG AGE BLUEBOOK CAR_AGE CAR_TYPE CAR_USE CLM_FREQ EDUCATION
## 1 1 51 350 17 Sports Car Private 0 PhD
## 2 1 67 1671 1 z_SUV Private 4 PhD
## 3 1 59 1431 14 Pickup Private 1 PhD
## HOMEKIDS HOME_VAL INCOME JOB KIDSDRIV MSTATUS MVR_PTS OLDCLAIM
## 1 0 3508 124 Professional 0 Yes 0 1
## 2 0 3994 6509 Doctor 0 Yes 5 131
## 3 0 3570 6045 Doctor 0 Yes 2 2585
## PARENT1 RED_CAR REVOKED SEX TIF TRAVTIME URBANICITY YOJ
## 1 No no Yes z_F 4 48 z_Highly Rural/ Rural 12
## 2 No no No M 6 17 Highly Urban/ Urban 13
## 3 No no No z_F 4 53 Highly Urban/ Urban 17
## .fitted .se.fit .resid .hat .sigma .cooksd
## 1 -3.191324 0.2431200 2.542290 0.002242393 0.9507453 0.001441625
## 2 -1.861628 0.2908489 2.003043 0.009848317 0.9509024 0.001700865
## 3 -2.350243 0.2620566 2.209667 0.005457469 0.9508469 0.001522856
## .std.resid index
## 1 2.545145 2338
## 2 2.012980 2768
## 3 2.215721 7741
## TARGET_FLAG AGE BLUEBOOK CAR_AGE CAR_TYPE CAR_USE CLM_FREQ EDUCATION
## 1 1 60 856 10 Minivan Private 0 Bachelors
## HOMEKIDS HOME_VAL INCOME JOB KIDSDRIV MSTATUS MVR_PTS OLDCLAIM
## 1 0 2843 4605 Professional 0 Yes 0 1
## PARENT1 RED_CAR REVOKED SEX TIF TRAVTIME URBANICITY YOJ
## 1 No yes No M 1 35 z_Highly Rural/ Rural 13
## .fitted .se.fit .resid .hat .sigma .cooksd
## 1 -4.966646 0.1769071 3.153915 0.0002150431 0.9505205 0.0008126726
## .std.resid index
## 1 3.154254 5101
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
## llh llhNull G2 McFadden r2ML
## -3674.0444376 -4708.9811460 2069.8734168 0.2197793 0.2240210
## r2CU
## 0.3272130
There are several variables that can be identified as not being significant. These variables are age, bluebook, car age, homekids, income, red car, and yoj. Our next model will eliminate these predictors but retain the levels. These variables were shown to have minimal correlation in our heatmap. Job can also be eliminated due to the fact that it has a high VIF number. The full model yields a McFadden of 0.2.
##
## Call:
## glm(formula = TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION +
## HOMEKIDS + HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM +
## PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY, family = "binomial",
## data = crash_training3a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5886 -0.7290 -0.4235 0.6570 3.0983
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.197e+00 1.502e-01 -7.969 1.60e-15 ***
## CAR_TYPEPanel Truck 1.193e-01 1.313e-01 0.908 0.363860
## CAR_TYPEPickup 5.796e-01 9.668e-02 5.995 2.03e-09 ***
## CAR_TYPESports Car 1.239e+00 1.196e-01 10.362 < 2e-16 ***
## CAR_TYPEVan 3.988e-01 1.183e-01 3.372 0.000747 ***
## CAR_TYPEz_SUV 9.756e-01 1.014e-01 9.622 < 2e-16 ***
## CAR_USEPrivate -8.009e-01 7.245e-02 -11.055 < 2e-16 ***
## CLM_FREQ 1.175e-01 3.182e-02 3.692 0.000223 ***
## EDUCATIONBachelors -7.543e-01 9.464e-02 -7.970 1.58e-15 ***
## EDUCATIONMasters -8.615e-01 1.026e-01 -8.399 < 2e-16 ***
## EDUCATIONPhD -1.055e+00 1.331e-01 -7.926 2.27e-15 ***
## EDUCATIONz_High School -1.280e-01 9.085e-02 -1.409 0.158823
## HOMEKIDS 7.975e-02 3.339e-02 2.389 0.016916 *
## HOME_VAL -9.488e-05 1.978e-05 -4.797 1.61e-06 ***
## KIDSDRIV 3.484e-01 5.934e-02 5.872 4.31e-09 ***
## MSTATUSz_No 5.204e-01 7.625e-02 6.826 8.74e-12 ***
## MVR_PTS 1.121e-01 1.354e-02 8.280 < 2e-16 ***
## OLDCLAIM 8.477e-05 4.205e-05 2.016 0.043804 *
## PARENT1Yes 3.721e-01 1.074e-01 3.466 0.000528 ***
## REVOKEDYes 7.487e-01 7.949e-02 9.419 < 2e-16 ***
## SEXz_F -2.382e-01 8.752e-02 -2.722 0.006485 **
## TIF -5.478e-02 7.253e-03 -7.553 4.26e-14 ***
## TRAVTIME 1.498e-02 1.862e-03 8.042 8.80e-16 ***
## URBANICITYz_Highly Rural/ Rural -2.239e+00 1.114e-01 -20.095 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7451.7 on 8137 degrees of freedom
## AIC: 7499.7
##
## Number of Fisher Scoring iterations: 5
## GVIF Df GVIF^(1/(2*Df))
## CAR_TYPE 3.188769 5 1.122955
## CAR_USE 1.562896 1 1.250158
## CLM_FREQ 1.849893 1 1.360108
## EDUCATION 1.345521 4 1.037794
## HOMEKIDS 1.801076 1 1.342042
## HOME_VAL 1.261352 1 1.123099
## KIDSDRIV 1.290858 1 1.136159
## MSTATUS 1.753165 1 1.324071
## MVR_PTS 1.178627 1 1.085646
## OLDCLAIM 1.839824 1 1.356401
## PARENT1 1.912960 1 1.383098
## REVOKED 1.014748 1 1.007347
## SEX 2.311561 1 1.520382
## TIF 1.007045 1 1.003516
## TRAVTIME 1.034820 1 1.017261
## URBANICITY 1.118487 1 1.057586
## TARGET_FLAG CAR_TYPE CAR_USE CLM_FREQ EDUCATION HOMEKIDS HOME_VAL
## 1 1 Van Private 3 z_High School 3 264
## 2 0 z_SUV Private 1 Bachelors 4 1333
## 3 0 z_SUV Private 3 <High School 3 2
## KIDSDRIV MSTATUS MVR_PTS OLDCLAIM PARENT1 REVOKED SEX TIF TRAVTIME
## 1 0 Yes 0 2331 No No M 1 40
## 2 4 Yes 9 717 No Yes z_F 1 23
## 3 3 z_No 4 2703 Yes No z_F 1 95
## URBANICITY .fitted .se.fit .resid .hat
## 1 z_Highly Rural/ Rural -2.657602 0.2005136 2.334681 0.002461993
## 2 Highly Urban/ Urban 1.797394 0.2473381 -1.975217 0.007460934
## 3 Highly Urban/ Urban 3.314761 0.2276243 -2.588613 0.001753313
## .sigma .cooksd .std.resid index
## 1 0.9566739 0.001470274 2.337561 807
## 2 0.9567724 0.001904079 -1.982627 3773
## 3 0.9565937 0.002017227 -2.590885 5652
## TARGET_FLAG CAR_TYPE CAR_USE CLM_FREQ EDUCATION HOMEKIDS HOME_VAL
## 1 1 Minivan Private 0 Bachelors 0 2843
## KIDSDRIV MSTATUS MVR_PTS OLDCLAIM PARENT1 REVOKED SEX TIF TRAVTIME
## 1 0 Yes 0 1 No No M 1 35
## URBANICITY .fitted .se.fit .resid .hat
## 1 z_Highly Rural/ Rural -4.791511 0.1518637 3.098315 0.0001883017
## .sigma .cooksd .std.resid index
## 1 0.9564081 0.0009456568 3.098606 5101
## llh llhNull G2 McFadden r2ML
## -3725.8668999 -4708.9811460 1966.2284922 0.2087743 0.2141032
## r2CU
## 0.3127267
The second model has the same McFadden as the first but it is less complicated. The AIC is hovering around the same value.
Build a model using backstepping
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
step<-stepAIC(mod3, trace=FALSE)
step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS +
## HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 +
## REVOKED + SEX + TIF + TRAVTIME + URBANICITY
##
## Final Model:
## TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS +
## HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 +
## REVOKED + SEX + TIF + TRAVTIME + URBANICITY
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 8137 7451.734 7499.734
Formulate the model built by backstepping
mod4<-glm(TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS +
HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 +
REVOKED + SEX + TIF + TRAVTIME + URBANICITY, family="binomial", data=crash_training3a)
summary(mod4);
##
## Call:
## glm(formula = TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION +
## HOMEKIDS + HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM +
## PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY, family = "binomial",
## data = crash_training3a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5886 -0.7290 -0.4235 0.6570 3.0983
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.197e+00 1.502e-01 -7.969 1.60e-15 ***
## CAR_TYPEPanel Truck 1.193e-01 1.313e-01 0.908 0.363860
## CAR_TYPEPickup 5.796e-01 9.668e-02 5.995 2.03e-09 ***
## CAR_TYPESports Car 1.239e+00 1.196e-01 10.362 < 2e-16 ***
## CAR_TYPEVan 3.988e-01 1.183e-01 3.372 0.000747 ***
## CAR_TYPEz_SUV 9.756e-01 1.014e-01 9.622 < 2e-16 ***
## CAR_USEPrivate -8.009e-01 7.245e-02 -11.055 < 2e-16 ***
## CLM_FREQ 1.175e-01 3.182e-02 3.692 0.000223 ***
## EDUCATIONBachelors -7.543e-01 9.464e-02 -7.970 1.58e-15 ***
## EDUCATIONMasters -8.615e-01 1.026e-01 -8.399 < 2e-16 ***
## EDUCATIONPhD -1.055e+00 1.331e-01 -7.926 2.27e-15 ***
## EDUCATIONz_High School -1.280e-01 9.085e-02 -1.409 0.158823
## HOMEKIDS 7.975e-02 3.339e-02 2.389 0.016916 *
## HOME_VAL -9.488e-05 1.978e-05 -4.797 1.61e-06 ***
## KIDSDRIV 3.484e-01 5.934e-02 5.872 4.31e-09 ***
## MSTATUSz_No 5.204e-01 7.625e-02 6.826 8.74e-12 ***
## MVR_PTS 1.121e-01 1.354e-02 8.280 < 2e-16 ***
## OLDCLAIM 8.477e-05 4.205e-05 2.016 0.043804 *
## PARENT1Yes 3.721e-01 1.074e-01 3.466 0.000528 ***
## REVOKEDYes 7.487e-01 7.949e-02 9.419 < 2e-16 ***
## SEXz_F -2.382e-01 8.752e-02 -2.722 0.006485 **
## TIF -5.478e-02 7.253e-03 -7.553 4.26e-14 ***
## TRAVTIME 1.498e-02 1.862e-03 8.042 8.80e-16 ***
## URBANICITYz_Highly Rural/ Rural -2.239e+00 1.114e-01 -20.095 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7451.7 on 8137 degrees of freedom
## AIC: 7499.7
##
## Number of Fisher Scoring iterations: 5
plot(mod4, which = 4, id.n = 3);
mod4.data <- augment(mod4) %>%
mutate(index = 1:n()) ;
car::vif(mod4);
## GVIF Df GVIF^(1/(2*Df))
## CAR_TYPE 3.188769 5 1.122955
## CAR_USE 1.562896 1 1.250158
## CLM_FREQ 1.849893 1 1.360108
## EDUCATION 1.345521 4 1.037794
## HOMEKIDS 1.801076 1 1.342042
## HOME_VAL 1.261352 1 1.123099
## KIDSDRIV 1.290858 1 1.136159
## MSTATUS 1.753165 1 1.324071
## MVR_PTS 1.178627 1 1.085646
## OLDCLAIM 1.839824 1 1.356401
## PARENT1 1.912960 1 1.383098
## REVOKED 1.014748 1 1.007347
## SEX 2.311561 1 1.520382
## TIF 1.007045 1 1.003516
## TRAVTIME 1.034820 1 1.017261
## URBANICITY 1.118487 1 1.057586
mod4.data %>% top_n(3, .cooksd);
## TARGET_FLAG CAR_TYPE CAR_USE CLM_FREQ EDUCATION HOMEKIDS HOME_VAL
## 1 1 Van Private 3 z_High School 3 264
## 2 0 z_SUV Private 1 Bachelors 4 1333
## 3 0 z_SUV Private 3 <High School 3 2
## KIDSDRIV MSTATUS MVR_PTS OLDCLAIM PARENT1 REVOKED SEX TIF TRAVTIME
## 1 0 Yes 0 2331 No No M 1 40
## 2 4 Yes 9 717 No Yes z_F 1 23
## 3 3 z_No 4 2703 Yes No z_F 1 95
## URBANICITY .fitted .se.fit .resid .hat
## 1 z_Highly Rural/ Rural -2.657602 0.2005136 2.334681 0.002461993
## 2 Highly Urban/ Urban 1.797394 0.2473381 -1.975217 0.007460934
## 3 Highly Urban/ Urban 3.314761 0.2276243 -2.588613 0.001753313
## .sigma .cooksd .std.resid index
## 1 0.9566739 0.001470274 2.337561 807
## 2 0.9567724 0.001904079 -1.982627 3773
## 3 0.9565937 0.002017227 -2.590885 5652
ggplot(mod4.data, aes(index, .std.resid)) +
geom_point(aes(color = TARGET_FLAG), alpha = .5) +
theme_bw();
mod4.data %>%
filter(abs(.std.resid) > 3);
## TARGET_FLAG CAR_TYPE CAR_USE CLM_FREQ EDUCATION HOMEKIDS HOME_VAL
## 1 1 Minivan Private 0 Bachelors 0 2843
## KIDSDRIV MSTATUS MVR_PTS OLDCLAIM PARENT1 REVOKED SEX TIF TRAVTIME
## 1 0 Yes 0 1 No No M 1 35
## URBANICITY .fitted .se.fit .resid .hat
## 1 z_Highly Rural/ Rural -4.791511 0.1518637 3.098315 0.0001883017
## .sigma .cooksd .std.resid index
## 1 0.9564081 0.0009456568 3.098606 5101
pR2(mod4)
## llh llhNull G2 McFadden r2ML
## -3725.8668999 -4708.9811460 1966.2284922 0.2087743 0.2141032
## r2CU
## 0.3127267
This model allows us to dive deeper into the levels. Having a car type of a truck or having a high school education does not seem to be significant due to high p values. It is commanly known to be a bad statistical practice to remove levels that are not significant because it will dichotomize the predictor.
We need to find a linear regression model to predict the amount of money a car crash costs. Full Model with log transform response variable
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## rivers
##
## Call:
## lm(formula = log(TARGET_AMT + 1) ~ AGE + BLUEBOOK + CAR_AGE +
## CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS + HOME_VAL +
## INCOME + JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM +
## PARENT1 + RED_CAR + REVOKED + SEX + TIF + TRAVTIME + URBANICITY,
## data = crash_training3a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0977 -2.3385 -0.9203 2.1755 10.8203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.912e+00 3.902e-01 4.900 9.77e-07 ***
## AGE -6.950e-03 4.868e-03 -1.428 0.153404
## BLUEBOOK 1.044e-05 4.260e-05 0.245 0.806508
## CAR_AGE -5.493e-03 8.077e-03 -0.680 0.496473
## CAR_TYPEPanel Truck 1.230e-01 1.814e-01 0.678 0.497673
## CAR_TYPEPickup 6.420e-01 1.236e-01 5.194 2.11e-07 ***
## CAR_TYPESports Car 1.387e+00 1.468e-01 9.449 < 2e-16 ***
## CAR_TYPEVan 4.509e-01 1.482e-01 3.043 0.002352 **
## CAR_TYPEz_SUV 1.066e+00 1.182e-01 9.024 < 2e-16 ***
## CAR_USEPrivate -9.995e-01 1.174e-01 -8.514 < 2e-16 ***
## CLM_FREQ 1.500e-01 4.498e-02 3.334 0.000860 ***
## EDUCATIONBachelors -5.570e-01 1.433e-01 -3.886 0.000103 ***
## EDUCATIONMasters -3.751e-01 2.072e-01 -1.811 0.070239 .
## EDUCATIONPhD -4.606e-01 2.397e-01 -1.921 0.054718 .
## EDUCATIONz_High School 2.078e-02 1.227e-01 0.169 0.865534
## HOMEKIDS 3.674e-02 4.573e-02 0.803 0.421776
## HOME_VAL -9.463e-05 2.505e-05 -3.778 0.000159 ***
## INCOME -2.007e-05 1.900e-05 -1.056 0.290902
## JOBClerical 9.008e-01 2.411e-01 3.736 0.000188 ***
## JOBDoctor -2.893e-01 2.917e-01 -0.992 0.321330
## JOBHome Maker 1.068e+00 2.448e-01 4.362 1.30e-05 ***
## JOBLawyer 3.035e-01 2.106e-01 1.441 0.149588
## JOBManager -4.683e-01 2.056e-01 -2.277 0.022800 *
## JOBProfessional 4.847e-01 2.199e-01 2.204 0.027559 *
## JOBStudent 9.691e-01 2.566e-01 3.777 0.000160 ***
## JOBz_Blue Collar 7.243e-01 2.292e-01 3.160 0.001581 **
## KIDSDRIV 5.067e-01 8.077e-02 6.274 3.70e-10 ***
## MSTATUSz_No 6.418e-01 9.536e-02 6.730 1.81e-11 ***
## MVR_PTS 1.762e-01 1.866e-02 9.444 < 2e-16 ***
## OLDCLAIM 1.089e-04 5.971e-05 1.824 0.068156 .
## PARENT1Yes 6.455e-01 1.441e-01 4.480 7.56e-06 ***
## RED_CARyes -3.549e-03 1.064e-01 -0.033 0.973389
## REVOKEDYes 1.073e+00 1.111e-01 9.656 < 2e-16 ***
## SEXz_F -3.211e-01 1.223e-01 -2.625 0.008688 **
## TIF -6.530e-02 8.696e-03 -7.510 6.56e-14 ***
## TRAVTIME 1.674e-02 2.300e-03 7.277 3.73e-13 ***
## URBANICITYz_Highly Rural/ Rural -2.437e+00 9.983e-02 -24.414 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.245 on 8124 degrees of freedom
## Multiple R-squared: 0.2222, Adjusted R-squared: 0.2188
## F-statistic: 64.48 on 36 and 8124 DF, p-value: < 2.2e-16
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------------------
## Response : log(TARGET_AMT + 1)
## Variables: fitted values of log(TARGET_AMT + 1)
##
## Test Summary
## --------------------------------
## DF = 1
## Chi2 = 825.5908
## Prob > Chi2 = 1.473243e-181
## GVIF Df GVIF^(1/(2*Df))
## AGE 1.392944 1 1.180230
## BLUEBOOK 1.122873 1 1.059657
## CAR_AGE 1.745095 1 1.321021
## CAR_TYPE 3.709046 5 1.140056
## CAR_USE 2.492971 1 1.578914
## CLM_FREQ 2.104164 1 1.450574
## EDUCATION 9.294596 4 1.321383
## HOMEKIDS 2.019551 1 1.421109
## HOME_VAL 1.400510 1 1.183431
## INCOME 1.223359 1 1.106055
## JOB 17.528235 8 1.196007
## KIDSDRIV 1.322743 1 1.150105
## MSTATUS 1.692053 1 1.300789
## MVR_PTS 1.243635 1 1.115184
## OLDCLAIM 2.053553 1 1.433022
## PARENT1 1.842964 1 1.357558
## RED_CAR 1.811537 1 1.345933
## REVOKED 1.028725 1 1.014261
## SEX 2.885243 1 1.698600
## TIF 1.007572 1 1.003779
## TRAVTIME 1.037493 1 1.018574
## URBANICITY 1.256678 1 1.121017
The full linear model suggests that we eliminate job due to its high VIF number. We can aslo remove age, bluebook, car age, education, homekids, and income due to high p values. Education has multiple levels, most with high p values.
##
## Call:
## lm(formula = log(TARGET_AMT + 1) ~ CAR_TYPE + CAR_USE + CLM_FREQ +
## HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 +
## REVOKED + SEX + TIF + TRAVTIME + URBANICITY + YOJ, data = crash_training3a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.947 -2.340 -1.029 2.241 10.813
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.361e+00 1.811e-01 13.038 < 2e-16 ***
## CAR_TYPEPanel Truck -2.468e-01 1.655e-01 -1.491 0.135961
## CAR_TYPEPickup 6.424e-01 1.193e-01 5.385 7.43e-08 ***
## CAR_TYPESports Car 1.479e+00 1.472e-01 10.050 < 2e-16 ***
## CAR_TYPEVan 2.733e-01 1.462e-01 1.869 0.061699 .
## CAR_TYPEz_SUV 1.162e+00 1.190e-01 9.769 < 2e-16 ***
## CAR_USEPrivate -1.198e+00 9.036e-02 -13.262 < 2e-16 ***
## CLM_FREQ 1.593e-01 4.560e-02 3.495 0.000477 ***
## HOME_VAL -2.025e-04 2.363e-05 -8.569 < 2e-16 ***
## KIDSDRIV 5.684e-01 7.391e-02 7.690 1.64e-14 ***
## MSTATUSz_No 3.312e-01 9.132e-02 3.627 0.000289 ***
## MVR_PTS 1.876e-01 1.888e-02 9.941 < 2e-16 ***
## OLDCLAIM 1.256e-04 6.055e-05 2.075 0.038040 *
## PARENT1Yes 9.147e-01 1.272e-01 7.191 7.00e-13 ***
## REVOKEDYes 1.122e+00 1.126e-01 9.968 < 2e-16 ***
## SEXz_F -3.179e-01 1.047e-01 -3.036 0.002405 **
## TIF -6.501e-02 8.809e-03 -7.381 1.73e-13 ***
## TRAVTIME 1.789e-02 2.329e-03 7.681 1.76e-14 ***
## URBANICITYz_Highly Rural/ Rural -2.010e+00 9.674e-02 -20.774 < 2e-16 ***
## YOJ -3.637e-02 7.961e-03 -4.568 4.99e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.293 on 8141 degrees of freedom
## Multiple R-squared: 0.1975, Adjusted R-squared: 0.1956
## F-statistic: 105.5 on 19 and 8141 DF, p-value: < 2.2e-16
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------------------
## Response : log(TARGET_AMT + 1)
## Variables: fitted values of log(TARGET_AMT + 1)
##
## Test Summary
## --------------------------------
## DF = 1
## Chi2 = 734.0186
## Prob > Chi2 = 1.197798e-161
## GVIF Df GVIF^(1/(2*Df))
## CAR_TYPE 2.760221 5 1.106864
## CAR_USE 1.434298 1 1.197622
## CLM_FREQ 2.100171 1 1.449197
## HOME_VAL 1.210598 1 1.100272
## KIDSDRIV 1.075910 1 1.037261
## MSTATUS 1.506874 1 1.227548
## MVR_PTS 1.236127 1 1.111813
## OLDCLAIM 2.051267 1 1.432224
## PARENT1 1.394992 1 1.181098
## REVOKED 1.026012 1 1.012922
## SEX 2.051994 1 1.432478
## TIF 1.004093 1 1.002045
## TRAVTIME 1.033058 1 1.016394
## URBANICITY 1.146057 1 1.070540
## YOJ 1.030601 1 1.015185
The reduced model has a much lower R squared value, however according to the Breusch Pagan test, the low p-value means there is non-constant variance present.
How about we build a model with backstepping?
#library(MASS)
lstep<-stepAIC(lmod2, trace=FALSE)
lstep$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## log(TARGET_AMT + 1) ~ AGE + BLUEBOOK + CAR_AGE + CAR_TYPE + CAR_USE +
## CLM_FREQ + EDUCATION + HOMEKIDS + HOME_VAL + INCOME + JOB +
## KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 + RED_CAR +
## REVOKED + SEX + TIF + TRAVTIME + URBANICITY
##
## Final Model:
## log(TARGET_AMT + 1) ~ AGE + CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION +
## HOME_VAL + JOB + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM +
## PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 8124 85545.12 19249.72
## 2 - RED_CAR 1 0.0117177 8125 85545.13 19247.72
## 3 - BLUEBOOK 1 0.6338732 8126 85545.77 19245.78
## 4 - CAR_AGE 1 4.9548030 8127 85550.72 19244.25
## 5 - HOMEKIDS 1 6.8761756 8128 85557.60 19242.91
## 6 - INCOME 1 11.7191331 8129 85569.32 19242.02
Formulate the model obtained through backstepping
lmod4<-lm(log(TARGET_AMT + 1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION +
HOMEKIDS + HOME_VAL + JOB + KIDSDRIV + MSTATUS + MVR_PTS +
OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY +
YOJ,data=crash_training3a)
#summary(lmod2);
summary(lmod4);
##
## Call:
## lm(formula = log(TARGET_AMT + 1) ~ CAR_TYPE + CAR_USE + CLM_FREQ +
## EDUCATION + HOMEKIDS + HOME_VAL + JOB + KIDSDRIV + MSTATUS +
## MVR_PTS + OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME +
## URBANICITY + YOJ, data = crash_training3a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0992 -2.3399 -0.9167 2.1626 10.8591
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.764e+00 3.129e-01 5.635 1.80e-08 ***
## CAR_TYPEPanel Truck 1.368e-01 1.782e-01 0.768 0.442546
## CAR_TYPEPickup 6.554e-01 1.209e-01 5.422 6.07e-08 ***
## CAR_TYPESports Car 1.378e+00 1.455e-01 9.471 < 2e-16 ***
## CAR_TYPEVan 4.587e-01 1.480e-01 3.099 0.001946 **
## CAR_TYPEz_SUV 1.062e+00 1.176e-01 9.024 < 2e-16 ***
## CAR_USEPrivate -9.942e-01 1.173e-01 -8.474 < 2e-16 ***
## CLM_FREQ 1.473e-01 4.495e-02 3.277 0.001054 **
## EDUCATIONBachelors -6.025e-01 1.362e-01 -4.423 9.87e-06 ***
## EDUCATIONMasters -4.434e-01 1.909e-01 -2.323 0.020219 *
## EDUCATIONPhD -5.160e-01 2.266e-01 -2.277 0.022790 *
## EDUCATIONz_High School 7.435e-03 1.220e-01 0.061 0.951389
## HOMEKIDS 7.263e-02 4.302e-02 1.688 0.091400 .
## HOME_VAL -9.539e-05 2.501e-05 -3.814 0.000138 ***
## JOBClerical 9.215e-01 2.406e-01 3.829 0.000129 ***
## JOBDoctor -3.047e-01 2.912e-01 -1.046 0.295553
## JOBHome Maker 9.855e-01 2.461e-01 4.004 6.29e-05 ***
## JOBLawyer 2.854e-01 2.102e-01 1.358 0.174611
## JOBManager -4.817e-01 2.054e-01 -2.345 0.019031 *
## JOBProfessional 4.701e-01 2.197e-01 2.139 0.032427 *
## JOBStudent 9.008e-01 2.576e-01 3.497 0.000473 ***
## JOBz_Blue Collar 7.221e-01 2.290e-01 3.153 0.001620 **
## KIDSDRIV 4.867e-01 7.958e-02 6.115 1.01e-09 ***
## MSTATUSz_No 6.291e-01 9.549e-02 6.589 4.72e-11 ***
## MVR_PTS 1.757e-01 1.863e-02 9.427 < 2e-16 ***
## OLDCLAIM 1.095e-04 5.967e-05 1.834 0.066649 .
## PARENT1Yes 6.552e-01 1.434e-01 4.568 4.99e-06 ***
## REVOKEDYes 1.075e+00 1.110e-01 9.677 < 2e-16 ***
## SEXz_F -3.129e-01 1.045e-01 -2.995 0.002751 **
## TIF -6.537e-02 8.691e-03 -7.522 5.98e-14 ***
## TRAVTIME 1.677e-02 2.299e-03 7.297 3.23e-13 ***
## URBANICITYz_Highly Rural/ Rural -2.445e+00 9.973e-02 -24.518 < 2e-16 ***
## YOJ -2.258e-02 8.617e-03 -2.621 0.008788 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.244 on 8128 degrees of freedom
## Multiple R-squared: 0.2225, Adjusted R-squared: 0.2195
## F-statistic: 72.7 on 32 and 8128 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lmod4)
hist(resid(lmod4), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod4);
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------------------
## Response : log(TARGET_AMT + 1)
## Variables: fitted values of log(TARGET_AMT + 1)
##
## Test Summary
## --------------------------------
## DF = 1
## Chi2 = 827.5224
## Prob > Chi2 = 5.601701e-182
vif(lmod4)
## GVIF Df GVIF^(1/(2*Df))
## CAR_TYPE 3.440799 5 1.131530
## CAR_USE 2.491683 1 1.578507
## CLM_FREQ 2.103030 1 1.450183
## EDUCATION 7.243179 4 1.280829
## HOMEKIDS 1.788846 1 1.337477
## HOME_VAL 1.397842 1 1.182304
## JOB 18.103217 8 1.198422
## KIDSDRIV 1.285407 1 1.133758
## MSTATUS 1.697876 1 1.303026
## MVR_PTS 1.241590 1 1.114267
## OLDCLAIM 2.053236 1 1.432912
## PARENT1 1.828223 1 1.352118
## REVOKED 1.028404 1 1.014103
## SEX 2.105523 1 1.451042
## TIF 1.007337 1 1.003662
## TRAVTIME 1.037289 1 1.018474
## URBANICITY 1.255074 1 1.120301
## YOJ 1.244493 1 1.115569
Based on the VIF numbers, we can remove job.
lmod5<-lm(log(TARGET_AMT+1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION +
HOMEKIDS + HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS +
OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY +
YOJ,data=crash_training3a)
#summary(lmod2);
summary(lmod5);
##
## Call:
## lm(formula = log(TARGET_AMT + 1) ~ CAR_TYPE + CAR_USE + CLM_FREQ +
## EDUCATION + HOMEKIDS + HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS +
## OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY +
## YOJ, data = crash_training3a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0420 -2.3491 -0.9675 2.2529 11.0907
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.663e+00 2.060e-01 12.925 < 2e-16 ***
## CAR_TYPEPanel Truck -4.865e-03 1.673e-01 -0.029 0.97681
## CAR_TYPEPickup 6.124e-01 1.188e-01 5.154 2.61e-07 ***
## CAR_TYPESports Car 1.392e+00 1.459e-01 9.538 < 2e-16 ***
## CAR_TYPEVan 3.938e-01 1.458e-01 2.701 0.00693 **
## CAR_TYPEz_SUV 1.075e+00 1.181e-01 9.108 < 2e-16 ***
## CAR_USEPrivate -1.085e+00 9.365e-02 -11.583 < 2e-16 ***
## CLM_FREQ 1.553e-01 4.517e-02 3.437 0.00059 ***
## EDUCATIONBachelors -9.482e-01 1.212e-01 -7.824 5.75e-15 ***
## EDUCATIONMasters -1.032e+00 1.309e-01 -7.881 3.67e-15 ***
## EDUCATIONPhD -1.309e+00 1.629e-01 -8.038 1.04e-15 ***
## EDUCATIONz_High School -1.123e-01 1.191e-01 -0.943 0.34588
## HOMEKIDS 1.048e-01 4.282e-02 2.447 0.01443 *
## HOME_VAL -1.143e-04 2.445e-05 -4.676 2.97e-06 ***
## KIDSDRIV 4.652e-01 7.988e-02 5.823 5.99e-09 ***
## MSTATUSz_No 5.982e-01 9.547e-02 6.266 3.90e-10 ***
## MVR_PTS 1.826e-01 1.870e-02 9.766 < 2e-16 ***
## OLDCLAIM 1.099e-04 5.998e-05 1.832 0.06703 .
## PARENT1Yes 6.271e-01 1.441e-01 4.353 1.36e-05 ***
## REVOKEDYes 1.093e+00 1.116e-01 9.794 < 2e-16 ***
## SEXz_F -2.636e-01 1.040e-01 -2.534 0.01131 *
## TIF -6.595e-02 8.728e-03 -7.557 4.57e-14 ***
## TRAVTIME 1.780e-02 2.308e-03 7.711 1.40e-14 ***
## URBANICITYz_Highly Rural/ Rural -2.244e+00 9.786e-02 -22.926 < 2e-16 ***
## YOJ -3.340e-02 7.922e-03 -4.217 2.51e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.261 on 8136 degrees of freedom
## Multiple R-squared: 0.2133, Adjusted R-squared: 0.211
## F-statistic: 91.92 on 24 and 8136 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lmod5)
hist(resid(lmod5), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod5);
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------------------
## Response : log(TARGET_AMT + 1)
## Variables: fitted values of log(TARGET_AMT + 1)
##
## Test Summary
## --------------------------------
## DF = 1
## Chi2 = 788.6912
## Prob > Chi2 = 1.551769e-173
vif(lmod5)
## GVIF Df GVIF^(1/(2*Df))
## CAR_TYPE 2.928573 5 1.113437
## CAR_USE 1.570921 1 1.253364
## CLM_FREQ 2.100759 1 1.449399
## EDUCATION 1.389562 4 1.041981
## HOMEKIDS 1.753133 1 1.324059
## HOME_VAL 1.321200 1 1.149434
## KIDSDRIV 1.281054 1 1.131836
## MSTATUS 1.679124 1 1.295810
## MVR_PTS 1.237080 1 1.112241
## OLDCLAIM 2.052316 1 1.432591
## PARENT1 1.824590 1 1.350774
## REVOKED 1.027230 1 1.013524
## SEX 2.066004 1 1.437360
## TIF 1.004898 1 1.002446
## TRAVTIME 1.034302 1 1.017006
## URBANICITY 1.195641 1 1.093454
## YOJ 1.040417 1 1.020008
This model has VIF numbers less than 4 for all predictors. The adjusted r square is still hovering 2. As for the coefficients, it appears that having a truck has a negative effect on getting into a crash where as sports cars, suvs, and vans have a positive effect. Education also has a negative effect on crashing. Many of these coefficients seem to make sense with our pre-existing knowledge on what elements might contribute to a car crash. It should also be noted that not having a log transform greatly reduces the adjusted r square.
We are going to validate mod4 and lmod5. These two models seemed to be the best performing out of all the linear and logisitc regression models built in this study.
In order to test the performance of our logitic model, we will partition the training data into a test and validation subset before deploying the model on the evaluation data.
## Warning: package 'caret' was built under R version 3.4.4
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## The following object is masked from 'package:survival':
##
## cluster
## 'data.frame': 2448 obs. of 3 variables:
## $ class : int 0 0 1 1 0 0 0 0 0 0 ...
## $ as.integer.pred_filter.: int 0 0 0 0 0 0 0 0 0 0 ...
## $ as.double.pred. : num 0.0934 0.0609 0.4835 0.2695 0.2503 ...
## [1] "class" "as.integer.pred_filter."
## [3] "as.double.pred."
Get a confusion matrix
library(pROC)
## Warning: package 'pROC' was built under R version 3.4.4
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(caret)
names(classes.df)<-c("class", "scored.class", "scored.probability")
matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
matrix.df2
## class
## scored.class 1 0
## 1 259 130
## 0 404 1655
#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix;
## Confusion Matrix and Statistics
##
## class
## scored.class 1 0
## 1 259 130
## 0 404 1655
##
## Accuracy : 0.7819
## 95% CI : (0.765, 0.7981)
## No Information Rate : 0.7292
## P-Value [Acc > NIR] : 1.161e-09
##
## Kappa : 0.3653
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.3906
## Specificity : 0.9272
## Pos Pred Value : 0.6658
## Neg Pred Value : 0.8038
## Prevalence : 0.2708
## Detection Rate : 0.1058
## Detection Prevalence : 0.1589
## Balanced Accuracy : 0.6589
##
## 'Positive' Class : 1
##
f1_reduced<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(2*TP)/(2*TP + Fn + FP)
}
f1_reduced(classes.df);
## [1] 0.4923954
plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");
auc(roc(classes.df$class, classes.df$scored.probability))
## Area under the curve: 0.8066
##
## Call:
## glm(formula = TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION +
## HOMEKIDS + HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM +
## PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY, family = "binomial",
## data = crash_training3a)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5886 -0.7290 -0.4235 0.6570 3.0983
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.197e+00 1.502e-01 -7.969 1.60e-15 ***
## CAR_TYPEPanel Truck 1.193e-01 1.313e-01 0.908 0.363860
## CAR_TYPEPickup 5.796e-01 9.668e-02 5.995 2.03e-09 ***
## CAR_TYPESports Car 1.239e+00 1.196e-01 10.362 < 2e-16 ***
## CAR_TYPEVan 3.988e-01 1.183e-01 3.372 0.000747 ***
## CAR_TYPEz_SUV 9.756e-01 1.014e-01 9.622 < 2e-16 ***
## CAR_USEPrivate -8.009e-01 7.245e-02 -11.055 < 2e-16 ***
## CLM_FREQ 1.175e-01 3.182e-02 3.692 0.000223 ***
## EDUCATIONBachelors -7.543e-01 9.464e-02 -7.970 1.58e-15 ***
## EDUCATIONMasters -8.615e-01 1.026e-01 -8.399 < 2e-16 ***
## EDUCATIONPhD -1.055e+00 1.331e-01 -7.926 2.27e-15 ***
## EDUCATIONz_High School -1.280e-01 9.085e-02 -1.409 0.158823
## HOMEKIDS 7.975e-02 3.339e-02 2.389 0.016916 *
## HOME_VAL -9.488e-05 1.978e-05 -4.797 1.61e-06 ***
## KIDSDRIV 3.484e-01 5.934e-02 5.872 4.31e-09 ***
## MSTATUSz_No 5.204e-01 7.625e-02 6.826 8.74e-12 ***
## MVR_PTS 1.121e-01 1.354e-02 8.280 < 2e-16 ***
## OLDCLAIM 8.477e-05 4.205e-05 2.016 0.043804 *
## PARENT1Yes 3.721e-01 1.074e-01 3.466 0.000528 ***
## REVOKEDYes 7.487e-01 7.949e-02 9.419 < 2e-16 ***
## SEXz_F -2.382e-01 8.752e-02 -2.722 0.006485 **
## TIF -5.478e-02 7.253e-03 -7.553 4.26e-14 ***
## TRAVTIME 1.498e-02 1.862e-03 8.042 8.80e-16 ***
## URBANICITYz_Highly Rural/ Rural -2.239e+00 1.114e-01 -20.095 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7451.7 on 8137 degrees of freedom
## AIC: 7499.7
##
## Number of Fisher Scoring iterations: 5
We can convert the coefficients of our model into odds ratios and generate a confidence interval at the 95% confidence level.
exp(cbind("Odds ratio" = coef(mod4), confint.default(mod4, level = 0.95)))
## Odds ratio 2.5 % 97.5 %
## (Intercept) 0.3021883 0.22514200 0.4056008
## CAR_TYPEPanel Truck 1.1266622 0.87096071 1.4574339
## CAR_TYPEPickup 1.7852863 1.47712552 2.1577362
## CAR_TYPESports Car 3.4538033 2.73196292 4.3663686
## CAR_TYPEVan 1.4901070 1.18174106 1.8789386
## CAR_TYPEz_SUV 2.6528289 2.17472761 3.2360379
## CAR_USEPrivate 0.4489373 0.38951076 0.5174304
## CLM_FREQ 1.1246531 1.05665567 1.1970263
## EDUCATIONBachelors 0.4703459 0.39071583 0.5662051
## EDUCATIONMasters 0.4225210 0.34557013 0.5166072
## EDUCATIONPhD 0.3482503 0.26829161 0.4520390
## EDUCATIONz_High School 0.8798394 0.73632411 1.0513270
## HOMEKIDS 1.0830169 1.01441227 1.1562614
## HOME_VAL 0.9999051 0.99986636 0.9999439
## KIDSDRIV 1.4168694 1.26129310 1.5916356
## MSTATUSz_No 1.6827732 1.44918932 1.9540067
## MVR_PTS 1.1186252 1.08933226 1.1487058
## OLDCLAIM 1.0000848 1.00000235 1.0001672
## PARENT1Yes 1.4507509 1.17548032 1.7904836
## REVOKEDYes 2.1142314 1.80921271 2.4706737
## SEXz_F 0.7880152 0.66380530 0.9354670
## TIF 0.9466951 0.93333296 0.9602485
## TRAVTIME 1.0150898 1.01139153 1.0188016
## URBANICITYz_Highly Rural/ Rural 0.1065228 0.08562206 0.1325254
From our odds ratios, we can conclude that given a unit increase in a car being a sports car, a user is 3 times more likely to be involved in a crash. Having an education makes someone less likely to be involved in a crash. Not being married makes someone 1.6 times more likely to be in a crash.
Our logistic regression model yields an accuracy of 0.7872. That is the best performing model we can pick. We need to validate our linear regression model before we deploy on the evaluation data.
Get predicted values vs actual values and compute root mean square error.
## [1] 4751.411
Find relative square error.
## [1] 1.109594
The chosen linear model best minimizes RSME.
We are now in a position to deploy our models on the evaluation data. Deploy our linear model
## 1 2 3 4 5 6
## 2.0164873 3.4219367 2.0270339 NA 3.4933309 1.7637496
## 7 8 9 10 11 12
## 3.4434531 3.3533618 0.8643448 2.4036975 NA 4.3959821
## 13 14 15 16 17 18
## 5.9841331 1.0004187 0.4147456 4.5898764 5.2030353 1.7597256
## 19 20 21 22 23 24
## 4.5924821 2.6904951 2.0228655 3.0288925 1.5836624 2.7983778
## 25 26 27 28 29 30
## 2.3394212 3.5325827 3.1600252 4.5128966 1.8515062 1.8302445
## 31 32 33 34 35 36
## 2.0753815 3.1451669 1.0976435 2.1666514 1.4640885 NA
## 37 38 39 40 41 42
## 1.9992239 2.1946236 NA 4.3450274 2.7032690 4.3297055
## 43 44 45 46 47 48
## -0.4526448 4.1708667 -2.1090651 2.6186578 1.4956471 3.2150948
## 49 50 51 52 53 54
## -0.7632353 4.3670731 1.4922485 3.2093304 5.5354579 1.0466093
## 55 56 57 58 59 60
## 3.7604082 2.5710424 2.9156530 3.2170713 1.0385135 4.0356455
## 61 62 63 64 65 66
## -1.2557860 1.2077085 2.6415229 2.0675694 0.4293705 NA
## 67 68 69 70 71 72
## 5.5118835 4.5515622 1.8418161 1.4485161 -0.1099099 1.9211665
## 73 74 75 76 77 78
## 4.8471779 2.2589293 5.1283445 3.0938316 3.4779705 2.1981171
## 79 80 81 82 83 84
## 1.6231959 0.6705303 3.8846945 2.8691952 2.9366777 0.8310662
## 85 86 87 88 89 90
## 3.5597005 4.2613771 2.7136331 2.9232136 -0.1523340 5.6311124
## 91 92 93 94 95 96
## 1.0259128 1.2735551 0.8046697 2.3227922 0.7327980 2.6377325
## 97 98 99 100
## -1.5968943 3.2152100 3.0867555 2.6955790
Deploy the logistic regression model
## 1 2 3 4 5 6
## 0.21859864 0.38829741 0.13316284 0.23055757 0.43513746 0.18411279
## 7 8 9 10 11 12
## 0.45438997 0.41101491 0.05049800 0.25272133 0.04414184 0.53631040
## 13 14 15 16 17 18
## 0.85647953 0.09356372 0.03851994 0.64799931 0.74097390 0.16656608
## 19 20
## 0.66786199 0.32327326
Consolidate and visualize predicted probabilities and predicted amount
## Error in eval(varsRHS[[1]], data, env): object 'test_results2b' not found
Our logisitc regression model had reasonable performance but our linear model had poor predictive power. Rather than using a linear model, I would consider using a different algorithm such as robust regression or smoothing splines. Addative regression might perform better in this scenerio.
Appendix)
crash_training <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-4/master/insurance_training_data.csv"), header =TRUE)
crash_evaluation <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-4/master/insurance-evaluation-data.csv"), header =TRUE)
head(crash_training, 10)
nrow(crash_training);ncol(crash_training);names(crash_training)
table(crash_training$TARGET_FLAG);
barplot(table(crash_training$TARGET_FLAG), ylim=c(0, 7000), xlab="Result", ylab="N", col="black")
x <- crash_training$TARGET_AMT
h<-hist(x, breaks=10, col="red", xlab="Target AMT",
main="Histogram with Normal Curve", ylim=range(0:1200))
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2);
x <- log(crash_training$TARGET_AMT)
h<-hist(x, breaks=10, col="red", xlab="Target AMT",
main="Histogram with Normal Curve After Log Transform", ylim=range(0:1200))
#xfit<-seq(min(x),max(x),length=40)
#yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
#yfit <- yfit*diff(h$mids[1:2])*length(x)
#lines(x, yfit, col="blue", lwd=2)
summary(crash_training)
barplot(table(crash_training$AGE), ylim=c(0, 500), xlab="AGE", ylab="N", col="black");
#barplot(table(crash_training$BLUEBOOK), ylim=c(0, 100), xlab="BLUEBOOK", ylab="N", col="black");
barplot(table(crash_training$CAR_AGE), ylim=c(0, 1000), xlab="CAR_AGE", ylab="N", col="black");
barplot(table(crash_training$CAR_TYPE), ylim=c(0, 5000), xlab="CAR_TYPE", ylab="N", col="black");
barplot(table(crash_training$CAR_USE), ylim=c(0, 5000), xlab="CAR_USE", ylab="N", col="black");
barplot(table(crash_training$CLM_FREQ), ylim=c(0, 5000), xlab="CLM_FREQ", ylab="N", col="black");
barplot(table(crash_training$EDUCATION), ylim=c(0, 5000), xlab="EDUCATION", ylab="N", col="black");
barplot(table(crash_training$HOMEKIDS), ylim=c(0, 7000), xlab="HOMEKIDS", ylab="N", col="black");
#barplot(table(crash_training$HOME_VAL), ylim=c(0, 500), xlab="HOME_VAL", ylab="N", col="black");
#barplot(table(crash_training$INCOME), ylim=c(0, 5000), xlab="INCOME", ylab="N", col="black");
barplot(table(crash_training$JOB), ylim=c(0, 5000), xlab="JOB", ylab="N", col="black");
barplot(table(crash_training$KIDSDRIV), ylim=c(0, 7000), xlab="KIDSDRIV", ylab="N", col="black");
barplot(table(crash_training$MSTATUS), ylim=c(0, 5000), xlab="MSTATUS", ylab="N", col="black");
barplot(table(crash_training$MVR_PTS), ylim=c(0, 4000), xlab="MVR_PTS", ylab="N", col="black");
#barplot(table(crash_training$OLDCLAIM), ylim=c(0, 5), xlab="OLDCLAIM", ylab="N", col="black");
barplot(table(crash_training$PARENT1), ylim=c(0, 5000), xlab="PARENT 1", ylab="N", col="black");
barplot(table(crash_training$RED_CAR), ylim=c(0, 5000), xlab="RED_CAR", ylab="N", col="black");
barplot(table(crash_training$REVOKED), ylim=c(0, 5000), xlab="REVOKED", ylab="N", col="black");
barplot(table(crash_training$SEX), ylim=c(0, 5000), xlab="SEX", ylab="N", col="black");
barplot(table(crash_training$TIF), ylim=c(0, 5000), xlab="TIF", ylab="N", col="black");
barplot(table(crash_training$TRAVTIME), ylim=c(0, 400), xlab="TRAV TIME", ylab="N", col="black");
barplot(table(crash_training$URBANICITY), ylim=c(0, 8000), xlab="URBAN CITY", ylab="N", col="black");
barplot(table(crash_training$YOJ), ylim=c(0, 2000), xlab="YOJ", ylab="N", col="black")
library(ggplot2)
library(tidyr)
crash_training_plot_subset <- subset(crash_training, select = c(BLUEBOOK, HOME_VAL,INCOME, OLDCLAIM))
crash_training_plot_subset$BLUEBOOK<-as.numeric(crash_training_plot_subset$BLUEBOOK)
crash_training_plot_subset$HOME_VAL<-as.numeric(crash_training_plot_subset$HOME_VAL)
crash_training_plot_subset$INCOME<-as.numeric(crash_training_plot_subset$INCOME)
crash_training_plot_subset$OLDCLAIM<-as.numeric(crash_training_plot_subset$OLDCLAIM)
ggplot(gather(crash_training_plot_subset), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x') ;summary(crash_training_plot_subset)
crash_training2<-crash_training
crash_training2<-subset(crash_training, select=-c(INDEX))
crash_evaluation2<-subset(crash_evaluation, select=-c(INDEX))
crash_train_cor<-subset(crash_training2, select=c(TARGET_FLAG, TARGET_AMT, KIDSDRIV, AGE, HOMEKIDS, YOJ, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, OLDCLAIM, CLM_FREQ, MVR_PTS, CAR_AGE))
crash_train_cor$TARGET_AMT<-as.numeric(crash_train_cor$TARGET_AMT)
crash_train_cor$TARGET_FLAG<-as.numeric(crash_train_cor$TARGET_FLAG)
crash_train_cor$KIDSDRIV<-as.numeric(crash_train_cor$KIDSDRIV)
crash_train_cor$AGE<-as.numeric(crash_train_cor$AGE)
crash_train_cor$HOMEKIDS<-as.numeric(crash_train_cor$HOMEKIDS)
crash_train_cor$YOJ<-as.numeric(crash_train_cor$YOJ)
crash_train_cor$HOME_VAL<-as.numeric(crash_train_cor$HOME_VAL)
crash_train_cor$TRAVTIME<-as.numeric(crash_train_cor$TRAVTIME)
crash_train_cor$BLUEBOOK<-as.numeric(crash_train_cor$BLUEBOOK)
crash_train_cor$TIF<-as.numeric(crash_train_cor$TIF)
crash_train_cor$OLDCLAIM<-as.numeric(crash_train_cor$OLDCLAIM)
crash_train_cor$CLM_FREQ<-as.numeric(crash_train_cor$CLM_FREQ)
crash_train_cor$MVR_PTS<-as.numeric(crash_train_cor$MVR_PTS)
crash_train_cor$CAR_AGE<-as.numeric(crash_train_cor$CAR_AGE)
apply(crash_train_cor,2, function(col)cor(col, crash_training$TARGET_AMT))
apply(crash_train_cor,2, function(col)cor(col, crash_training$TARGET_FLAG))
#correlation matrix and visualization
correlation_matrix <- round(cor(crash_train_cor),2)
# Get lower triangle of the correlation matrix
get_lower_tri<-function(correlation_matrix){
correlation_matrix[upper.tri(correlation_matrix)] <- NA
return(correlation_matrix)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(correlation_matrix){
correlation_matrix[lower.tri(correlation_matrix)]<- NA
return(correlation_matrix)
}
upper_tri <- get_upper_tri(correlation_matrix)
library(reshape2)
# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 15, hjust = 1))+
coord_fixed()
#add nice labels
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x=element_text(size=rel(0.8), angle=90),
axis.text.y=element_text(size=rel(0.8)),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
#missing data
colSums(is.na(crash_training2)) #;colSums(is.na(crash_evaluation))
colSums(crash_training2 == 0)
outlierKD<-function(crash_training2, var) {
var_name <- eval(substitute(var),eval(crash_training2))
na1 <- sum(is.na(var_name))
m1 <- mean(var_name, na.rm = T)
par(mfrow=c(2, 2), oma=c(0,0,3,0))
boxplot(var_name, main="With outliers")
hist(var_name, main="With outliers", xlab=NA, ylab=NA)
outlier <- boxplot.stats(var_name)$out
mo <- mean(outlier)
var_name <- ifelse(var_name %in% outlier, NA, var_name)
boxplot(var_name, main="Without outliers")
hist(var_name, main="Without outliers", xlab=NA, ylab=NA)
title("Outlier Check", outer=TRUE)
na2 <- sum(is.na(var_name))
cat("Outliers identified:", na2 - na1, "n")
cat("Propotion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name))*100, 1), "n")
cat("Mean of the outliers:", round(mo, 2), "n")
m2 <- mean(var_name, na.rm = T)
cat("Mean without removing outliers:", round(m1, 2), "n")
cat("Mean if we remove outliers:", round(m2, 2), "n")
response <- readline(prompt="Do you want to remove outliers and to replace with NA? [yes/no]: ")
if(response == "y" | response == "yes"){
crash_training2[as.character(substitute(var))] <- invisible(var_name)
assign(as.character(as.list(match.call())$crash_training2), crash_training2, envir = .GlobalEnv)
cat("Outliers successfully removed", "n")
return(invisible(crash_training2))
} else{
cat("Nothing changed", "n")
return(invisible(var_name))
}
}
outlierKD(crash_training2, CAR_AGE)
colSums(is.na(crash_training2)) #;colSums(is.na(crash_evaluation))
library(Hmisc)
crash_training3<-crash_training2
impute(crash_training3$AGE, median)
impute(crash_training3$CAR_AGE, median)
crash_training3[is.na(crash_training3)] <- 0
colSums(is.na(crash_training3))
crash_evaluation3<-crash_evaluation2
impute(crash_evaluation3$AGE, median)
impute(crash_evaluation3$CAR_AGE, median)
crash_evaluation3[is.na(crash_evaluation3)] <- 0
crash_training3$CAR_AGE<-abs(crash_training3$CAR_AGE)
summary(crash_training3);summary(crash_training2)
crash_training3$TARGET_AMT<-as.numeric(crash_training3$TARGET_AMT)
crash_training3$TARGET_FLAG<-as.numeric(crash_training3$TARGET_FLAG)
crash_training3$KIDSDRIV<-as.numeric(crash_training3$KIDSDRIV)
crash_training3$AGE<-as.numeric(crash_training3$AGE)
crash_training3$HOMEKIDS<-as.numeric(crash_training3$HOMEKIDS)
crash_training3$YOJ<-as.numeric(crash_training3$YOJ)
crash_training3$HOME_VAL<-as.numeric(crash_training3$HOME_VAL)
crash_training3$TRAVTIME<-as.numeric(crash_training3$TRAVTIME)
crash_training3$BLUEBOOK<-as.numeric(crash_training3$BLUEBOOK)
crash_training3$TIF<-as.numeric(crash_training3$TIF)
crash_training3$OLDCLAIM<-as.numeric(crash_training3$OLDCLAIM)
crash_training3$CLM_FREQ<-as.numeric(crash_training3$CLM_FREQ)
crash_training3$MVR_PTS<-as.numeric(crash_training3$MVR_PTS)
crash_training3$CAR_AGE<-as.numeric(crash_training3$CAR_AGE)
crash_training3$INCOME<-as.numeric(crash_training3$INCOME)
crash_training3a<-crash_training3
crash_evaluation2$TARGET_AMT<-as.numeric(crash_evaluation2$TARGET_AMT)
crash_evaluation2$TARGET_FLAG<-as.numeric(crash_evaluation2$TARGET_FLAG)
crash_evaluation2$KIDSDRIV<-as.numeric(crash_evaluation2$KIDSDRIV)
crash_evaluation2$AGE<-as.numeric(crash_evaluation2$AGE)
crash_evaluation2$HOMEKIDS<-as.numeric(crash_evaluation2$HOMEKIDS)
crash_evaluation2$YOJ<-as.numeric(crash_evaluation2$YOJ)
crash_evaluation2$HOME_VAL<-as.numeric(crash_evaluation2$HOME_VAL)
crash_evaluation2$TRAVTIME<-as.numeric(crash_evaluation2$TRAVTIME)
crash_evaluation2$BLUEBOOK<-as.numeric(crash_evaluation2$BLUEBOOK)
crash_evaluation2$TIF<-as.numeric(crash_evaluation2$TIF)
crash_evaluation2$OLDCLAIM<-as.numeric(crash_evaluation2$OLDCLAIM)
crash_evaluation2$CLM_FREQ<-as.numeric(crash_evaluation2$CLM_FREQ)
crash_evaluation2$MVR_PTS<-as.numeric(crash_evaluation2$MVR_PTS)
crash_evaluation2$CAR_AGE<-as.numeric(crash_evaluation2$CAR_AGE)
crash_evaluation2$INCOME<-as.numeric(crash_evaluation2$INCOME)
crash_evaluation2a<-crash_evaluation2
library(dplyr)
crash_training3 %>%
sapply(levels)
crash_training3$CLERICAL <- ifelse(crash_training3$JOB == "Clerical", 1, 0)
crash_training3$DOCTOR <- ifelse(crash_training3$JOB == "Doctor", 1, 0)
crash_training3$HOME_MAKER <- ifelse(crash_training3$JOB == "Home Maker", 1, 0)
crash_training3$LAWYER <- ifelse(crash_training3$JOB == "Lawyer", 1, 0)
crash_training3$MANAGER <- ifelse(crash_training3$JOB == "manager", 1, 0)
crash_training3$PROF <- ifelse(crash_training3$JOB == "Professional", 1, 0)
crash_training3$STUDENT <- ifelse(crash_training3$JOB == "Student", 1, 0)
crash_training3$BLUE_COLLAR <- ifelse(crash_training3$JOB == "z_Blue Collar", 1, 0)
crash_training3$MINIVAN <- ifelse(crash_training3$CAR_TYPE == "minivan", 1, 0)
crash_training3$TRUCK <- ifelse(crash_training3$CAR_TYPE == "Panel Truck", 1, 0)
crash_training3$PICKUP <- ifelse(crash_training3$CAR_TYPE == "Pickup", 1, 0)
crash_training3$SPORTS <- ifelse(crash_training3$CAR_TYPE == "Sports Car", 1, 0)
crash_training3$VAN <- ifelse(crash_training3$CAR_TYPE == "Van", 1, 0)
crash_training3$SUV <- ifelse(crash_training3$CAR_TYPE == "z_SUV", 1, 0)
#do it for evaluation data as well
crash_evaluation3$CLERICAL <- ifelse(crash_evaluation3$JOB == "Clerical", 1, 0)
crash_evaluation3$DOCTOR <- ifelse(crash_evaluation3$JOB == "Doctor", 1, 0)
crash_evaluation3$HOME_MAKER <- ifelse(crash_evaluation3$JOB == "Home Maker", 1, 0)
crash_evaluation3$LAWYER <- ifelse(crash_evaluation3$JOB == "Lawyer", 1, 0)
crash_evaluation3$MANAGER <- ifelse(crash_evaluation3$JOB == "Manager", 1, 0)
crash_evaluation3$PROF <- ifelse(crash_evaluation3$JOB == "Professional", 1, 0)
crash_evaluation3$STUDENT <- ifelse(crash_evaluation3$JOB == "Student", 1, 0)
crash_evaluation3$BLUE_COLLAR <- ifelse(crash_evaluation3$JOB == "z_Blue Collar", 1, 0)
crash_evaluation3$MINIVAN <- ifelse(crash_evaluation3$CAR_TYPE == "Minivan", 1, 0)
crash_evaluation3$TRUCK <- ifelse(crash_evaluation3$CAR_TYPE == "Panel Truck", 1, 0)
crash_evaluation3$PICKUP <- ifelse(crash_evaluation3$CAR_TYPE == "Pickup", 1, 0)
crash_evaluation3$SPORTS <- ifelse(crash_evaluation3$CAR_TYPE == "Sports Car", 1, 0)
crash_evaluation3$VAN <- ifelse(crash_evaluation3$CAR_TYPE == "Van", 1, 0)
crash_evaluation3$SUV <- ifelse(crash_evaluation3$CAR_TYPE == "z_SUV", 1, 0)
crash_training4<-subset(crash_training3, select=-c(CAR_TYPE, JOB))
crash_evaluation4<-subset(crash_evaluation3, select=-c(CAR_TYPE, JOB))
library(tidyverse)
library(dplyr)
library(broom)
library(car)
mod2<-glm(TARGET_FLAG~AGE+BLUEBOOK+CAR_AGE+CAR_TYPE+CAR_USE+CLM_FREQ+EDUCATION+HOMEKIDS+HOME_VAL+INCOME+JOB+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+RED_CAR+REVOKED+SEX+TIF+TRAVTIME+URBANICITY+YOJ, family="binomial", data=crash_training3a)
summary(mod2);
plot(mod2, which = 4, id.n = 3);
mod2.data <- augment(mod2) %>%
mutate(index = 1:n()) ;
car::vif(mod2);
mod2.data %>% top_n(3, .cooksd);
ggplot(mod2.data, aes(index, .std.resid)) +
geom_point(aes(color = TARGET_FLAG), alpha = .5) +
theme_bw();
mod2.data %>%
filter(abs(.std.resid) > 3);
library(pscl)
pR2(mod2)
mod3<-glm(TARGET_FLAG~CAR_TYPE+CAR_USE+CLM_FREQ+EDUCATION+HOMEKIDS+HOME_VAL+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+REVOKED+SEX+TIF+TRAVTIME+URBANICITY, family="binomial", data=crash_training3a)
summary(mod3);
plot(mod3, which = 4, id.n = 3);
mod3.data <- augment(mod3) %>%
mutate(index = 1:n()) ;
car::vif(mod3);
mod3.data %>% top_n(3, .cooksd);
ggplot(mod3.data, aes(index, .std.resid)) +
geom_point(aes(color = TARGET_FLAG), alpha = .5) +
theme_bw();
mod3.data %>%
filter(abs(.std.resid) > 3);
pR2(mod3)
library(MASS)
step<-stepAIC(mod3, trace=FALSE)
step$anova
mod4<-glm(TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS +
HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 +
REVOKED + SEX + TIF + TRAVTIME + URBANICITY, family="binomial", data=crash_training3a)
summary(mod4);
plot(mod4, which = 4, id.n = 3);
mod4.data <- augment(mod4) %>%
mutate(index = 1:n()) ;
car::vif(mod4);
mod4.data %>% top_n(3, .cooksd);
ggplot(mod4.data, aes(index, .std.resid)) +
geom_point(aes(color = TARGET_FLAG), alpha = .5) +
theme_bw();
mod4.data %>%
filter(abs(.std.resid) > 3);
pR2(mod4)
library(olsrr)
lmod2<-lm(log(TARGET_AMT+1)~AGE+BLUEBOOK+CAR_AGE+CAR_TYPE+CAR_USE+CLM_FREQ+EDUCATION+HOMEKIDS+HOME_VAL+INCOME+JOB+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+RED_CAR+REVOKED+SEX+TIF+TRAVTIME+URBANICITY,data=crash_training3a)
#summary(lmod2);
summary(lmod2);
par(mfrow=c(2,2))
plot(lmod2)
hist(resid(lmod2), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod2);
vif(lmod2)
lmod3<-lm(log(TARGET_AMT+1)~CAR_TYPE+CAR_USE+CLM_FREQ+HOME_VAL+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+REVOKED+SEX+TIF+TRAVTIME+URBANICITY+YOJ,data=crash_training3a)
#summary(lmod2);
summary(lmod3);
par(mfrow=c(2,2))
plot(lmod3)
hist(resid(lmod3), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod3);
vif(lmod3)
#library(MASS)
lstep<-stepAIC(lmod2, trace=FALSE)
lstep$anova
lmod4<-lm(log(TARGET_AMT + 1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION +
HOMEKIDS + HOME_VAL + JOB + KIDSDRIV + MSTATUS + MVR_PTS +
OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY +
YOJ,data=crash_training3a)
#summary(lmod2);
summary(lmod4);
par(mfrow=c(2,2))
plot(lmod4)
hist(resid(lmod4), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod4);
vif(lmod4)
lmod5<-lm(log(TARGET_AMT+1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION +
HOMEKIDS + HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS +
OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY +
YOJ,data=crash_training3a)
#summary(lmod2);
summary(lmod5);
par(mfrow=c(2,2))
plot(lmod5)
hist(resid(lmod5), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod5);
vif(lmod5)
#The evaluation data does not contain the target response variable. To test predictive power, we need to partition our training data into a test and evaluation
library(caret)
Train <- createDataPartition(crash_training3a$TARGET_FLAG, p=0.7, list=FALSE)
train <- crash_training3a[Train, ]
test <- crash_training3a[-Train, ]
#probabilities of prediction
pred <- predict(mod4, newdata=test, type="response", na.action = na.pass)
scored.probability<-data.frame(as.double(pred))
#probability threshold
pred_filter <- ifelse(pred > 0.5, 1, 0)
scored.class<-data.frame(as.integer(pred_filter))
#subset actuals
class<-data.frame(subset(test, select=c(TARGET_FLAG)))
class<-as.integer(class$TARGET_FLAG)
#library(plyr)
classes.df<-data.frame(class, scored.class, scored.probability)
str(classes.df)
names(classes.df)
library(pROC)
library(caret)
names(classes.df)<-c("class", "scored.class", "scored.probability")
matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
matrix.df2
#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix;
f1_reduced<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(2*TP)/(2*TP + Fn + FP)
}
f1_reduced(classes.df);
plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");
auc(roc(classes.df$class, classes.df$scored.probability))
summary(mod4)
exp(cbind("Odds ratio" = coef(mod4), confint.default(mod4, level = 0.95)))
lpred <- data.frame(predict(lmod5, newdata=test, type = "response"))
actual <- subset(test, select=c(TARGET_AMT))
l_testing<-data.frame(lpred, actual)
l_testing$rmse <- (mean((l_testing$lpred - l_testing$actual)^2))^0.5
rmse<-(mean((lpred - actual)^2))^0.5
rmse
mu <- mean(actual$TARGET_AMT)
rse <- mean((lpred - actual)^2) / mean((mu - actual)^2)
rse
test_results <- predict(lmod5, newdata = crash_evaluation2)
head(test_results, 100);
test_results2<-predict(lmod5, newdata=crash_evaluation2, type = "response")
#table(test_results)
#predict(mod5, newdata=moneyball_evaluation3)
#plot(test_results);
#predict(lmod5,newdata=crash_evaluation2, interval='prediction')
test_resultsb <- predict(mod4, newdata = crash_evaluation2, type="response")
head(test_resultsb, 20)
predicted_amt<-data.frame(test_results2)
predicted_prb<-data.frame(test_resultsb)
production<-data.frame(predicted_amt,predicted_prb )
xyplot(test_results2 ~ test_results2b, data = production,
xlab = "Predicted Probabilities",
ylab = "Predicted Amount",
main = "Predicted Cost vs. Predicted Probability of Getting Into a Car Crash")
crash_training <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-4/master/insurance_training_data.csv"), header =TRUE)
crash_evaluation <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-4/master/insurance-evaluation-data.csv"), header =TRUE)
head(crash_training, 10)
nrow(crash_training);ncol(crash_training);names(crash_training)
table(crash_training$TARGET_FLAG);
barplot(table(crash_training$TARGET_FLAG), ylim=c(0, 7000), xlab="Result", ylab="N", col="black")
x <- crash_training$TARGET_AMT
h<-hist(x, breaks=10, col="red", xlab="Target AMT",
main="Histogram with Normal Curve", ylim=range(0:1200))
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2);
x <- log(crash_training$TARGET_AMT)
h<-hist(x, breaks=10, col="red", xlab="Target AMT",
main="Histogram with Normal Curve After Log Transform", ylim=range(0:1200))
#xfit<-seq(min(x),max(x),length=40)
#yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
#yfit <- yfit*diff(h$mids[1:2])*length(x)
#lines(x, yfit, col="blue", lwd=2)
summary(crash_training)
barplot(table(crash_training$AGE), ylim=c(0, 500), xlab="AGE", ylab="N", col="black");
#barplot(table(crash_training$BLUEBOOK), ylim=c(0, 100), xlab="BLUEBOOK", ylab="N", col="black");
barplot(table(crash_training$CAR_AGE), ylim=c(0, 1000), xlab="CAR_AGE", ylab="N", col="black");
barplot(table(crash_training$CAR_TYPE), ylim=c(0, 5000), xlab="CAR_TYPE", ylab="N", col="black");
barplot(table(crash_training$CAR_USE), ylim=c(0, 5000), xlab="CAR_USE", ylab="N", col="black");
barplot(table(crash_training$CLM_FREQ), ylim=c(0, 5000), xlab="CLM_FREQ", ylab="N", col="black");
barplot(table(crash_training$EDUCATION), ylim=c(0, 5000), xlab="EDUCATION", ylab="N", col="black");
barplot(table(crash_training$HOMEKIDS), ylim=c(0, 7000), xlab="HOMEKIDS", ylab="N", col="black");
#barplot(table(crash_training$HOME_VAL), ylim=c(0, 500), xlab="HOME_VAL", ylab="N", col="black");
#barplot(table(crash_training$INCOME), ylim=c(0, 5000), xlab="INCOME", ylab="N", col="black");
barplot(table(crash_training$JOB), ylim=c(0, 5000), xlab="JOB", ylab="N", col="black");
barplot(table(crash_training$KIDSDRIV), ylim=c(0, 7000), xlab="KIDSDRIV", ylab="N", col="black");
barplot(table(crash_training$MSTATUS), ylim=c(0, 5000), xlab="MSTATUS", ylab="N", col="black");
barplot(table(crash_training$MVR_PTS), ylim=c(0, 4000), xlab="MVR_PTS", ylab="N", col="black");
#barplot(table(crash_training$OLDCLAIM), ylim=c(0, 5), xlab="OLDCLAIM", ylab="N", col="black");
barplot(table(crash_training$PARENT1), ylim=c(0, 5000), xlab="PARENT 1", ylab="N", col="black");
barplot(table(crash_training$RED_CAR), ylim=c(0, 5000), xlab="RED_CAR", ylab="N", col="black");
barplot(table(crash_training$REVOKED), ylim=c(0, 5000), xlab="REVOKED", ylab="N", col="black");
barplot(table(crash_training$SEX), ylim=c(0, 5000), xlab="SEX", ylab="N", col="black");
barplot(table(crash_training$TIF), ylim=c(0, 5000), xlab="TIF", ylab="N", col="black");
barplot(table(crash_training$TRAVTIME), ylim=c(0, 400), xlab="TRAV TIME", ylab="N", col="black");
barplot(table(crash_training$URBANICITY), ylim=c(0, 8000), xlab="URBAN CITY", ylab="N", col="black");
barplot(table(crash_training$YOJ), ylim=c(0, 2000), xlab="YOJ", ylab="N", col="black")
library(ggplot2)
library(tidyr)
crash_training_plot_subset <- subset(crash_training, select = c(BLUEBOOK, HOME_VAL,INCOME, OLDCLAIM))
crash_training_plot_subset$BLUEBOOK<-as.numeric(crash_training_plot_subset$BLUEBOOK)
crash_training_plot_subset$HOME_VAL<-as.numeric(crash_training_plot_subset$HOME_VAL)
crash_training_plot_subset$INCOME<-as.numeric(crash_training_plot_subset$INCOME)
crash_training_plot_subset$OLDCLAIM<-as.numeric(crash_training_plot_subset$OLDCLAIM)
ggplot(gather(crash_training_plot_subset), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x') ;summary(crash_training_plot_subset)
crash_training2<-crash_training
crash_training2<-subset(crash_training, select=-c(INDEX))
crash_evaluation2<-subset(crash_evaluation, select=-c(INDEX))
crash_train_cor<-subset(crash_training2, select=c(TARGET_FLAG, TARGET_AMT, KIDSDRIV, AGE, HOMEKIDS, YOJ, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, OLDCLAIM, CLM_FREQ, MVR_PTS, CAR_AGE))
crash_train_cor$TARGET_AMT<-as.numeric(crash_train_cor$TARGET_AMT)
crash_train_cor$TARGET_FLAG<-as.numeric(crash_train_cor$TARGET_FLAG)
crash_train_cor$KIDSDRIV<-as.numeric(crash_train_cor$KIDSDRIV)
crash_train_cor$AGE<-as.numeric(crash_train_cor$AGE)
crash_train_cor$HOMEKIDS<-as.numeric(crash_train_cor$HOMEKIDS)
crash_train_cor$YOJ<-as.numeric(crash_train_cor$YOJ)
crash_train_cor$HOME_VAL<-as.numeric(crash_train_cor$HOME_VAL)
crash_train_cor$TRAVTIME<-as.numeric(crash_train_cor$TRAVTIME)
crash_train_cor$BLUEBOOK<-as.numeric(crash_train_cor$BLUEBOOK)
crash_train_cor$TIF<-as.numeric(crash_train_cor$TIF)
crash_train_cor$OLDCLAIM<-as.numeric(crash_train_cor$OLDCLAIM)
crash_train_cor$CLM_FREQ<-as.numeric(crash_train_cor$CLM_FREQ)
crash_train_cor$MVR_PTS<-as.numeric(crash_train_cor$MVR_PTS)
crash_train_cor$CAR_AGE<-as.numeric(crash_train_cor$CAR_AGE)
apply(crash_train_cor,2, function(col)cor(col, crash_training$TARGET_AMT))
apply(crash_train_cor,2, function(col)cor(col, crash_training$TARGET_FLAG))
#correlation matrix and visualization
correlation_matrix <- round(cor(crash_train_cor),2)
# Get lower triangle of the correlation matrix
get_lower_tri<-function(correlation_matrix){
correlation_matrix[upper.tri(correlation_matrix)] <- NA
return(correlation_matrix)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(correlation_matrix){
correlation_matrix[lower.tri(correlation_matrix)]<- NA
return(correlation_matrix)
}
upper_tri <- get_upper_tri(correlation_matrix)
library(reshape2)
# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 15, hjust = 1))+
coord_fixed()
#add nice labels
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x=element_text(size=rel(0.8), angle=90),
axis.text.y=element_text(size=rel(0.8)),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwicrash_training2h = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
#missing data
colSums(is.na(crash_training2)) #;colSums(is.na(crash_evaluation))
colSums(crash_training2 == 0)
outlierKD<-function(crash_training2, var) {
var_name <- eval(substitute(var),eval(crash_training2))
na1 <- sum(is.na(var_name))
m1 <- mean(var_name, na.rm = T)
par(mfrow=c(2, 2), oma=c(0,0,3,0))
boxplot(var_name, main="With outliers")
hist(var_name, main="With outliers", xlab=NA, ylab=NA)
outlier <- boxplot.stats(var_name)$out
mo <- mean(outlier)
var_name <- ifelse(var_name %in% outlier, NA, var_name)
boxplot(var_name, main="Without outliers")
hist(var_name, main="Without outliers", xlab=NA, ylab=NA)
title("Outlier Check", outer=TRUE)
na2 <- sum(is.na(var_name))
cat("Outliers identified:", na2 - na1, "n")
cat("Propotion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name))*100, 1), "n")
cat("Mean of the outliers:", round(mo, 2), "n")
m2 <- mean(var_name, na.rm = T)
cat("Mean without removing outliers:", round(m1, 2), "n")
cat("Mean if we remove outliers:", round(m2, 2), "n")
response <- readline(prompt="Do you want to remove outliers and to replace with NA? [yes/no]: ")
if(response == "y" | response == "yes"){
crash_training2[as.character(substitute(var))] <- invisible(var_name)
assign(as.character(as.list(match.call())$crash_training2), crash_training2, envir = .GlobalEnv)
cat("Outliers successfully removed", "n")
return(invisible(crash_training2))
} else{
cat("Nothing changed", "n")
return(invisible(var_name))
}
}
outlierKD(crash_training2, CAR_AGE)
colSums(is.na(crash_training2)) #;colSums(is.na(crash_evaluation))
library(Hmisc)
crash_training3<-crash_training2
impute(crash_training3$AGE, median)
impute(crash_training3$CAR_AGE, median)
crash_training3[is.na(crash_training3)] <- 0
colSums(is.na(crash_training3))
crash_evaluation3<-crash_evaluation2
impute(crash_evaluation3$AGE, median)
impute(crash_evaluation3$CAR_AGE, median)
crash_evaluation3[is.na(crash_evaluation3)] <- 0
crash_training3$CAR_AGE<-abs(crash_training3$CAR_AGE)
summary(crash_training3);summary(crash_training2)
crash_training3$TARGET_AMT<-as.numeric(crash_training3$TARGET_AMT)
crash_training3$TARGET_FLAG<-as.numeric(crash_training3$TARGET_FLAG)
crash_training3$KIDSDRIV<-as.numeric(crash_training3$KIDSDRIV)
crash_training3$AGE<-as.numeric(crash_training3$AGE)
crash_training3$HOMEKIDS<-as.numeric(crash_training3$HOMEKIDS)
crash_training3$YOJ<-as.numeric(crash_training3$YOJ)
crash_training3$HOME_VAL<-as.numeric(crash_training3$HOME_VAL)
crash_training3$TRAVTIME<-as.numeric(crash_training3$TRAVTIME)
crash_training3$BLUEBOOK<-as.numeric(crash_training3$BLUEBOOK)
crash_training3$TIF<-as.numeric(crash_training3$TIF)
crash_training3$OLDCLAIM<-as.numeric(crash_training3$OLDCLAIM)
crash_training3$CLM_FREQ<-as.numeric(crash_training3$CLM_FREQ)
crash_training3$MVR_PTS<-as.numeric(crash_training3$MVR_PTS)
crash_training3$CAR_AGE<-as.numeric(crash_training3$CAR_AGE)
crash_training3$INCOME<-as.numeric(crash_training3$INCOME)
crash_training3a<-crash_training3
crash_evaluation2$TARGET_AMT<-as.numeric(crash_evaluation2$TARGET_AMT)
crash_evaluation2$TARGET_FLAG<-as.numeric(crash_evaluation2$TARGET_FLAG)
crash_evaluation2$KIDSDRIV<-as.numeric(crash_evaluation2$KIDSDRIV)
crash_evaluation2$AGE<-as.numeric(crash_evaluation2$AGE)
crash_evaluation2$HOMEKIDS<-as.numeric(crash_evaluation2$HOMEKIDS)
crash_evaluation2$YOJ<-as.numeric(crash_evaluation2$YOJ)
crash_evaluation2$HOME_VAL<-as.numeric(crash_evaluation2$HOME_VAL)
crash_evaluation2$TRAVTIME<-as.numeric(crash_evaluation2$TRAVTIME)
crash_evaluation2$BLUEBOOK<-as.numeric(crash_evaluation2$BLUEBOOK)
crash_evaluation2$TIF<-as.numeric(crash_evaluation2$TIF)
crash_evaluation2$OLDCLAIM<-as.numeric(crash_evaluation2$OLDCLAIM)
crash_evaluation2$CLM_FREQ<-as.numeric(crash_evaluation2$CLM_FREQ)
crash_evaluation2$MVR_PTS<-as.numeric(crash_evaluation2$MVR_PTS)
crash_evaluation2$CAR_AGE<-as.numeric(crash_evaluation2$CAR_AGE)
crash_evaluation2$INCOME<-as.numeric(crash_evaluation2$INCOME)
crash_evaluation2a<-crash_evaluation2
library(dplyr)
crash_training3 %>%
sapply(levels)
#modeling
mod2<-glm(TARGET_FLAG~AGE+BLUEBOOK+CAR_AGE+CAR_TYPE+CAR_USE+CLM_FREQ+EDUCATION+HOMEKIDS+HOME_VAL+INCOME+JOB+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+RED_CAR+REVOKED+SEX+TIF+TRAVTIME+URBANICITY+YOJ, family="binomial", data=crash_training3a)
summary(mod2);
plot(mod2, which = 4, id.n = 3);
mod2.data <- augment(mod2) %>%
mutate(index = 1:n()) ;
car::vif(mod2);
mod2.data %>% top_n(3, .cooksd);
ggplot(mod2.data, aes(index, .std.resid)) +
geom_point(aes(color = TARGET_FLAG), alpha = .5) +
theme_bw();
mod2.data %>%
filter(abs(.std.resid) > 3);
library(pscl)
pR2(mod2)
mod3<-glm(TARGET_FLAG~CAR_TYPE+CAR_USE+CLM_FREQ+EDUCATION+HOMEKIDS+HOME_VAL+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+REVOKED+SEX+TIF+TRAVTIME+URBANICITY, family="binomial", data=crash_training3a)
summary(mod3);
plot(mod3, which = 4, id.n = 3);
mod3.data <- augment(mod3) %>%
mutate(index = 1:n()) ;
car::vif(mod3);
mod3.data %>% top_n(3, .cooksd);
ggplot(mod3.data, aes(index, .std.resid)) +
geom_point(aes(color = TARGET_FLAG), alpha = .5) +
theme_bw();
mod3.data %>%
filter(abs(.std.resid) > 3);
pR2(mod3)
library(MASS)
step<-stepAIC(mod3, trace=FALSE)
step$anova
mod4<-glm(TARGET_FLAG ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION + HOMEKIDS +
HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS + OLDCLAIM + PARENT1 +
REVOKED + SEX + TIF + TRAVTIME + URBANICITY, family="binomial", data=crash_training3a)
summary(mod4);
plot(mod4, which = 4, id.n = 3);
mod4.data <- augment(mod4) %>%
mutate(index = 1:n()) ;
car::vif(mod4);
mod4.data %>% top_n(3, .cooksd);
ggplot(mod4.data, aes(index, .std.resid)) +
geom_point(aes(color = TARGET_FLAG), alpha = .5) +
theme_bw();
mod4.data %>%
filter(abs(.std.resid) > 3);
pR2(mod4)
library(olsrr)
lmod2<-lm(log(TARGET_AMT+1)~AGE+BLUEBOOK+CAR_AGE+CAR_TYPE+CAR_USE+CLM_FREQ+EDUCATION+HOMEKIDS+HOME_VAL+INCOME+JOB+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+RED_CAR+REVOKED+SEX+TIF+TRAVTIME+URBANICITY,data=crash_training3a)
#summary(lmod2);
summary(lmod2);
par(mfrow=c(2,2))
plot(lmod2)
hist(resid(lmod2), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod2);
vif(lmod2)
lmod3<-lm(log(TARGET_AMT+1)~CAR_TYPE+CAR_USE+CLM_FREQ+HOME_VAL+KIDSDRIV+MSTATUS+MVR_PTS+OLDCLAIM+PARENT1+REVOKED+SEX+TIF+TRAVTIME+URBANICITY+YOJ,data=crash_training3a)
#summary(lmod2);
summary(lmod3);
par(mfrow=c(2,2))
plot(lmod3)
hist(resid(lmod3), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod3);
vif(lmod3)
#library(MASS)
lstep<-stepAIC(lmod2, trace=FALSE)
lstep$anova
lmod4<-lm(log(TARGET_AMT + 1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION +
HOMEKIDS + HOME_VAL + JOB + KIDSDRIV + MSTATUS + MVR_PTS +
OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY +
YOJ,data=crash_training3a)
#summary(lmod2);
summary(lmod4);
par(mfrow=c(2,2))
plot(lmod4)
hist(resid(lmod4), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod4);
vif(lmod4)
lmod5<-lm(log(TARGET_AMT+1) ~ CAR_TYPE + CAR_USE + CLM_FREQ + EDUCATION +
HOMEKIDS + HOME_VAL + KIDSDRIV + MSTATUS + MVR_PTS +
OLDCLAIM + PARENT1 + REVOKED + SEX + TIF + TRAVTIME + URBANICITY +
YOJ,data=crash_training3a)
#summary(lmod2);
summary(lmod5);
par(mfrow=c(2,2))
plot(lmod5)
hist(resid(lmod5), main="Histogram of Residuals");
ols_test_breusch_pagan(lmod5);
vif(lmod5)
#model selection
#The evaluation data does not contain the target response variable. To test predictive power, we need to partition our training data into a test and evaluation
library(caret)
Train <- createDataPartition(crash_training3a$TARGET_FLAG, p=0.7, list=FALSE)
train <- crash_training3a[Train, ]
test <- crash_training3a[-Train, ]
#probabilities of prediction
pred <- predict(mod4, newdata=test, type="response", na.action = na.pass)
scored.probability<-data.frame(as.double(pred))
#probability threshold
pred_filter <- ifelse(pred > 0.5, 1, 0)
scored.class<-data.frame(as.integer(pred_filter))
#subset actuals
class<-data.frame(subset(test, select=c(TARGET_FLAG)))
class<-as.integer(class$TARGET_FLAG)
#library(plyr)
classes.df<-data.frame(class, scored.class, scored.probability)
str(classes.df)
names(classes.df)
library(pROC)
library(caret)
names(classes.df)<-c("class", "scored.class", "scored.probability")
matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
matrix.df2
#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix;
f1_reduced<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(2*TP)/(2*TP + Fn + FP)
}
f1_reduced(classes.df);
plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");
auc(roc(classes.df$class, classes.df$scored.probability))
exp(cbind("Odds ratio" = coef(mod4), confint.default(mod4, level = 0.95)))
lpred <- data.frame(predict(lmod5, newdata=test, type = "response"))
actual <- subset(test, select=c(TARGET_AMT))
l_testing<-data.frame(lpred, actual)
l_testing$rmse <- (mean((l_testing$lpred - l_testing$actual)^2))^0.5
rmse
mu <- mean(actual$TARGET_AMT)
rse <- mean((lpred - actual)^2) / mean((mu - actual)^2)
rse
#prediction
test_results <- predict(lmod5, newdata = crash_evaluation2)
head(test_results, 100);
test_results2<-predict(lmod5, newdata=crash_evaluation2, type = "response")
#table(test_results)
#predict(mod5, newdata=moneyball_evaluation3)
#plot(test_results);
predict(lmod5,newdata=crash_evaluation2, interval='prediction')
test_resultsb <- predict(mod4, newdata = crash_evaluation2, type="response")
head(test_resultsb, 20)
predicted_amt<-data.frame(test_results2)
predicted_prb<-data.frame(test_resultsb)
production<-data.frame(predicted_amt,predicted_prb )
xyplot(test_results2 ~ test_results2b, data = production,
xlab = "Predicted Probabilities",
ylab = "Predicted Amount",
main = "Predicted Cost of Car Crash vs. Predicted Probability of Getting Into a Car Crash")