The data set contains records representing a customer at an auto insurance company.
#read data set
data <- read.csv(file=
"https://raw.githubusercontent.com/olga0503/DATA-621/master/insurance_training_data.csv",
stringsAsFactors=T, header=T)
head(data)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1 1 0 0 0 60 0 11 $67,349 No
## 2 2 0 0 0 43 0 11 $91,449 No
## 3 4 0 0 0 35 1 10 $16,039 No
## 4 5 0 0 0 51 0 14 No
## 5 6 0 0 0 50 0 NA $114,986 No
## 6 7 1 2946 0 34 1 12 $125,301 Yes
## HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE
## 1 $0 z_No M PhD Professional 14 Private
## 2 $257,252 z_No M z_High School z_Blue Collar 22 Commercial
## 3 $124,191 Yes z_F z_High School Clerical 5 Private
## 4 $306,251 Yes M <High School z_Blue Collar 32 Private
## 5 $243,925 Yes z_F PhD Doctor 36 Private
## 6 $0 z_No z_F Bachelors z_Blue Collar 46 Commercial
## BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1 $14,230 11 Minivan yes $4,461 2 No 3
## 2 $14,940 1 Minivan yes $0 0 No 0
## 3 $4,010 4 z_SUV no $38,690 2 No 3
## 4 $15,440 7 Minivan yes $0 0 No 0
## 5 $18,000 1 z_SUV no $19,217 2 Yes 3
## 6 $17,430 1 Sports Car no $0 0 No 0
## CAR_AGE URBANICITY
## 1 18 Highly Urban/ Urban
## 2 1 Highly Urban/ Urban
## 3 10 Highly Urban/ Urban
## 4 6 Highly Urban/ Urban
## 5 17 Highly Urban/ Urban
## 6 7 Highly Urban/ Urban
#dimentions of thedataset
dim(data)
## [1] 8161 26
The data set contains 26 variables and 8161 records.
#select records of customers who got into car crash
data_crash <- subset(data, data$TARGET_FLAG == 1)
dim(data_crash)
## [1] 2153 26
In order to build a regression that predicts cost of auto accident we selected only records of customers who got into car accidents. The new data set contains 2153 records.
#display all variables
colnames(data_crash)
## [1] "INDEX" "TARGET_FLAG" "TARGET_AMT" "KIDSDRIV" "AGE"
## [6] "HOMEKIDS" "YOJ" "INCOME" "PARENT1" "HOME_VAL"
## [11] "MSTATUS" "SEX" "EDUCATION" "JOB" "TRAVTIME"
## [16] "CAR_USE" "BLUEBOOK" "TIF" "CAR_TYPE" "RED_CAR"
## [21] "OLDCLAIM" "CLM_FREQ" "REVOKED" "MVR_PTS" "CAR_AGE"
## [26] "URBANICITY"
The response variable ‘TARGET_AMT’ and explanatory variables ‘INCOME’, ‘YOJ’, ‘TRAVTIME’, ‘OLDCLAIM’ belong to continuous data type while the variables ‘AGE’, ‘CAR_AGE’, ‘MVR_PTS’ belong to discrete data type. The variables ‘JOB’ ,‘EDUCATION’,‘MS_STATUS’ and ‘SEX’ belong to categorical data type.
#clean data
#remove "$" and "z_"
data_crash <- data_crash %>% mutate(INCOME = str_replace(INCOME, "[^[:alnum:]]", ""), HOME_VAL = str_replace(HOME_VAL, "[^[:alnum:]]", ""),SEX = as.factor(str_replace(SEX, "z_", "")), OLDCLAIM = as.factor(str_replace(OLDCLAIM, "[^[:alnum:]]", "")), MSTATUS = as.factor(str_replace(MSTATUS, "z_", "")),EDUCATION = as.factor(str_replace(EDUCATION, "z_", "")), BLUEBOOK = as.factor(str_replace(BLUEBOOK, "[^[:alnum:]]", "")),EDUCATION = as.factor(str_replace(EDUCATION, "<", "")),JOB = as.factor(str_replace(JOB, "z_", "")),CAR_TYPE = as.factor(str_replace(CAR_TYPE, "z_", "")),URBANICITY = as.factor(str_replace(URBANICITY, "z_", "")))
#convert INCOME, HOME_VAL and OLDCLAIM to numeric
data_crash <- data_crash %>% mutate(INCOME = as.numeric(as.factor(INCOME)), HOME_VAL = as.numeric(as.factor(HOME_VAL)), OLDCLAIM = as.numeric(as.factor(OLDCLAIM)),BLUEBOOK = as.numeric(as.factor(BLUEBOOK)))
#data_testing <- data_testing %>% mutate(INCOME = as.numeric(as.factor(INCOME)), HOME_VAL = as.numeric(as.factor(HOME_VAL)), OLDCLAIM = as.numeric(as.factor(OLDCLAIM)),BLUEBOOK = as.numeric(as.factor(BLUEBOOK)))
#data_testing <- data_testing %>% mutate(INCOME = str_replace(INCOME, "[^[:alnum:]]", ""), HOME_VAL = str_replace(HOME_VAL, "[^[:alnum:]]", ""),SEX = as.factor(str_replace(SEX, "z_", "")), OLDCLAIM = as.factor(str_replace(OLDCLAIM, "[^[:alnum:]]", "")), MSTATUS = as.factor(str_replace(MSTATUS, "z_", "")),EDUCATION = as.factor(str_replace(EDUCATION, "z_", "")), BLUEBOOK = as.factor(str_replace(BLUEBOOK, "[^[:alnum:]]", "")),EDUCATION = as.factor(str_replace(EDUCATION, "<", "")),JOB = as.factor(str_replace(JOB, "z_", "")),CAR_TYPE = as.factor(str_replace(CAR_TYPE, "z_", "")),URBANICITY = as.factor(str_replace(URBANICITY, "z_", "")))
#omit NAs
data_crash <- na.omit(data_crash)
head(data_crash)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1 7 1 2946.000 0 34 1 12 160 Yes
## 3 12 1 2501.000 0 34 0 10 1315 No
## 4 14 1 6077.000 0 53 0 14 1536 No
## 5 17 1 1267.000 0 53 0 11 192 No
## 6 19 1 2920.167 0 45 0 0 2 No
## 7 25 1 6857.000 0 28 1 13 915 No
## HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE
## 1 2 No F Bachelors Blue Collar 46 Commercial
## 3 2 No F Bachelors Clerical 34 Private
## 4 2 No F Masters Lawyer 15 Private
## 5 2 No M PhD 64 Commercial
## 6 19 Yes F High School Home Maker 48 Private
## 7 370 Yes F High School Blue Collar 29 Commercial
## BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1 471 1 Sports Car no 1 0 No 0
## 3 96 1 SUV no 1 0 No 0
## 4 521 1 Sports Car no 1 0 No 0
## 5 919 6 Panel Truck yes 1 0 No 3
## 6 1147 1 SUV no 1 0 No 3
## 7 1314 6 SUV no 1141 2 No 0
## CAR_AGE URBANICITY
## 1 7 Highly Urban/ Urban
## 3 1 Highly Urban/ Urban
## 4 11 Highly Urban/ Urban
## 5 10 Highly Urban/ Urban
## 6 5 Highly Urban/ Urban
## 7 1 Highly Urban/ Urban
Since the response variable is a discrete variable and predictor variables are either discrete,continuous or categorical variables a linear regression was selected to build the model that predicts auto accident amount.
#build lm model using stepwise approach
linear_model.null = lm(TARGET_AMT ~ 1, data = data_crash)
linear_model.full = lm(TARGET_AMT ~ ., data = data_crash)
step(linear_model.null,
scope = list(upper=linear_model.full),
direction = "both",
data = data_crash)
## Start: AIC=33841.23
## TARGET_AMT ~ 1
##
## Df Sum of Sq RSS AIC
## + CAR_TYPE 5 881040861 1.0892e+11 33836
## + MSTATUS 1 277922853 1.0952e+11 33838
## + CAR_USE 1 220700204 1.0958e+11 33839
## + SEX 1 192617922 1.0960e+11 33840
## + REVOKED 1 118967437 1.0968e+11 33841
## <none> 1.0980e+11 33841
## + PARENT1 1 109443748 1.0969e+11 33841
## + MVR_PTS 1 105520144 1.0969e+11 33841
## + YOJ 1 103068817 1.0969e+11 33841
## + INCOME 1 42442311 1.0975e+11 33842
## + AGE 1 40333734 1.0976e+11 33843
## + CAR_AGE 1 26556088 1.0977e+11 33843
## + HOMEKIDS 1 21854988 1.0978e+11 33843
## + TRAVTIME 1 15490316 1.0978e+11 33843
## + CLM_FREQ 1 15181150 1.0978e+11 33843
## + RED_CAR 1 12635915 1.0978e+11 33843
## + TIF 1 8389260 1.0979e+11 33843
## + KIDSDRIV 1 7522012 1.0979e+11 33843
## + URBANICITY 1 4219827 1.0979e+11 33843
## + OLDCLAIM 1 4138664 1.0979e+11 33843
## + INDEX 1 2664273 1.0979e+11 33843
## + BLUEBOOK 1 2000888 1.0980e+11 33843
## + HOME_VAL 1 358143 1.0980e+11 33843
## + EDUCATION 3 172507522 1.0962e+11 33844
## + JOB 8 703216490 1.0909e+11 33845
##
## Step: AIC=33835.98
## TARGET_AMT ~ CAR_TYPE
##
## Df Sum of Sq RSS AIC
## + MSTATUS 1 251703068 1.0866e+11 33834
## + PARENT1 1 142021530 1.0877e+11 33836
## + MVR_PTS 1 127222706 1.0879e+11 33836
## <none> 1.0892e+11 33836
## + REVOKED 1 93890746 1.0882e+11 33836
## + YOJ 1 73539414 1.0884e+11 33837
## + CAR_AGE 1 66685043 1.0885e+11 33837
## + HOMEKIDS 1 45918691 1.0887e+11 33837
## + RED_CAR 1 42747717 1.0887e+11 33837
## + CAR_USE 1 32396980 1.0888e+11 33837
## + AGE 1 22473139 1.0889e+11 33838
## + INCOME 1 19628951 1.0890e+11 33838
## + HOME_VAL 1 17043881 1.0890e+11 33838
## + TRAVTIME 1 14533771 1.0890e+11 33838
## + CLM_FREQ 1 13619726 1.0890e+11 33838
## + BLUEBOOK 1 13308119 1.0890e+11 33838
## + TIF 1 8774827 1.0891e+11 33838
## + INDEX 1 7902835 1.0891e+11 33838
## + KIDSDRIV 1 7645685 1.0891e+11 33838
## + SEX 1 5274732 1.0891e+11 33838
## + OLDCLAIM 1 1073054 1.0891e+11 33838
## + URBANICITY 1 138093 1.0892e+11 33838
## + EDUCATION 3 88562536 1.0883e+11 33840
## - CAR_TYPE 5 881040861 1.0980e+11 33841
## + JOB 8 416607293 1.0850e+11 33845
##
## Step: AIC=33833.6
## TARGET_AMT ~ CAR_TYPE + MSTATUS
##
## Df Sum of Sq RSS AIC
## + MVR_PTS 1 118525976 1.0855e+11 33834
## + YOJ 1 117276981 1.0855e+11 33834
## <none> 1.0866e+11 33834
## + REVOKED 1 99751169 1.0856e+11 33834
## + CAR_AGE 1 83560584 1.0858e+11 33834
## + HOMEKIDS 1 64077606 1.0860e+11 33834
## + AGE 1 41863959 1.0862e+11 33835
## + CAR_USE 1 38465090 1.0863e+11 33835
## + RED_CAR 1 36858180 1.0863e+11 33835
## + INCOME 1 21228401 1.0864e+11 33835
## + PARENT1 1 18336946 1.0865e+11 33835
## + TRAVTIME 1 17957391 1.0865e+11 33835
## + BLUEBOOK 1 15786952 1.0865e+11 33835
## + CLM_FREQ 1 14920962 1.0865e+11 33835
## + TIF 1 14311099 1.0865e+11 33835
## + SEX 1 12175700 1.0865e+11 33835
## + KIDSDRIV 1 12032899 1.0865e+11 33835
## + INDEX 1 6305611 1.0866e+11 33835
## + HOME_VAL 1 6108348 1.0866e+11 33835
## + URBANICITY 1 1007899 1.0866e+11 33836
## + OLDCLAIM 1 546989 1.0866e+11 33836
## - MSTATUS 1 251703068 1.0892e+11 33836
## + EDUCATION 3 72614014 1.0859e+11 33838
## - CAR_TYPE 5 854821076 1.0952e+11 33838
## + JOB 8 411288798 1.0825e+11 33842
##
## Step: AIC=33833.53
## TARGET_AMT ~ CAR_TYPE + MSTATUS + MVR_PTS
##
## Df Sum of Sq RSS AIC
## + YOJ 1 125081593 1.0842e+11 33833
## <none> 1.0855e+11 33834
## - MVR_PTS 1 118525976 1.0866e+11 33834
## + REVOKED 1 94467874 1.0845e+11 33834
## + CAR_AGE 1 73876351 1.0847e+11 33834
## + HOMEKIDS 1 57153156 1.0849e+11 33835
## + CLM_FREQ 1 56020772 1.0849e+11 33835
## + AGE 1 45800105 1.0850e+11 33835
## + RED_CAR 1 40678583 1.0851e+11 33835
## + CAR_USE 1 31202382 1.0851e+11 33835
## + INCOME 1 22955745 1.0852e+11 33835
## + TRAVTIME 1 17361415 1.0853e+11 33835
## + OLDCLAIM 1 16829916 1.0853e+11 33835
## + BLUEBOOK 1 15224120 1.0853e+11 33835
## + PARENT1 1 14181990 1.0853e+11 33835
## + SEX 1 12232883 1.0853e+11 33835
## + TIF 1 11660306 1.0853e+11 33835
## + KIDSDRIV 1 10929573 1.0853e+11 33835
## + HOME_VAL 1 9557377 1.0854e+11 33835
## + INDEX 1 5860014 1.0854e+11 33835
## + URBANICITY 1 257193 1.0855e+11 33836
## - MSTATUS 1 243006338 1.0879e+11 33836
## + EDUCATION 3 74606002 1.0847e+11 33838
## - CAR_TYPE 5 874882524 1.0942e+11 33839
## + JOB 8 394337535 1.0815e+11 33843
##
## Step: AIC=33833.35
## TARGET_AMT ~ CAR_TYPE + MSTATUS + MVR_PTS + YOJ
##
## Df Sum of Sq RSS AIC
## <none> 1.0842e+11 33833
## + REVOKED 1 105936347 1.0831e+11 33834
## - YOJ 1 125081593 1.0855e+11 33834
## - MVR_PTS 1 126330588 1.0855e+11 33834
## + CAR_AGE 1 84989277 1.0834e+11 33834
## + CLM_FREQ 1 53926426 1.0837e+11 33834
## + HOMEKIDS 1 45961587 1.0837e+11 33835
## + RED_CAR 1 36371947 1.0838e+11 33835
## + CAR_USE 1 34715713 1.0839e+11 33835
## + AGE 1 29453290 1.0839e+11 33835
## + OLDCLAIM 1 17427765 1.0840e+11 33835
## + TRAVTIME 1 14913152 1.0841e+11 33835
## + PARENT1 1 12241647 1.0841e+11 33835
## + TIF 1 12152739 1.0841e+11 33835
## + SEX 1 11976117 1.0841e+11 33835
## + BLUEBOOK 1 10287714 1.0841e+11 33835
## + INDEX 1 7303702 1.0841e+11 33835
## + KIDSDRIV 1 6088028 1.0841e+11 33835
## + HOME_VAL 1 3782089 1.0842e+11 33835
## + INCOME 1 248465 1.0842e+11 33835
## + URBANICITY 1 495 1.0842e+11 33835
## - MSTATUS 1 287498666 1.0871e+11 33836
## - CAR_TYPE 5 835158704 1.0926e+11 33838
## + EDUCATION 3 73898025 1.0835e+11 33838
## + JOB 8 325899692 1.0809e+11 33844
##
## Call:
## lm(formula = TARGET_AMT ~ CAR_TYPE + MSTATUS + MVR_PTS + YOJ,
## data = data_crash)
##
## Coefficients:
## (Intercept) CAR_TYPEPanel Truck CAR_TYPEPickup
## 5059.38 2164.76 -168.90
## CAR_TYPESports Car CAR_TYPESUV CAR_TYPEVan
## 13.55 -158.22 876.20
## MSTATUSYes MVR_PTS YOJ
## -788.88 100.24 58.25
#optimal linear regressionmodel
optimal_model <- lm(formula = TARGET_AMT ~ CAR_TYPE + MSTATUS + MVR_PTS + YOJ,
data = data_crash)
summary(optimal_model)
##
## Call:
## lm(formula = TARGET_AMT ~ CAR_TYPE + MSTATUS + MVR_PTS + YOJ,
## data = data_crash)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7251 -3085 -1595 276 79768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5059.38 624.28 8.104 9.44e-16 ***
## CAR_TYPEPanel Truck 2164.76 742.96 2.914 0.00361 **
## CAR_TYPEPickup -168.90 579.26 -0.292 0.77064
## CAR_TYPESports Car 13.55 635.51 0.021 0.98299
## CAR_TYPESUV -158.22 535.58 -0.295 0.76771
## CAR_TYPEVan 876.20 721.42 1.215 0.22469
## MSTATUSYes -788.88 352.94 -2.235 0.02553 *
## MVR_PTS 100.24 67.65 1.482 0.13861
## YOJ 58.25 39.51 1.474 0.14057
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7586 on 1884 degrees of freedom
## Multiple R-squared: 0.01254, Adjusted R-squared: 0.008342
## F-statistic: 2.99 on 8 and 1884 DF, p-value: 0.002472
According to summary statistics, only the variables ‘CAR_TYPEPanel_Truck’ and ‘MSTATUS_Yes’ are statistically significant as their variables are leass that the level of significance of 5%.
The linear model that predicts cost of car accidents is described by the folowing equasion:
TARGET_AMT = 5059.38 + 2164.76CAR_TYPEPanel_Truckm + -$788.88MSTATUS_Yes
The intercept shows that car crash amount equal to 5059.38 when all other explanatory variables are 0s. The coefficient of 2164.76 indicates that car crash amount increases by 2164.76 if car type is panel truck. Whereas the coefficient of -788.88 indicates that car crash payout decreases by 788.88 if driver is married.