Linear regression model that predicts auto accidenten cost

The data set contains records representing a customer at an auto insurance company.

#read data set
data <- read.csv(file=
"https://raw.githubusercontent.com/olga0503/DATA-621/master/insurance_training_data.csv",
stringsAsFactors=T, header=T)
head(data)
##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ   INCOME PARENT1
## 1     1           0          0        0  60        0  11  $67,349      No
## 2     2           0          0        0  43        0  11  $91,449      No
## 3     4           0          0        0  35        1  10  $16,039      No
## 4     5           0          0        0  51        0  14               No
## 5     6           0          0        0  50        0  NA $114,986      No
## 6     7           1       2946        0  34        1  12 $125,301     Yes
##   HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE
## 1       $0    z_No   M           PhD  Professional       14    Private
## 2 $257,252    z_No   M z_High School z_Blue Collar       22 Commercial
## 3 $124,191     Yes z_F z_High School      Clerical        5    Private
## 4 $306,251     Yes   M  <High School z_Blue Collar       32    Private
## 5 $243,925     Yes z_F           PhD        Doctor       36    Private
## 6       $0    z_No z_F     Bachelors z_Blue Collar       46 Commercial
##   BLUEBOOK TIF   CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1  $14,230  11    Minivan     yes   $4,461        2      No       3
## 2  $14,940   1    Minivan     yes       $0        0      No       0
## 3   $4,010   4      z_SUV      no  $38,690        2      No       3
## 4  $15,440   7    Minivan     yes       $0        0      No       0
## 5  $18,000   1      z_SUV      no  $19,217        2     Yes       3
## 6  $17,430   1 Sports Car      no       $0        0      No       0
##   CAR_AGE          URBANICITY
## 1      18 Highly Urban/ Urban
## 2       1 Highly Urban/ Urban
## 3      10 Highly Urban/ Urban
## 4       6 Highly Urban/ Urban
## 5      17 Highly Urban/ Urban
## 6       7 Highly Urban/ Urban
#dimentions of thedataset
dim(data)
## [1] 8161   26

The data set contains 26 variables and 8161 records.

#select records of customers who got into car crash
data_crash <- subset(data, data$TARGET_FLAG == 1)
dim(data_crash)
## [1] 2153   26

In order to build a regression that predicts cost of auto accident we selected only records of customers who got into car accidents. The new data set contains 2153 records.

#display all variables
colnames(data_crash)
##  [1] "INDEX"       "TARGET_FLAG" "TARGET_AMT"  "KIDSDRIV"    "AGE"        
##  [6] "HOMEKIDS"    "YOJ"         "INCOME"      "PARENT1"     "HOME_VAL"   
## [11] "MSTATUS"     "SEX"         "EDUCATION"   "JOB"         "TRAVTIME"   
## [16] "CAR_USE"     "BLUEBOOK"    "TIF"         "CAR_TYPE"    "RED_CAR"    
## [21] "OLDCLAIM"    "CLM_FREQ"    "REVOKED"     "MVR_PTS"     "CAR_AGE"    
## [26] "URBANICITY"

The response variable ‘TARGET_AMT’ and explanatory variables ‘INCOME’, ‘YOJ’, ‘TRAVTIME’, ‘OLDCLAIM’ belong to continuous data type while the variables ‘AGE’, ‘CAR_AGE’, ‘MVR_PTS’ belong to discrete data type. The variables ‘JOB’ ,‘EDUCATION’,‘MS_STATUS’ and ‘SEX’ belong to categorical data type.

#clean data
#remove "$" and "z_"
data_crash <- data_crash %>% mutate(INCOME = str_replace(INCOME, "[^[:alnum:]]", ""), HOME_VAL = str_replace(HOME_VAL, "[^[:alnum:]]", ""),SEX = as.factor(str_replace(SEX, "z_", "")), OLDCLAIM = as.factor(str_replace(OLDCLAIM, "[^[:alnum:]]", "")), MSTATUS = as.factor(str_replace(MSTATUS, "z_", "")),EDUCATION = as.factor(str_replace(EDUCATION, "z_", "")), BLUEBOOK = as.factor(str_replace(BLUEBOOK, "[^[:alnum:]]", "")),EDUCATION = as.factor(str_replace(EDUCATION, "<", "")),JOB = as.factor(str_replace(JOB, "z_", "")),CAR_TYPE = as.factor(str_replace(CAR_TYPE, "z_", "")),URBANICITY = as.factor(str_replace(URBANICITY, "z_", "")))
#convert INCOME, HOME_VAL and OLDCLAIM to numeric
data_crash <- data_crash %>% mutate(INCOME = as.numeric(as.factor(INCOME)), HOME_VAL = as.numeric(as.factor(HOME_VAL)), OLDCLAIM = as.numeric(as.factor(OLDCLAIM)),BLUEBOOK = as.numeric(as.factor(BLUEBOOK)))
#data_testing <- data_testing %>% mutate(INCOME = as.numeric(as.factor(INCOME)), HOME_VAL = as.numeric(as.factor(HOME_VAL)), OLDCLAIM = as.numeric(as.factor(OLDCLAIM)),BLUEBOOK = as.numeric(as.factor(BLUEBOOK))) 

#data_testing <- data_testing %>% mutate(INCOME = str_replace(INCOME, "[^[:alnum:]]", ""), HOME_VAL = str_replace(HOME_VAL, "[^[:alnum:]]", ""),SEX = as.factor(str_replace(SEX, "z_", "")), OLDCLAIM = as.factor(str_replace(OLDCLAIM, "[^[:alnum:]]", "")), MSTATUS = as.factor(str_replace(MSTATUS, "z_", "")),EDUCATION = as.factor(str_replace(EDUCATION, "z_", "")), BLUEBOOK = as.factor(str_replace(BLUEBOOK, "[^[:alnum:]]", "")),EDUCATION = as.factor(str_replace(EDUCATION, "<", "")),JOB = as.factor(str_replace(JOB, "z_", "")),CAR_TYPE = as.factor(str_replace(CAR_TYPE, "z_", "")),URBANICITY = as.factor(str_replace(URBANICITY, "z_", "")))

#omit NAs
data_crash <- na.omit(data_crash)
head(data_crash)
##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1     7           1   2946.000        0  34        1  12    160     Yes
## 3    12           1   2501.000        0  34        0  10   1315      No
## 4    14           1   6077.000        0  53        0  14   1536      No
## 5    17           1   1267.000        0  53        0  11    192      No
## 6    19           1   2920.167        0  45        0   0      2      No
## 7    25           1   6857.000        0  28        1  13    915      No
##   HOME_VAL MSTATUS SEX   EDUCATION         JOB TRAVTIME    CAR_USE
## 1        2      No   F   Bachelors Blue Collar       46 Commercial
## 3        2      No   F   Bachelors    Clerical       34    Private
## 4        2      No   F     Masters      Lawyer       15    Private
## 5        2      No   M         PhD                   64 Commercial
## 6       19     Yes   F High School  Home Maker       48    Private
## 7      370     Yes   F High School Blue Collar       29 Commercial
##   BLUEBOOK TIF    CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1      471   1  Sports Car      no        1        0      No       0
## 3       96   1         SUV      no        1        0      No       0
## 4      521   1  Sports Car      no        1        0      No       0
## 5      919   6 Panel Truck     yes        1        0      No       3
## 6     1147   1         SUV      no        1        0      No       3
## 7     1314   6         SUV      no     1141        2      No       0
##   CAR_AGE          URBANICITY
## 1       7 Highly Urban/ Urban
## 3       1 Highly Urban/ Urban
## 4      11 Highly Urban/ Urban
## 5      10 Highly Urban/ Urban
## 6       5 Highly Urban/ Urban
## 7       1 Highly Urban/ Urban

Since the response variable is a discrete variable and predictor variables are either discrete,continuous or categorical variables a linear regression was selected to build the model that predicts auto accident amount.

#build lm model using stepwise approach
linear_model.null = lm(TARGET_AMT ~ 1, data = data_crash)

linear_model.full = lm(TARGET_AMT ~ ., data = data_crash)
     
step(linear_model.null,
     scope = list(upper=linear_model.full),
             direction = "both",
             data = data_crash)
## Start:  AIC=33841.23
## TARGET_AMT ~ 1
## 
##              Df Sum of Sq        RSS   AIC
## + CAR_TYPE    5 881040861 1.0892e+11 33836
## + MSTATUS     1 277922853 1.0952e+11 33838
## + CAR_USE     1 220700204 1.0958e+11 33839
## + SEX         1 192617922 1.0960e+11 33840
## + REVOKED     1 118967437 1.0968e+11 33841
## <none>                    1.0980e+11 33841
## + PARENT1     1 109443748 1.0969e+11 33841
## + MVR_PTS     1 105520144 1.0969e+11 33841
## + YOJ         1 103068817 1.0969e+11 33841
## + INCOME      1  42442311 1.0975e+11 33842
## + AGE         1  40333734 1.0976e+11 33843
## + CAR_AGE     1  26556088 1.0977e+11 33843
## + HOMEKIDS    1  21854988 1.0978e+11 33843
## + TRAVTIME    1  15490316 1.0978e+11 33843
## + CLM_FREQ    1  15181150 1.0978e+11 33843
## + RED_CAR     1  12635915 1.0978e+11 33843
## + TIF         1   8389260 1.0979e+11 33843
## + KIDSDRIV    1   7522012 1.0979e+11 33843
## + URBANICITY  1   4219827 1.0979e+11 33843
## + OLDCLAIM    1   4138664 1.0979e+11 33843
## + INDEX       1   2664273 1.0979e+11 33843
## + BLUEBOOK    1   2000888 1.0980e+11 33843
## + HOME_VAL    1    358143 1.0980e+11 33843
## + EDUCATION   3 172507522 1.0962e+11 33844
## + JOB         8 703216490 1.0909e+11 33845
## 
## Step:  AIC=33835.98
## TARGET_AMT ~ CAR_TYPE
## 
##              Df Sum of Sq        RSS   AIC
## + MSTATUS     1 251703068 1.0866e+11 33834
## + PARENT1     1 142021530 1.0877e+11 33836
## + MVR_PTS     1 127222706 1.0879e+11 33836
## <none>                    1.0892e+11 33836
## + REVOKED     1  93890746 1.0882e+11 33836
## + YOJ         1  73539414 1.0884e+11 33837
## + CAR_AGE     1  66685043 1.0885e+11 33837
## + HOMEKIDS    1  45918691 1.0887e+11 33837
## + RED_CAR     1  42747717 1.0887e+11 33837
## + CAR_USE     1  32396980 1.0888e+11 33837
## + AGE         1  22473139 1.0889e+11 33838
## + INCOME      1  19628951 1.0890e+11 33838
## + HOME_VAL    1  17043881 1.0890e+11 33838
## + TRAVTIME    1  14533771 1.0890e+11 33838
## + CLM_FREQ    1  13619726 1.0890e+11 33838
## + BLUEBOOK    1  13308119 1.0890e+11 33838
## + TIF         1   8774827 1.0891e+11 33838
## + INDEX       1   7902835 1.0891e+11 33838
## + KIDSDRIV    1   7645685 1.0891e+11 33838
## + SEX         1   5274732 1.0891e+11 33838
## + OLDCLAIM    1   1073054 1.0891e+11 33838
## + URBANICITY  1    138093 1.0892e+11 33838
## + EDUCATION   3  88562536 1.0883e+11 33840
## - CAR_TYPE    5 881040861 1.0980e+11 33841
## + JOB         8 416607293 1.0850e+11 33845
## 
## Step:  AIC=33833.6
## TARGET_AMT ~ CAR_TYPE + MSTATUS
## 
##              Df Sum of Sq        RSS   AIC
## + MVR_PTS     1 118525976 1.0855e+11 33834
## + YOJ         1 117276981 1.0855e+11 33834
## <none>                    1.0866e+11 33834
## + REVOKED     1  99751169 1.0856e+11 33834
## + CAR_AGE     1  83560584 1.0858e+11 33834
## + HOMEKIDS    1  64077606 1.0860e+11 33834
## + AGE         1  41863959 1.0862e+11 33835
## + CAR_USE     1  38465090 1.0863e+11 33835
## + RED_CAR     1  36858180 1.0863e+11 33835
## + INCOME      1  21228401 1.0864e+11 33835
## + PARENT1     1  18336946 1.0865e+11 33835
## + TRAVTIME    1  17957391 1.0865e+11 33835
## + BLUEBOOK    1  15786952 1.0865e+11 33835
## + CLM_FREQ    1  14920962 1.0865e+11 33835
## + TIF         1  14311099 1.0865e+11 33835
## + SEX         1  12175700 1.0865e+11 33835
## + KIDSDRIV    1  12032899 1.0865e+11 33835
## + INDEX       1   6305611 1.0866e+11 33835
## + HOME_VAL    1   6108348 1.0866e+11 33835
## + URBANICITY  1   1007899 1.0866e+11 33836
## + OLDCLAIM    1    546989 1.0866e+11 33836
## - MSTATUS     1 251703068 1.0892e+11 33836
## + EDUCATION   3  72614014 1.0859e+11 33838
## - CAR_TYPE    5 854821076 1.0952e+11 33838
## + JOB         8 411288798 1.0825e+11 33842
## 
## Step:  AIC=33833.53
## TARGET_AMT ~ CAR_TYPE + MSTATUS + MVR_PTS
## 
##              Df Sum of Sq        RSS   AIC
## + YOJ         1 125081593 1.0842e+11 33833
## <none>                    1.0855e+11 33834
## - MVR_PTS     1 118525976 1.0866e+11 33834
## + REVOKED     1  94467874 1.0845e+11 33834
## + CAR_AGE     1  73876351 1.0847e+11 33834
## + HOMEKIDS    1  57153156 1.0849e+11 33835
## + CLM_FREQ    1  56020772 1.0849e+11 33835
## + AGE         1  45800105 1.0850e+11 33835
## + RED_CAR     1  40678583 1.0851e+11 33835
## + CAR_USE     1  31202382 1.0851e+11 33835
## + INCOME      1  22955745 1.0852e+11 33835
## + TRAVTIME    1  17361415 1.0853e+11 33835
## + OLDCLAIM    1  16829916 1.0853e+11 33835
## + BLUEBOOK    1  15224120 1.0853e+11 33835
## + PARENT1     1  14181990 1.0853e+11 33835
## + SEX         1  12232883 1.0853e+11 33835
## + TIF         1  11660306 1.0853e+11 33835
## + KIDSDRIV    1  10929573 1.0853e+11 33835
## + HOME_VAL    1   9557377 1.0854e+11 33835
## + INDEX       1   5860014 1.0854e+11 33835
## + URBANICITY  1    257193 1.0855e+11 33836
## - MSTATUS     1 243006338 1.0879e+11 33836
## + EDUCATION   3  74606002 1.0847e+11 33838
## - CAR_TYPE    5 874882524 1.0942e+11 33839
## + JOB         8 394337535 1.0815e+11 33843
## 
## Step:  AIC=33833.35
## TARGET_AMT ~ CAR_TYPE + MSTATUS + MVR_PTS + YOJ
## 
##              Df Sum of Sq        RSS   AIC
## <none>                    1.0842e+11 33833
## + REVOKED     1 105936347 1.0831e+11 33834
## - YOJ         1 125081593 1.0855e+11 33834
## - MVR_PTS     1 126330588 1.0855e+11 33834
## + CAR_AGE     1  84989277 1.0834e+11 33834
## + CLM_FREQ    1  53926426 1.0837e+11 33834
## + HOMEKIDS    1  45961587 1.0837e+11 33835
## + RED_CAR     1  36371947 1.0838e+11 33835
## + CAR_USE     1  34715713 1.0839e+11 33835
## + AGE         1  29453290 1.0839e+11 33835
## + OLDCLAIM    1  17427765 1.0840e+11 33835
## + TRAVTIME    1  14913152 1.0841e+11 33835
## + PARENT1     1  12241647 1.0841e+11 33835
## + TIF         1  12152739 1.0841e+11 33835
## + SEX         1  11976117 1.0841e+11 33835
## + BLUEBOOK    1  10287714 1.0841e+11 33835
## + INDEX       1   7303702 1.0841e+11 33835
## + KIDSDRIV    1   6088028 1.0841e+11 33835
## + HOME_VAL    1   3782089 1.0842e+11 33835
## + INCOME      1    248465 1.0842e+11 33835
## + URBANICITY  1       495 1.0842e+11 33835
## - MSTATUS     1 287498666 1.0871e+11 33836
## - CAR_TYPE    5 835158704 1.0926e+11 33838
## + EDUCATION   3  73898025 1.0835e+11 33838
## + JOB         8 325899692 1.0809e+11 33844
## 
## Call:
## lm(formula = TARGET_AMT ~ CAR_TYPE + MSTATUS + MVR_PTS + YOJ, 
##     data = data_crash)
## 
## Coefficients:
##         (Intercept)  CAR_TYPEPanel Truck       CAR_TYPEPickup  
##             5059.38              2164.76              -168.90  
##  CAR_TYPESports Car          CAR_TYPESUV          CAR_TYPEVan  
##               13.55              -158.22               876.20  
##          MSTATUSYes              MVR_PTS                  YOJ  
##             -788.88               100.24                58.25
#optimal linear regressionmodel
optimal_model <- lm(formula = TARGET_AMT ~ CAR_TYPE + MSTATUS + MVR_PTS + YOJ, 
    data = data_crash)
summary(optimal_model)
## 
## Call:
## lm(formula = TARGET_AMT ~ CAR_TYPE + MSTATUS + MVR_PTS + YOJ, 
##     data = data_crash)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7251  -3085  -1595    276  79768 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5059.38     624.28   8.104 9.44e-16 ***
## CAR_TYPEPanel Truck  2164.76     742.96   2.914  0.00361 ** 
## CAR_TYPEPickup       -168.90     579.26  -0.292  0.77064    
## CAR_TYPESports Car     13.55     635.51   0.021  0.98299    
## CAR_TYPESUV          -158.22     535.58  -0.295  0.76771    
## CAR_TYPEVan           876.20     721.42   1.215  0.22469    
## MSTATUSYes           -788.88     352.94  -2.235  0.02553 *  
## MVR_PTS               100.24      67.65   1.482  0.13861    
## YOJ                    58.25      39.51   1.474  0.14057    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7586 on 1884 degrees of freedom
## Multiple R-squared:  0.01254,    Adjusted R-squared:  0.008342 
## F-statistic:  2.99 on 8 and 1884 DF,  p-value: 0.002472

According to summary statistics, only the variables ‘CAR_TYPEPanel_Truck’ and ‘MSTATUS_Yes’ are statistically significant as their variables are leass that the level of significance of 5%.

The linear model that predicts cost of car accidents is described by the folowing equasion:

TARGET_AMT = 5059.38 + 2164.76CAR_TYPEPanel_Truckm + -$788.88MSTATUS_Yes

The intercept shows that car crash amount equal to 5059.38 when all other explanatory variables are 0s. The coefficient of 2164.76 indicates that car crash amount increases by 2164.76 if car type is panel truck. Whereas the coefficient of -788.88 indicates that car crash payout decreases by 788.88 if driver is married.