Prediction of Insurance Charges

The purpose of this project is to solve a problem using linear regression and understand various feature selection and model selection techniques. In this project we predict the insurance charges based on different details. The data used in this project can be found here.

Information about various parameters such as, age, bmi, smoke, children etc.

We have to come up with a robust regression model that can predict the insurance charges.

Packages Required

library(tidyverse)
library(dplyr)
library(gplots)
library(ggplot2)
library(readxl)
library(ggplot2)
library(dplyr)
library(GGally)
library(leaps)
library(gridExtra)
library(plotly)
library(mvinfluence)
library(ggthemes)

Loading and cleaning the data

The data is stored in two excel sheets called insurance_web respectively. We need to load it, analyse the data for completeness.

head(df)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Change data type

df$sex <- as.factor(df$sex)
df$smoker <- as.factor(df$smoker)
df$region <- as.factor(df$region)
str(df)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

Observe the data

summary(df)
##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.66   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.13   Max.   :5.000             
##        region       charges     
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

We sample 80% of the observation for training the regression model and the rest 20% for testing the model-

rows<-sample(1338,0.8*1338,replace = FALSE)
train<-df[rows,]
test<-df[-rows,]

# checking the correlation among variables
ggpairs(train) 

Since the dataset is small, we can use best subset selection for variable selection. Best subset selection algorithm will look for best subset of predictors that closely relate with the response variable. This method may not be best suited when the number of variables are too large.

We will use-

As creterion for selecting the best subset of predictors for our linear model.

#Best subset variable selection
best.subset<-regsubsets(charges~.,data = train,nvmax = 7)
best.subset.summary <- summary(best.subset)

# Now, we will use different criterion to select the best subset of the predictors for our problem
# 1. Adjused R squared
p1<-ggplot(data = NULL,aes(x=1:7,y = best.subset.summary$adjr2))+geom_line(color='white')+labs(x='Index',y='Adjusted R Squared',title='Adjusted R Squared')+
  theme_hc(bgcolor = "darkunica")

# 2. CP
p2<-ggplot(data = NULL,aes(x=1:7,y = best.subset.summary$cp))+geom_line(color='white')+labs(x='Index',y='CP',title='CP')+
  theme_hc(bgcolor = "darkunica")

# 3. BIC
p3<-ggplot(data = NULL,aes(x=1:7,y = best.subset.summary$bic))+geom_line(color='white')+labs(x='Index',y='BIC',title='BIC')+
  theme_hc(bgcolor = "darkunica")

grid.arrange(p1,p2,p3)

From above graphs, it is visible that-

Thus,subset with index 3, is the best subset that leads to optimum results. We will use this subset for our linear model.

# The best model is model with index 3.
best.subset.summary$outmat[3,]
##             age         sexmale             bmi        children       smokeryes 
##             "*"             " "             "*"             " "             "*" 
## regionnorthwest regionsoutheast regionsouthwest 
##             " "             " "             " "

The best subset contains predictors- age, bmi and smoker.

Before model the building, we try to visualise the relationship among the variables in a 3D graph just to get a sense of the data-

p <- plot_ly(train, x = ~age, y = ~charges, z = ~bmi, color = ~smoker, colors = c('red', 'blue'),alpha=0.5) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'age'),
                      yaxis = list(title = 'charges'),
                      zaxis = list(title = 'bmi')))
p

Building the model-

We go ahead with building the model-

# Building linear regression model
model<-lm(data=train, charges~age+bmi+smoker)
summary(model)
## 
## Call:
## lm(formula = charges ~ age + bmi + smoker, data = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12644  -3031  -1030   1382  28903 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11182.91    1061.46  -10.54   <2e-16 ***
## age            256.25      13.47   19.02   <2e-16 ***
## bmi            310.28      30.89   10.04   <2e-16 ***
## smokeryes    23921.96     463.45   51.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6136 on 1066 degrees of freedom
## Multiple R-squared:  0.746,  Adjusted R-squared:  0.7453 
## F-statistic:  1043 on 3 and 1066 DF,  p-value: < 2.2e-16

Model Diagnostics

par(mfrow=c(2,2))
plot(model)

par(mfrow=c(1,1))
infIndexPlot(model, var="cook", main="Index Influence Plot")

From the graphs we can see that- * The residuals are normally distributed (Q-Q plot) * The residuals have constant variance * There are no obvious high laverage or high influence points

Observation with index 6, has high Cook’s Distance, 0.038. Let’s take a closer look at this observation-

# point has high Studentized Residual > 4 but < 5
a<-as.numeric(train[4,3])
b<-as.numeric(train[4,6])
ggplot(data = train,aes(x = bmi,y = charges,color= smoker))+geom_point()+geom_smooth(method='lm')+
  geom_point(aes(x=a,y=b),color='Purple')+annotate("text", x = 125, y = 5400, label = "Possible Outlier",col='Purple')+
  theme_hc(bgcolor = "darkunica")

influencePlot(model)

##        StudRes         Hat       CookD
## 1318 -1.439235 0.017027088 0.008961194
## 1048  1.615232 0.018720591 0.012424580
## 1301  4.769467 0.004737299 0.026527834
## 578   4.296606 0.006285999 0.028724195

Prediction on the Test data-

Now we try to predict the insurance charges of the observations from test data.

test$predictedcharges<-predict(model,test)
cor.test(test$charges,test$predictedcharges)
## 
##  Pearson's product-moment correlation
## 
## data:  test$charges and test$predictedcharges
## t = 28.499, df = 266, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8349328 0.8947006
## sample estimates:
##       cor 
## 0.8679245
ggplot(data = test,aes(charges,y=predictedcharges,color=smoker))+geom_point()+labs(x='Actual Charges',
      y='Predicted Charges',title='Prediction Graph')+theme_hc(bgcolor = "darkunica")

We can se that the correlation coefficient between actual charges and the predicted charges is 0.8656935 with 95% Confidence Interval being 0.8321945, 0.8928960. The prediction is fairly accurate as it predicts with close to 83% accuracy.

Thus, we can say that the Liner Regression model does a good job of predicting the insurence charges.