The purpose of this project is to solve a problem using linear regression and understand various feature selection and model selection techniques. In this project we predict the insurance charges based on different details. The data used in this project can be found here.
Information about various parameters such as, age, bmi, smoke, children etc.
We have to come up with a robust regression model that can predict the insurance charges.
library(tidyverse)
library(dplyr)
library(gplots)
library(ggplot2)
library(readxl)
library(ggplot2)
library(dplyr)
library(GGally)
library(leaps)
library(gridExtra)
library(plotly)
library(mvinfluence)
library(ggthemes)
The data is stored in two excel sheets called insurance_web respectively. We need to load it, analyse the data for completeness.
head(df)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
Change data type
df$sex <- as.factor(df$sex)
df$smoker <- as.factor(df$smoker)
df$region <- as.factor(df$region)
str(df)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ charges : num 16885 1726 4449 21984 3867 ...
Observe the data
summary(df)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## northeast:324 Min. : 1122
## northwest:325 1st Qu.: 4740
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
We sample 80% of the observation for training the regression model and the rest 20% for testing the model-
rows<-sample(1338,0.8*1338,replace = FALSE)
train<-df[rows,]
test<-df[-rows,]
# checking the correlation among variables
ggpairs(train)
Since the dataset is small, we can use best subset selection for variable selection. Best subset selection algorithm will look for best subset of predictors that closely relate with the response variable. This method may not be best suited when the number of variables are too large.
We will use-
As creterion for selecting the best subset of predictors for our linear model.
#Best subset variable selection
best.subset<-regsubsets(charges~.,data = train,nvmax = 7)
best.subset.summary <- summary(best.subset)
# Now, we will use different criterion to select the best subset of the predictors for our problem
# 1. Adjused R squared
p1<-ggplot(data = NULL,aes(x=1:7,y = best.subset.summary$adjr2))+geom_line(color='white')+labs(x='Index',y='Adjusted R Squared',title='Adjusted R Squared')+
theme_hc(bgcolor = "darkunica")
# 2. CP
p2<-ggplot(data = NULL,aes(x=1:7,y = best.subset.summary$cp))+geom_line(color='white')+labs(x='Index',y='CP',title='CP')+
theme_hc(bgcolor = "darkunica")
# 3. BIC
p3<-ggplot(data = NULL,aes(x=1:7,y = best.subset.summary$bic))+geom_line(color='white')+labs(x='Index',y='BIC',title='BIC')+
theme_hc(bgcolor = "darkunica")
grid.arrange(p1,p2,p3)
From above graphs, it is visible that-
Thus,subset with index 3, is the best subset that leads to optimum results. We will use this subset for our linear model.
# The best model is model with index 3.
best.subset.summary$outmat[3,]
## age sexmale bmi children smokeryes
## "*" " " "*" " " "*"
## regionnorthwest regionsoutheast regionsouthwest
## " " " " " "
The best subset contains predictors- age, bmi and smoker.
Before model the building, we try to visualise the relationship among the variables in a 3D graph just to get a sense of the data-
p <- plot_ly(train, x = ~age, y = ~charges, z = ~bmi, color = ~smoker, colors = c('red', 'blue'),alpha=0.5) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'age'),
yaxis = list(title = 'charges'),
zaxis = list(title = 'bmi')))
p
We go ahead with building the model-
# Building linear regression model
model<-lm(data=train, charges~age+bmi+smoker)
summary(model)
##
## Call:
## lm(formula = charges ~ age + bmi + smoker, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12644 -3031 -1030 1382 28903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11182.91 1061.46 -10.54 <2e-16 ***
## age 256.25 13.47 19.02 <2e-16 ***
## bmi 310.28 30.89 10.04 <2e-16 ***
## smokeryes 23921.96 463.45 51.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6136 on 1066 degrees of freedom
## Multiple R-squared: 0.746, Adjusted R-squared: 0.7453
## F-statistic: 1043 on 3 and 1066 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model)
par(mfrow=c(1,1))
infIndexPlot(model, var="cook", main="Index Influence Plot")
From the graphs we can see that- * The residuals are normally
distributed (Q-Q plot) * The residuals have constant variance * There
are no obvious high laverage or high influence points
Observation with index 6, has high Cook’s Distance, 0.038. Let’s take a closer look at this observation-
# point has high Studentized Residual > 4 but < 5
a<-as.numeric(train[4,3])
b<-as.numeric(train[4,6])
ggplot(data = train,aes(x = bmi,y = charges,color= smoker))+geom_point()+geom_smooth(method='lm')+
geom_point(aes(x=a,y=b),color='Purple')+annotate("text", x = 125, y = 5400, label = "Possible Outlier",col='Purple')+
theme_hc(bgcolor = "darkunica")
influencePlot(model)
## StudRes Hat CookD
## 1318 -1.439235 0.017027088 0.008961194
## 1048 1.615232 0.018720591 0.012424580
## 1301 4.769467 0.004737299 0.026527834
## 578 4.296606 0.006285999 0.028724195
Now we try to predict the insurance charges of the observations from test data.
test$predictedcharges<-predict(model,test)
cor.test(test$charges,test$predictedcharges)
##
## Pearson's product-moment correlation
##
## data: test$charges and test$predictedcharges
## t = 28.499, df = 266, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8349328 0.8947006
## sample estimates:
## cor
## 0.8679245
ggplot(data = test,aes(charges,y=predictedcharges,color=smoker))+geom_point()+labs(x='Actual Charges',
y='Predicted Charges',title='Prediction Graph')+theme_hc(bgcolor = "darkunica")
We can se that the correlation coefficient between actual charges and the predicted charges is 0.8656935 with 95% Confidence Interval being 0.8321945, 0.8928960. The prediction is fairly accurate as it predicts with close to 83% accuracy.
Thus, we can say that the Liner Regression model does a good job of predicting the insurence charges.