Multiple Linear Regression Analysis For Future Medical Expenses

Introduction:

The purposes of this analysis is to look into different features to observe their relationship, and plot a multiple linear regression based on several features of individual such as age, physical/family condition and location against their existing medical expense to be used for predicting future medical expenses of individuals that help medical insurance to make decision on charging the premium.

I have used the dataset called insurance.csv.The insurance.csv dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value desginated for each level.

Analysys:

Importing data set and look how the structure of the insurance data set:

insurance = read.csv("insurance.csv", stringsAsFactors = T)
str(insurance)

## 'data.frame':    1338 obs. of  9 variables:
##  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ expenses: num  16885 1726 4449 21984 3867 ...
##  $ X       : logi  NA NA NA NA NA NA ...
##  $ X.1     : logi  NA NA NA NA NA NA ...

Let’s print the head(first 6 tuples) of the data set:

suppressMessages(library(dplyr))
head(insurance)

Now look the summary of the data set:

summary(insurance)

##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :16.00   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.67   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.70   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.10   Max.   :5.000             
##        region       expenses        X             X.1         
##  northeast:324   Min.   : 1122   Mode:logical   Mode:logical  
##  northwest:325   1st Qu.: 4740   NA's:1338      NA's:1338     
##  southeast:364   Median : 9382                                
##  southwest:325   Mean   :13270                                
##                  3rd Qu.:16640                                
##                  Max.   :63770

histogram of the medical expenses:

hist(insurance$expenses, main="Histogram of medical expenses",breaks = 100,xlab = "Expenses",ylab = "Frequency",col="lightblue")

A histogram of the medical expenses is plotted, and for the first 100 observations it has shown a right skewed distribution.

There are three nominal variables in the dataset.

Sex
Smoker
Region

It is usefull to know the proportion distribution of the nominal features.Let’s do that.

Table of sex(male/female):

table_sex<-table(insurance$sex)
table_sex

## 
## female   male 
##    662    676

Table of smokers:

table_smoke<-table(insurance$smoker)
table_smoke

## 
##   no  yes 
## 1064  274

table of region:

table_region<-table(insurance$region)
table_region

## 
## northeast northwest southeast southwest 
##       324       325       364       325

Let’s visualize the above results:

par(mfrow=c(1,3))
barplot(table_sex,main="Sex",ylab="Frequency(people)",col="lightblue")
barplot(table_smoke,main="Smoke",ylab="Frequency(people)",col="lightgreen")
barplot(table_region,main="Region",ylab="Frequency(people)",col="lightgray")

Let’s make a multiple regression model to check whether there is a relationship with other variables to expenses or not, It will very useful to people for future medical expenses. That is,

Dependent variable=expenses
Independent variable= All the rest of variables

fit <- lm(expenses ~ (age+sex+bmi+children+smoker+region), data=insurance)
summary(fit)

## 
## Call:
## lm(formula = expenses ~ (age + sex + bmi + children + smoker + 
##     region), data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11302.7  -2850.9   -979.6   1383.9  29981.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11941.6      987.8 -12.089  < 2e-16 ***
## age                256.8       11.9  21.586  < 2e-16 ***
## sexmale           -131.3      332.9  -0.395 0.693255    
## bmi                339.3       28.6  11.864  < 2e-16 ***
## children           475.7      137.8   3.452 0.000574 ***
## smokeryes        23847.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -352.8      476.3  -0.741 0.458976    
## regionsoutheast  -1035.6      478.7  -2.163 0.030685 *  
## regionsouthwest   -959.3      477.9  -2.007 0.044921 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.9 on 8 and 1329 DF,  p-value: < 2.2e-16

A multiple linear regression is plotted by using expenses as the dependent variable, and the rest of features as indipendent variables in the regression model.

Therefore, by the analysis,

A unit increase in age will increase the expense by 256.8 dollars.
A person be a male will decrease the expense by 131.3 dollars.
A unit increase in bmi will increase the expense by 339.3 dollars.
A person make one children will increase the expense by 475.7 dollars.
A person be a smoker will increase the expense by 23847.5 dollars.
A person living in the southwest might decrease the expense by 1035.6 dollars.
A person living in the northwest might decrease the expense by 352.8 dollars.

Also,

age
bmi
children
smokeyes
region southeast
region southwest

are significant, while the rest are not which can be dropped for model improvement. R-square ~ 0.75 indicates that about 75% of the variation in expenses is explained by the model.

Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.

Diagnostic plots(four plots):

layout(matrix(c(1,2,3,4),2,2)) 
plot(fit)

By the above four plots we can conclude that,

+ By the **Residuals vs Fitted** graph -->> Variance is nearly constant.
+ By the **Normal QQ** graph -->>  Approximately equals to the normal assuptions.

So, we can imagine this is a multiple linear regression model(But its better if we do with imagine this is a multiple non-linear regression model) and can be applied a normal assumptions.

Conclusion:

The regression model is, expenses=-11941.6+256.8(age)+339.3(bmi)+475.7(children)+23847.5(smokeryes)-1035.6(regionsoutheast)-959.3(regionsouthwest)

Multiple Linear Regression Analysis For Future Medical Expenses

Data Science Master

2018.10.18

Introduction:

Analysys:

Diagnostic plots(four plots):

Conclusion: