The purposes of this analysis is to look into different features to observe their relationship, and plot a multiple linear regression based on several features of individual such as age, physical/family condition and location against their existing medical expense to be used for predicting future medical expenses of individuals that help medical insurance to make decision on charging the premium.
I have used the dataset called insurance.csv.The insurance.csv dataset contains 1338 observations (rows) and 7 features (columns). The dataset contains 4 numerical features (age, bmi, children and expenses) and 3 nominal features (sex, smoker and region) that were converted into factors with numerical value desginated for each level.
Importing data set and look how the structure of the insurance data set:
insurance = read.csv("insurance.csv", stringsAsFactors = T)
str(insurance)
## 'data.frame': 1338 obs. of 9 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 25.7 33.4 27.7 29.8 25.8 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ expenses: num 16885 1726 4449 21984 3867 ...
## $ X : logi NA NA NA NA NA NA ...
## $ X.1 : logi NA NA NA NA NA NA ...
Let’s print the head(first 6 tuples) of the data set:
suppressMessages(library(dplyr))
head(insurance)
Now look the summary of the data set:
summary(insurance)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :16.00 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.67 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.70 3rd Qu.:2.000
## Max. :64.00 Max. :53.10 Max. :5.000
## region expenses X X.1
## northeast:324 Min. : 1122 Mode:logical Mode:logical
## northwest:325 1st Qu.: 4740 NA's:1338 NA's:1338
## southeast:364 Median : 9382
## southwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
histogram of the medical expenses:
hist(insurance$expenses, main="Histogram of medical expenses",breaks = 100,xlab = "Expenses",ylab = "Frequency",col="lightblue")
A histogram of the medical expenses is plotted, and for the first 100 observations it has shown a right skewed distribution.
There are three nominal variables in the dataset.
It is usefull to know the proportion distribution of the nominal features.Let’s do that.
table_sex<-table(insurance$sex)
table_sex
##
## female male
## 662 676
table_smoke<-table(insurance$smoker)
table_smoke
##
## no yes
## 1064 274
table_region<-table(insurance$region)
table_region
##
## northeast northwest southeast southwest
## 324 325 364 325
Let’s visualize the above results:
par(mfrow=c(1,3))
barplot(table_sex,main="Sex",ylab="Frequency(people)",col="lightblue")
barplot(table_smoke,main="Smoke",ylab="Frequency(people)",col="lightgreen")
barplot(table_region,main="Region",ylab="Frequency(people)",col="lightgray")
Let’s make a multiple regression model to check whether there is a relationship with other variables to expenses or not, It will very useful to people for future medical expenses. That is,
fit <- lm(expenses ~ (age+sex+bmi+children+smoker+region), data=insurance)
summary(fit)
##
## Call:
## lm(formula = expenses ~ (age + sex + bmi + children + smoker +
## region), data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11302.7 -2850.9 -979.6 1383.9 29981.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11941.6 987.8 -12.089 < 2e-16 ***
## age 256.8 11.9 21.586 < 2e-16 ***
## sexmale -131.3 332.9 -0.395 0.693255
## bmi 339.3 28.6 11.864 < 2e-16 ***
## children 475.7 137.8 3.452 0.000574 ***
## smokeryes 23847.5 413.1 57.723 < 2e-16 ***
## regionnorthwest -352.8 476.3 -0.741 0.458976
## regionsoutheast -1035.6 478.7 -2.163 0.030685 *
## regionsouthwest -959.3 477.9 -2.007 0.044921 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
## F-statistic: 500.9 on 8 and 1329 DF, p-value: < 2.2e-16
A multiple linear regression is plotted by using expenses as the dependent variable, and the rest of features as indipendent variables in the regression model.
Therefore, by the analysis,
Also,
are significant, while the rest are not which can be dropped for model improvement. R-square ~ 0.75 indicates that about 75% of the variation in expenses is explained by the model.
Diagnostic plots provide checks for heteroscedasticity, normality, and influential observerations.
layout(matrix(c(1,2,3,4),2,2))
plot(fit)
By the above four plots we can conclude that,
+ By the **Residuals vs Fitted** graph -->> Variance is nearly constant.
+ By the **Normal QQ** graph -->> Approximately equals to the normal assuptions.
So, we can imagine this is a multiple linear regression model(But its better if we do with imagine this is a multiple non-linear regression model) and can be applied a normal assumptions.
The regression model is, expenses=-11941.6+256.8(age)+339.3(bmi)+475.7(children)+23847.5(smokeryes)-1035.6(regionsoutheast)-959.3(regionsouthwest)