Dataset: Insurance dataset from kaggle: https://www.kaggle.com/mirichoi0218/insurance
Description: This dataset looks at medical insurance costs charges for various people based on several factors like number of children, region of residency, age etc.
Step 1) Load the dataset
Step 2) Display first few rows of insurance dataset
insurance <- read.csv("https://raw.githubusercontent.com/PriyaShaji/Data605/master/week%2012/insurance.csv")
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages ------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v readr 1.3.1
## v tibble 2.1.3 v purrr 0.3.2
## v tidyr 0.8.3 v stringr 1.3.1
## v ggplot2 3.1.0 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.2
## -- Conflicts ---------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
insurance <- mutate(insurance,smoker_group = case_when(str_detect(smoker,"yes") ~ 1,
TRUE ~ 0))
insurance <- mutate(insurance,age_group = case_when(age > 60 ~ 4,
age > 50 ~ 3,
age > 40 ~ 2,
age > 30 ~ 1,
age < 30 ~ 0,
TRUE ~ 0))
insurance <- mutate(insurance,bmi_group = case_when(bmi > 30 ~ 2,
age > 25 ~ 1,
age < 25 ~ 0,
TRUE ~ 0))
head(insurance)
Step 3) Summarize the dataset and display the dimensions
summary(insurance)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges smoker_group age_group
## northeast:324 Min. : 1122 Min. :0.0000 Min. :0.000
## northwest:325 1st Qu.: 4740 1st Qu.:0.0000 1st Qu.:0.000
## southeast:364 Median : 9382 Median :0.0000 Median :1.000
## southwest:325 Mean :13270 Mean :0.2048 Mean :1.478
## 3rd Qu.:16640 3rd Qu.:0.0000 3rd Qu.:3.000
## Max. :63770 Max. :1.0000 Max. :4.000
## bmi_group
## Min. :0.000
## 1st Qu.:1.000
## Median :2.000
## Mean :1.413
## 3rd Qu.:2.000
## Max. :2.000
dim(insurance)
## [1] 1338 10
str(insurance)
## 'data.frame': 1338 obs. of 10 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children : int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
## $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
## $ charges : num 16885 1726 4449 21984 3867 ...
## $ smoker_group: num 1 0 0 0 0 0 0 0 0 0 ...
## $ age_group : num 0 0 0 1 1 1 2 1 1 3 ...
## $ bmi_group : num 0 2 2 1 1 1 2 1 1 1 ...
From the above steps we see that the dataset is tidy and clean.
Step 4) Now, let’s analyze it using graphs
par(mfrow=c(2,2))
hist(insurance$bmi_group, xlab = "BMI (Body Mass Index)",
main = "Histogram of BMI")
hist(insurance$charges, xlab = "Medical Charges",
main = "Histogram of Medical Charges")
hist(insurance$age_group, xlab = "Age_group",
main = "Histogram of Age group")
hist(insurance$smoker_group, xlab = "smoker_cond",
main = "Histogram of smoker_cond group")
Histogram of medical charges looks more right-skewed.
par(mfrow=c(1,3))
with(insurance, plot(charges ~ smoker + sex + region))
I have plotted above graphs using smoker, gender and region.
We see that BMI is nearly normally distributed, medical charges is right-skewed and there are many outliers for high medical charges against both genders and various regions.
We also see that the median is about the same for all regions, and genders.
Note that for smokers, medical charges are much higher than normal ones which we should expect.
Now, letâs fit a multiple regression model, let have the explanatory variables as
sex (categorical)
bmi (numerical, continous)
age (numerical, discrete)
smoker (categorical)
charges (numerical, continous)
Step 5) Letâs make a multiple regression model of the following equation:
\(charges = β0+β1âSex+β2âbmi+β3âage+β4âsmoker+β5(bmiâsex)\)
lm_insurance <- lm(charges ~ age*bmi_group+age*smoker_group +age*age_group + age*smoker_group*bmi_group, data = insurance)
summary(lm_insurance)
##
## Call:
## lm(formula = charges ~ age * bmi_group + age * smoker_group +
## age * age_group + age * smoker_group * bmi_group, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10148.7 -1981.3 -1398.6 -293.5 23776.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -161.282 1088.565 -0.148 0.882239
## age 192.129 41.509 4.629 4.04e-06 ***
## bmi_group 280.862 536.702 0.523 0.600845
## smoker_group 24020.686 1806.664 13.296 < 2e-16 ***
## age_group -1159.028 703.193 -1.648 0.099541 .
## age:bmi_group -7.385 15.693 -0.471 0.638018
## age:smoker_group -636.493 55.530 -11.462 < 2e-16 ***
## age:age_group 36.434 9.489 3.840 0.000129 ***
## bmi_group:smoker_group 3109.117 1157.505 2.686 0.007320 **
## age:bmi_group:smoker_group 350.018 34.287 10.208 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4663 on 1328 degrees of freedom
## Multiple R-squared: 0.8527, Adjusted R-squared: 0.8517
## F-statistic: 854.3 on 9 and 1328 DF, p-value: < 2.2e-16
Step 6) Plot Histogram of Residuals
par(mfrow=c(2,2))
hist(lm_insurance$residuals, main = "Histogram of Residuals", xlab= "")
plot(lm_insurance$residuals, fitted(lm_insurance))
qqnorm(lm_insurance$residuals)
qqline(lm_insurance$residuals)
We see that the residuals histogram is somewhat normal but the residuals vs fitted values doesnât show constant variance which is not good for a multiple regression model.
The equation of this multiple regression model is as follows:
\(charges = â11717.37+54.1âsex+325.78âbmi+259.47âage+23836.07âsmokerâ5.33(bmiâsex)\)
Note the variables
sex = 1 for male and 0 for female
smoker = 1 for male and 0 for female
What does this tell us? Letâs look at the details of the summary in more detail
Coefficients:
Intercept: This tells us that leaving all other terms constant, on average the estimated medical charge is about $-11717.36 which logically wonât make sense and is good there are other terms in the model.
Sex: If a person is male and leaving all other terms constant, he can expect to pay about $54.1 in medical costs.
BMI: Leaving all other terms constant, a person can be expected to pay about $325.78 in medical charges per BMI value.
Age: Leaving all other terms constant, a person can be expected to pay about $259.47 in medical expenses multiplied by their age (A 31 year old will pay about $8043.57)
Smoker: A person who smokes and leaving all variables constant can expect to pay $23836.07
Sex*BMI: A male can expect to pay holding all other variables constant can expect to pay $-5.326 which doesnât make sense logically.
P-values of coefficients:
The p-values of the intercept, bmi, age and male smokers are very low and we can reject the null hypothesis (H_0 = 0) and favor the alternative (H_A != 0) that is the true coefficients is not 0
For Males and Male*bmi, we fail to reject the null hypothesis and thus these coefficients are very close to 0 and can be excluded in our model.
Residual Standard Error: The residual standard error of 6097 is the standard deviation and is a bit far from the good fit of points.
R-squared/Adjusted R^2: values of 0.7475 and 0.7466 respectively, this means that about 75% of the data fall into the regression line.
F-statistic: value of 788.6 with a small p-value < 2.2e-16 means that the features selected are better than the intercept-only model which as described before makes sense as a intercept only model gives a negative medical cost which doesnât apply or make sense.
The above model does not have much efficiency, as there are coefficients that can be removed or probably added for better accuracy and properly modeling and predicting medical costs. The residual standard error as well as the Q-Q plots show that the model is not a good fit for the data. One good thing I can say about the model is that the BMI and Age coefficients make sense as the more your BMI is and older, you are more likley to have more health problems and have more medical costs to pay.
Future work can be done to add more coefficients, transforms and possibly use non-linear regression to better predict medical costs.