In this case study, i’ll be taking a look at insurance data, looking for any indicators we can use to predict higher insurance charges. I have a dataset featuring a collection of customer from around the country in US with their demographic information, body mass index, smoking status, and their medical charges billed to insurance.
Workflow for Linear Regression Analysis:
- Import and clean data
- Explore data visually and numerically
- Identify significant correlations
- Create a predictive model
- Build Regression Linear Model
- Model Evaluation
- Conclusion
insurance <- read.csv("data_input/insurance.csv", stringsAsFactors = T)
head(insurance)As we can see, I got these features:
Check data types:
# check type data
str(insurance)#> 'data.frame': 1338 obs. of 7 variables:
#> $ age : int 19 18 28 33 32 31 46 37 37 60 ...
#> $ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
#> $ bmi : num 27.9 33.8 33 22.7 28.9 ...
#> $ children: int 0 1 3 0 0 0 1 3 2 0 ...
#> $ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
#> $ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
#> $ charges : num 16885 1726 4449 21984 3867 ...
Check Summary Data:
summary(insurance)#> age sex bmi children smoker
#> Min. :18.00 female:662 Min. :15.96 Min. :0.000 no :1064
#> 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 yes: 274
#> Median :39.00 Median :30.40 Median :1.000
#> Mean :39.21 Mean :30.66 Mean :1.095
#> 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
#> Max. :64.00 Max. :53.13 Max. :5.000
#> region charges
#> northeast:324 Min. : 1122
#> northwest:325 1st Qu.: 4740
#> southeast:364 Median : 9382
#> southwest:325 Mean :13270
#> 3rd Qu.:16640
#> Max. :63770
Check missing value each of columns:
colSums(is.na(insurance))#> age sex bmi children smoker region charges
#> 0 0 0 0 0 0 0
Good! No missing value.
Now, Insurance dataset is ready to be processed and analyzed.
# Distribution Variable Charges
library(ggplot2)
library(dplyr)
insurance %>%
ggplot(aes(charges)) +
geom_histogram(color = "blue", fill = "blue",alpha = .5, bins = 25) +
scale_x_continuous(breaks = seq(0,64000,10000)) +
theme(axis.text.x = element_text(size = 14)) +
labs(title="Variable Charges", x="")As for medical charges amount, from the table, we can see that there is a big range between minimum and maximum amount of medical charges. We can see a huge positive skewness in the histogram, which means that there are patients in this sample who have a high medical charges expense.
# Distribution Variable Age
insurance %>%
ggplot(aes(age)) +
geom_histogram(color = "blue", fill = "blue",alpha = .5, bins = 25) +
scale_x_continuous(breaks = seq(18,64,5)) +
theme(axis.text.x = element_text(size = 14)) +
labs(title="Variable Age", x="")The minimum age of the patients in the sample is 18, whereas the maximum age is 64 (only adults are included), with mean of 39 years. From the histogram, we can see that the distribution is almost uniform (all bars - same height), but in the population, we can also expect that the patient’s age is very volatile when it comes to medical costs (or coming to the hospital). Although this distribution isn’t normal-looking (bell like), it isn’t a bell-like in real life either, so we shouldn’t bother with that prerequisite for a potential linear regression model.
insurance %>%
ggplot(aes(x=age, y = charges)) +
geom_point(aes(color = smoker)) +
ggtitle("Charges vs Age")# Distribution Variable BMI
insurance %>%
ggplot(aes(bmi)) +
geom_histogram(color = "blue", fill = "blue",alpha = .5, bins = 25) +
scale_x_continuous(breaks = seq(15,54,5)) +
theme(axis.text.x = element_text(size = 14)) +
labs(title="Variable Body Mass Index", x="")According to CDC, “BMI is a person’s weight in kilograms divided by the square of height in meters.”. Since BMI is correlated with body fatness, it is also correlated with various diseases which rise medical costs. There are certain categories of BMI:
< 18,5 is underweight
18,5 - 24,9 is normal or healthy weight
25 - 29,9 is overweight
30+ is obese
As seen on histogram, I would say that there are no significant outliers.
# Distribution Variable Children
barplot(height = table(insurance$children),col = "blue", main = "Distribution of children", xlab="No.of children")On average, people have 1 kid, whereas the maximum number of children is 5. There are no outliers, and most of the people in the sample don’t have children.
# Barplot Variable Sex
insurance %>%
ggplot(aes(x=sex, y = prop.table(stat(count)),
fill = sex,
label = scales::percent(prop.table(stat(count))))) +
geom_bar(show.legend=F) +
geom_text(stat = 'count',
position = position_dodge(.9),
vjust = -0.5,
size = 3) +
scale_y_continuous(labels = scales::percent) +
theme(axis.text.x = element_text(size = 14)) +
labs(title="Variable Sex", x="", y="")Based on current samples of female and male genders with balanced proportions.
# Barplot Variable Smoker
insurance %>%
ggplot(aes(x=smoker, y = prop.table(stat(count)),
fill = smoker,
label = scales::percent(prop.table(stat(count))))) +
geom_bar(show.legend=F) +
geom_text(stat = 'count',
position = position_dodge(.9),
vjust = -0.5,
size = 3) +
scale_y_continuous(labels = scales::percent) +
theme(axis.text.x = element_text(size = 14)) +
labs(title="Variable Smoker", x="", y="")Variables from the sample who do not smoke more than those who smoker with a proportion of 80% -20%.
# Boxplot Variable Region
insurance %>%
ggplot(aes(reorder(region, region,
function(x)-length(x)),
y = prop.table(stat(count)),
fill = region,
label = scales::percent(prop.table(stat(count))))) +
geom_bar(show.legend=F) +
geom_text(stat = 'count',
position = position_dodge(.9),
vjust = -0.5,
size = 3) +
scale_y_continuous(labels = scales::percent) +
theme(axis.text.x = element_text(size = 15)) +
labs(title="Variable Region", x="", y="")The sample on this insurance customer has a balanced region in the US country.
library(GGally)
ggcorr(insurance, label = T, hjust = 1, layout.exp = 2)Column that most influence charges are age = 0.3 > bmi 0.2 > children 0.1
Assumption : The higher the age, the higher the premium.
set.seed(100)
index_ins <- sample(x = nrow(insurance) , size = nrow(insurance)*0.70)
ins_train <- insurance[index_ins , ]
ins_test <- insurance[-index_ins, ]A linear regression model can also be constructed using the age predictor variable, which has the highest positive correlation with the target variable charges
model_age <- lm(formula = charges ~ age , data = ins_train)
summary(model_age)#>
#> Call:
#> lm(formula = charges ~ age, data = ins_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -7870 -6478 -5739 5101 47593
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2958.99 1095.24 2.702 0.00702 **
#> age 258.69 26.09 9.916 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 11290 on 934 degrees of freedom
#> Multiple R-squared: 0.09525, Adjusted R-squared: 0.09428
#> F-statistic: 98.33 on 1 and 934 DF, p-value: < 2.2e-16
As can be seen, the adjusted R-squared value is 0.09583
Then, using step-wise regression with backward elimination, i will attempt to select predictor variables automatically.
model_all <- lm(formula = charges ~. , data = ins_train)
model_back <- step(object = model_all, direction = "backward", trace = 0)
summary(model_back)#>
#> Call:
#> lm(formula = charges ~ age + bmi + children + smoker + region,
#> data = ins_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -11099 -2929 -1018 1395 25387
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -11473.21 1118.55 -10.257 < 2e-16 ***
#> age 249.95 13.92 17.953 < 2e-16 ***
#> bmi 340.09 32.90 10.338 < 2e-16 ***
#> children 510.66 162.62 3.140 0.00174 **
#> smokeryes 23441.91 488.79 47.959 < 2e-16 ***
#> regionnorthwest -649.48 566.64 -1.146 0.25201
#> regionsoutheast -1617.36 559.61 -2.890 0.00394 **
#> regionsouthwest -1111.36 553.28 -2.009 0.04486 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5967 on 928 degrees of freedom
#> Multiple R-squared: 0.7491, Adjusted R-squared: 0.7472
#> F-statistic: 395.9 on 7 and 928 DF, p-value: < 2.2e-16
This step-wise regression method will generate an optimum formula based on the lowest AIC value, where the lower the AIC value, the lower the uncaught observation value.
When compared to the initial model that only uses the age variable, the regression model that uses all predictor variables has an adjusted R-squared of 0.7558, which is higher than the previous model’s R-squared of 0.09583 Then the next we will use
model_back.
Test assumptions for linearity with cor.test()
H0 : Not Linear
H1 : Linear
cor.test(ins_train$charges, ins_train$age)#>
#> Pearson's product-moment correlation
#>
#> data: ins_train$charges and ins_train$age
#> t = 9.9161, df = 934, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.2494816 0.3654768
#> sample estimates:
#> cor
#> 0.3086262
cor.test(ins_train$charges, ins_train$children)#>
#> Pearson's product-moment correlation
#>
#> data: ins_train$charges and ins_train$children
#> t = 2.5541, df = 934, p-value = 0.0108
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.01930792 0.14657947
#> sample estimates:
#> cor
#> 0.0832833
cor.test(ins_train$charges, ins_train$bmi)#>
#> Pearson's product-moment correlation
#>
#> data: ins_train$charges and ins_train$bmi
#> t = 6.28, df = 934, p-value = 5.181e-10
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.138997 0.261982
#> sample estimates:
#> cor
#> 0.2012826
The p-value of all numeric variables < 0.05, then reject H0 ; this indicates that it has a linear relation with variable charges.
hist() function.hist(model_back$residuals)shapiro.test(). (hopefully p-value > alpha, so that the decision taken is to fail to reject H0)H0: error is normally distributed H1: error is not normally distributed
Expected p-value > 0.05
Shapiro-Wilk hypothesis test:
# Shapiro Test
shapiro.test(model_back$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_back$residuals
#> W = 0.90181, p-value < 2.2e-16
p-value < 0.05, then reject H0 ; it means that the residual error is not normally distributed
# plot model.fitted with residuals
plot(x = model_back$fitted.values, y = model_back$residuals)
abline(h = 0, col = "red", lty = 2)lmtestlibrary(lmtest)
bptest(model_back)#>
#> studentized Breusch-Pagan test
#>
#> data: model_back
#> BP = 88.999, df = 7, p-value < 2.2e-16
p-value < 0.05, it means that the error variance spreads not randomly (heterocedasticity)
library(car)
vif(model_back)#> GVIF Df GVIF^(1/(2*Df))
#> age 1.020579 1 1.010237
#> bmi 1.101757 1 1.049646
#> children 1.004098 1 1.002047
#> smoker 1.004237 1 1.002116
#> region 1.089173 3 1.014338
VIF value < 10, the model does not exhibit multicollinearity
pred_back <- predict(model_back, newdata = ins_test)library(MLmetrics)
MAPE(y_pred = pred_back , y_true = ins_test$charges)#> [1] 0.4261799
After testing the analysis, the model has quite good criteria. However, based on the above test, the model’s assumptions of homoscedasticity and normality of residual are violated, but the model meets the linear assumption and there is no multicollinearity. Calculating the MAPE (Mean Absolute Percentage Error) of 0.42 or 42 percent, which is quite high, means that the model is good enough not to exceed 50 percent because MAPE sees that our model deviates from the actual value by the MAPE value on average.