Introduction

In this case study, i’ll be taking a look at insurance data, looking for any indicators we can use to predict higher insurance charges. I have a dataset featuring a collection of customer from around the country in US with their demographic information, body mass index, smoking status, and their medical charges billed to insurance.

Workflow for Linear Regression Analysis:
- Import and clean data
- Explore data visually and numerically
- Identify significant correlations
- Create a predictive model
- Build Regression Linear Model
- Model Evaluation
- Conclusion

Import Data

insurance <- read.csv("data_input/insurance.csv", stringsAsFactors = T)
head(insurance)

As we can see, I got these features:

  • Age: Age of primary policyholder.
  • Sex: Sex of the policy policyholder.
  • BMI: Body Mass Index of policyholder, defined as the body mass divided by the square of the body height (kg/m2).
  • Smoker status: Whether the policyholder is a smoker or a non-smoker.
  • Children: Number of children/dependents covered in the policy.
  • Region: Residential areas of the policy holder (in the US) - North East, South East, South West, North West.
  • Charges: Yearly medical expenses billed by the medical insurance provider ($).

Data Cleansing

Check data types:

# check type data
str(insurance)
#> 'data.frame':    1338 obs. of  7 variables:
#>  $ age     : int  19 18 28 33 32 31 46 37 37 60 ...
#>  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
#>  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
#>  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
#>  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
#>  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
#>  $ charges : num  16885 1726 4449 21984 3867 ...

Check Summary Data:

summary(insurance)
#>       age            sex           bmi           children     smoker    
#>  Min.   :18.00   female:662   Min.   :15.96   Min.   :0.000   no :1064  
#>  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
#>  Median :39.00                Median :30.40   Median :1.000             
#>  Mean   :39.21                Mean   :30.66   Mean   :1.095             
#>  3rd Qu.:51.00                3rd Qu.:34.69   3rd Qu.:2.000             
#>  Max.   :64.00                Max.   :53.13   Max.   :5.000             
#>        region       charges     
#>  northeast:324   Min.   : 1122  
#>  northwest:325   1st Qu.: 4740  
#>  southeast:364   Median : 9382  
#>  southwest:325   Mean   :13270  
#>                  3rd Qu.:16640  
#>                  Max.   :63770

Check missing value each of columns:

colSums(is.na(insurance))
#>      age      sex      bmi children   smoker   region  charges 
#>        0        0        0        0        0        0        0

Good! No missing value.
Now, Insurance dataset is ready to be processed and analyzed.

Explore data visually and numerically

# Distribution Variable Charges
library(ggplot2)
library(dplyr)
insurance %>%
ggplot(aes(charges)) +
geom_histogram(color = "blue", fill = "blue",alpha = .5, bins = 25) +
scale_x_continuous(breaks = seq(0,64000,10000)) +
theme(axis.text.x = element_text(size = 14)) +
labs(title="Variable Charges", x="")

As for medical charges amount, from the table, we can see that there is a big range between minimum and maximum amount of medical charges. We can see a huge positive skewness in the histogram, which means that there are patients in this sample who have a high medical charges expense.

# Distribution Variable Age
insurance %>%
ggplot(aes(age)) +
geom_histogram(color = "blue", fill = "blue",alpha = .5, bins = 25) +
scale_x_continuous(breaks = seq(18,64,5)) +
theme(axis.text.x = element_text(size = 14)) +
labs(title="Variable Age", x="")

The minimum age of the patients in the sample is 18, whereas the maximum age is 64 (only adults are included), with mean of 39 years. From the histogram, we can see that the distribution is almost uniform (all bars - same height), but in the population, we can also expect that the patient’s age is very volatile when it comes to medical costs (or coming to the hospital). Although this distribution isn’t normal-looking (bell like), it isn’t a bell-like in real life either, so we shouldn’t bother with that prerequisite for a potential linear regression model.

insurance %>%
  ggplot(aes(x=age, y = charges)) + 
  geom_point(aes(color = smoker)) + 
  ggtitle("Charges vs Age")

# Distribution Variable BMI
insurance %>%
ggplot(aes(bmi)) +
geom_histogram(color = "blue", fill = "blue",alpha = .5, bins = 25) +
scale_x_continuous(breaks = seq(15,54,5)) +
theme(axis.text.x = element_text(size = 14)) +
labs(title="Variable Body Mass Index", x="")

According to CDC, “BMI is a person’s weight in kilograms divided by the square of height in meters.”. Since BMI is correlated with body fatness, it is also correlated with various diseases which rise medical costs. There are certain categories of BMI:

  1. < 18,5 is underweight

  2. 18,5 - 24,9 is normal or healthy weight

  3. 25 - 29,9 is overweight

  4. 30+ is obese

As seen on histogram, I would say that there are no significant outliers.

# Distribution Variable Children
barplot(height = table(insurance$children),col = "blue",  main = "Distribution of children", xlab="No.of children")

On average, people have 1 kid, whereas the maximum number of children is 5. There are no outliers, and most of the people in the sample don’t have children.

# Barplot Variable Sex
insurance %>%
ggplot(aes(x=sex, y = prop.table(stat(count)), 
                          fill = sex, 
                          label = scales::percent(prop.table(stat(count))))) +
    geom_bar(show.legend=F) + 
    geom_text(stat = 'count',
              position = position_dodge(.9), 
              vjust = -0.5, 
              size = 3) + 
    scale_y_continuous(labels = scales::percent) +
theme(axis.text.x = element_text(size = 14)) +
labs(title="Variable Sex", x="", y="")

Based on current samples of female and male genders with balanced proportions.

# Barplot Variable Smoker
insurance %>%
ggplot(aes(x=smoker, y = prop.table(stat(count)), 
                          fill = smoker, 
                          label = scales::percent(prop.table(stat(count))))) +
    geom_bar(show.legend=F) + 
    geom_text(stat = 'count',
              position = position_dodge(.9), 
              vjust = -0.5, 
              size = 3) + 
    scale_y_continuous(labels = scales::percent) +
theme(axis.text.x = element_text(size = 14)) +
labs(title="Variable Smoker", x="", y="")

Variables from the sample who do not smoke more than those who smoker with a proportion of 80% -20%.

# Boxplot Variable Region
insurance %>%
ggplot(aes(reorder(region, region,
                     function(x)-length(x)),
                   y = prop.table(stat(count)), 
                          fill = region, 
                          label = scales::percent(prop.table(stat(count))))) +
    geom_bar(show.legend=F) + 
    geom_text(stat = 'count',
              position = position_dodge(.9), 
              vjust = -0.5, 
              size = 3) + 
    scale_y_continuous(labels = scales::percent) +
theme(axis.text.x = element_text(size = 15)) +
labs(title="Variable Region", x="", y="")

The sample on this insurance customer has a balanced region in the US country.

Identify significant correlations

library(GGally)
ggcorr(insurance, label = T, hjust = 1, layout.exp = 2)

Column that most influence charges are age = 0.3 > bmi 0.2 > children 0.1
Assumption : The higher the age, the higher the premium.

Build Regression Linear Model

set.seed(100) 

index_ins <- sample(x = nrow(insurance) , size = nrow(insurance)*0.70) 
ins_train <- insurance[index_ins , ]
ins_test <- insurance[-index_ins, ]

A linear regression model can also be constructed using the age predictor variable, which has the highest positive correlation with the target variable charges

model_age <- lm(formula = charges ~ age , data = ins_train)
summary(model_age)
#> 
#> Call:
#> lm(formula = charges ~ age, data = ins_train)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#>  -7870  -6478  -5739   5101  47593 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  2958.99    1095.24   2.702  0.00702 ** 
#> age           258.69      26.09   9.916  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 11290 on 934 degrees of freedom
#> Multiple R-squared:  0.09525,    Adjusted R-squared:  0.09428 
#> F-statistic: 98.33 on 1 and 934 DF,  p-value: < 2.2e-16

As can be seen, the adjusted R-squared value is 0.09583

Then, using step-wise regression with backward elimination, i will attempt to select predictor variables automatically.

model_all <- lm(formula = charges ~. , data = ins_train)
model_back <- step(object = model_all, direction = "backward", trace = 0)
summary(model_back)
#> 
#> Call:
#> lm(formula = charges ~ age + bmi + children + smoker + region, 
#>     data = ins_train)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -11099  -2929  -1018   1395  25387 
#> 
#> Coefficients:
#>                  Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)     -11473.21    1118.55 -10.257  < 2e-16 ***
#> age                249.95      13.92  17.953  < 2e-16 ***
#> bmi                340.09      32.90  10.338  < 2e-16 ***
#> children           510.66     162.62   3.140  0.00174 ** 
#> smokeryes        23441.91     488.79  47.959  < 2e-16 ***
#> regionnorthwest   -649.48     566.64  -1.146  0.25201    
#> regionsoutheast  -1617.36     559.61  -2.890  0.00394 ** 
#> regionsouthwest  -1111.36     553.28  -2.009  0.04486 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 5967 on 928 degrees of freedom
#> Multiple R-squared:  0.7491, Adjusted R-squared:  0.7472 
#> F-statistic: 395.9 on 7 and 928 DF,  p-value: < 2.2e-16

This step-wise regression method will generate an optimum formula based on the lowest AIC value, where the lower the AIC value, the lower the uncaught observation value.

When compared to the initial model that only uses the age variable, the regression model that uses all predictor variables has an adjusted R-squared of 0.7558, which is higher than the previous model’s R-squared of 0.09583 Then the next we will use model_back.

Linearity

Test assumptions for linearity with cor.test()

H0 : Not Linear
H1 : Linear

cor.test(ins_train$charges, ins_train$age)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  ins_train$charges and ins_train$age
#> t = 9.9161, df = 934, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.2494816 0.3654768
#> sample estimates:
#>       cor 
#> 0.3086262
cor.test(ins_train$charges, ins_train$children)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  ins_train$charges and ins_train$children
#> t = 2.5541, df = 934, p-value = 0.0108
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.01930792 0.14657947
#> sample estimates:
#>       cor 
#> 0.0832833
cor.test(ins_train$charges, ins_train$bmi)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  ins_train$charges and ins_train$bmi
#> t = 6.28, df = 934, p-value = 5.181e-10
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.138997 0.261982
#> sample estimates:
#>       cor 
#> 0.2012826

The p-value of all numeric variables < 0.05, then reject H0 ; this indicates that it has a linear relation with variable charges.

Normality of Residual

  1. Visualization of the residual histogram, using the hist() function.
hist(model_back$residuals)

  1. Statistical test using shapiro.test(). (hopefully p-value > alpha, so that the decision taken is to fail to reject H0)

H0: error is normally distributed H1: error is not normally distributed

Expected p-value > 0.05

Shapiro-Wilk hypothesis test:

# Shapiro Test
shapiro.test(model_back$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_back$residuals
#> W = 0.90181, p-value < 2.2e-16

p-value < 0.05, then reject H0 ; it means that the residual error is not normally distributed

Homoscedasticity of Residual

  1. Visualization with a scatter plot between the predicted values (fitted values) and the error value
# plot model.fitted with residuals
plot(x = model_back$fitted.values, y = model_back$residuals)
abline(h = 0, col = "red", lty = 2)

  1. Statistical test with Breusch-Pagan from package lmtest
library(lmtest)
bptest(model_back)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_back
#> BP = 88.999, df = 7, p-value < 2.2e-16

p-value < 0.05, it means that the error variance spreads not randomly (heterocedasticity)

No Multicolinearity

library(car)
vif(model_back)
#>              GVIF Df GVIF^(1/(2*Df))
#> age      1.020579  1        1.010237
#> bmi      1.101757  1        1.049646
#> children 1.004098  1        1.002047
#> smoker   1.004237  1        1.002116
#> region   1.089173  3        1.014338

VIF value < 10, the model does not exhibit multicollinearity

Model Evaluation

pred_back <- predict(model_back, newdata = ins_test)
library(MLmetrics)
MAPE(y_pred = pred_back , y_true = ins_test$charges)
#> [1] 0.4261799

Conclusion

After testing the analysis, the model has quite good criteria. However, based on the above test, the model’s assumptions of homoscedasticity and normality of residual are violated, but the model meets the linear assumption and there is no multicollinearity. Calculating the MAPE (Mean Absolute Percentage Error) of 0.42 or 42 percent, which is quite high, means that the model is good enough not to exceed 50 percent because MAPE sees that our model deviates from the actual value by the MAPE value on average.