Medical Expenses

In this analysis, I will be using the Insurance Premium Prediction dataset from Kaggle. I will determine which variables in the dataset are significant predictors of medical expenses by creating a multiple linear regression model.

Import Data

url <- "https://raw.githubusercontent.com/kristinlussi/DATA605/main/Week12/insurance.csv"
premium_data <- read_csv(url, show_col_types = FALSE) %>%
  as.data.frame()

head(premium_data)
##   age    sex  bmi children smoker    region expenses
## 1  19 female 27.9        0    yes southwest 16884.92
## 2  18   male 33.8        1     no southeast  1725.55
## 3  28   male 33.0        3     no southeast  4449.46
## 4  33   male 22.7        0     no northwest 21984.47
## 5  32   male 28.9        0     no northwest  3866.86
## 6  31 female 25.7        0     no southeast  3756.62

Clean Data

The data is already in an OK format. I will rename the “expenses” column to “medical_expenses” so it’s more descriptive. I will also create dummy variables for the region column.

# rename expenses to medical_expenses
names(premium_data)[7] <- "medical_expenses" 

# dummy variables for region column
premium_data$northeast <- ifelse(premium_data$region == "northeast", 1, 0)
premium_data$southeast <- ifelse(premium_data$region == "southeast", 1, 0)
premium_data$northwest <- ifelse(premium_data$region == "northwest", 1, 0)
premium_data$southwest <- ifelse(premium_data$region == "southwest", 1, 0)

# select all but region column
premium_data <- premium_data %>%
  select(age, sex, bmi, children, smoker, northeast, southeast, northwest, southwest, medical_expenses)

# glimpse of the data
head(premium_data)
##   age    sex  bmi children smoker northeast southeast northwest southwest
## 1  19 female 27.9        0    yes         0         0         0         1
## 2  18   male 33.8        1     no         0         1         0         0
## 3  28   male 33.0        3     no         0         1         0         0
## 4  33   male 22.7        0     no         0         0         1         0
## 5  32   male 28.9        0     no         0         0         1         0
## 6  31 female 25.7        0     no         0         1         0         0
##   medical_expenses
## 1         16884.92
## 2          1725.55
## 3          4449.46
## 4         21984.47
## 5          3866.86
## 6          3756.62

There are no missing values in the dataset:

# search for missing values
sum(is.na(premium_data))
## [1] 0

Here is a summary of the data:

summary(premium_data)
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :16.00   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.67   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.70   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.10   Max.   :5.000  
##     smoker            northeast        southeast       northwest     
##  Length:1338        Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :0.000   Median :0.0000  
##                     Mean   :0.2422   Mean   :0.272   Mean   :0.2429  
##                     3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :1.0000   Max.   :1.000   Max.   :1.0000  
##    southwest      medical_expenses
##  Min.   :0.0000   Min.   : 1122   
##  1st Qu.:0.0000   1st Qu.: 4740   
##  Median :0.0000   Median : 9382   
##  Mean   :0.2429   Mean   :13270   
##  3rd Qu.:0.0000   3rd Qu.:16640   
##  Max.   :1.0000   Max.   :63770

Selecting at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction ter

Quadratic Term

We want to select a variable that we expect to have a non-linear relationship with the dependent variable (medical expenses).

The impact of age on medical expenses might increase or decrease non-linearly as age increases, so we will choose age as the quadratic term. (Both BMI and age could be considered for the quadratic term).

premium_data$age_squared <- premium_data$age^2

Dichotomous Term

There are 6 dichotomous variables in the dataset: smoker, sex, and the four region terms. The regions are already encoded, but we will also encode smoker and sex as yes = 1 and no = 0 and male = 1 and female = 0, respectively.

premium_data$sex <- ifelse(premium_data$sex == "male", 1, 0) # male = 1 & female = 0
premium_data$smoker <-ifelse(premium_data$smoker == "yes", 1, 0) # yes = 1 & no = 0

Dichotomous vs. Quantitative Interaction Term

I think it would be interesting to evaluate the interaction between BMI and smoker.

premium_data$smoker_bmi_interaction <- premium_data$smoker * premium_data$bmi

Visualizing the Relationships in the Data

Correlation Plot

corrplot(cor(premium_data), method = "square")

Scatter Plots

pairs(premium_data)

Build Linear Model

premium.lm <- lm(medical_expenses ~ age + age_squared + sex + smoker + northeast + southeast + northwest + southwest + smoker_bmi_interaction, data = premium_data)

summary(premium.lm)
## 
## Call:
## lm(formula = medical_expenses ~ age + age_squared + sex + smoker + 
##     northeast + southeast + northwest + southwest + smoker_bmi_interaction, 
##     data = premium_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14768.2  -1945.4  -1412.0   -347.7  30258.1 
## 
## Coefficients: (1 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             9.361e+02  1.174e+03   0.797  0.42537    
## age                     8.124e+01  6.237e+01   1.302  0.19299    
## age_squared             2.340e+00  7.785e-01   3.006  0.00270 ** 
## sex                    -4.717e+02  2.677e+02  -1.762  0.07826 .  
## smoker                 -2.092e+04  1.484e+03 -14.090  < 2e-16 ***
## northeast               1.148e+03  3.831e+02   2.996  0.00279 ** 
## southeast               1.221e+01  3.733e+02   0.033  0.97391    
## northwest               6.150e+02  3.823e+02   1.609  0.10792    
## southwest                      NA         NA      NA       NA    
## smoker_bmi_interaction  1.460e+03  4.723e+01  30.907  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4869 on 1329 degrees of freedom
## Multiple R-squared:  0.8393, Adjusted R-squared:  0.8384 
## F-statistic: 867.8 on 8 and 1329 DF,  p-value: < 2.2e-16

The adjusted \(R^2\) is 0.8384, which means that the variables used in the model can account for 83.84% of the variability in medical expenses.

According to the p values, the following variables are statistically significant:

  • age_squared
  • smoker
  • northeast
  • smoker_bmi_interaction

The p-value for the model as a whole is very low, which is a good sign that the variables selected are significant predictors of medical expenses.

Let’s see if we can try to improve the adjusted \(R^2\) by removing some of the insignificant variables.

premium.lm.improved <- lm(medical_expenses ~ age + age_squared + sex + smoker + northeast + northwest + smoker_bmi_interaction, data = premium_data)

summary(premium.lm.improved)
## 
## Call:
## lm(formula = medical_expenses ~ age + age_squared + sex + smoker + 
##     northeast + northwest + smoker_bmi_interaction, data = premium_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14765.4  -1943.8  -1410.8   -348.2  30263.5 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             9.430e+02  1.155e+03   0.817  0.41434    
## age                     8.121e+01  6.234e+01   1.303  0.19294    
## age_squared             2.340e+00  7.782e-01   3.007  0.00269 ** 
## sex                    -4.716e+02  2.676e+02  -1.763  0.07816 .  
## smoker                 -2.092e+04  1.482e+03 -14.117  < 2e-16 ***
## northeast               1.141e+03  3.299e+02   3.459  0.00056 ***
## northwest               6.086e+02  3.291e+02   1.850  0.06459 .  
## smoker_bmi_interaction  1.460e+03  4.710e+01  30.995  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4867 on 1330 degrees of freedom
## Multiple R-squared:  0.8393, Adjusted R-squared:  0.8385 
## F-statistic: 992.5 on 7 and 1330 DF,  p-value: < 2.2e-16

Removing southeast and southwest slightly improves the \(R^2\) to 83.85%.

Residual Analysis

In the following graph, the residuals fall on the line between -1.5 and 1, but there are many outliers. This is a sign that the model may not be well-fit. We cannot assume normality based on this plot.

qqnorm(resid(premium.lm.improved))
qqline(resid(premium.lm.improved))

In the below plot, we can see that there is no obvious pattern and the residuals are scattered about 0. However, the points are densely populated towards the left side of the graph. This is a sign that the model may not be well-fit. We cannot assume homoscedasticity based on this plot.

plot(fitted(premium.lm.improved),resid(premium.lm.improved))
abline(h = 0, col = "red", lty = 2)

par(mfrow=c(2,2))
plot(premium.lm.improved)

Conclusion

In conclusion, we were able to determine the significant predictors of medical expenses, being:

  • age_squared
  • smoker
  • northeast
  • smoker_bmi_interaction

We were able to create a linear model that showed a low p-value and a adjusted \(R^2\) of 83.85%, which overall shows that the variables selected for the model are statistically significant.

However, we have shown that the linear model we created was not appropriate from the normal Q-Q and residuals vs fitted values plots. Due to the Q-Q plot, we cannot assume normality, which is essential for the validity of a linear regression model. Due to the residuals vs fitted values plot, we cannot assume homoscedasticity, which is also essesntial for the validity of a linear regression model.

Sources

Kaggle Dataset