In this analysis, I will be using the Insurance Premium Prediction dataset from Kaggle. I will determine which variables in the dataset are significant predictors of medical expenses by creating a multiple linear regression model.
url <- "https://raw.githubusercontent.com/kristinlussi/DATA605/main/Week12/insurance.csv"
premium_data <- read_csv(url, show_col_types = FALSE) %>%
as.data.frame()
head(premium_data)
## age sex bmi children smoker region expenses
## 1 19 female 27.9 0 yes southwest 16884.92
## 2 18 male 33.8 1 no southeast 1725.55
## 3 28 male 33.0 3 no southeast 4449.46
## 4 33 male 22.7 0 no northwest 21984.47
## 5 32 male 28.9 0 no northwest 3866.86
## 6 31 female 25.7 0 no southeast 3756.62
The data is already in an OK format. I will rename the “expenses” column to “medical_expenses” so it’s more descriptive. I will also create dummy variables for the region column.
# rename expenses to medical_expenses
names(premium_data)[7] <- "medical_expenses"
# dummy variables for region column
premium_data$northeast <- ifelse(premium_data$region == "northeast", 1, 0)
premium_data$southeast <- ifelse(premium_data$region == "southeast", 1, 0)
premium_data$northwest <- ifelse(premium_data$region == "northwest", 1, 0)
premium_data$southwest <- ifelse(premium_data$region == "southwest", 1, 0)
# select all but region column
premium_data <- premium_data %>%
select(age, sex, bmi, children, smoker, northeast, southeast, northwest, southwest, medical_expenses)
# glimpse of the data
head(premium_data)
## age sex bmi children smoker northeast southeast northwest southwest
## 1 19 female 27.9 0 yes 0 0 0 1
## 2 18 male 33.8 1 no 0 1 0 0
## 3 28 male 33.0 3 no 0 1 0 0
## 4 33 male 22.7 0 no 0 0 1 0
## 5 32 male 28.9 0 no 0 0 1 0
## 6 31 female 25.7 0 no 0 1 0 0
## medical_expenses
## 1 16884.92
## 2 1725.55
## 3 4449.46
## 4 21984.47
## 5 3866.86
## 6 3756.62
There are no missing values in the dataset:
# search for missing values
sum(is.na(premium_data))
## [1] 0
Here is a summary of the data:
summary(premium_data)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :16.00 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.67 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.70 3rd Qu.:2.000
## Max. :64.00 Max. :53.10 Max. :5.000
## smoker northeast southeast northwest
## Length:1338 Min. :0.0000 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :0.0000 Median :0.000 Median :0.0000
## Mean :0.2422 Mean :0.272 Mean :0.2429
## 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000
## southwest medical_expenses
## Min. :0.0000 Min. : 1122
## 1st Qu.:0.0000 1st Qu.: 4740
## Median :0.0000 Median : 9382
## Mean :0.2429 Mean :13270
## 3rd Qu.:0.0000 3rd Qu.:16640
## Max. :1.0000 Max. :63770
We want to select a variable that we expect to have a non-linear relationship with the dependent variable (medical expenses).
The impact of age on medical expenses might increase or decrease non-linearly as age increases, so we will choose age as the quadratic term. (Both BMI and age could be considered for the quadratic term).
premium_data$age_squared <- premium_data$age^2
There are 6 dichotomous variables in the dataset: smoker, sex, and the four region terms. The regions are already encoded, but we will also encode smoker and sex as yes = 1 and no = 0 and male = 1 and female = 0, respectively.
premium_data$sex <- ifelse(premium_data$sex == "male", 1, 0) # male = 1 & female = 0
premium_data$smoker <-ifelse(premium_data$smoker == "yes", 1, 0) # yes = 1 & no = 0
I think it would be interesting to evaluate the interaction between BMI and smoker.
premium_data$smoker_bmi_interaction <- premium_data$smoker * premium_data$bmi
corrplot(cor(premium_data), method = "square")
pairs(premium_data)
premium.lm <- lm(medical_expenses ~ age + age_squared + sex + smoker + northeast + southeast + northwest + southwest + smoker_bmi_interaction, data = premium_data)
summary(premium.lm)
##
## Call:
## lm(formula = medical_expenses ~ age + age_squared + sex + smoker +
## northeast + southeast + northwest + southwest + smoker_bmi_interaction,
## data = premium_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14768.2 -1945.4 -1412.0 -347.7 30258.1
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.361e+02 1.174e+03 0.797 0.42537
## age 8.124e+01 6.237e+01 1.302 0.19299
## age_squared 2.340e+00 7.785e-01 3.006 0.00270 **
## sex -4.717e+02 2.677e+02 -1.762 0.07826 .
## smoker -2.092e+04 1.484e+03 -14.090 < 2e-16 ***
## northeast 1.148e+03 3.831e+02 2.996 0.00279 **
## southeast 1.221e+01 3.733e+02 0.033 0.97391
## northwest 6.150e+02 3.823e+02 1.609 0.10792
## southwest NA NA NA NA
## smoker_bmi_interaction 1.460e+03 4.723e+01 30.907 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4869 on 1329 degrees of freedom
## Multiple R-squared: 0.8393, Adjusted R-squared: 0.8384
## F-statistic: 867.8 on 8 and 1329 DF, p-value: < 2.2e-16
The adjusted \(R^2\) is 0.8384, which means that the variables used in the model can account for 83.84% of the variability in medical expenses.
According to the p values, the following variables are statistically significant:
The p-value for the model as a whole is very low, which is a good sign that the variables selected are significant predictors of medical expenses.
Let’s see if we can try to improve the adjusted \(R^2\) by removing some of the insignificant variables.
premium.lm.improved <- lm(medical_expenses ~ age + age_squared + sex + smoker + northeast + northwest + smoker_bmi_interaction, data = premium_data)
summary(premium.lm.improved)
##
## Call:
## lm(formula = medical_expenses ~ age + age_squared + sex + smoker +
## northeast + northwest + smoker_bmi_interaction, data = premium_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14765.4 -1943.8 -1410.8 -348.2 30263.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.430e+02 1.155e+03 0.817 0.41434
## age 8.121e+01 6.234e+01 1.303 0.19294
## age_squared 2.340e+00 7.782e-01 3.007 0.00269 **
## sex -4.716e+02 2.676e+02 -1.763 0.07816 .
## smoker -2.092e+04 1.482e+03 -14.117 < 2e-16 ***
## northeast 1.141e+03 3.299e+02 3.459 0.00056 ***
## northwest 6.086e+02 3.291e+02 1.850 0.06459 .
## smoker_bmi_interaction 1.460e+03 4.710e+01 30.995 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4867 on 1330 degrees of freedom
## Multiple R-squared: 0.8393, Adjusted R-squared: 0.8385
## F-statistic: 992.5 on 7 and 1330 DF, p-value: < 2.2e-16
Removing southeast and southwest slightly improves the \(R^2\) to 83.85%.
In the following graph, the residuals fall on the line between -1.5 and 1, but there are many outliers. This is a sign that the model may not be well-fit. We cannot assume normality based on this plot.
qqnorm(resid(premium.lm.improved))
qqline(resid(premium.lm.improved))
In the below plot, we can see that there is no obvious pattern and the residuals are scattered about 0. However, the points are densely populated towards the left side of the graph. This is a sign that the model may not be well-fit. We cannot assume homoscedasticity based on this plot.
plot(fitted(premium.lm.improved),resid(premium.lm.improved))
abline(h = 0, col = "red", lty = 2)
par(mfrow=c(2,2))
plot(premium.lm.improved)
In conclusion, we were able to determine the significant predictors of medical expenses, being:
We were able to create a linear model that showed a low p-value and a adjusted \(R^2\) of 83.85%, which overall shows that the variables selected for the model are statistically significant.
However, we have shown that the linear model we created was not appropriate from the normal Q-Q and residuals vs fitted values plots. Due to the Q-Q plot, we cannot assume normality, which is essential for the validity of a linear regression model. Due to the residuals vs fitted values plot, we cannot assume homoscedasticity, which is also essesntial for the validity of a linear regression model.