Using insurance data, I analyzed the effects that different variables have on insurance costs. The data I downloaded from Kaggle contains 1,338 anonymous individuals and includes their age, BMI (body mass index), the number of children they have, whether they smoke, and the region they reside in within the U.S. Using this data, I also set out to predict a person’s insurance costs based on these variables.

library(sqldf)
library(ggplot2)
insurance <- read.csv("~/Downloads/insurance.csv") 
head(insurance)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

To begin the EDA, I break down the data by region in order to see how they compare to each other.

region <- sqldf("SELECT AVG(bmi) AS avg_bmi,region,AVG(charges) AS avg_charges FROM insurance GROUP BY region")
r <- ggplot(region, aes(region,avg_charges)) + geom_bar(stat="identity",fill='maroon') + theme_minimal()
plot(r)

As we can see, the southeast of the U.S. has the highest average insurance costs of any region. Conversely, the southwest is the region with the lowest average insurance costs, just edging out the northwest. Next, I compare smokers and non-smokers to see how big the difference is between the two categories.

smoker <- sqldf("SELECT AVG(bmi) AS avg_bmi,smoker,AVG(charges) AS avg_charges FROM insurance GROUP BY smoker")
s <- ggplot(smoker, aes(smoker,avg_charges)) + geom_bar(stat="identity",fill='black') + ylab("Average Charge")+ theme_minimal()
plot(s)

Unsurprisingly, smokers have much higher medical costs than non-smokers, topping an extraordinary $30,000 per year. Insurance companies, knowing the person is a smoker, charge much higher premiums than average to offset the risk they take on by insuring a smoker.

Interestingly, some of my hypotheses turned out to be wrong regarding certain groups’ medical costs. For instance, I figured that older people or people with higher BMIs would have noticably higher medical costs than others.

bmi_cor <- cor(insurance$bmi,insurance$charges)
age_cor <- cor(insurance$age,insurance$charges)
bmi_cor
## [1] 0.198341
age_cor
## [1] 0.2990082

However, the correlation between a person’s BMI and their healh costs is only 0.20, well below the mark necessary to indicate a strong correlation. The same goes for age, which only has a 0.30 correlation with medical charges, well below my expectations.

In order to predict how much someone’s medical costs will add up to, I make a multiple linear regression model.

mod <- lm(charges ~ age + sex + bmi + children + smoker + region, data=insurance)
summary(mod)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## sexmale           -131.3      332.9  -0.394 0.693348    
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16

Using a female non-smoker located in the northeast as our baseline, the model predicts that as age, BMI, number of children, and smoking increase, so will the person’s medical costs. On the other hand, being a male and living in any of the other regions will decrease their costs. Some of this goes against intuition, as the southeast is a notably less healthy region than the northeast and women are thought to be healthier than men on average. The standard error for sex is very high, though, at 332.9, making the variable not significant. We can also see that the multiple R-squared for this regression is 0.75, meaning that 75% of the variation in this model can be explained by the variables included.

Using this model, I enter my personal variables to determine what my medical costs would be.

pred_1 <- predict(mod, newdata=data.frame(age=21,sex='male',bmi=23.1,children=0,smoker='no',region='northeast'))
pred_1
##        1 
## 1159.499

The model predicts I would have an annual medical cost of $1,159, which sounds about right. Next, I try to predict the medical cost of a woman of similar age and BMI, but who’s a smoker.

pred_2 <- predict(mod, newdata=data.frame(age=25,sex='female',bmi=25,children=0,smoker='yes',region='southwest'))
pred_2
##        1 
## 25851.19
pred_3 <- predict(mod, newdata=data.frame(age=60,sex='female',bmi=45,children=3,smoker='no',region='southeast'))
pred_3
##        1 
## 19128.03

We see here just how much being a smoker raises someone’s annual medical costs. Even at a young age and low BMI, the first woman has to pay $25,851, likely from extremely high premiums and deductibles. In comparison, the second woman, despite being 60 years old, a higher BMI, and three children, only has to pay a little over $19,000 per year.