Initial Analysis

What the data contains

This is an analysis of the data available at https://www.kaggle.com/mirichoi0218/insurance. It contains age, sex, BMI, children, smoker, region data as well as insurance charges for over 1000 example subjects.

Initial Reading

insurance <- read_csv("~/cRamp/insurance/insurance.csv")
## Parsed with column specification:
## cols(
##   age = col_double(),
##   sex = col_character(),
##   bmi = col_double(),
##   children = col_double(),
##   smoker = col_character(),
##   region = col_character(),
##   charges = col_double()
## )

Here is a closer look at some of the categories.

Sex

insurance %>% group_by(sex) %>% summarise("Average charges" = mean(charges)) -> avgChargeGender
ggplot(avgChargeGender, aes(sex, `Average charges`, fill = sex)) + geom_col()

From the bar graph we see men as a whole pay a bit more than women.

female <- filter(insurance, sex == "female")
male <- filter(insurance, sex == "male")
par(mfrow=c(1,2))
ggplot(data=female, aes(age)) + geom_histogram(binwidth=10) + labs(title = "Female")

ggplot(data=male, aes(age)) + geom_histogram(binwidth=10) + labs(title = "Male")

These histograms show that this sample has more younger people than older people.

Charges

summary(insurance$charges)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1122    4740    9382   13270   16640   63770
ggplot(data=insurance, aes(charges))+ geom_histogram(binwidth=1000) + labs(title = "Distribution of Charges")

It looks like most people are paying under $20,000 for insurance costs.

Age

youngAdult <- filter(insurance, age < 32)
Adult <- filter(insurance, (age >= 32) & (age < 46))
Senior <- filter(insurance, age >= 46)
ggplot(insurance, aes(age)) + geom_histogram(binwidth=2) + labs(title = "Age")

There are a lot of 18 year olds in this sample.

Smoking

smoker <- filter(insurance, smoker == "yes")
nonsmoker <- filter(insurance, smoker == "no")
ggplot(data=smoker, aes(age)) + geom_histogram(binwidth=2) + labs(title = "Smoker")

ggplot(data=nonsmoker, aes(age)) + geom_histogram(binwidth=2) + labs(title = "Non-smoker")

From the first histogram, it seems like there are a lot of smoking 18 year olds. After looking at the second one, we see that there are also a lot of non-smoking 18 year olds. So that difference we see compared to the other age groups is due to the fact that there are a lot of 18 year olds in the sample.

How some characteristics relate with charges

BMI

ggplot(insurance, aes(bmi, charges, color = sex)) + geom_point() + labs(title = "BMI vs Charges")

ggplot(insurance, aes(bmi, charges, color = sex)) + geom_boxplot() + labs(title = "BMI vs Charges") + coord_flip()

BMI doesn’t appear to have much of an impact on the charges.

Smoking

par(mfrow=c(1,2))
ggplot(insurance, aes(smoker, charges, color = smoker)) + geom_boxplot() + labs(title = "Overall Smoker vs Charges") + coord_flip()

Since we noticed there were a lot of 18 year olds, let’s take a closer look at just that age group.

youth <- filter(insurance, age == 18)
ggplot(youth, aes(smoker, charges, color = smoker)) + geom_boxplot() + labs(title = "18 year olds: Smoker vs Charges") + coord_flip()

Here we can clearly see that smoking plays a big role in insurance costs.

Region

insurance %>% group_by(region) %>% summarise("Average charges" = mean(charges)) -> avgChargeRegion
ggplot(avgChargeRegion, aes(region, `Average charges`, fill = region)) + geom_col()

People in the Southeast seem to pay a little more, but people in the other regions have about the same average cost.

Linear model

linear.model <- lm(charges ~ sex + bmi + smoker + region + children + age + age*smoker, data = insurance)
summary(linear.model)
## 
## Call:
## lm(formula = charges ~ sex + bmi + smoker + region + children + 
##     age + age * smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11296.5  -2832.8   -970.8   1420.9  29775.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11612.13    1010.15 -11.495  < 2e-16 ***
## sexmale           -135.16     332.79  -0.406 0.684706    
## bmi                340.62      28.60  11.910  < 2e-16 ***
## smokeryes        22105.00    1213.13  18.222  < 2e-16 ***
## regionnorthwest   -362.54     476.08  -0.762 0.446480    
## regionsoutheast  -1060.06     478.73  -2.214 0.026977 *  
## regionsouthwest   -943.31     477.82  -1.974 0.048566 *  
## children           471.41     137.76   3.422 0.000641 ***
## age                247.73      13.31  18.617  < 2e-16 ***
## smokeryes:age       45.13      29.53   1.529 0.126626    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6059 on 1328 degrees of freedom
## Multiple R-squared:  0.7514, Adjusted R-squared:  0.7497 
## F-statistic: 445.9 on 9 and 1328 DF,  p-value: < 2.2e-16
lm.results <- predict(linear.model, newdata = insurance)

test <- data.frame("Predicted" = lm.results, "Actual" = insurance$charges)
ggplot(test, aes(Actual, Predicted)) + geom_point() + geom_abline()

From the graph and the R-squared value we see that the linear model is not the best fit for this data. If it was a perfect fit, the data points would all be on the line. Instead, there are two groups that are near the line and many points that are not even close.