Insurance

Initial Analysis

What the data contains

This is an analysis of the data available at https://www.kaggle.com/mirichoi0218/insurance. It contains age, sex, BMI, children, smoker, region data as well as insurance charges for over 1000 example subjects.

Initial Reading

insurance <- read_csv("~/cRamp/insurance/insurance.csv")

## Parsed with column specification:
## cols(
##   age = col_double(),
##   sex = col_character(),
##   bmi = col_double(),
##   children = col_double(),
##   smoker = col_character(),
##   region = col_character(),
##   charges = col_double()
## )

Here is a closer look at some of the categories.

Sex

insurance %>% group_by(sex) %>% summarise("Average charges" = mean(charges)) -> avgChargeGender
ggplot(avgChargeGender, aes(sex, `Average charges`, fill = sex)) + geom_col()

From the bar graph we see men as a whole pay a bit more than women.

female <- filter(insurance, sex == "female")
male <- filter(insurance, sex == "male")
par(mfrow=c(1,2))
ggplot(data=female, aes(age)) + geom_histogram(binwidth=10) + labs(title = "Female")

ggplot(data=male, aes(age)) + geom_histogram(binwidth=10) + labs(title = "Male")

These histograms show that this sample has more younger people than older people.

Charges

summary(insurance$charges)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1122    4740    9382   13270   16640   63770

ggplot(data=insurance, aes(charges))+ geom_histogram(binwidth=1000) + labs(title = "Distribution of Charges")

It looks like most people are paying under $20,000 for insurance costs.

Age

youngAdult <- filter(insurance, age < 32)
Adult <- filter(insurance, (age >= 32) & (age < 46))
Senior <- filter(insurance, age >= 46)
ggplot(insurance, aes(age)) + geom_histogram(binwidth=2) + labs(title = "Age")

There are a lot of 18 year olds in this sample.

Smoking

smoker <- filter(insurance, smoker == "yes")
nonsmoker <- filter(insurance, smoker == "no")
ggplot(data=smoker, aes(age)) + geom_histogram(binwidth=2) + labs(title = "Smoker")

ggplot(data=nonsmoker, aes(age)) + geom_histogram(binwidth=2) + labs(title = "Non-smoker")

From the first histogram, it seems like there are a lot of smoking 18 year olds. After looking at the second one, we see that there are also a lot of non-smoking 18 year olds. So that difference we see compared to the other age groups is due to the fact that there are a lot of 18 year olds in the sample.

How some characteristics relate with charges

BMI

ggplot(insurance, aes(bmi, charges, color = sex)) + geom_point() + labs(title = "BMI vs Charges")

ggplot(insurance, aes(bmi, charges, color = sex)) + geom_boxplot() + labs(title = "BMI vs Charges") + coord_flip()

BMI doesn’t appear to have much of an impact on the charges.

Smoking

par(mfrow=c(1,2))
ggplot(insurance, aes(smoker, charges, color = smoker)) + geom_boxplot() + labs(title = "Overall Smoker vs Charges") + coord_flip()

Since we noticed there were a lot of 18 year olds, let’s take a closer look at just that age group.

youth <- filter(insurance, age == 18)
ggplot(youth, aes(smoker, charges, color = smoker)) + geom_boxplot() + labs(title = "18 year olds: Smoker vs Charges") + coord_flip()

Here we can clearly see that smoking plays a big role in insurance costs.

Region

insurance %>% group_by(region) %>% summarise("Average charges" = mean(charges)) -> avgChargeRegion
ggplot(avgChargeRegion, aes(region, `Average charges`, fill = region)) + geom_col()

People in the Southeast seem to pay a little more, but people in the other regions have about the same average cost.

Linear model

linear.model <- lm(charges ~ sex + bmi + smoker + region + children + age + age*smoker, data = insurance)
summary(linear.model)

## 
## Call:
## lm(formula = charges ~ sex + bmi + smoker + region + children + 
##     age + age * smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11296.5  -2832.8   -970.8   1420.9  29775.1 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11612.13    1010.15 -11.495  < 2e-16 ***
## sexmale           -135.16     332.79  -0.406 0.684706    
## bmi                340.62      28.60  11.910  < 2e-16 ***
## smokeryes        22105.00    1213.13  18.222  < 2e-16 ***
## regionnorthwest   -362.54     476.08  -0.762 0.446480    
## regionsoutheast  -1060.06     478.73  -2.214 0.026977 *  
## regionsouthwest   -943.31     477.82  -1.974 0.048566 *  
## children           471.41     137.76   3.422 0.000641 ***
## age                247.73      13.31  18.617  < 2e-16 ***
## smokeryes:age       45.13      29.53   1.529 0.126626    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6059 on 1328 degrees of freedom
## Multiple R-squared:  0.7514, Adjusted R-squared:  0.7497 
## F-statistic: 445.9 on 9 and 1328 DF,  p-value: < 2.2e-16

lm.results <- predict(linear.model, newdata = insurance)

test <- data.frame("Predicted" = lm.results, "Actual" = insurance$charges)
ggplot(test, aes(Actual, Predicted)) + geom_point() + geom_abline()

From the graph and the R-squared value we see that the linear model is not the best fit for this data. If it was a perfect fit, the data points would all be on the line. Instead, there are two groups that are near the line and many points that are not even close.

Insurance

Yung-Hsin Tung

March 8, 2019

Initial Analysis

What the data contains

Initial Reading

Sex

Charges

Age

Smoking

How some characteristics relate with charges

BMI

Smoking

Region

Linear model