This is an analysis of the data available at https://www.kaggle.com/mirichoi0218/insurance. It contains age, sex, BMI, children, smoker, region data as well as insurance charges for over 1000 example subjects.
insurance <- read_csv("~/cRamp/insurance/insurance.csv")
## Parsed with column specification:
## cols(
## age = col_double(),
## sex = col_character(),
## bmi = col_double(),
## children = col_double(),
## smoker = col_character(),
## region = col_character(),
## charges = col_double()
## )
Here is a closer look at some of the categories.
insurance %>% group_by(sex) %>% summarise("Average charges" = mean(charges)) -> avgChargeGender
ggplot(avgChargeGender, aes(sex, `Average charges`, fill = sex)) + geom_col()
From the bar graph we see men as a whole pay a bit more than women.
female <- filter(insurance, sex == "female")
male <- filter(insurance, sex == "male")
par(mfrow=c(1,2))
ggplot(data=female, aes(age)) + geom_histogram(binwidth=10) + labs(title = "Female")
ggplot(data=male, aes(age)) + geom_histogram(binwidth=10) + labs(title = "Male")
These histograms show that this sample has more younger people than older people.
summary(insurance$charges)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1122 4740 9382 13270 16640 63770
ggplot(data=insurance, aes(charges))+ geom_histogram(binwidth=1000) + labs(title = "Distribution of Charges")
It looks like most people are paying under $20,000 for insurance costs.
youngAdult <- filter(insurance, age < 32)
Adult <- filter(insurance, (age >= 32) & (age < 46))
Senior <- filter(insurance, age >= 46)
ggplot(insurance, aes(age)) + geom_histogram(binwidth=2) + labs(title = "Age")
There are a lot of 18 year olds in this sample.
smoker <- filter(insurance, smoker == "yes")
nonsmoker <- filter(insurance, smoker == "no")
ggplot(data=smoker, aes(age)) + geom_histogram(binwidth=2) + labs(title = "Smoker")
ggplot(data=nonsmoker, aes(age)) + geom_histogram(binwidth=2) + labs(title = "Non-smoker")
From the first histogram, it seems like there are a lot of smoking 18 year olds. After looking at the second one, we see that there are also a lot of non-smoking 18 year olds. So that difference we see compared to the other age groups is due to the fact that there are a lot of 18 year olds in the sample.
ggplot(insurance, aes(bmi, charges, color = sex)) + geom_point() + labs(title = "BMI vs Charges")
ggplot(insurance, aes(bmi, charges, color = sex)) + geom_boxplot() + labs(title = "BMI vs Charges") + coord_flip()
BMI doesn’t appear to have much of an impact on the charges.
par(mfrow=c(1,2))
ggplot(insurance, aes(smoker, charges, color = smoker)) + geom_boxplot() + labs(title = "Overall Smoker vs Charges") + coord_flip()
Since we noticed there were a lot of 18 year olds, let’s take a closer look at just that age group.
youth <- filter(insurance, age == 18)
ggplot(youth, aes(smoker, charges, color = smoker)) + geom_boxplot() + labs(title = "18 year olds: Smoker vs Charges") + coord_flip()
Here we can clearly see that smoking plays a big role in insurance costs.
insurance %>% group_by(region) %>% summarise("Average charges" = mean(charges)) -> avgChargeRegion
ggplot(avgChargeRegion, aes(region, `Average charges`, fill = region)) + geom_col()
People in the Southeast seem to pay a little more, but people in the other regions have about the same average cost.
linear.model <- lm(charges ~ sex + bmi + smoker + region + children + age + age*smoker, data = insurance)
summary(linear.model)
##
## Call:
## lm(formula = charges ~ sex + bmi + smoker + region + children +
## age + age * smoker, data = insurance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11296.5 -2832.8 -970.8 1420.9 29775.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11612.13 1010.15 -11.495 < 2e-16 ***
## sexmale -135.16 332.79 -0.406 0.684706
## bmi 340.62 28.60 11.910 < 2e-16 ***
## smokeryes 22105.00 1213.13 18.222 < 2e-16 ***
## regionnorthwest -362.54 476.08 -0.762 0.446480
## regionsoutheast -1060.06 478.73 -2.214 0.026977 *
## regionsouthwest -943.31 477.82 -1.974 0.048566 *
## children 471.41 137.76 3.422 0.000641 ***
## age 247.73 13.31 18.617 < 2e-16 ***
## smokeryes:age 45.13 29.53 1.529 0.126626
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6059 on 1328 degrees of freedom
## Multiple R-squared: 0.7514, Adjusted R-squared: 0.7497
## F-statistic: 445.9 on 9 and 1328 DF, p-value: < 2.2e-16
lm.results <- predict(linear.model, newdata = insurance)
test <- data.frame("Predicted" = lm.results, "Actual" = insurance$charges)
ggplot(test, aes(Actual, Predicted)) + geom_point() + geom_abline()
From the graph and the R-squared value we see that the linear model is not the best fit for this data. If it was a perfect fit, the data points would all be on the line. Instead, there are two groups that are near the line and many points that are not even close.