Is BMI a significant predictor for the cost of insurance? Is it more significant when one is a smoker?
There are 1,339 cases of Americans with basic health information, such as age (18-64), number of children per individual, smoking, region of residence within US, and insurance charges.
This data is collected from a prepared dataset via Kaggle.com. It will be stored in my github in order to make the results replicable if one wishes to replicate this study with my code.
The data from this study is observational.
Original: https://www.kaggle.com/mirichoi0218/insurance
Project Source: https://raw.githubusercontent.com/jconno/Data-606-project/main/insurance.csv
The cost of insurance is the dependent/response variable, which is quantitative.
BMI (quantitative) and smoker (qualitative) are the independent variables.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
insurance <- read.csv("https://raw.githubusercontent.com/jconno/Data-606-project/main/insurance.csv")
summary(insurance)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
BMI is normally distributed and each individual is independent of other individuals in this study.
hist(insurance$bmi)
Most partipants are non-smokers
table(insurance$smoker)
##
## no yes
## 1064 274
library(ggplot2)
library(tidyverse)
## -- Attaching packages -------------------------- tidyverse 1.3.0 --
## v tibble 3.0.3 v dplyr 1.0.0
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ----------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Graphing the relationship between BMI and Insurance cost
insurance_plot <- data.frame(insurance)
ggplot(insurance_plot, aes(x = insurance_plot$bmi, y = insurance_plot$charges)) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
From this figure, we can see a positive correlation between BMI and insurances costs.
no_smoke <- insurance %>% filter(smoker =="no")
yes_smoke <- insurance %>% filter(smoker == "yes")
ggplot(no_smoke, aes(x = no_smoke$bmi, y = no_smoke$charges)) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
ggplot(yes_smoke, aes(x = yes_smoke$bmi, y = yes_smoke$charges)) + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
From the BMI graphs of smokers, there is a more dramatic curve there than in the graph of non smokers. In addition, the insurance costs are generally much higher among smokers than non-smokers.
From both graphs, we can see there is a positive correlation between BMI and insurance costs. I will perform a linear regression analysis to see if BMI and smoking predicts the cost of insurance. For the project, the following hypothesis test will be conducted:
Assuming a 95% level of confidence, we hypothesize the following…
Null Hypothesis: There exists predictability of insurance costs with regard to BMI and smoking.
Alternative: There is no predictability of insurance cost with regards to BMI and smoking.