Data 606 Project Proposal

Research question

Is BMI a significant predictor for the cost of insurance? Is it more significant when one is a smoker?

Cases

There are 1,339 cases of Americans with basic health information, such as age (18-64), number of children per individual, smoking, region of residence within US, and insurance charges.

Data collection

This data is collected from a prepared dataset via Kaggle.com. It will be stored in my github in order to make the results replicable if one wishes to replicate this study with my code.

Type of study

The data from this study is observational.

Data Source

Original: https://www.kaggle.com/mirichoi0218/insurance

Project Source: https://raw.githubusercontent.com/jconno/Data-606-project/main/insurance.csv

Dependent Variable

The cost of insurance is the dependent/response variable, which is quantitative.

Independent Variable

BMI (quantitative) and smoker (qualitative) are the independent variables.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

insurance <- read.csv("https://raw.githubusercontent.com/jconno/Data-606-project/main/insurance.csv")

summary(insurance)

##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

BMI is normally distributed and each individual is independent of other individuals in this study.

hist(insurance$bmi)

Most partipants are non-smokers

table(insurance$smoker)

## 
##   no  yes 
## 1064  274

library(ggplot2)
library(tidyverse)

## -- Attaching packages -------------------------- tidyverse 1.3.0 --

## v tibble  3.0.3     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts ----------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Graphing the relationship between BMI and Insurance cost

insurance_plot <- data.frame(insurance)

ggplot(insurance_plot, aes(x = insurance_plot$bmi, y =  insurance_plot$charges)) + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

From this figure, we can see a positive correlation between BMI and insurances costs.

Filtering dataset for smokers and non-smokers.

no_smoke <- insurance %>% filter(smoker =="no")

yes_smoke <- insurance %>% filter(smoker == "yes")

Plotting relationship of BMI and cost for non-smokers

ggplot(no_smoke, aes(x = no_smoke$bmi, y =  no_smoke$charges)) + geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Plotting relationship of BMI and cost for smokers

ggplot(yes_smoke, aes(x = yes_smoke$bmi, y =  yes_smoke$charges)) + geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Conclusion

From the BMI graphs of smokers, there is a more dramatic curve there than in the graph of non smokers. In addition, the insurance costs are generally much higher among smokers than non-smokers.

From both graphs, we can see there is a positive correlation between BMI and insurance costs. I will perform a linear regression analysis to see if BMI and smoking predicts the cost of insurance. For the project, the following hypothesis test will be conducted:

Assuming a 95% level of confidence, we hypothesize the following…

Null Hypothesis: There exists predictability of insurance costs with regard to BMI and smoking.

Alternative: There is no predictability of insurance cost with regards to BMI and smoking.