Medical costs becomes an essential expense in different regions in America, especially for people with obesity. According to CDC, obesity is a serious and costly chronic disease for adults and children and will continue to increase in United States. For this project, the relationship between the obesity and the medical costs will be analyzed and explored.
For the dataset, there are 1338 observations with 7 variables. The response variable is obesity. The independent variable will be the charge.
# load data
raw_data <- data.frame(read_csv("Resources/insurance.csv"))
head(raw_data)
medical_cost_data <- raw_data %>%
select(age, sex, bmi, smoker, region, charges) %>%
mutate(smoker_ind = ifelse(smoker == "yes", 1, 0),
weight_group = case_when(
bmi < 18.5 ~ "underweight",
bmi < 24.9 ~ "normal",
bmi < 29.9 ~ "overweight",
TRUE ~ "obesed"),
obesity = ifelse(weight_group == "overweight" | weight_group == "obesed", "yes", "no"))
head(medical_cost_data)
summary(medical_cost_data)
## age sex bmi smoker
## Min. :18.00 Length:1338 Min. :15.96 Length:1338
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 Class :character
## Median :39.00 Mode :character Median :30.40 Mode :character
## Mean :39.21 Mean :30.66
## 3rd Qu.:51.00 3rd Qu.:34.69
## Max. :64.00 Max. :53.13
## region charges smoker_ind weight_group
## Length:1338 Min. : 1122 Min. :0.0000 Length:1338
## Class :character 1st Qu.: 4740 1st Qu.:0.0000 Class :character
## Mode :character Median : 9382 Median :0.0000 Mode :character
## Mean :13270 Mean :0.2048
## 3rd Qu.:16640 3rd Qu.:0.0000
## Max. :63770 Max. :1.0000
## obesity
## Length:1338
## Class :character
## Mode :character
##
##
##
# histogram plot based on charges and weight group separated by regions
ggplot(medical_cost_data, aes(charges, fill = weight_group)) +
geom_histogram() +
xlab("Medical costs") +
ylab("Count of medical billing") +
guides(fill = guide_legend(title = "Obesity")) +
theme_minimal() +
facet_wrap(~ region)
# bar plot based on charges and genders separated by regions and weight group
ggplot(medical_cost_data, aes(x = sex, y = charges, fill = weight_group)) +
geom_bar(stat = "identity") +
xlab("Gender") +
ylab("Medical costs") +
theme_minimal() +
coord_flip() +
facet_wrap(~ region)
# scatter plot based on bmi and weight group separated by regions
ggplot(medical_cost_data, aes(x = bmi, y = charges, color = weight_group)) +
geom_jitter() +
xlab("Medical cost") +
ylab("BMI") +
labs(color = "Obesity") +
theme_minimal() +
facet_wrap(~ region)
HO Hypothesis: Obesity have no correlation with the medical cost charges.
H1 Hypothesis: Obesity does have a correlation with the medical cost charges.
# inference on obese based on the confidence level of 95%
medical_cost_data %>%
specify(response = obesity, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
# calculate the obesity percentage
obesity_percentage <- medical_cost_data %>%
count(obesity, sort = TRUE) %>%
mutate(freq = n/sum(n)*100)
# calculate the margin of error based on obesity
obesity_n <- nrow(medical_cost_data)
z_score <- 1.96
obesity_p <- obesity_percentage$n[1]/sum(obesity_percentage$n)
obesity_moe <- round(z_score * sqrt((obesity_p*(1-obesity_p))/obesity_n), 4)
# display margin of error
obesity_moe
## [1] 0.0206
Based on the 95% Confidence Interval, the lower bound is 0.7981876 and the upper bound is 0.8408072. The margin of error for being obese is .0206. This is very narrow and precise.
# create a dataframe based on obesity
obesity_data <- medical_cost_data %>%
filter(obesity == "yes")
# create a cormat and melted cormat
obesity_data2 <- obesity_data %>%
select(age, bmi, charges, smoker_ind)
obesity_cormat <- round(cor(obesity_data2), 2)
melted_obesity_cormat <- melt(obesity_cormat)
# create a heat map
ggplot(data = melted_obesity_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
geom_text(aes(label = round(value, 1))) +
theme_minimal()
# box plot to see if there is any outlines
ggplot(obesity_data, aes(x = bmi, y = charges)) +
geom_boxplot() +
xlab("BMI") +
ylab("Medical costs") +
theme_minimal()
ggplot(obesity_data, aes(x = age, y = charges)) +
geom_boxplot() +
xlab("Age") +
ylab("Medical costs") +
theme_minimal()
obesity_lm <- lm(charges ~ bmi + age + smoker, data = obesity_data)
summary(obesity_lm)
##
## Call:
## lm(formula = charges ~ bmi + age + smoker, data = obesity_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13089.8 -2437.6 -736.5 1077.8 26585.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12208.23 1257.35 -9.709 <2e-16 ***
## bmi 311.33 35.77 8.703 <2e-16 ***
## age 268.13 12.72 21.087 <2e-16 ***
## smokeryes 26697.09 446.49 59.793 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5904 on 1092 degrees of freedom
## Multiple R-squared: 0.7883, Adjusted R-squared: 0.7877
## F-statistic: 1355 on 3 and 1092 DF, p-value: < 2.2e-16
ggplot(data = obesity_data, aes(x = bmi, y = charges)) +
geom_jitter() +
geom_smooth(method = "lm") +
xlab("bmi") +
ylab("Medical costs") +
theme_minimal()
qqnorm(obesity_lm$residuals)
qqline(obesity_lm$residuals)
Outliers are considered southeast or considered health problems that may occur for a person that cannot control their weight. In conclusion, we need to take a step deeper into health care and insurance as we see in the news today that insurance uses AI unethically by letting AI choose who to give money to in insurance without people double checking who they give extra funds to. This is a wide problem and why healthcare is so difficult in some places and why maybe areas may be affected more than others. To coincide with this is the ration of doctor offices/hospitals in some areas. In more rural areas, it may take up to even 1 hour to even get to the closest hospital or doctor. This lack of resources could decrease the likelihood of someone wanting to spend their time and money to even get to a medical clinic on top of the prices of gas to get there, the price of the medical bill aftercare, and then even buying the medication which can cost thousands. Another factor could be the resources needed to stay healthy. Things like this should be considered when looking at this data and figuring out whether weight gain is affected by environment or health problems or lack of care. Moving forward, we should investigate these issues and make sure that we try to help others find the correct tools for care.
Choi, Miri. “Medical Cost Personal Datasets.” Kaggle, 21 Feb. 2018, www.kaggle.com/datasets/mirichoi0218/insurance.
CDC, CDC. “About Overweight and Obesity.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 24 Feb. 2023, www.cdc.gov/obesity/about-obesity/index.html.