DATA 606 Data Project - Inference on Medical Costs

Statistical Inference on Medical Costs

Load Packages

Part 1 - Introduction

Medical costs becomes an essential expense in different regions in America, especially for people with obesity. According to CDC, obesity is a serious and costly chronic disease for adults and children and will continue to increase in United States. For this project, the relationship between the obesity and the medical costs will be analyzed and explored.

Part 2 - Data

For the dataset, there are 1338 observations with 7 variables. The response variable is obesity. The independent variable will be the charge.

# load data
raw_data <- data.frame(read_csv("Resources/insurance.csv"))

head(raw_data)

medical_cost_data <- raw_data %>%
  select(age, sex, bmi, smoker, region, charges) %>%
  mutate(smoker_ind = ifelse(smoker == "yes", 1, 0),
         weight_group = case_when(
           bmi < 18.5 ~ "underweight",
           bmi < 24.9 ~ "normal",
           bmi < 29.9 ~ "overweight",
           TRUE ~ "obesed"),
         obesity = ifelse(weight_group == "overweight" | weight_group == "obesed", "yes", "no"))

head(medical_cost_data)

Part 3 - Exploratory data analysis

Summary

summary(medical_cost_data)

##       age            sex                 bmi           smoker         
##  Min.   :18.00   Length:1338        Min.   :15.96   Length:1338       
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   Class :character  
##  Median :39.00   Mode  :character   Median :30.40   Mode  :character  
##  Mean   :39.21                      Mean   :30.66                     
##  3rd Qu.:51.00                      3rd Qu.:34.69                     
##  Max.   :64.00                      Max.   :53.13                     
##     region             charges        smoker_ind     weight_group      
##  Length:1338        Min.   : 1122   Min.   :0.0000   Length:1338       
##  Class :character   1st Qu.: 4740   1st Qu.:0.0000   Class :character  
##  Mode  :character   Median : 9382   Median :0.0000   Mode  :character  
##                     Mean   :13270   Mean   :0.2048                     
##                     3rd Qu.:16640   3rd Qu.:0.0000                     
##                     Max.   :63770   Max.   :1.0000                     
##    obesity         
##  Length:1338       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Histogram on Regions

# histogram plot based on charges and weight group separated by regions
ggplot(medical_cost_data, aes(charges, fill = weight_group)) +
  geom_histogram() +
  xlab("Medical costs") +
  ylab("Count of medical billing") +
  guides(fill = guide_legend(title = "Obesity")) +
  theme_minimal() +
  facet_wrap(~ region)

Bar Plot on Genders

# bar plot based on charges and genders separated by regions and weight group
ggplot(medical_cost_data, aes(x = sex, y = charges, fill = weight_group)) +
  geom_bar(stat = "identity") +
  xlab("Gender") +
  ylab("Medical costs") +
  theme_minimal() +
  coord_flip() +
  facet_wrap(~ region)

Scatter Plot on BMI and Obseity

# scatter plot based on bmi and weight group separated by regions
ggplot(medical_cost_data, aes(x = bmi, y = charges, color = weight_group)) +
  geom_jitter() +
  xlab("Medical cost") +
  ylab("BMI") +
  labs(color = "Obesity") +
  theme_minimal() +
  facet_wrap(~ region)

Part 4 - Inference

Hypothesis

HO Hypothesis: Obesity have no correlation with the medical cost charges.

H1 Hypothesis: Obesity does have a correlation with the medical cost charges.

Inference on Obesity

# inference on obese based on the confidence level of 95%
medical_cost_data %>%
  specify(response = obesity, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

# calculate the obesity percentage
obesity_percentage <- medical_cost_data %>%
  count(obesity, sort = TRUE) %>%
  mutate(freq = n/sum(n)*100)

# calculate the margin of error based on obesity
obesity_n <- nrow(medical_cost_data)
z_score <- 1.96
obesity_p <- obesity_percentage$n[1]/sum(obesity_percentage$n)
obesity_moe <- round(z_score * sqrt((obesity_p*(1-obesity_p))/obesity_n), 4)

# display margin of error
obesity_moe

## [1] 0.0206

Based on the 95% Confidence Interval, the lower bound is 0.7981876 and the upper bound is 0.8408072. The margin of error for being obese is .0206. This is very narrow and precise.

Linear Regression

# create a dataframe based on obesity
obesity_data <- medical_cost_data %>%
  filter(obesity == "yes")

# create a cormat and melted cormat
obesity_data2 <- obesity_data %>%
  select(age, bmi, charges, smoker_ind)

obesity_cormat <- round(cor(obesity_data2), 2)

melted_obesity_cormat <- melt(obesity_cormat)

# create a heat map
ggplot(data = melted_obesity_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() +
  geom_text(aes(label = round(value, 1))) +
  theme_minimal()

# box plot to see if there is any outlines
ggplot(obesity_data, aes(x = bmi, y = charges)) +
  geom_boxplot() +
  xlab("BMI") +
  ylab("Medical costs") +
  theme_minimal()

ggplot(obesity_data, aes(x = age, y = charges)) +
  geom_boxplot() +
  xlab("Age") +
  ylab("Medical costs") +
  theme_minimal()

obesity_lm <- lm(charges ~ bmi + age + smoker, data = obesity_data)
summary(obesity_lm)

## 
## Call:
## lm(formula = charges ~ bmi + age + smoker, data = obesity_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13089.8  -2437.6   -736.5   1077.8  26585.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12208.23    1257.35  -9.709   <2e-16 ***
## bmi            311.33      35.77   8.703   <2e-16 ***
## age            268.13      12.72  21.087   <2e-16 ***
## smokeryes    26697.09     446.49  59.793   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5904 on 1092 degrees of freedom
## Multiple R-squared:  0.7883, Adjusted R-squared:  0.7877 
## F-statistic:  1355 on 3 and 1092 DF,  p-value: < 2.2e-16

ggplot(data = obesity_data, aes(x = bmi, y = charges)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  xlab("bmi") +
  ylab("Medical costs") +
  theme_minimal()

qqnorm(obesity_lm$residuals)
qqline(obesity_lm$residuals)

Part 5 - Conclusion

Outliers are considered southeast or considered health problems that may occur for a person that cannot control their weight. In conclusion, we need to take a step deeper into health care and insurance as we see in the news today that insurance uses AI unethically by letting AI choose who to give money to in insurance without people double checking who they give extra funds to. This is a wide problem and why healthcare is so difficult in some places and why maybe areas may be affected more than others. To coincide with this is the ration of doctor offices/hospitals in some areas. In more rural areas, it may take up to even 1 hour to even get to the closest hospital or doctor. This lack of resources could decrease the likelihood of someone wanting to spend their time and money to even get to a medical clinic on top of the prices of gas to get there, the price of the medical bill aftercare, and then even buying the medication which can cost thousands. Another factor could be the resources needed to stay healthy. Things like this should be considered when looking at this data and figuring out whether weight gain is affected by environment or health problems or lack of care. Moving forward, we should investigate these issues and make sure that we try to help others find the correct tools for care.

References

Choi, Miri. “Medical Cost Personal Datasets.” Kaggle, 21 Feb. 2018, www.kaggle.com/datasets/mirichoi0218/insurance.

CDC, CDC. “About Overweight and Obesity.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 24 Feb. 2023, www.cdc.gov/obesity/about-obesity/index.html.