Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Load Dependencies and Data

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

The weekly discussion will be focusing on the relationship between the obesity and the medical costs will be analyzed and explored. Medical costs becomes an essential expense in different regions in America, especially for people with obesity.

Data Wrangling

medical_cost_obesity_data <- raw_data %>%
  select(age, sex, bmi, smoker, region, charges) %>%
  mutate(smoker_ind = ifelse(smoker == "yes", 1, 0),
         weight_group = case_when(
           bmi < 18.5 ~ "underweight",
           bmi < 24.9 ~ "normal",
           bmi < 29.9 ~ "overweight",
           TRUE ~ "obesed"),
         obesity = ifelse(weight_group == "overweight" | weight_group == "obesed", "yes", "no")) %>%
  filter(obesity == "yes")

head(medical_cost_obesity_data)
##   age    sex   bmi smoker    region   charges smoker_ind weight_group obesity
## 1  19 female 27.90    yes southwest 16884.924          1   overweight     yes
## 2  18   male 33.77     no southeast  1725.552          0       obesed     yes
## 3  28   male 33.00     no southeast  4449.462          0       obesed     yes
## 4  32   male 28.88     no northwest  3866.855          0   overweight     yes
## 5  31 female 25.74     no southeast  3756.622          0   overweight     yes
## 6  46 female 33.44     no southeast  8240.590          0       obesed     yes

Summary and Linear Regression

obesity_lm <- lm(charges ~ bmi + age + smoker, data = medical_cost_obesity_data)
summary(obesity_lm)
## 
## Call:
## lm(formula = charges ~ bmi + age + smoker, data = medical_cost_obesity_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13089.8  -2437.6   -736.5   1077.8  26585.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12208.23    1257.35  -9.709   <2e-16 ***
## bmi            311.33      35.77   8.703   <2e-16 ***
## age            268.13      12.72  21.087   <2e-16 ***
## smokeryes    26697.09     446.49  59.793   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5904 on 1092 degrees of freedom
## Multiple R-squared:  0.7883, Adjusted R-squared:  0.7877 
## F-statistic:  1355 on 3 and 1092 DF,  p-value: < 2.2e-16

Data Visualization

ggplot(data = medical_cost_obesity_data, aes(x = bmi, y = charges)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  xlab("bmi") +
  ylab("Medical costs") +
  theme_minimal()

par(mfrow=c(2,2))
plot(obesity_lm)

Residual Analysis and Conclusion

Based on the summary analysis, the median value for the regression is -736.5. This shows that it is not normally distributed and the value of F-statistic is very high, indicating the null hypothesis of the all variables being zero can be rejected and a strong relationship between medical costs and obesity is evident. Not only that, the multiple R-squared explains 78.83% of the variation within the medical costs. There are few outliners that are within the residual values.

In conclusion, the data and regression model shows a strong and significant relationship between medical costs and obesity in America.

References

Choi, Miri. “Medical Cost Personal Datasets.” Kaggle, 21 Feb. 2018, www.kaggle.com/datasets/mirichoi0218/insurance.

CDC, CDC. “About Overweight and Obesity.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 24 Feb. 2023, www.cdc.gov/obesity/about-obesity/index.html.