Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
The weekly discussion will be focusing on the relationship between the obesity and the medical costs will be analyzed and explored. Medical costs becomes an essential expense in different regions in America, especially for people with obesity.
medical_cost_obesity_data <- raw_data %>%
select(age, sex, bmi, smoker, region, charges) %>%
mutate(smoker_ind = ifelse(smoker == "yes", 1, 0),
weight_group = case_when(
bmi < 18.5 ~ "underweight",
bmi < 24.9 ~ "normal",
bmi < 29.9 ~ "overweight",
TRUE ~ "obesed"),
obesity = ifelse(weight_group == "overweight" | weight_group == "obesed", "yes", "no")) %>%
filter(obesity == "yes")
head(medical_cost_obesity_data)
## age sex bmi smoker region charges smoker_ind weight_group obesity
## 1 19 female 27.90 yes southwest 16884.924 1 overweight yes
## 2 18 male 33.77 no southeast 1725.552 0 obesed yes
## 3 28 male 33.00 no southeast 4449.462 0 obesed yes
## 4 32 male 28.88 no northwest 3866.855 0 overweight yes
## 5 31 female 25.74 no southeast 3756.622 0 overweight yes
## 6 46 female 33.44 no southeast 8240.590 0 obesed yes
obesity_lm <- lm(charges ~ bmi + age + smoker, data = medical_cost_obesity_data)
summary(obesity_lm)
##
## Call:
## lm(formula = charges ~ bmi + age + smoker, data = medical_cost_obesity_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13089.8 -2437.6 -736.5 1077.8 26585.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12208.23 1257.35 -9.709 <2e-16 ***
## bmi 311.33 35.77 8.703 <2e-16 ***
## age 268.13 12.72 21.087 <2e-16 ***
## smokeryes 26697.09 446.49 59.793 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5904 on 1092 degrees of freedom
## Multiple R-squared: 0.7883, Adjusted R-squared: 0.7877
## F-statistic: 1355 on 3 and 1092 DF, p-value: < 2.2e-16
ggplot(data = medical_cost_obesity_data, aes(x = bmi, y = charges)) +
geom_jitter() +
geom_smooth(method = "lm") +
xlab("bmi") +
ylab("Medical costs") +
theme_minimal()
par(mfrow=c(2,2))
plot(obesity_lm)
Based on the summary analysis, the median value for the regression is -736.5. This shows that it is not normally distributed and the value of F-statistic is very high, indicating the null hypothesis of the all variables being zero can be rejected and a strong relationship between medical costs and obesity is evident. Not only that, the multiple R-squared explains 78.83% of the variation within the medical costs. There are few outliners that are within the residual values.
In conclusion, the data and regression model shows a strong and significant relationship between medical costs and obesity in America.
Choi, Miri. “Medical Cost Personal Datasets.” Kaggle, 21 Feb. 2018, www.kaggle.com/datasets/mirichoi0218/insurance.
CDC, CDC. “About Overweight and Obesity.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 24 Feb. 2023, www.cdc.gov/obesity/about-obesity/index.html.