library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## Warning: package 'lubridate' was built under R version 4.1.3
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr 1.1.2 v readr 2.1.4
## v forcats 1.0.0 v stringr 1.5.1
## v ggplot2 3.5.1 v tibble 3.2.1
## v lubridate 1.9.2 v tidyr 1.3.0
## v purrr 1.0.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
obesity <- read.csv(file.choose())
obesity <- obesity |>
mutate (BMI = Weight / (Height ^ 2))
The variables I have selected for my GLM are family_history_with_overweight (binary), BMI, Gender, and MTRANS - which is what transportation do you usually use?.
#Transforming family_history_with_overweight from yes/no to 1/0
obesity$family_history_binary <- ifelse(obesity$family_history_with_overweight == "yes", 1,0)
model <- glm(BMI ~ Gender + family_history_binary, family = 'gaussian', data = obesity)
Because these evaluation methods were done with LMs and not GLMs, the lindia library did not work on all of them. I chose to do the ones where I was able to find an easy alternative for GLMs, so not every graph we covered is represented here, but there are enough to get a good picture of the effectiveness (or ineffectiveness) of the model.
plot(model, which = 1)
Based on this residuals vs. fitted plot, the red like is mostly smooth and is close to the dotted line, which implies that the model does not violate the homeoscedasticity assumption. However, because the residuals are stacked on top of each other and not spread out, this implies that there are only a few BMI values related to these explanatory variables (gender and family history binary). Also, there is no clustering or odd trends in the graph, but the way that the residuals are stacked suggests that the model is not a good fit.
plot(model, which = 2)
Based on this Q-Q plot, the model violates assumptions, because the model strays from the projected line, especially on the right-most tail. This could indicate outliers, or just a poorly fit model, which might be because there are not linear relationships between the data. This ultimately says that the model is not normalized.
plot(model, which = 4)
As with the other two graphs, the Cook’s D plot suggests that there are some outliers in the model, although not a large number, so this might not be a heavy influence on the model and is likely not the only reason the model violates assumptions and is a poor fit.
model$coefficients
## (Intercept) GenderMale family_history_binary
## 22.16063 -1.66112 10.24915
summary(model)
##
## Call:
## glm(formula = BMI ~ Gender + family_history_binary, family = "gaussian",
## data = obesity)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -17.457 -4.833 -0.300 5.616 18.724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.1606 0.3752 59.066 < 2e-16 ***
## GenderMale -1.6611 0.3049 -5.448 5.69e-08 ***
## family_history_binary 10.2491 0.3948 25.963 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 48.54034)
##
## Null deviance: 135423 on 2110 degrees of freedom
## Residual deviance: 102323 on 2108 degrees of freedom
## AIC: 14191
##
## Number of Fisher Scoring iterations: 2
Looking at these coefficients, it shows that men have a BMI that is about 1.66 points less than females, and the estimated BMI for females is 22.2 assuming that there is no family history of being overweight. Additionally, individuals that do have a family history of obesity/overweight have a BMI that is around 10.2 points higher than those without.
As for the summary of the model, looking at the standard errors, these values are relatively small, meaning that the predictions of the model seem to be precise, which is confusing considering the other analysis shows that is it probably not a well-fit model. Looking at the AIC and the Deviance, which were metrics we discussed this week, the AIC seems to be high, which is not ideal- although it’s hard to say anything conclusive because this is a metric best used to compare two models. Additionally, with the deviance, they are both relatively large numbers suggesting there may be some outliers and the model is not well fit.
I tried to transform my response variable (BMI) using log/quadratic ways like we did in class, but I still ended up with graphs where the points were all gathered on the 0 or 1 and not spread out in a somewhat linear fashion like we were shown in class. I think that this may be a reason why my model is so poorly fitted - it’s also possible that because this is health data, and it’s difficult for things with relationships to be modeled simply, that the model is a poor fit. In the future, I would like to make sure I’m building the GLMs correctly, and if there really is no relationship in this model, then that is one thing, but if a good model can be built, that is something I would like to explore more before this class is over!