Week 11 Data Dive

Load Dataset

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## Warning: package 'lubridate' was built under R version 4.1.3

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.5.1     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

obesity <- read.csv(file.choose())

obesity <- obesity |>
  mutate (BMI = Weight / (Height ^ 2))

Select variables/transform them as needed

The variables I have selected for my GLM are family_history_with_overweight (binary), BMI, Gender, and MTRANS - which is what transportation do you usually use?.

#Transforming family_history_with_overweight from yes/no to 1/0

obesity$family_history_binary <- ifelse(obesity$family_history_with_overweight == "yes", 1,0)

Build the model

model <- glm(BMI ~ Gender + family_history_binary, family = 'gaussian', data = obesity)

Evaluate the model using what we’ve previously learned - graphs

Because these evaluation methods were done with LMs and not GLMs, the lindia library did not work on all of them. I chose to do the ones where I was able to find an easy alternative for GLMs, so not every graph we covered is represented here, but there are enough to get a good picture of the effectiveness (or ineffectiveness) of the model.

plot(model, which = 1)

Based on this residuals vs. fitted plot, the red like is mostly smooth and is close to the dotted line, which implies that the model does not violate the homeoscedasticity assumption. However, because the residuals are stacked on top of each other and not spread out, this implies that there are only a few BMI values related to these explanatory variables (gender and family history binary). Also, there is no clustering or odd trends in the graph, but the way that the residuals are stacked suggests that the model is not a good fit.

Check the Q-Q Plot of the model

plot(model, which = 2)

Based on this Q-Q plot, the model violates assumptions, because the model strays from the projected line, especially on the right-most tail. This could indicate outliers, or just a poorly fit model, which might be because there are not linear relationships between the data. This ultimately says that the model is not normalized.

Cook’s D Plot

plot(model, which = 4)

As with the other two graphs, the Cook’s D plot suggests that there are some outliers in the model, although not a large number, so this might not be a heavy influence on the model and is likely not the only reason the model violates assumptions and is a poor fit.

model$coefficients

##           (Intercept)            GenderMale family_history_binary 
##              22.16063              -1.66112              10.24915

summary(model)

## 
## Call:
## glm(formula = BMI ~ Gender + family_history_binary, family = "gaussian", 
##     data = obesity)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -17.457   -4.833   -0.300    5.616   18.724  
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            22.1606     0.3752  59.066  < 2e-16 ***
## GenderMale             -1.6611     0.3049  -5.448 5.69e-08 ***
## family_history_binary  10.2491     0.3948  25.963  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 48.54034)
## 
##     Null deviance: 135423  on 2110  degrees of freedom
## Residual deviance: 102323  on 2108  degrees of freedom
## AIC: 14191
## 
## Number of Fisher Scoring iterations: 2

Looking at these coefficients, it shows that men have a BMI that is about 1.66 points less than females, and the estimated BMI for females is 22.2 assuming that there is no family history of being overweight. Additionally, individuals that do have a family history of obesity/overweight have a BMI that is around 10.2 points higher than those without.

As for the summary of the model, looking at the standard errors, these values are relatively small, meaning that the predictions of the model seem to be precise, which is confusing considering the other analysis shows that is it probably not a well-fit model. Looking at the AIC and the Deviance, which were metrics we discussed this week, the AIC seems to be high, which is not ideal- although it’s hard to say anything conclusive because this is a metric best used to compare two models. Additionally, with the deviance, they are both relatively large numbers suggesting there may be some outliers and the model is not well fit.

I tried to transform my response variable (BMI) using log/quadratic ways like we did in class, but I still ended up with graphs where the points were all gathered on the 0 or 1 and not spread out in a somewhat linear fashion like we were shown in class. I think that this may be a reason why my model is so poorly fitted - it’s also possible that because this is health data, and it’s difficult for things with relationships to be modeled simply, that the model is a poor fit. In the future, I would like to make sure I’m building the GLMs correctly, and if there really is no relationship in this model, then that is one thing, but if a good model can be built, that is something I would like to explore more before this class is over!

Week 11 Data Dive - GLMs

Kylie Heagy

2024-11-12