Data Dive 9 - Regression Diagnostics

Upload Data Set

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## Warning: package 'lubridate' was built under R version 4.1.3

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.5.1     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

obesity <- read.csv(file.choose())

library(ggplot2)
library(ggthemes)
library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.1.3

library(lindia)

Calculate BMI

obesity <- obesity |>
  mutate (BMI = Weight / (Height ^ 2))

Add 1-3 more variables into the regression model from last week

model <- lm(BMI ~ FAF + SCC + CH2O, obesity)

summary(model)

## 
## Call:
## lm(formula = BMI ~ FAF + SCC + CH2O, data = obesity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0616  -5.5595  -0.5268   6.2508  21.6615 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.1766     0.5803  46.830  < 2e-16 ***
## FAF          -1.8330     0.1989  -9.217  < 2e-16 ***
## SCCyes       -6.5843     0.8003  -8.227 3.33e-16 ***
## CH2O          2.3281     0.2752   8.459  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.64 on 2107 degrees of freedom
## Multiple R-squared:  0.0918, Adjusted R-squared:  0.09051 
## F-statistic: 70.99 on 3 and 2107 DF,  p-value: < 2.2e-16

The variable from my model last week was FAF (frequency of physical activity). As I addressed in last weeks data dive, there are not any variables in this data set that have a linear relationship to each other, so continuing on the same track as last week, I chose variables that I think would tell a good story. I chose to add the binary variable SCC, which calculates whether or not the patient monitors the calories they eat daily. There is no issue of multicolinearity with SCC and the other variables in the model, and as I said earlier, it could help build the model to determine if there is any relationship between BMI, frequency of physical activity (FAF), and whether or not a person counts calories (SCC). The other variable that I added was CH2O, which is a continuous variable monitoring the patients daily water consumption. I thought this would be a good variable to add because, while water consumption might not be directly correlated to BMI, an increase in water consumption might have a relationship to the frequency of physical activity a person partakes in, or might be an indicator of a healthier lifestyle which might be an indicator of a lower BMI and therefore healthier obesity category.

Evaluate the model

1. Residuals vs. Fitted Values

gg_resfitted(model) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Based on this plot, assumptions about linear models are being violated. There is a good portion of datapoints above the prediction line, and less below, meaning they are not evenly dispersed. The curve goes both high and low, which from our notes, I conclude means that both higher fitted values are overestimating in this model, as well as lower fitting values are underestimating.

2. Residuals vs. x-values

plots <- gg_resX(model, plot.all = TRUE)

plots$FAF

## NULL

plots$SCC

## NULL

plots$CH2O

## NULL

Looking at residual versus FAF and CH2O, it is difficult to tell if there is a constant variance of residuals across all x-values. It seems like there is a decent dispersion, but there are also spots where there are more highly populated residuals and areas where there are lower populated residuals. For these two plots, I would say that assumptions are violated again. As far as residual vs. SCC goes, this was a binary variable, so it is difficult to see the exact dispersion of residuals versus x-vales, but it does appear that there are more patients who do not count their daily calories than patients who do, so I would still conclude that it violates the assumptions.

3. Residual histogram

gg_reshist(model)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This histogram also suggests that the assumptions are violated. It is not even near being normally distributed, and has more residuals around the -5 to 0 range, as well as the 0-5 range, and a spike near 10 residuals. I assume that this likely has something to do with the fact that there is not a somewhat linear relationship between the variables in the model, which makes it an extraordinarily bad model. Additionally, the tail on the left is much higher, and the tail on the right is lower and longer. The best way to fix these issues would to find better, different explanatory variables, but as that is not available within this dataset, I wonder if it would be possible to calculate some other variables from the data (like I did with BMI) that would be somewhat linear and relationship and make a better model that would not violate the assumptions so much.

4. QQ-plot

# the normal QQ plot
gg_qqplot(model)

Based on this QQ-plot, I would say that the residuals are not normal, and when coupled with the results from the histogram and other graphs I have made and analyzed, I feel confident in saying that. There is a brief moment in time around the 0th quartile where the residuals are somewhat normal, but on both ends they differ drastically, and there is a significant amount of difference near the 0th quartile even then.

5. Cook’s D observation

gg_cooksd(model, threshold = 'matlab')

Based on this Cook’s D, it appears that rows 69 and 303 have the highest influence on the model, with some of the following rows being 472, 628, 258, 656, 1809, and 1833. There is a large number of rows that highly influence the model, which might be one reason why it violates all of the assumptions and does not function well to display the relationships between BMI and the other variables.