Research Question:
Does depression score and age predict weekly physical activity levels
among adults who reported doing moderate recreational activity?
This project uses data from the National Health and Nutrition Examination Survey, also known as NHANES, from 2017 to 2018. NHANES is a health survey from the CDC that collects information about health, lifestyle, and demographics from people in the United States.
For this project, I used three NHANES files. The depression screener file is DPQ_J, the physical activity file is PAQ_J, and the demographics file is DEMO_J. These files were merged using SEQN, which is the participant ID number.
The main variables used in this project are PHQ 9 depression score, weekly moderate physical activity minutes, and age. PHQ 9 depression score is created by adding DPQ010 through DPQ090. Physical activity is measured using PAD615, which records moderate recreational physical activity minutes per week. Age is measured using RIDAGEYR.
This topic was chosen because depression and physical activity are both important public health issues. I wanted to see whether depression score and age can help predict physical activity levels.
Source: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017
For this project, I merged the three files, created a PHQ 9 depression score, selected the needed variables, removed missing or invalid values, and kept only people who reported some physical activity. This follows the professor’s feedback because activity minutes are often right skewed with many zeros. To handle that issue, I filtered to people with activity above zero and used a log transformation on activity minutes.
merged <- demo %>%
inner_join(dpq, by = "SEQN") %>%
inner_join(paq, by = "SEQN")
dim(merged)
## [1] 5533 72
project_data <- merged %>%
select(SEQN, RIDAGEYR, DPQ010, DPQ020, DPQ030, DPQ040,
DPQ050, DPQ060, DPQ070, DPQ080, DPQ090, PAD615) %>%
filter(if_all(DPQ010:DPQ090, ~ . %in% 0:3)) %>%
filter(!is.na(PAD615), PAD615 > 0, PAD615 < 7777, !is.na(RIDAGEYR)) %>%
mutate(
PHQ9 = DPQ010 + DPQ020 + DPQ030 + DPQ040 + DPQ050 +
DPQ060 + DPQ070 + DPQ080 + DPQ090,
log_activity = log(PAD615)
)
dim(project_data)
## [1] 1243 14
project_data %>%
summarise(
count = n(),
mean_activity = mean(PAD615),
median_activity = median(PAD615),
mean_log_activity = mean(log_activity),
mean_PHQ9 = mean(PHQ9),
mean_age = mean(RIDAGEYR)
)
## # A tibble: 1 × 6
## count mean_activity median_activity mean_log_activity mean_PHQ9 mean_age
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1243 198. 150 4.87 3.40 45.1
par(mfrow = c(1, 2))
hist(project_data$PAD615,
main = "Raw Physical Activity Minutes",
xlab = "Minutes Per Week")
hist(project_data$log_activity,
main = "Log Physical Activity Minutes",
xlab = "Log Minutes Per Week")
ggplot(project_data, aes(x = PHQ9, y = log_activity)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Depression Score vs Log Physical Activity",
x = "PHQ 9 Depression Score",
y = "Log Weekly Physical Activity Minutes"
)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(project_data, aes(x = RIDAGEYR, y = log_activity)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Age vs Log Physical Activity",
x = "Age",
y = "Log Weekly Physical Activity Minutes"
)
## `geom_smooth()` using formula = 'y ~ x'
Multiple linear regression is appropriate because the outcome variable is quantitative. The outcome variable is log transformed weekly physical activity minutes. The predictor variables are PHQ 9 depression score and age.
The final model is:
log(PAD615) = β0 + β1(PHQ9) + β2(Age)
model <- lm(log_activity ~ PHQ9 + RIDAGEYR, data = project_data)
summary(model)
##
## Call:
## lm(formula = log_activity ~ PHQ9 + RIDAGEYR, data = project_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7522 -0.7153 0.1527 0.8105 1.7762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.181e+00 8.587e-02 60.335 < 2e-16 ***
## PHQ9 -2.079e-05 6.735e-03 -0.003 0.998
## RIDAGEYR -7.001e-03 1.695e-03 -4.130 3.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.03 on 1240 degrees of freedom
## Multiple R-squared: 0.0136, Adjusted R-squared: 0.01201
## F-statistic: 8.547 on 2 and 1240 DF, p-value: 0.0002059
confint(model)
## 2.5 % 97.5 %
## (Intercept) 5.01235075 5.349274454
## PHQ9 -0.01323382 0.013192250
## RIDAGEYR -0.01032593 -0.003675097
model_summary <- summary(model)
r_squared <- model_summary$r.squared
adj_r_squared <- model_summary$adj.r.squared
phq_coef <- coef(model)["PHQ9"]
age_coef <- coef(model)["RIDAGEYR"]
phq_percent <- (exp(phq_coef) - 1) * 100
age_percent <- (exp(age_coef) - 1) * 100
r_squared
## [1] 0.0135974
adj_r_squared
## [1] 0.01200643
phq_percent
## PHQ9
## -0.002078732
age_percent
## RIDAGEYR
## -0.6976069
The R squared value is 0.014. This means the model explains about 1.4 percent of the variation in log weekly physical activity minutes.
The adjusted R squared value is 0.012. This tells us the model explains a small amount of the variation after accounting for the number of predictors.
The PHQ 9 coefficient is 0. Since the outcome is log transformed, this means that for each one point increase in depression score, weekly activity minutes are predicted to change by about 0 percent, holding age constant.
The age coefficient is -0.007. This means that for each additional year of age, weekly activity minutes are predicted to change by about -0.7 percent, holding depression score constant.
The p-values show whether each predictor is statistically significant. If the p-value is below 0.05, the predictor is considered statistically significant. Based on the model output, the p-value for PHQ 9 is 0.998, which is greater than 0.05, so PHQ 9 is not statistically significant. The p-value for age is less than 0.05, so age is statistically significant. This means age is a meaningful predictor of physical activity in this model, but depression score is not.
For multiple linear regression, the five main assumptions are linearity, independence, homoscedasticity, normality of residuals, and multicollinearity.
Linearity means the predictors should have a roughly linear relationship with the outcome. Independence means each observation should come from a separate person. Homoscedasticity means the residuals should have a similar spread across fitted values. Normality means the residuals should be roughly normal. Multicollinearity means the predictors should not be too strongly related to each other.
par(mfrow = c(2, 2))
plot(model)
vif_values <- vif(model)
vif_values
## PHQ9 RIDAGEYR
## 1.002171 1.002171
The Residuals vs Fitted plot does not show a strong curved pattern, so the linearity assumption appears reasonable. The Normal Q Q plot shows some deviation at the tails, so the residuals are not perfectly normal. The Scale Location plot shows some uneven spread, meaning homoscedasticity is not perfect. The Residuals vs Leverage plot does not show many extreme influential points, so there does not appear to be a major leverage problem.
Independence is reasonable because each row represents a separate NHANES participant. The VIF values are 1, 1, which are below 5. This means multicollinearity does not appear to be a serious issue.
The log transformation helps reduce the extreme right skew in physical activity minutes, but the model is still not perfect because activity behavior is complicated and can be affected by many other factors.
This model looked at whether depression score and age predict weekly physical activity levels. The model explains about 1.4 percent of the variation in log weekly physical activity minutes.
The coefficient for depression score shows the expected change in activity for each one point increase in PHQ 9 score while holding age constant. The coefficient for age shows the expected change in activity for each additional year of age while holding depression score constant.
Although age is statistically significant, the overall R squared value shows that depression score and age alone do not explain most of the variation in weekly activity.
This project examined whether depression score and age predict weekly physical activity levels using NHANES 2017 to 2018 data. Because activity minutes were heavily right skewed and had many zeros, I filtered the data to active respondents and used a log transformation for the outcome variable.
The multiple linear regression model gives a way to measure the relationship between depression score, age, and weekly activity. However, the model fit suggests that these two predictors alone do not explain physical activity very strongly.
A limitation is that this project only used depression score and age as predictors. Other factors like income, BMI, health status, sleep, chronic illness, and access to exercise spaces may also affect physical activity. Future research could add more predictors, use interaction terms, or compare models to see which factors best explain activity levels.
National Center for Health Statistics. NHANES 2017 to 2018.
https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017
Fox, J. and Weisberg, S. An R Companion to Applied Regression.
https://cran.r-project.org/package=car