Background

This script is designed to analyze and visualize the effect of smoking on pulse rate across different age groups.

Effect modification is an important concept in epidemiology and health research and in other domains(!), as it can reveal more nuanced relationships between variables than can be understood through simple main effects models.

Loading libraries

We will use the NHANES, dplyr, and ggplot2 packages. These are essential for handling the dataset, performing data manipulation, and plotting the results.

if (!requireNamespace("NHANES", quietly = TRUE)) install.packages("NHANES")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")

library(NHANES)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2

Data Preparation:

The script filters the NHANES dataset to include only adults (age 20 and above) with non-missing values for smoking status (SmokeNow) and pulse rate (Pulse). It then categorizes individuals into age groups to simplify the analysis.

Filter the dataset for adults and non-missing smoking status and pulse rate

data <- NHANES %>%
  filter(Age >= 20, !is.na(SmokeNow), !is.na(Pulse))

Categorize age to simplify analysis

data$AgeGroup <- cut(data$Age, breaks = c(19, 30, 40, 50, 60, 70, Inf), labels = c("20-30", "31-40", "41-50", "51-60", "61-70", "71+"), right = FALSE)

Analyzing the Effect:

It calculates the mean pulse rate for each combination of age group and smoking status, preparing the data for visualization.

effect_analysis <- data %>%
  group_by(AgeGroup, SmokeNow) %>%
  summarise(MeanPulse = mean(Pulse, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'AgeGroup'. You can override using the
## `.groups` argument.

Visualizing the Effect:

Two visualizations are created. The first one is a bar chart showing the mean pulse rate by age group and smoking status. The second visualization is generated after fitting a linear model to predict pulse rate based on smoking status and age group, plotting predicted pulse rates for each combination of age group and smoking status.

ggplot(effect_analysis, aes(x = AgeGroup, y = MeanPulse, fill = SmokeNow)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Effect of Smoking on Pulse Rate by Age Group", x = "Age Group", y = "Mean Pulse Rate") +
  scale_fill_manual(values = c("No" = "blue", "Yes" = "red")) +
  theme_minimal()

Modeling and Prediction:

A linear model is used to understand the relationship between pulse rate and the interaction between smoking status and age group. Predictions from this model are then visualized, showing how pulse rate varies across age groups for smokers and non-smokers.

data$SmokeNow <- as.factor(data$SmokeNow)
data$AgeGroup <- as.factor(data$AgeGroup)

model <- lm(Pulse ~ SmokeNow * AgeGroup, data = data)
summary(model)
## 
## Call:
## lm(formula = Pulse ~ SmokeNow * AgeGroup, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.908  -8.846  -1.251   7.747  50.555 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                74.8456     0.9876  75.786  < 2e-16 ***
## SmokeNowYes                -1.5949     1.1767  -1.355  0.17539    
## AgeGroup31-40              -0.5924     1.2688  -0.467  0.64064    
## AgeGroup41-50              -2.3132     1.2130  -1.907  0.05661 .  
## AgeGroup51-60              -3.5522     1.1787  -3.014  0.00260 ** 
## AgeGroup61-70              -3.9168     1.2116  -3.233  0.00124 ** 
## AgeGroup71+                -5.4009     1.1614  -4.650 3.45e-06 ***
## SmokeNowYes:AgeGroup31-40   4.6388     1.5956   2.907  0.00367 ** 
## SmokeNowYes:AgeGroup41-50   4.4090     1.5363   2.870  0.00413 ** 
## SmokeNowYes:AgeGroup51-60   2.5286     1.5268   1.656  0.09779 .  
## SmokeNowYes:AgeGroup61-70   4.5745     1.7282   2.647  0.00816 ** 
## SmokeNowYes:AgeGroup71+     6.2645     1.9582   3.199  0.00139 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.06 on 3102 degrees of freedom
## Multiple R-squared:  0.03276,    Adjusted R-squared:  0.02933 
## F-statistic: 9.551 on 11 and 3102 DF,  p-value: < 2.2e-16
new_data <- expand.grid(AgeGroup = levels(data$AgeGroup),
                        SmokeNow = levels(data$SmokeNow))

# Predict values
new_data$PredictedPulse = predict(model, new_data)

levels(new_data$SmokeNow) <- c("Non-smoker", "Smoker")
library(ggplot2)

ggplot(new_data, aes(x = AgeGroup, y = PredictedPulse, color = SmokeNow)) +
  geom_line(aes(group = SmokeNow), linewidth = 1) + # Corrected for deprecated warning
  geom_point(size = 2) + # Add points to emphasize the actual predictions
  labs(title = "Predicted Pulse Rate by Smoking Status Across Age Groups",
       x = "Age Group", y = "Predicted Pulse Rate") +
  theme_minimal() +
  scale_color_manual(values = c("Non-smoker" = "blue", "Smoker" = "red"))

The linear model specified in the script (lm(Pulse ~ SmokeNow * AgeGroup, data = data)) includes an interaction term (SmokeNow * AgeGroup). This interaction term allows the model to estimate different effects of age on pulse rate for smokers compared to non-smokers. In statistical terms, it’s testing whether the slope of the relationship between age and pulse rate differs by smoking status.

Explanation for the coefficients!

Intercepts and main effects

  • (Intercept) 74.8456: This is the estimated average pulse rate for the reference group, which in this model is non-smokers aged 20-30. The estimate is highly significant, as indicated by *** (p < 0.001).
  • SmokeNowYes -1.5949: This suggests that, holding age constant at the reference group (20-30), smokers have a pulse rate that is 1.5949 units lower than non-smokers, though this effect is not statistically significant (p = 0.17539).
  • AgeGroup31-40 to AgeGroup71+: These coefficients represent the difference in pulse rate from the reference age group (20-30) for non-smokers, since they don’t involve SmokeNowYes. Most are negative, suggesting lower pulse rates in older age groups compared to the 20-30 age group, with varying levels of significance.

Interaction Effects

  • SmokeNowYes:AgeGroup31-40 to SmokeNowYes:AgeGroup71+: These coefficients show how the effect of being in a higher age group on pulse rate changes for smokers compared to non-smokers. For instance, the SmokeNowYes:AgeGroup31-40 coefficient of 4.6388 suggests that the difference in pulse rate between the 31-40 age group and the 20-30 age group is 4.6388 units greater for smokers than for non-smokers, and this effect is statistically significant (p = 0.00367).

Some other metrics

  • R2 is very low!!
  • F-statistic is highly significant (p < 0.001), suggesting that the model as a whole is significantly better at predicting pulse rate than a model with no predictors.