This script is designed to analyze and visualize the effect of smoking on pulse rate across different age groups.
Effect modification is an important concept in epidemiology and health research and in other domains(!), as it can reveal more nuanced relationships between variables than can be understood through simple main effects models.
We will use the NHANES, dplyr, and ggplot2 packages. These are essential for handling the dataset, performing data manipulation, and plotting the results.
if (!requireNamespace("NHANES", quietly = TRUE)) install.packages("NHANES")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
library(NHANES)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
The script filters the NHANES dataset to include only adults (age 20 and above) with non-missing values for smoking status (SmokeNow) and pulse rate (Pulse). It then categorizes individuals into age groups to simplify the analysis.
data <- NHANES %>%
filter(Age >= 20, !is.na(SmokeNow), !is.na(Pulse))
data$AgeGroup <- cut(data$Age, breaks = c(19, 30, 40, 50, 60, 70, Inf), labels = c("20-30", "31-40", "41-50", "51-60", "61-70", "71+"), right = FALSE)
It calculates the mean pulse rate for each combination of age group and smoking status, preparing the data for visualization.
effect_analysis <- data %>%
group_by(AgeGroup, SmokeNow) %>%
summarise(MeanPulse = mean(Pulse, na.rm = TRUE)) %>%
ungroup()
## `summarise()` has grouped output by 'AgeGroup'. You can override using the
## `.groups` argument.
Two visualizations are created. The first one is a bar chart showing the mean pulse rate by age group and smoking status. The second visualization is generated after fitting a linear model to predict pulse rate based on smoking status and age group, plotting predicted pulse rates for each combination of age group and smoking status.
ggplot(effect_analysis, aes(x = AgeGroup, y = MeanPulse, fill = SmokeNow)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Effect of Smoking on Pulse Rate by Age Group", x = "Age Group", y = "Mean Pulse Rate") +
scale_fill_manual(values = c("No" = "blue", "Yes" = "red")) +
theme_minimal()
A linear model is used to understand the relationship between pulse rate and the interaction between smoking status and age group. Predictions from this model are then visualized, showing how pulse rate varies across age groups for smokers and non-smokers.
data$SmokeNow <- as.factor(data$SmokeNow)
data$AgeGroup <- as.factor(data$AgeGroup)
model <- lm(Pulse ~ SmokeNow * AgeGroup, data = data)
summary(model)
##
## Call:
## lm(formula = Pulse ~ SmokeNow * AgeGroup, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.908 -8.846 -1.251 7.747 50.555
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.8456 0.9876 75.786 < 2e-16 ***
## SmokeNowYes -1.5949 1.1767 -1.355 0.17539
## AgeGroup31-40 -0.5924 1.2688 -0.467 0.64064
## AgeGroup41-50 -2.3132 1.2130 -1.907 0.05661 .
## AgeGroup51-60 -3.5522 1.1787 -3.014 0.00260 **
## AgeGroup61-70 -3.9168 1.2116 -3.233 0.00124 **
## AgeGroup71+ -5.4009 1.1614 -4.650 3.45e-06 ***
## SmokeNowYes:AgeGroup31-40 4.6388 1.5956 2.907 0.00367 **
## SmokeNowYes:AgeGroup41-50 4.4090 1.5363 2.870 0.00413 **
## SmokeNowYes:AgeGroup51-60 2.5286 1.5268 1.656 0.09779 .
## SmokeNowYes:AgeGroup61-70 4.5745 1.7282 2.647 0.00816 **
## SmokeNowYes:AgeGroup71+ 6.2645 1.9582 3.199 0.00139 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.06 on 3102 degrees of freedom
## Multiple R-squared: 0.03276, Adjusted R-squared: 0.02933
## F-statistic: 9.551 on 11 and 3102 DF, p-value: < 2.2e-16
new_data <- expand.grid(AgeGroup = levels(data$AgeGroup),
SmokeNow = levels(data$SmokeNow))
# Predict values
new_data$PredictedPulse = predict(model, new_data)
levels(new_data$SmokeNow) <- c("Non-smoker", "Smoker")
library(ggplot2)
ggplot(new_data, aes(x = AgeGroup, y = PredictedPulse, color = SmokeNow)) +
geom_line(aes(group = SmokeNow), linewidth = 1) + # Corrected for deprecated warning
geom_point(size = 2) + # Add points to emphasize the actual predictions
labs(title = "Predicted Pulse Rate by Smoking Status Across Age Groups",
x = "Age Group", y = "Predicted Pulse Rate") +
theme_minimal() +
scale_color_manual(values = c("Non-smoker" = "blue", "Smoker" = "red"))
The linear model specified in the script (lm(Pulse ~ SmokeNow *
AgeGroup, data = data)) includes an interaction term (SmokeNow *
AgeGroup). This interaction term allows the model to estimate different
effects of age on pulse rate for smokers compared to non-smokers. In
statistical terms, it’s testing whether the slope of the relationship
between age and pulse rate differs by smoking status.