In this lab assignment, you will practice interpreting interactions by visualizing them with different types of graphs. You’ll work with simulated datasets and explore interactions between categorical x categorical, linear x linear, and categorical x linear variables.
Please follow the instructions for each exercise, and use
ggplot2 for all visualizations.
Task: Use the dataset with two categorical variables and one outcome variable. Fit a model with a categorical x categorical interaction, and visualize the interaction using a bar graph with error bars.
Education_Level (e.g., “High School”, “College”) and
Job_Type (e.g., “Office”, “Field”), and an outcome variable
Job_Satisfaction.Education_Level and Job_Type on
Job_Satisfaction.emmeans to compare all groups. First run
emmeans and then pairs.# Simulate data
set.seed(123)
Education_Level <- factor(rep(c("High School", "College"), each = 50))
Job_Type <- factor(rep(c("Office", "Field"), each = 25, times = 2))
Job_Satisfaction <- ifelse(Education_Level == "College",
70 + 10 * (Job_Type == "Office"),
60 + 5 * (Job_Type == "Field")) + rnorm(100, sd = 5)
data1 <- data.frame(Education_Level, Job_Type, Job_Satisfaction)
# Calculate means and standard errors with correct handling
means1 <- data1 %>%
group_by(Education_Level, Job_Type) %>%
summarise(
Job_Satisfaction_Mean = mean(Job_Satisfaction),
SE = (sd(Job_Satisfaction) / sqrt(n())), # Calculate SE
lower = Job_Satisfaction_Mean - 1.96 * SE, # Lower bound of the confidence interval
upper = Job_Satisfaction_Mean + 1.96 * SE # Upper bound of the confidence interval
)
# Check the calculated values
print(means1)## # A tibble: 4 × 6
## # Groups: Education_Level [2]
## Education_Level Job_Type Job_Satisfaction_Mean SE lower upper
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 College Field 71.4 0.830 69.8 73.0
## 2 College Office 80.1 0.972 78.1 82.0
## 3 High School Field 65.5 0.919 63.7 67.3
## 4 High School Office 59.8 0.947 58.0 61.7
#Run your lm() model and summary() below using the data1 dataset
# Dummy Coding:
# High school is coded as 1, college is coded as 0
# Office is coded as 1, Field is coded as 0.
#Run your summary() below
model1 <- lm(Job_Satisfaction ~ Education_Level * Job_Type, data = data1)
summary(model1)##
## Call:
## lm(formula = Job_Satisfaction ~ Education_Level * Job_Type, data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5970 -2.8385 -0.2066 3.0467 10.3341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.4129 0.9187 77.735 < 2e-16
## Education_LevelHigh School -5.9022 1.2992 -4.543 1.61e-05
## Job_TypeOffice 8.6383 1.2992 6.649 1.80e-09
## Education_LevelHigh School:Job_TypeOffice -14.3157 1.8373 -7.792 7.83e-12
##
## (Intercept) ***
## Education_LevelHigh School ***
## Job_TypeOffice ***
## Education_LevelHigh School:Job_TypeOffice ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.593 on 96 degrees of freedom
## Multiple R-squared: 0.7344, Adjusted R-squared: 0.7261
## F-statistic: 88.47 on 3 and 96 DF, p-value: < 2.2e-16
Interpret any significant Effects:
All coefficients in the model are statistically significant (more than
.001), which represents strong effects for both main variables and their
interactions. The intercept 71.41 represents the average job
satisfaction for individuals with a college education working in field
jobs. Compared to this, high school graduates in field jobs report job
satisfaction scores 5.90 points lower. College-educated people working
an office job have a 8.64 point increase in jobsatisfaction compared to
field jobs, but show a singificant negative interaction (-14.32),
meaning the positive effect of having an office job doen’t apply the
same to high school graduates.
# Plot the bar graph of means with error bars.
ggplot(means1, aes(x = Job_Type, y = Job_Satisfaction_Mean, fill = Education_Level)) +
geom_bar(stat = "identity", position = position_dodge(), color = "black") +
geom_errorbar(aes(ymin = Job_Satisfaction_Mean - 1.96 * SE,
ymax = Job_Satisfaction_Mean + 1.96 * SE),
width = 0.2, position = position_dodge(0.9)) +
labs(title = "Job Satisfaction by Education Level and Job Type",
y = "Mean Job Satisfaction") +
theme_minimal()Interpretation of Plot: The bar graph show a clear interaction between education level and job types on jon satisfaction. For people with a college education with office jobs, they report a noticably higher job satisfaction than those in field jobs. In individuals with only a high school education, they report a lower satisfaction than those in field jobs. Office jobs are associated with higher satisfaction only for college-educated indivduals, and high school graduates report the lowest satisfaction office jobs.
Below run emmeans to estimate the marginal means.
## Education_Level Job_Type emmean SE df lower.CL upper.CL
## College Field 71.4 0.919 96 69.6 73.2
## High School Field 65.5 0.919 96 63.7 67.3
## College Office 80.1 0.919 96 78.2 81.9
## High School Office 59.8 0.919 96 58.0 61.7
##
## Confidence level used: 0.95
Below run pairs to perform pairwise comparisons (post
hoc tests) with Tukey adjustment for multiple comparisons.
## contrast estimate SE df t.ratio p.value
## College Field - High School Field 5.90 1.3 96 4.543 0.0001
## College Field - College Office -8.64 1.3 96 -6.649 <.0001
## College Field - High School Office 11.58 1.3 96 8.913 <.0001
## High School Field - College Office -14.54 1.3 96 -11.192 <.0001
## High School Field - High School Office 5.68 1.3 96 4.370 0.0002
## College Office - High School Office 20.22 1.3 96 15.562 <.0001
##
## P value adjustment: tukey method for comparing a family of 4 estimates
Task: Use the dataset with two continuous variables
and one outcome variable. Fit a model with a linear x linear
interaction, and visualize the interaction using a 2D plot with the
interactions library.
Age
and Weekly_Hours_Worked, and an outcome variable
Income.Age and Weekly_Hours_Worked.interact_plot() to visualize the
interaction.# Simulate data
set.seed(123)
Age <- rnorm(100, mean = 40, sd = 10)
Weekly_Hours_Worked <- rnorm(100, mean = 40, sd = 5)
Income <- 30000 + 500 * Age + 1000 * Weekly_Hours_Worked + 50 * Age * Weekly_Hours_Worked + rnorm(100, sd = 5000)
data2 <- data.frame(Age, Weekly_Hours_Worked, Income)# Fit the lm() model
model2 <- lm(Income ~ Age * Weekly_Hours_Worked, data = data2)
# summary() of the model
summary(model2)##
## Call:
## lm(formula = Income ~ Age * Weekly_Hours_Worked, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9360 -3389 -543 2948 11583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56644.49 18039.76 3.140 0.00225 **
## Age -182.83 446.51 -0.409 0.68311
## Weekly_Hours_Worked 397.91 461.04 0.863 0.39025
## Age:Weekly_Hours_Worked 65.91 11.45 5.757 1.03e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4734 on 96 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9668
## F-statistic: 962.5 on 3 and 96 DF, p-value: < 2.2e-16
interact_plot(model2, pred = "Age", modx = "Weekly_Hours_Worked", plot.points = TRUE, interval = TRUE)Interpretation of the plot: The lines on the plot representing weekly hours worked are not parallel, meaning there is an interaction between age and weekly hours worked.The lines indicate income increases more with age for individuals who work morw hours per week.
Task: Use the simulated dataset with one categorical variable and one continuous variable as predictors. Fit a model with a categorical x linear interaction, and visualize the interaction using an interaction plot.
Gender and one continuous variable
Study_Hours, and an outcome variable
Test_Score.Gender and Study_Hours.ggplot2 to visualize
the interaction.# Simulate data
set.seed(123)
Gender <- factor(rep(c("Male", "Female"), each = 50))
Study_Hours <- rnorm(100, mean = 5, sd = 2)
Test_Score <- 60 + 10 * (Gender == "Female") + 5 * Study_Hours + 5 * (Gender == "Female") * Study_Hours + rnorm(100, sd = 5)
data3 <- data.frame(Gender, Study_Hours, Test_Score)# Fit the model
#Dummy coding: Female coded as 0, Male coded as 1
model3 <- lm(Test_Score ~ Gender * Study_Hours, data = data3)
# Summary of the model
summary(model3)##
## Call:
## lm(formula = Test_Score ~ Gender * Study_Hours, data = data3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1191 -3.3752 -0.4846 3.0552 15.0753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.3200 2.1265 34.009 < 2e-16 ***
## GenderMale -13.9836 2.9233 -4.784 6.22e-06 ***
## Study_Hours 9.5983 0.3805 25.223 < 2e-16 ***
## GenderMale:Study_Hours -4.5206 0.5323 -8.493 2.54e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.824 on 96 degrees of freedom
## Multiple R-squared: 0.9624, Adjusted R-squared: 0.9613
## F-statistic: 820.2 on 3 and 96 DF, p-value: < 2.2e-16
Interpretation for significant main effects: The model shows a significant relationship ebtween gender, studt hours, and test scores. The intercept is 72.32, meaning that when females study 0 hours, their expected test score is 72,32. The baseline test score for males is lower by 113.98 points, showing a negative coefficient for males. Overall, this means that without considering study hours, males score lower than females on average. Additioanlly, the effect of study hours on test scores is singificant: each additional hour of studying contributes to an increase of 9.60 points, holding gender constant.
# Plot the interaction
predict_grid <- expand.grid(
Gender = levels(data3$Gender),
Study_Hours = seq(min(data3$Study_Hours), max(data3$Study_Hours), length.out = 100)
)
predict_grid$Test_Score <- predict(lm(Test_Score ~ Gender * Study_Hours, data = data3), newdata = predict_grid)
ggplot(predict_grid, aes(x = Study_Hours, y = Test_Score, color = Gender)) +
geom_line(size = 1) +
labs(title = "Interaction between Gender and Study Hours on Test Score",
x = "Study Hours",
y = "Test Score") +
theme_minimal() +
scale_color_manual(values = c("blue", "red")) +
theme(legend.title = element_blank())## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Interpretation of graph: The slope of the line is steeper for females than it is for males, meaning the effect of additional study hours on test scores is stronger for females. Additionally, the lines are not parallel, meaing there is a significant interaction between gender and study hours. Overall, the graph concludes both males and females benefit from studying, but females see a greater improvement on their test scores when study hours increase.
Task: Given a multivariate dataset, create different types of graphs to visualize interactions and discuss which type of graph is most appropriate.
interact_plot for the continuous x continuous interaction.
Use any variables you’d like as long as they fit the variable type.# Simulate data
set.seed(123)
Age <- rnorm(100, mean = 35, sd = 8)
Experience <- rnorm(100, mean = 10, sd = 5)
Gender <- factor(rep(c("Male", "Female"), each = 50))
Job_Type <- factor(rep(c("Office", "Field"), each = 25, times = 2))
Salary <- 30000 + 1000 * Age + 2000 * Experience + 150 * Age * Experience +
5000 * (Gender == "Female") + rnorm(100, sd = 5000)
data4 <- data.frame(Age, Experience, Gender, Job_Type, Salary)
head(data4)## Age Experience Gender Job_Type Salary
## 1 30.51619 6.447967 Male Office 113921.3
## 2 33.15858 11.284419 Male Office 148415.8
## 3 47.46967 8.766541 Male Office 156098.7
## 4 35.56407 8.262287 Male Office 128880.7
## 5 36.03430 5.241907 Male Office 102779.7
## 6 48.72052 9.774861 Male Office 167324.5
## Age Experience Gender Job_Type Salary
## 95 45.88522 3.445992 Female Field 121507.64
## 96 30.19792 19.986067 Female Field 196034.22
## 97 52.49866 13.003544 Female Field 225240.32
## 98 47.26089 3.743643 Female Field 109532.84
## 99 33.11440 6.944170 Female Field 116600.46
## 100 26.78863 4.072600 Female Field 92548.31
# Categorical x Categorical Interaction (Bar graph with means and SE)
library(ggplot2)
library(dplyr)
summary_data <- data4 %>%
group_by(Gender, Job_Type) %>%
summarize(mean_salary = mean(Salary), se_salary = sd(Salary) / sqrt(n()))
ggplot(summary_data, aes(x = Job_Type, y = mean_salary, fill = Gender)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7) +
geom_errorbar(aes(ymin = mean_salary - se_salary, ymax = mean_salary + se_salary),
position = position_dodge(0.7), width = 0.25) +
labs(x = "Job Type", y = "Mean Salary", title = "Categorical x Categorical Interaction (Gender x Job Type)") +
theme_minimal()Graph interpretation: The graph shows that females generally have higher mean salaries than males in both field and office jobs. The error bars suggest the variability in salary between these groups is not different y statistical significance.
# Categorical x Continuous plot
ggplot(data4, aes(x = Age, y = Salary, color = Gender)) +
geom_smooth(method = "lm", aes(group = Gender), se = TRUE) +
labs(x = "Age", y = "Salary", title = "Categorical x Continuous Interaction (Gender x Age)") +
theme_minimal()Graph interpretation: Both females and males show a positive relationship between age and salaary. However, males have a more gradual increase in salary with age, and females have a steeper increase in salary, particularly with younger ages.
# Continuous x Continuous plot
library(interactions)
lm_model <- lm(Salary ~ Age * Experience, data = data4)
interact_plot(lm_model, pred = "Age", modx = "Experience",
plot.points = TRUE, interval = TRUE,
main.title = "Continuous x Continuous Interaction (Age x Experience)")Graph interpretation: As experience increases, the effect of age on salary becomes bigger. Lines of different experiences show different slopes, indicating that the relationship with age and salary changes with the amount of experience an individual has.