LGBdata <- read.delim("/Users/oliviabrady/Desktop/LGBdata/DS0001/37166-0001-Data.tsv")
Variables:
Life Satisfaction is my response variable, measured on a Likert scale of 1-7
Community Connectedness is my numeric predictor variable, measured on a Likert scale of 1-4
Cohort is my categorical predictor, showing which generation the participant is in, young adult (1), middle age adult (2), and older adult (3)
# fitting the model
slr <- lm(W1LIFESAT~W1CONNECTEDNESS, data=LGBdata)
summary(slr)
##
## Call:
## lm(formula = W1LIFESAT ~ W1CONNECTEDNESS, data = LGBdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5668 -1.2064 0.1541 1.3220 2.9786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.66842 0.22807 16.085 <2e-16 ***
## W1CONNECTEDNESS 0.22459 0.07554 2.973 0.003 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.621 on 1446 degrees of freedom
## (70 observations deleted due to missingness)
## Multiple R-squared: 0.006075, Adjusted R-squared: 0.005388
## F-statistic: 8.839 on 1 and 1446 DF, p-value: 0.002998
The relationship between life satisfaction and community connection is small but significant (p=.003).
# plotting the relationship
ggplot(LGBdata, aes(W1CONNECTEDNESS, W1LIFESAT)) +
geom_point(position="jitter", alpha=.5) +
geom_abline(intercept=slr$coefficients[1], slope=slr$coefficients[2], color="red") +
xlab("Community Connectedness") +
ylab("Life Satisfaction")
Dummy coding:
young cohort: 0,0
middle cohort: 0,1
oldest cohort: 1,1
#fitting the model
clr <- lm(W1LIFESAT~as.factor(COHORT), data=LGBdata)
summary(clr)
##
## Call:
## lm(formula = W1LIFESAT ~ as.factor(COHORT), data = LGBdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5210 -1.1793 0.2207 1.2790 2.8424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.15764 0.06303 65.958 < 2e-16 ***
## as.factor(COHORT)2 0.22165 0.10550 2.101 0.035807 *
## as.factor(COHORT)3 0.36339 0.09803 3.707 0.000217 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.621 on 1491 degrees of freedom
## (24 observations deleted due to missingness)
## Multiple R-squared: 0.009486, Adjusted R-squared: 0.008158
## F-statistic: 7.14 on 2 and 1491 DF, p-value: 0.0008204
The relationship between cohort and life satisfaction is significant, with the youngest cohort having the least life satisfaction, and oldest cohort having the most (p=.0008).
# graphing the relationship
ggplot(LGBdata, aes(as.factor(COHORT), W1LIFESAT , fill= as.factor(COHORT))) +
geom_boxplot() +
xlab("Cohort") +
ylab("Life Satisfaction")
# fitting the model
mlr <- lm(W1LIFESAT~W1CONNECTEDNESS + as.factor(COHORT), data=LGBdata)
summary(mlr)
##
## Call:
## lm(formula = W1LIFESAT ~ W1CONNECTEDNESS + as.factor(COHORT),
## data = LGBdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8071 -1.2196 0.1626 1.2702 3.1974
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.40267 0.23776 14.312 < 2e-16 ***
## W1CONNECTEDNESS 0.25449 0.07560 3.366 0.000781 ***
## as.factor(COHORT)2 0.23493 0.10639 2.208 0.027396 *
## as.factor(COHORT)3 0.38653 0.09998 3.866 0.000115 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.613 on 1444 degrees of freedom
## (70 observations deleted due to missingness)
## Multiple R-squared: 0.01666, Adjusted R-squared: 0.01461
## F-statistic: 8.154 on 3 and 1444 DF, p-value: 2.199e-05
ggplot(LGBdata, aes(W1CONNECTEDNESS, W1LIFESAT, color= as.factor(COHORT))) +
geom_point(position="jitter", alpha=.7) +
geom_abline(intercept=mlr$coefficients[1], slope=mlr$coefficients[2], color="red") +
geom_abline(intercept=mlr$coefficients[1]+mlr$coefficients[3], slope=mlr$coefficients[2], color="green") +
geom_abline(intercept=mlr$coefficients[1]+mlr$coefficients[4], slope=mlr$coefficients[2], color="blue") +
xlab("Community Connectedness") +
ylab("Life Satisfaction")
# fitting the model
mlri <- lm(W1LIFESAT~W1CONNECTEDNESS + as.factor(COHORT) + W1CONNECTEDNESS*as.factor(COHORT), data=LGBdata)
summary(mlri)
##
## Call:
## lm(formula = W1LIFESAT ~ W1CONNECTEDNESS + as.factor(COHORT) +
## W1CONNECTEDNESS * as.factor(COHORT), data = LGBdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9099 -1.2099 0.1677 1.2805 3.1213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7971 0.3521 10.784 <2e-16 ***
## W1CONNECTEDNESS 0.1243 0.1143 1.088 0.277
## as.factor(COHORT)2 -0.4838 0.5768 -0.839 0.402
## as.factor(COHORT)3 -0.2809 0.5297 -0.530 0.596
## W1CONNECTEDNESS:as.factor(COHORT)2 0.2411 0.1914 1.260 0.208
## W1CONNECTEDNESS:as.factor(COHORT)3 0.2241 0.1758 1.275 0.203
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.613 on 1442 degrees of freedom
## (70 observations deleted due to missingness)
## Multiple R-squared: 0.01823, Adjusted R-squared: 0.01483
## F-statistic: 5.356 on 5 and 1442 DF, p-value: 6.941e-05
The interactions between cohort and life satisfaction are not significant, and even make the cohort variables individually nonsignificant when added as well.
ggplot(LGBdata, aes(W1CONNECTEDNESS, W1LIFESAT, color= as.factor(COHORT))) +
geom_point(position="jitter", alpha=.7) +
geom_abline(intercept=mlri$coefficients[1], slope=mlri$coefficients[2], color="red") +
geom_abline(intercept=mlri$coefficients[1]+mlri$coefficients[3], slope=mlri$coefficients[2]+mlri$coefficients[5], color="green") +
geom_abline(intercept=mlri$coefficients[1]+mlri$coefficients[4], slope=mlri$coefficients[2]+mlri$coefficients[6], color="blue") +
xlab("Community Connectedness") +
ylab("Life Satisfaction")
Comparing the MSE of Each Model
# MSE of SLR = 2.62
anova(slr)
## Analysis of Variance Table
##
## Response: W1LIFESAT
## Df Sum Sq Mean Sq F value Pr(>F)
## W1CONNECTEDNESS 1 23.2 23.2146 8.8386 0.002998 **
## Residuals 1446 3797.9 2.6265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# MSE of CLR = 2.62
anova(clr)
## Analysis of Variance Table
##
## Response: W1LIFESAT
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(COHORT) 2 37.5 18.7515 7.1397 0.0008204 ***
## Residuals 1491 3915.9 2.6264
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# MSE of MLR = 2.602
anova(mlr)
## Analysis of Variance Table
##
## Response: W1LIFESAT
## Df Sum Sq Mean Sq F value Pr(>F)
## W1CONNECTEDNESS 1 23.2 23.2146 8.9213 0.0028661 **
## as.factor(COHORT) 2 40.4 20.2187 7.7700 0.0004401 ***
## Residuals 1444 3757.5 2.6021
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# MSE of MLRI = 2.601
anova(mlri)
## Analysis of Variance Table
##
## Response: W1LIFESAT
## Df Sum Sq Mean Sq F value Pr(>F)
## W1CONNECTEDNESS 1 23.2 23.2146 8.9233 0.0028631 **
## as.factor(COHORT) 2 40.4 20.2187 7.7717 0.0004394 ***
## W1CONNECTEDNESS:as.factor(COHORT) 2 6.0 3.0093 1.1567 0.3148135
## Residuals 1442 3751.5 2.6016
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The multiple linear regression model with an interaction (mlri) had the lowest MSE, but I would choose the mlr model, because the added complexity of an interaction term outweighs the benefit of .001 difference in MSE. Additionally, the interaction terms were not significant, and decreased the significance of the other variables, while all variables in mlr were significant, meaning both cohort and community connectedness had a significant affect on life satisfaction.
I learned that variables that seem to have a relatively small effect (either on the graph or a small slope term) can still be quite significant and impactful on the data. Especially in a dataset with many variables and a concept as complex as life satisfaction, it makes sense that many variables will each have a small effect but can still be significant.