Simple Linear Regression

LGBdata <- read.delim("/Users/oliviabrady/Desktop/LGBdata/DS0001/37166-0001-Data.tsv")

Variables:

# fitting the model
slr <- lm(W1LIFESAT~W1CONNECTEDNESS, data=LGBdata)
summary(slr)
## 
## Call:
## lm(formula = W1LIFESAT ~ W1CONNECTEDNESS, data = LGBdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5668 -1.2064  0.1541  1.3220  2.9786 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.66842    0.22807  16.085   <2e-16 ***
## W1CONNECTEDNESS  0.22459    0.07554   2.973    0.003 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.621 on 1446 degrees of freedom
##   (70 observations deleted due to missingness)
## Multiple R-squared:  0.006075,   Adjusted R-squared:  0.005388 
## F-statistic: 8.839 on 1 and 1446 DF,  p-value: 0.002998

The relationship between life satisfaction and community connection is small but significant (p=.003).

# plotting the relationship
ggplot(LGBdata, aes(W1CONNECTEDNESS, W1LIFESAT)) +
  geom_point(position="jitter", alpha=.5) +
  geom_abline(intercept=slr$coefficients[1], slope=slr$coefficients[2], color="red") +
  xlab("Community Connectedness") +
  ylab("Life Satisfaction")

Categorical Linear Regression

Dummy coding:

#fitting the model
clr <- lm(W1LIFESAT~as.factor(COHORT), data=LGBdata)
summary(clr)
## 
## Call:
## lm(formula = W1LIFESAT ~ as.factor(COHORT), data = LGBdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5210 -1.1793  0.2207  1.2790  2.8424 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.15764    0.06303  65.958  < 2e-16 ***
## as.factor(COHORT)2  0.22165    0.10550   2.101 0.035807 *  
## as.factor(COHORT)3  0.36339    0.09803   3.707 0.000217 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.621 on 1491 degrees of freedom
##   (24 observations deleted due to missingness)
## Multiple R-squared:  0.009486,   Adjusted R-squared:  0.008158 
## F-statistic:  7.14 on 2 and 1491 DF,  p-value: 0.0008204

The relationship between cohort and life satisfaction is significant, with the youngest cohort having the least life satisfaction, and oldest cohort having the most (p=.0008).

# graphing the relationship
ggplot(LGBdata, aes(as.factor(COHORT), W1LIFESAT , fill= as.factor(COHORT))) +
  geom_boxplot() +
  xlab("Cohort") +
  ylab("Life Satisfaction")

Multiple Linear Regression

# fitting the model
mlr <- lm(W1LIFESAT~W1CONNECTEDNESS + as.factor(COHORT), data=LGBdata)
summary(mlr)
## 
## Call:
## lm(formula = W1LIFESAT ~ W1CONNECTEDNESS + as.factor(COHORT), 
##     data = LGBdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8071 -1.2196  0.1626  1.2702  3.1974 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.40267    0.23776  14.312  < 2e-16 ***
## W1CONNECTEDNESS     0.25449    0.07560   3.366 0.000781 ***
## as.factor(COHORT)2  0.23493    0.10639   2.208 0.027396 *  
## as.factor(COHORT)3  0.38653    0.09998   3.866 0.000115 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.613 on 1444 degrees of freedom
##   (70 observations deleted due to missingness)
## Multiple R-squared:  0.01666,    Adjusted R-squared:  0.01461 
## F-statistic: 8.154 on 3 and 1444 DF,  p-value: 2.199e-05
ggplot(LGBdata, aes(W1CONNECTEDNESS, W1LIFESAT, color= as.factor(COHORT))) +
  geom_point(position="jitter", alpha=.7) +
  geom_abline(intercept=mlr$coefficients[1], slope=mlr$coefficients[2], color="red") +
  geom_abline(intercept=mlr$coefficients[1]+mlr$coefficients[3], slope=mlr$coefficients[2], color="green") +
  geom_abline(intercept=mlr$coefficients[1]+mlr$coefficients[4], slope=mlr$coefficients[2], color="blue") +
  xlab("Community Connectedness") +
  ylab("Life Satisfaction")

Multiple Linear Regression with an Interaction

# fitting the model
mlri <- lm(W1LIFESAT~W1CONNECTEDNESS + as.factor(COHORT) + W1CONNECTEDNESS*as.factor(COHORT), data=LGBdata)
summary(mlri)
## 
## Call:
## lm(formula = W1LIFESAT ~ W1CONNECTEDNESS + as.factor(COHORT) + 
##     W1CONNECTEDNESS * as.factor(COHORT), data = LGBdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9099 -1.2099  0.1677  1.2805  3.1213 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          3.7971     0.3521  10.784   <2e-16 ***
## W1CONNECTEDNESS                      0.1243     0.1143   1.088    0.277    
## as.factor(COHORT)2                  -0.4838     0.5768  -0.839    0.402    
## as.factor(COHORT)3                  -0.2809     0.5297  -0.530    0.596    
## W1CONNECTEDNESS:as.factor(COHORT)2   0.2411     0.1914   1.260    0.208    
## W1CONNECTEDNESS:as.factor(COHORT)3   0.2241     0.1758   1.275    0.203    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.613 on 1442 degrees of freedom
##   (70 observations deleted due to missingness)
## Multiple R-squared:  0.01823,    Adjusted R-squared:  0.01483 
## F-statistic: 5.356 on 5 and 1442 DF,  p-value: 6.941e-05

The interactions between cohort and life satisfaction are not significant, and even make the cohort variables individually nonsignificant when added as well.

ggplot(LGBdata, aes(W1CONNECTEDNESS, W1LIFESAT, color= as.factor(COHORT))) +
  geom_point(position="jitter", alpha=.7) +
  geom_abline(intercept=mlri$coefficients[1], slope=mlri$coefficients[2], color="red") +
  geom_abline(intercept=mlri$coefficients[1]+mlri$coefficients[3], slope=mlri$coefficients[2]+mlri$coefficients[5], color="green") +
  geom_abline(intercept=mlri$coefficients[1]+mlri$coefficients[4], slope=mlri$coefficients[2]+mlri$coefficients[6], color="blue") +
  xlab("Community Connectedness") +
  ylab("Life Satisfaction")

Model Selection

Comparing the MSE of Each Model

# MSE of SLR = 2.62
anova(slr)
## Analysis of Variance Table
## 
## Response: W1LIFESAT
##                   Df Sum Sq Mean Sq F value   Pr(>F)   
## W1CONNECTEDNESS    1   23.2 23.2146  8.8386 0.002998 **
## Residuals       1446 3797.9  2.6265                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# MSE of CLR = 2.62
anova(clr)
## Analysis of Variance Table
## 
## Response: W1LIFESAT
##                     Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(COHORT)    2   37.5 18.7515  7.1397 0.0008204 ***
## Residuals         1491 3915.9  2.6264                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# MSE of MLR = 2.602
anova(mlr)
## Analysis of Variance Table
## 
## Response: W1LIFESAT
##                     Df Sum Sq Mean Sq F value    Pr(>F)    
## W1CONNECTEDNESS      1   23.2 23.2146  8.9213 0.0028661 ** 
## as.factor(COHORT)    2   40.4 20.2187  7.7700 0.0004401 ***
## Residuals         1444 3757.5  2.6021                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# MSE of MLRI = 2.601
anova(mlri)
## Analysis of Variance Table
## 
## Response: W1LIFESAT
##                                     Df Sum Sq Mean Sq F value    Pr(>F)    
## W1CONNECTEDNESS                      1   23.2 23.2146  8.9233 0.0028631 ** 
## as.factor(COHORT)                    2   40.4 20.2187  7.7717 0.0004394 ***
## W1CONNECTEDNESS:as.factor(COHORT)    2    6.0  3.0093  1.1567 0.3148135    
## Residuals                         1442 3751.5  2.6016                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The multiple linear regression model with an interaction (mlri) had the lowest MSE, but I would choose the mlr model, because the added complexity of an interaction term outweighs the benefit of .001 difference in MSE. Additionally, the interaction terms were not significant, and decreased the significance of the other variables, while all variables in mlr were significant, meaning both cohort and community connectedness had a significant affect on life satisfaction.

I learned that variables that seem to have a relatively small effect (either on the graph or a small slope term) can still be quite significant and impactful on the data. Especially in a dataset with many variables and a concept as complex as life satisfaction, it makes sense that many variables will each have a small effect but can still be significant.