1 Loading Libraries

#install.packages("sjPlot")

library(psych) # for the describe() command
library(car) # for the vif() command

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

library(sjPlot) # to visualize our results

2 Importing Data

# For HW, import the dataset you cleaned previously, this will be the dataset you'll use throughout the rest of the semester

d <- read.csv(file="Data/projectdata.csv", header=T)

3 State Your Hypothesis

We hypothesize that social media use, perceived social support, and life satisfaction will significantly predict levels of belonging.

4 Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct

str(d)

## 'data.frame':    2162 obs. of  7 variables:
##  $ ResponseID: chr  "R_BJN3bQqi1zUMid3" "R_2TGbiBXmAtxywsD" "R_12G7bIqN2wB2N65" "R_39pldNoon8CePfP" ...
##  $ age       : chr  "1 between 18 and 25" "1 between 18 and 25" "1 between 18 and 25" "1 between 18 and 25" ...
##  $ gender    : chr  "f" "m" "m" "f" ...
##  $ swb       : num  4.33 4.17 1.83 5.17 3.67 ...
##  $ belong    : num  2.8 4.2 3.6 4 3.4 4.2 3.9 3.6 2.9 2.5 ...
##  $ support   : num  6 6.75 5.17 5.58 6 ...
##  $ socmeduse : int  47 23 34 35 37 13 37 43 37 29 ...

# Place only continuous variables of interest in new dataframe, and name it "cont"
cont <- na.omit(subset(d, select=c(socmeduse, support, swb, belong)))
cont$row_id <- 1:nrow(cont)

# Standardize all IVs
cont$socmeduse <- scale(cont$socmeduse, center=T, scale=T)
cont$support <- scale(cont$support, center=T, scale=T)
cont$swb <- scale(cont$swb, center=T, scale=T)

# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(cont)

##           vars    n    mean     sd  median trimmed    mad   min     max   range
## socmeduse    1 2162    0.00   1.00    0.09    0.03   0.86 -2.70    2.41    5.12
## support      2 2162    0.00   1.00    0.19    0.11   0.87 -4.88    1.30    6.17
## swb          3 2162    0.00   1.00    0.05    0.04   1.12 -2.59    1.93    4.52
## belong       4 2162    3.21   0.61    3.20    3.23   0.59  1.30    5.00    3.70
## row_id       5 2162 1081.50 624.26 1081.50 1081.50 801.35  1.00 2162.00 2161.00
##            skew kurtosis    se
## socmeduse -0.30     0.19  0.02
## support   -1.08     1.30  0.02
## swb       -0.35    -0.49  0.02
## belong    -0.27    -0.10  0.01
## row_id     0.00    -1.20 13.43

# also use histograms to examine your continuous variables
hist(cont$socmeduse)

hist(cont$support)

hist(cont$swb)

hist(cont$belong)

# last, use scatterplots to examine each pairing of your continuous variables together
plot(cont$socmeduse, cont$belong)       # PUT YOUR DV 2ND (Y-AXIS)

plot(cont$support, cont$belong)       # PUT YOUR DV 2ND (Y-AXIS)

plot(cont$swb, cont$belong)       # PUT YOUR DV 2ND (Y-AXIS)

plot(cont$socmeduse, cont$support)

plot(cont$socmeduse, cont$swb)

plot(cont$support, cont$swb)

5 View Your Correlations

corr_output_m <- corr.test(cont)
corr_output_m

## Call:corr.test(x = cont)
## Correlation matrix 
##           socmeduse support   swb belong row_id
## socmeduse      1.00    0.19  0.09   0.27   0.02
## support        0.19    1.00  0.46   0.09   0.01
## swb            0.09    0.46  1.00  -0.15  -0.01
## belong         0.27    0.09 -0.15   1.00  -0.01
## row_id         0.02    0.01 -0.01  -0.01   1.00
## Sample Size 
## [1] 2162
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##           socmeduse support swb belong row_id
## socmeduse      0.00    0.00 0.0   0.00      1
## support        0.00    0.00 0.0   0.00      1
## swb            0.00    0.00 0.0   0.00      1
## belong         0.00    0.00 0.0   0.00      1
## row_id         0.37    0.59 0.6   0.74      0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

# CHECK FOR ANY CORRELATIONS AMONG YOUR IVs ABOVE .70 --> BAD (aka multicollinearity)

6 Run a Multiple Linear Regression

# ONLY use the commented out section below IF if you need to remove outliers AFTER examining the Cook's distance and a Residuals vs Leverage plots in your HW -- remember we practiced this in the ANOVA lab

#cont <- subset(cont, row_id!=c(1970))


# use the lm() command to run the regression. Put DV on the left,  IVs on the right separated by "+"
reg_model <- lm(belong  ~ socmeduse + support + swb, data = cont)

7 Check Your Assumptions

7.1 Multiple Linear Regression Assumptions

Assumptions we’ve discussed previously:

Observations should be independent
Variables should be continuous and normally distributed
Outliers should be identified and removed
Relationship between the variables should be linear
Homogeneity of variance (we are skipping here)
Residuals should be normal and have constant variance

New assumptions:

Number of cases should be adequate (N ≥ 80 + 8*m, where m is the number of IV’s)
Independent variables should not be too correlated (aka multicollinearity)

7.2 Count Number of Cases

needed <- 80 + 8*3
nrow(cont) >= needed

## [1] TRUE

NOTE: For your homework, if you don’t have the required number of cases you’ll need to drop one of your independent variables. Reach out to me and we can figure out the best way to proceed!

7.3 Check for multicollinearity

Higher values indicate more multicollinearity
Cutoff is usually 5

# Variance Inflation Factor = VIF
vif(reg_model)

## socmeduse   support       swb 
##  1.037566  1.304689  1.267787

There is no evidence of high multicollinearity among the variables in the model, as all VIF values are well below the cutoff of 5. Therefore, no variables need to be removed on the basis of multicollinearity.

7.4 Check linearity with Residuals vs Fitted plot

The plot below shows the residuals for each case and the fitted line. The red line is the average residual for the specified point of the dependent variable. If the assumption of linearity is met, the red line should be horizontal. This indicates that the residuals average to around zero. However, a bit of deviation is okay – just like with skewness and kurtosis, there’s a range that we can work in before non-normality becomes a critical issue. For some examples of good Residuals vs Fitted plot and ones that show serious errors, check out this page.

plot(reg_model, 1)

The Residuals vs Fitted plot shows that the red line is mostly flat and centered around zero, suggesting that the linearity assumption is reasonably met. The residuals appear randomly scattered without a clear pattern, which is a good sign. While there is some slight spread and a few mild outliers, these are not severe enough to suggest major problems with the model. Overall, the assumptions appear to be met.

7.5 Check for outliers using Cook’s distance and a Residuals vs Leverage plot

The plots below both address leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.

The first plot, Cook’s distance, is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka row or participant) in the dataframe. Cook’s distance tells us how much the regression would change if the point was removed. The second plot also includes the residuals in the examination of leverage. The standardized residuals are on the y-axis and leverage is on the x-axis; this shows us which points have high residuals (are far from the regression line) and high leverage. Points that have large residuals and high leverage are especially worrisome, because they are far from the regression line but are also exerting a large influence on it.

# Cook's distance
plot(reg_model, 4)

# Residuals vs Leverage
plot(reg_model, 5)

The Cook’s Distance plot shows that most data points have very low influence on the model, with only a few cases (e.g., observations 1378, 1530, and 2017) standing out slightly. However, none of these points have Cook’s distance values approaching the common rule-of-thumb cutoff of 1, so their influence is not extreme. In the Residuals vs Leverage plot, the same observations (1378, 1530, and 2017) appear with relatively higher leverage and residuals, but none fall far outside the Cook’s distance threshold. The red line, which shows the Cook’s distance, curves slightly upward on the right. This curvature indicates that as leverage increases, the potential influence of points also rises. There are no severe outliers based on Cook’s Distance or the Residuals vs Leverage plot. Observations 1378, 1530, and 2017 have slightly higher influence, but they do not exceed critical thresholds and do not need to be removed.

7.6 Check normality of residuals with a Q-Q plot

This plot is a bit new. It’s called a Q-Q plot and shows the standardized residuals plotted against a normal distribution. If our variables are perfectly normal, the points will fit on the dashed line perfectly. This page shows how different types of non-normality appear on a Q-Q plot.

It’s normal for Q-Q plots show a bit of deviation at the ends. This page shows some examples that help us put our Q-Q plot into context.

plot(reg_model, 2)

The Q-Q plot shows that the standardized residuals from the regression model closely follow the expected diagonal line, indicating that the residuals are approximately normally distributed. While there are minor deviations at the tails, this is expected and within an acceptable range. The assumption of normality is reasonably met. The minor deviations at the extremes are typical and not cause for concern.

7.7 Issues with My Data

Before interpreting our results, we assessed our variables to see if they met the assumptions for a multiple linear regression. We detected slight issues with linearity in a Residuals vs Fitted plot. However, we did not detect any outliers (by visually analyzing Cook’s Distance and Residuals vs Leverage plots) or any serious issues with the normality of our residuals (by visually analyzing a Q-Q plot), nor were there any issues of multicollinearity among our three independent variables.

8 View Test Output

summary(reg_model)

## 
## Call:
## lm(formula = belong ~ socmeduse + support + swb, data = cont)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02257 -0.36871  0.03267  0.38478  1.78645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.20985    0.01224 262.268  < 2e-16 ***
## socmeduse    0.16253    0.01247  13.034  < 2e-16 ***
## support      0.09385    0.01398   6.712 2.45e-11 ***
## swb         -0.14831    0.01378 -10.760  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5691 on 2158 degrees of freedom
## Multiple R-squared:  0.1242, Adjusted R-squared:  0.123 
## F-statistic:   102 on 3 and 2158 DF,  p-value: < 2.2e-16

# Note for section below: to type lowercase Beta below (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work you should be able to copy/paste it from somewhere else

Effect size, based on Regression Beta (Estimate) value Trivial: Less than 0.10 Small: 0.10–0.29 Medium: 0.30–0.49 Large: 0.50 or greater

9 Write Up Results

To test our hypothesis that social media use, perceived social support, and life satisfaction would significantly predict levels of belonging, we used a multiple linear regression to model the associations between these variables. We confirmed that our data met the assumptions of a linear regression, and although there were slight issues with linearity we still continued on with the analysis.

Our model was statistically significant, Adj. R² = 0.12, F(3, 2158) = 102, p < .001. The relationship between social media use, perceived social support and levels of belonging was positive and had effect sizes that were trivial (perceived social support) or small (social media use) (per Cohen, 1988), while the relationship between our remaining predictor (life satisfaction) and levels of belonging was negative and had a small effect size. Full output from the regression model is reported in Table 1.

Table 1: Multiple Regression Model Predicting Levels of Belonging
	Levels of Belonging
Predictors	Estimates	SE	CI	p
Intercept	3.21	0.01	3.19 – 3.23	<0.001
Social Media Use	0.16	0.01	0.14 – 0.19	<0.001
Percieved Social Support	0.09	0.01	0.07 – 0.12	<0.001
Life Satisfaction	-0.15	0.01	-0.18 – -0.12	<0.001
Observations	2162
R² / R² adjusted	0.124 / 0.123

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.

Running a Multiple Linear Regression

Kate Foley

2025-04-30