1 Loading Libraries

#install.packages("sjPlot")

library(psych) # for the describe() command
library(car) # for the vif() command

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

library(sjPlot) # to visualize our results

## Learn more about sjPlot with 'browseVignettes("sjPlot")'.

2 Importing Data

# For HW, import the dataset you cleaned previously, this will be the dataset you'll use throughout the rest of the semester

d <- read.csv(file="Data/projectdata.csv", header=T)

3 State Your Hypothesis

We hypothesize that openness to experience, intolerance of uncertainty, and self-esteem will significantly predict perceived stress, with higher levels of openness and self-esteem negatively associated with perceived stress, and higher intolerance of uncertainty positively associated with perceived stress.

4 Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct

str(d)

## 'data.frame':    1252 obs. of  7 variables:
##  $ X         : int  1 20 30 31 33 49 57 81 86 104 ...
##  $ gender    : chr  "female" "male" "female" "female" ...
##  $ employment: chr  "3 employed" "1 high school equivalent" "1 high school equivalent" "3 employed" ...
##  $ big5_open : num  5.33 5.33 5 6 5 ...
##  $ iou       : num  3.19 4 1.59 3.37 1.7 ...
##  $ rse       : num  2.3 1.6 3.9 1.7 3.9 2.4 1.8 3.5 2.6 3 ...
##  $ pss       : num  3.25 3.75 1 3.25 2 2 4 1.25 2.5 2.5 ...

# Place only variables of interest in new dataframe
cont <- na.omit(subset(d, select=c(pss, big5_open,iou, rse)))

# Standardize all IVs
cont$big5_open <- scale(cont$big5_open, center=T, scale=T)
cont$iou <- scale(cont$iou, center=T, scale=T)
cont$rse <- scale(cont$rse, center=T, scale=T)

# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(cont)

##           vars    n mean   sd median trimmed  mad   min  max range  skew
## pss          1 1252 2.95 0.95   3.00    2.95 1.11  1.00 5.00  4.00  0.06
## big5_open    2 1252 0.00 1.00   0.08    0.08 0.88 -3.78 1.56  5.33 -0.74
## iou          3 1252 0.00 1.00  -0.18   -0.06 1.09 -1.73 2.68  4.41  0.49
## rse          4 1252 0.00 1.00   0.11    0.02 1.03 -2.26 1.91  4.17 -0.21
##           kurtosis   se
## pss          -0.76 0.03
## big5_open     0.43 0.03
## iou          -0.60 0.03
## rse          -0.73 0.03

# also use histograms to examine your continuous variables
hist(cont$big5_open)

hist(cont$iou)

hist(cont$rse)

hist(cont$pss)

# last, use scatterplots to examine your continuous variables together
plot(cont$big5_open, cont$pss)      # PUT YOUR DV 2ND (Y-AXIS)

plot(cont$iou, cont$pss)       # PUT YOUR DV 2ND (Y-AXIS)

plot(cont$rse, cont$pss)       # PUT YOUR DV 2ND (Y-AXIS)

plot(cont$pss, cont$big5_open)

plot(cont$pss, cont$rse)

plot(cont$iou, cont$rse)

5 View Your Correlations

corr_output_m <- corr.test(cont)
corr_output_m

## Call:corr.test(x = cont)
## Correlation matrix 
##             pss big5_open   iou   rse
## pss        1.00     -0.04  0.64 -0.74
## big5_open -0.04      1.00 -0.09  0.10
## iou        0.64     -0.09  1.00 -0.66
## rse       -0.74      0.10 -0.66  1.00
## Sample Size 
## [1] 1252
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##            pss big5_open iou rse
## pss       0.00      0.11   0   0
## big5_open 0.11      0.00   0   0
## iou       0.00      0.00   0   0
## rse       0.00      0.00   0   0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

# CHECK FOR CORRELATIONS AMONG IVs ABOVE .70 --> BAD (aka multicollinearity)

6 Run a Multiple Linear Regression

# use this commented out section below ONLY IF if you need to remove outliers AFTER examining the Cook's distance and a Residuals vs Leverage plots in your HW -- remember we practiced this in the ANOVA lab
#cont$row_id <- 1:nrow(cont)
#d <- subset(d, row_id!=c(1970))

# use the lm() command to run the regression
# DV on the left,  IVs on the right separated by "+"
reg_model <- lm(pss ~ big5_open + iou + rse, data = cont)

7 Check Your Assumptions

7.1 Multiple Linear Regression Assumptions

Assumptions we’ve discussed previously:

Observations should be independent
Variables should be continuous and normally distributed
Outliers should be identified and removed (for our class, we will not be removing outliers, just identifying them)
Relationship between the variables should be linear

New assumptions:

Number of cases should be adequate (N ≥ 80 + 8*m, where m is the number of IVs)
Independent variables should not be too correlated (aka multicollinearity)
Residuals should be normal and have constant variance

7.2 Count Number of Cases

needed <- 80 + 8*3
nrow(cont) >= needed

## [1] TRUE

NOTE: For your homework, if you don’t have the required number of cases you’ll need to drop one of your independent variables. Reach out to me and we can figure out the best way to proceed!

7.3 Check for multicollinearity

Higher values indicate more multicollinearity
Cutoff is usually 5

# Variance Inflation Factor = VIF
vif(reg_model)

## big5_open       iou       rse 
##  1.011011  1.781954  1.786101

NOTE: For your homework, you will need to discuss multicollinearity and any high values, but you don’t have to drop any variables.

7.4 Check linearity with Residuals vs Fitted plot

The plot below shows the residuals for each case and the fitted line. The red line is the average residual for the specified point of the dependent variable. If the assumption of linearity is met, the red line should be horizontal. This indicates that the residuals average to around zero. However, a bit of deviation is okay – just like with skewness and kurtosis, there’s a range that we can work in before non-normality becomes a critical issue. For some examples of good Residuals vs Fitted plot and ones that show serious errors, check out this page.

plot(reg_model, 1)

NOTE: For your homework, you’ll simply need to generate this plot and talk about whether your assumptions are met. This is going to be a judgement call, and that’s okay! In practice, you’ll always be making these judgement calls as part of a team, so this assignment is just about getting experience with it, not making the perfect call.

7.5 Check for outliers using Cook’s distance and a Residuals vs Leverage plot

The plots below both address leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.

The first plot, Cook’s distance, is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka row or participant) in the dataframe. Cook’s distance tells us how much the regression would change if the point was removed. The second plot also includes the residuals in the examination of leverage. The standardized residuals are on the y-axis and leverage is on the x-axis; this shows us which points have high residuals (are far from the regression line) and high leverage. Points that have large residuals and high leverage are especially worrisome, because they are far from the regression line but are also exerting a large influence on it.

# Cook's distance
plot(reg_model, 4)

# Residuals vs Leverage
plot(reg_model, 5)

NOTE: For your homework, you’ll simply need to generate these plots, assess Cook’s distance in your dataset, and then identify any potential cases that are prominent outliers. Since we have some cutoffs, that makes this process is a bit less subjective than some of the other assessments we’ve done here, which is nice change!

I examined Cook’s distance and residuals vs. leverage plots to identify influential outliers in our model. While cases 74, 436, and 1038 showed higher Cook’s distances compared to others, none exceeded the threshold of concern, suggesting no undue influence on the regression results. Similarly, in the residuals vs. leverage plot, case 1038 demonstrated higher leverage, but its influence remained within acceptable bounds. As a result, no cases were removed, ensuring that all data points contribute to the final analysis.

7.6 Check homogeneity of variance in a Scale-Location plot

This plot shows us the standardized residuals across the range of the regression line. Because the residuals are standarized, large residuals (whether positive or negative) are at the top of the plot, while small residuals (whether positive or negative) are at the bottom of the plot. If the assumption of homogeneity of variance (also called homoscedasticity) is met, the red line should be mostly flat, horizontal. If it deviates from the mean line, that means that the variance is smaller or larger at that point of the regression line.

You can check out this page for some other examples of this type of plot.

Based on the Scale-Location plot, we assessed the assumption of homogeneity of variance in our regression model. The plot displays the square root of the standardized residuals versus the fitted values. The residuals appear to be evenly spread across the range of fitted values, with no clear funnel shape or systematic pattern, indicating that the assumption of homoscedasticity is met. This suggests that the variance of the residuals is consistent across levels of the predictor variables, supporting the validity of the regression results. However, while the visual inspection supports homogeneity of variance, additional tests could further confirm this finding if needed. Overall, no corrective measures, such as data transformation or robust standard errors, were deemed necessary for this analysis.

plot(reg_model, 3)

NOTE: For your homework, you’ll simply need to generate the Scale-Location plot and talk about how your plot compares to the ones pictured in the link above. Is it closer to the ‘good’ plots or one of the ‘bad’ plots? Again, this is a judgement call! It’s okay if feel uncertain, and you won’t be penalized for that.

The Scale-Location plot suggests that the assumption of homoscedasticity is reasonably satisfied. The spread of residuals appears fairly consistent along the red trend line, with no pronounced funnel-shaped pattern indicating heteroscedasticity. A few cases, such as 1038, 86, and 74, display slightly larger residual values, but they do not severely impact the overall variance structure. Overall, the plot more closely resembles ‘good’ examples, supporting the validity of the regression model.

7.7 Check normality of residuals with a Q-Q plot

This plot is a bit new. It’s called a Q-Q plot and shows the standardized residuals plotted against a normal distribution. If our variables are perfectly normal, the points will fit on the dashed line perfectly. This page shows how different types of non-normality appear on a Q-Q plot.

It’s normal for Q-Q plots show a bit of deviation at the ends. This page shows some examples that help us put our Q-Q plot into context.

plot(reg_model, 2)

NOTE: For your homework, you’ll simply need to generate this plot and talk about how your plot compares to the ones pictured. Does it seem like any skew or kurtosis (i.e., thin/thick tails) is indicated by your plot ? Is it closer to the ‘good’ or ‘bad’ plots from the second link above?

The Q-Q plot for the regression residuals suggests that the residuals generally follow a normal distribution, as most points closely align with the diagonal line. However, minor deviations at the tails (e.g., cases 1038 and 86) indicate potential skew or kurtosis. Despite these deviations, the residuals appear reasonably normal, and no major transformations are deemed necessary. This plot aligns more closely with ‘good’ examples of Q-Q plots, supporting the validity of the model. ## Issues with My Data

Before interpreting our results, we assessed our variables to see if they met the assumptions for a multiple linear regression. We analyzed a Scale-Location plot and detected slight issues with homogeneity of variance, as well as slight issues with linearity in a Residuals vs Fitted plot. However, we did not detect any outliers (visually analyzing Cook’s Distance and Residuals vs Leverage plots) or any serious issues with the normality of our residuals (visually analyzing a Q-Q plot), nor were there any issues of multicollinearity among our three independent variables.

8 View Test Output

summary(reg_model)

## 
## Call:
## lm(formula = pss ~ big5_open + iou + rse, data = cont)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.31397 -0.42300  0.00416  0.39861  2.56610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.95387    0.01733  170.46   <2e-16 ***
## big5_open    0.03365    0.01743    1.93   0.0538 .  
## iou          0.24776    0.02314   10.71   <2e-16 ***
## rse         -0.54262    0.02317  -23.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6131 on 1248 degrees of freedom
## Multiple R-squared:  0.5855, Adjusted R-squared:  0.5845 
## F-statistic: 587.7 on 3 and 1248 DF,  p-value: < 2.2e-16

# you do not have to run the below code for the HW
#reg_model_test <- lm(rel_stability ~ satis, data = cont)
#summary(reg_model_test)

# Note for section below: to type lowercase Beta below (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work you should be able to copy/paste it from somewhere else

Effect size, based on Regression Beta (Estimate) value Trivial: Less than 0.10 Small: 0.10–0.29 Medium: 0.30–0.49 Large: 0.50 or greater

9 Write Up Results

To test our hypothesis that openness to experience, intolerance of uncertainty, and self-esteem would significantly predict perceived stress, we used a multiple linear regression to model the associations between these variables. We confirmed that our data met the assumptions of a linear regression, and although there were slight issues with homogeneity of variance and normality in the residuals, we proceeded with the analysis.

Our model was significant, Adj. R² = …, F(3, …) = …, p < …. The relationship between self-esteem and perceived stress was significant and negative, with a … effect size (per Cohen, 1988), indicating that higher self-esteem was associated with lower levels of perceived stress. The relationships between openness to experience and intolerance of uncertainty with perceived stress were … and had effect sizes that were … (intolerance of uncertainty) or … (openness to experience). Full output from the regression model is reported in Table 1.

Table 1: Multiple Regression Model Predicting Perceived Stress
	Perceived Stress
Predictors	Estimates	SE	CI	p
Intercept	2.95	0.02	2.92 – 2.99	<0.001
Openness	0.03	0.02	-0.00 – 0.07	0.054
Intolerance of Uncertainty	0.25	0.02	0.20 – 0.29	<0.001
Self Esteem	-0.54	0.02	-0.59 – -0.50	<0.001
Observations	1252
R² / R² adjusted	0.586 / 0.585

Running a Multiple Linear Regression HW

Cassidy Schmidt

2024-12-08