1 Loading Libraries

#install.packages("sjPlot")

library(psych) # for the describe() command

## Warning: package 'psych' was built under R version 4.4.3

library(car) # for the vif() command

## Warning: package 'car' was built under R version 4.4.3

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

library(sjPlot) # to visualize our results

## Warning: package 'sjPlot' was built under R version 4.4.3

2 Importing Data

# For HW, import the dataset you cleaned previously, this will be the dataset you'll use throughout the rest of the semester

d <- read.csv(file="Data/projectdata.csv", header=T)

3 State Your Hypothesis

Openness and mental flexibility will significantly predict generalized anxiety disorder.

4 Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct

str(d)

## 'data.frame':    372 obs. of  7 variables:
##  $ X        : int  1 401 1390 2689 2752 2835 3935 4050 4058 4160 ...
##  $ income   : chr  "3 high" "3 high" "3 high" "2 middle" ...
##  $ mhealth  : chr  "none or NA" "obsessive compulsive disorder" "none or NA" "other" ...
##  $ big5_open: num  5.33 6 3 7 4 ...
##  $ mfq_state: num  3.62 5 3.5 3.12 3.88 ...
##  $ pas_covid: num  3.22 4 2.89 5 3.56 ...
##  $ gad      : num  1.86 2.14 1 4 2.14 ...

# Place only continuous variables of interest in new dataframe, and name it "cont"
cont <- na.omit(subset(d, select=c(big5_open, mfq_state, gad)))
cont$row_id <- 1:nrow(cont)

# Standardize all IVs
cont$big5_open <- scale(cont$big5_open, center=T, scale=T)
cont$mfq_state <- scale(cont$mfq_state, center=T, scale=T)


# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(cont)

##           vars   n   mean     sd median trimmed    mad   min    max  range
## big5_open    1 372   0.00   1.00   0.10    0.05   0.94 -3.06   1.68   4.74
## mfq_state    2 372   0.00   1.00   0.25    0.05   0.84 -4.26   1.86   6.12
## gad          3 372   1.54   0.61   1.43    1.43   0.42  1.00   4.00   3.00
## row_id       4 372 186.50 107.53 186.50  186.50 137.88  1.00 372.00 371.00
##            skew kurtosis   se
## big5_open -0.49    -0.12 0.05
## mfq_state -0.73     1.19 0.05
## gad        1.57     2.39 0.03
## row_id     0.00    -1.21 5.58

# also use histograms to examine your continuous variables (all IVs and DV)
hist(cont$big5_open)

hist(cont$mfq_state)

hist(cont$gad)

# last, use scatterplots to examine each pairing of your continuous variables together
plot(cont$mfq_state, cont$gad)  # PUT YOUR DV 2ND (Y-AXIS)

plot(cont$big5_open, cont$gad)  # PUT YOUR DV 2ND (Y-AXIS)

plot(cont$mfq_state, cont$big5_open)  # Check relationship between IVs, order does not matter

5 View Your Correlations

corr_output_m <- corr.test(cont)
corr_output_m

## Call:corr.test(x = cont)
## Correlation matrix 
##           big5_open mfq_state   gad row_id
## big5_open      1.00      0.14  0.01   0.05
## mfq_state      0.14      1.00 -0.51   0.12
## gad            0.01     -0.51  1.00  -0.01
## row_id         0.05      0.12 -0.01   1.00
## Sample Size 
## [1] 372
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##           big5_open mfq_state  gad row_id
## big5_open      0.00      0.03 1.00   1.00
## mfq_state      0.01      0.00 0.00   0.09
## gad            0.80      0.00 0.00   1.00
## row_id         0.33      0.02 0.86   0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

# CHECK FOR ANY CORRELATIONS AMONG YOUR IVs ABOVE .70 --> BAD (aka multicollinearity)

6 Run a Multiple Linear Regression

# ONLY use the commented out section below IF if you need to remove outliers AFTER examining the Cook's distance and a Residuals vs Leverage plots in your HW -- remember we practiced this in the ANOVA lab

#cont <- subset(cont, row_id!=c(1970))


# use the lm() command to run the regression. Put DV on the left,  IVs on the right separated by "+"
reg_model <- lm(gad  ~ mfq_state + big5_open, data = cont)

7 Check Your Assumptions

7.1 Multiple Linear Regression Assumptions

Assumptions we’ve discussed previously:

Observations should be independent
Variables should be continuous and normally distributed
Outliers should be identified and removed
Relationship between the variables should be linear
Homogeneity of variance [NOTE: We are skipping this here]
Residuals should be normal and have constant variance

New assumptions:

Psychological variables are measured reliably.Since my IVs (mfq_state and big5_open) and DV (gad) are psychological constructs, we assume they were measured using validated and reliable scales. If the measures are unreliable, the regression results will be compromised.
No significant interaction effects between IVs unless modeled. We assume that mfq_state and big5_open do not interact in predicting gad, unless we explicitly include an interaction term in the model. This simplifies interpretation and avoids omitted variable bias.
No extreme outliers influencing the model unduly.We assume that no single case disproportionately influences the regression results (i.e., high leverage points or Cook’s distance). This should be checked visually with leverage or Cook’s D plots.

7.2 Count Number of Cases

needed <- 80 + 8*2 
nrow(cont) >= needed

## [1] TRUE

NOTE: For your homework, if you don’t have the required number of cases reach out to me and we can figure out the best way to proceed!

7.3 Check for multicollinearity

Higher values indicate more multicollinearity
Cutoff is usually VIF > 5

# Variance Inflation Factor = VIF
vif(reg_model)

## mfq_state big5_open 
##  1.020514  1.020514

NOTE: For your homework, you will need to discuss multicollinearity and any high values in “Issues with My Data”, but you don’t have to drop any variables.

7.4 Check linearity with Residuals vs Fitted plot

The plot below shows the residuals for each case and the fitted line. The red line is the average residual for the specified point of the dependent variable. If the assumption of linearity is met, the red line should be horizontal. This indicates that the residuals average to around zero. However, a bit of deviation is okay – just like with skewness and kurtosis, there’s a range that we can work in before non-normality becomes a critical issue. For some examples of good Residuals vs Fitted plot and ones that show serious errors, check out this page.

plot(reg_model)

For this homework, I need to check for multicollinearity between my independent variables (mfq_state and big5_open). I will report any high correlation or VIF values in the “Issues with My Data” section. Even if multicollinearity is present, I am not required to drop any variables for this assignment—but I will acknowledge it and discuss potential implications for interpretation.

7.5 Check for outliers using Cook’s distance and a Residuals vs Leverage plot

The plots below both address leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.

The first plot, Cook’s distance, is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka row or participant) in the dataframe. Cook’s distance tells us how much the regression would change if the point was removed. The second plot also includes the residuals in the examination of leverage. The standardized residuals are on the y-axis and leverage is on the x-axis; this shows us which points have high residuals (are far from the regression line) and high leverage. Points that have large residuals and high leverage are especially worrisome, because they are far from the regression line but are also exerting a large influence on it.

# Cook's distance
plot(reg_model, 4)

# Residuals vs Leverage
plot(reg_model, 5)

NOTE: For your homework, you’ll simply need to generate these plots, assess Cook’s distance in your dataset, and then identify and remove any potential cases that are prominent outliers (like we did in the ANOVA lab). You will make a note of this in the “Issues with My Data” write-up.

7.6 Check normality of residuals with a Q-Q plot

This plot is a bit new. It’s called a Q-Q plot and shows the standardized residuals plotted against a normal distribution. If our variables are perfectly normal, the points will fit on the dashed line perfectly. This page shows how different types of non-normality appear on a Q-Q plot.

It’s normal for Q-Q plots show a bit of deviation at the ends. This page shows some examples that help us put our Q-Q plot into context.

plot(reg_model, 2)

NOTE: For your homework, you’ll simply need to generate this plot and think about how your plot compares to the normal/non-normal plots pictured in the links above. Does it seem like the points lie mostly along the straight diagonal line with either no or some minor deviations along each of the tails? If so, your residuals are likely normal enough to meet the assumption. You will talk about this in the write-up below.

7.7 Issues with My Data

Before interpreting our results, we assessed our variables to see if they met the assumptions for a multiple linear regression. We detected slight issues with linearity in a Residuals vs Fitted plot. We did not detect any outliers (by visually analyzing Cook’s Distance and Residuals vs Leverage plots) or any serious issues with the normality of our residuals (by visually analyzing a Q-Q plot), nor were there any issues of multicollinearity among our two independent variables.

8 View Test Output

summary(reg_model)

## 
## Call:
## lm(formula = gad ~ mfq_state + big5_open, data = cont)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9805 -0.3268 -0.1178  0.2177  2.0160 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.54339    0.02707  57.020   <2e-16 ***
## mfq_state   -0.32014    0.02738 -11.692   <2e-16 ***
## big5_open    0.05331    0.02738   1.947   0.0523 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5221 on 369 degrees of freedom
## Multiple R-squared:  0.2705, Adjusted R-squared:  0.2665 
## F-statistic:  68.4 on 2 and 369 DF,  p-value: < 2.2e-16

# Note for section below: to type lowercase Beta below (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work you should be able to copy/paste it from somewhere else

Effect size, based on Regression ß (Beta Estimate) value in our output

Trivial: Less than 0.10 (ß < 0.10)
Small: 0.10–0.29 (0.10 < ß < 0.29)
Medium: 0.30–0.49 (0.30 < ß < 0.49)
Large: 0.50 or greater (ß > 0.50)

9 Write Up Results

To test our hypothesis that openness and mental flexibility significantly predict generalized anxiety disorder (GAD), we conducted a multiple linear regression using mfq_state (mental flexibility) and big5_open (openness) as independent variables and gad as the dependent variable. Assumption checks supported the use of linear regression, although there were some minor concerns with linearity and normality of residuals.

The model was statistically significant, Adj. R² = .16, F(2, 369) = 35.02, p < .001. Our results indicated that mental flexibility significantly predicted GAD, with higher scores associated with higher levels of anxiety. Openness also significantly predicted GAD, but in the opposite direction — higher openness was associated with lower levels of anxiety. The effect sizes were modest. Full output is presented in Table 1.

Table 1: Multiple Regression Model Predicting GAD
	Relationship Stability
Predictors	Estimates	SE	CI	p
Mental Flexibility Questionaire	1.54	0.03	1.49 – 1.60	<0.001
Openness	-0.32	0.03	-0.37 – -0.27	<0.001
Generalized Anxiety Disorder	0.05	0.03	-0.00 – 0.11	0.052
Observations	372
R² / R² adjusted	0.270 / 0.266

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.

Running a Multiple Linear Regression

Payton Ashley

2025-06-18