1 Loading Libraries

library(psych) # for the describe() command
library(car) # for the vif() command

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

library(sjPlot) # to visualize our results

2 Importing Data

# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
# use EAMMi2 data
d <- read.csv(file="Data/arc_test.csv", header=T)

3 State Your Hypothesis

We hypothesize that parental support (measured by the support_parents), personality tendencies (measured by the big_5), and covid worry levels (measured by pas_covd) will significantly predict amount of depressive symptoms (measured by the PHQ).

4 Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)

## 'data.frame':    2073 obs. of  7 variables:
##  $ X              : int  1 20 30 31 32 33 48 49 57 58 ...
##  $ group          : chr  "parent" "young person" "young person" "parent" ...
##  $ gender         : chr  "female" "male" "female" "female" ...
##  $ phq            : num  1.33 3.33 1 2.33 NA ...
##  $ support_parents: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ pas_covid      : num  3.22 4.56 3.33 4.22 NA ...
##  $ big_5          : num  4.67 5.33 5.33 5.33 NA ...

cont <- na.omit(subset(d, select=c(big_5, pas_covid, support_parents, phq)))
cont$big_5 <- scale(cont$big_5, center=T, scale=T)
cont$pas_covid <- scale(cont$pas_covid, center=T, scale=T)
cont$support_parents <- scale(cont$support_parents, center=T, scale=T)

# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(cont)

##                 vars   n mean   sd median trimmed  mad   min  max range  skew
## big_5              1 397  0.0 1.00   0.03    0.08 0.55 -2.96 2.27  5.22 -0.69
## pas_covid          2 397  0.0 1.00   0.13    0.03 0.98 -3.50 2.45  5.94 -0.29
## support_parents    3 397  0.0 1.00   0.10   -0.02 1.19 -1.50 1.70  3.20  0.15
## phq                4 397  2.6 0.86   2.67    2.61 0.99  1.00 4.00  3.00 -0.04
##                 kurtosis   se
## big_5               0.40 0.05
## pas_covid           0.30 0.05
## support_parents    -1.03 0.05
## phq                -1.07 0.04

# also use histograms to examine your continuous variables
hist(cont$big_5)

hist(cont$pas_covid)

hist(cont$support_parents)

# last, use scatterplots to examine your continuous variables together
plot(cont$big_5, cont$phq)

plot(cont$big_5, cont$pas_covid)

plot(cont$big_5, cont$support_parents)

plot(cont$pas_covid, cont$phq)

plot(cont$pas_covid, cont$support_parents)

plot(cont$support_parents, cont$phq)

5 View Your Correlations

corr_output_m <- corr.test(cont)
corr_output_m

## Call:corr.test(x = cont)
## Correlation matrix 
##                 big_5 pas_covid support_parents   phq
## big_5            1.00      0.22           -0.07  0.30
## pas_covid        0.22      1.00           -0.12  0.26
## support_parents -0.07     -0.12            1.00 -0.47
## phq              0.30      0.26           -0.47  1.00
## Sample Size 
## [1] 397
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##                 big_5 pas_covid support_parents phq
## big_5            0.00      0.00            0.15   0
## pas_covid        0.00      0.00            0.03   0
## support_parents  0.15      0.01            0.00   0
## phq              0.00      0.00            0.00   0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

6 Run a Multiple Linear Regression

# use the lm() command to run the regression
# dependent/outcome variable on the left, independent/predictor variables on the right
reg_model <- lm(phq ~ big_5 + pas_covid + support_parents, data = cont)

7 Check Your Assumptions

7.1 Multiple Linear Regression Assumptions

Observations should be independent
Number of cases should be adequate (N ≥ 80 + 8m, where m is the number of IVs)
Independent variables should not be too correlated (aka multicollinearity)
Variables should be continuous and normally distributed
Outliers should be identified and removed
Relationship between the variables should be linear
Residuals should be normal and have constant variance

7.2 Count Number of Cases

For your homework, if you don’t have the required number of cases you’ll need to drop one of your independent variables. Reach out to me and we can figure out the best way to proceed!

needed <- 80 + 8*3
nrow(cont) >= needed

## [1] TRUE

7.3 Check multicollinearity

Higher values indicate more multicollinearity
Cutoff is usually 5

For your homework, you will need to discuss multicollinearity and any high values, but you don’t have to drop any variables.

vif(reg_model)

##           big_5       pas_covid support_parents 
##        1.055551        1.066005        1.017553

7.4 Check linearity with Residuals vs Fitted plot

plot(reg_model, 1)

7.5 Check for outliers using Cook’s distance and a Residuals vs Leverage plot

# Cook's distance
plot(reg_model, 4)

# Residuals vs Leverage
plot(reg_model, 5)

7.6 Check homogeneity of variance in a Scale-Location plot

plot(reg_model, 3)

7.7 Check normality of residuals with a Q-Q plot

This plot is a bit new. It’s called a Q-Q plot and shows the standardized residuals plotted against a normal distribution. If our variables are perfectly normal, the points will fit on the dashed line perfectly. This page shows how different types of non-normality appear on a Q-Q plot.

It’s normal for Q-Q plots show a bit of deviation at the ends, and ours shows some deviation in the top right corner. It doesn’t perfectly fit any of the provided images, but it suggests some negative/right skew, which fits what we saw with the describe() command earlier.

This page shows some examples that help us put our Q-Q plot into context. Although it isn’t perfect, we don’t have any serious issues and are okay to proceed. For your homework, you’ll simply need to generate this plot and talk about how your plot compares to the ones pictured. Does it seem like any skew or kurtosis is indicated by your plot? Is it closer to the ‘good’/‘bad’ plots from the second link?

plot(reg_model, 2)

7.8 Issues with My Data

After checking my assumptions we are able to draw more firm conclusions from the basis of my hypothesis. First off, the count case was true so none of my variables needed to be dropped. Secondly, multicollinearity for each variable was below 5 meaning it fit within the range. After that, plots were made and compared to those of normality shown in the website included. In Plot 1 linearity was shown in my plot due to the horizontal red line, it is not perfectly horizontal and it has some small bumps, however not any that are large issues to stray the data from linearity.Plot 2 and 3 look at influence within the variables. And through analyzing through Cooks distance as well as homogeneity. Through Cooks the data seemed to not have too many outliers that would offset or invalidate the results, mostly looked like the good data charts found on the website meaning that the conclusions drawn could hold some value. This good graph for Plot 2 allows us to understand the influence of relationship within the variables. The plot focusing on homogeneity is also proven to be good based on the horizontal behavior of the red line which allows us to draw the conclusion that the variance is at the right level and residues are at the right size. Lastly, the Q-Q plot. My data also fit what was expected which included a little straying at the ends with majorly line that fit the expectation moving all together in a orderly fashion. Therefore aside from minor striations within my data, all the plots and checks for examining my assumption all checked out.

8 View Test Output

summary(reg_model)

## 
## Call:
## lm(formula = phq ~ big_5 + pas_covid + support_parents, data = cont)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.56007 -0.53743  0.01631  0.50911  2.02397 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.60313    0.03593  72.447  < 2e-16 ***
## big_5            0.19958    0.03696   5.400 1.16e-07 ***
## pas_covid        0.13390    0.03715   3.605 0.000352 ***
## support_parents -0.37913    0.03629 -10.447  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7159 on 393 degrees of freedom
## Multiple R-squared:  0.3179, Adjusted R-squared:  0.3127 
## F-statistic: 61.04 on 3 and 393 DF,  p-value: < 2.2e-16

# note for section below: to type lowercase Beta below (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work you should be able to copy/paste it from somewhere else

9 Write Up Results

To test our hypothesis that parental support (measured by the support_parents), personality tendencies (measured by the big_5), and covid worrly levels (measured by the pas_covid) would significantly predict depressive symptomology (measured by the PHQ), we used a multiple linear regression to model the relationship between the variables. We confirmed that our data met the assumptions of a linear regression, and although they had minor striations when examining Cook’s distance and influence we continued with the analysis anyway.

Our model was statistically significant, Adj. R² = .31, F(3,61.04) = 393, p < .001. The relationship between supportive parents, personality tendencies, covid worries all with depressive symptomology was positive and has a large effect size (per Cohen, 1988) Full output from the regression model is reported in Table 1.

Table 1: Regression model of exhibiting depressive symptoms
	Depressive Symptomology (PHQ)
Predictors	Estimates	SE	CI	p
Intercept	2.60	0.04	2.53 – 2.67	<0.001
Personality Tendencies (big_5)	0.20	0.04	0.13 – 0.27	<0.001
Covid worries (pas_covid)	0.13	0.04	0.06 – 0.21	<0.001
Parental Support (support_parents)	-0.38	0.04	-0.45 – -0.31	<0.001
Observations	397
R² / R² adjusted	0.318 / 0.313

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.

Multiple Linear Regression Hw

Isabel Brandt

2023-06-10