1 Loading Libraries

#install.packages("broom")
#install.packages("ggplot2")

library(psych) # for the describe() command
library(broom) # for the augment() command
library(ggplot2) # to visualize our results

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

2 Importing Data

# For HW, import the dataset you cleaned previously, this will be the dataset you'll use throughout the rest of the semester

d <- read.csv(file="Data/arc_data_final_SP26.csv", header=T)

3 State Your Hypothesis

We hypothesize that people’s reported level of anxiety levels will significantly predict depression levels and the relationship will be positive. This means that as people report higher levels of anxiety, they will also report higher levels of depression.

My independent variable (the one doing the predicting) is: anxiety levels My dependent variable (the one being predicted) is: depression levels

4 Check Your Variables

# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again to be sure that everything is correct

str(d)

## 'data.frame':    996 obs. of  40 variables:
##  $ X                   : int  520 2814 3146 3295 717 6056 4753 5365 2044 1965 ...
##  $ gender              : chr  "female" "male" "female" "male" ...
##  $ trans               : chr  "no" "no" "no" "no" ...
##  $ sexual_orientation  : chr  "Prefer not to say" "Heterosexual/Straight" "Heterosexual/Straight" "Heterosexual/Straight" ...
##  $ ethnicity           : chr  "Prefer not to say" "White - British, Irish, other" "Asian/Asian British - Indian, Pakistani, Bangladeshi, other" "Asian/Asian British - Indian, Pakistani, Bangladeshi, other" ...
##  $ relationship_status : chr  "Single, never married" "Single, never married" "Prefer not to say" "Single, never married" ...
##  $ age                 : chr  "1 under 18" "1 under 18" "1 under 18" "1 under 18" ...
##  $ urban_rural         : chr  "town" "town" "town" "town" ...
##  $ income              : chr  NA NA NA NA ...
##  $ education           : chr  "1 equivalent to not completing high school" "prefer not to say" "2 equivalent to high school completion" "prefer not to say" ...
##  $ employment          : chr  "1 high school equivalent" "1 high school equivalent" "1 high school equivalent" "1 high school equivalent" ...
##  $ treatment           : chr  NA "not in treatment" NA "no psychological disorders" ...
##  $ health              : chr  "something else or not applicable" "something else or not applicable" "prefer not to say" "lung disease" ...
##  $ mhealth             : chr  "none or NA" "none or NA" "none or NA" "none or NA" ...
##  $ sleep_hours         : chr  "2 5-6 hours" "3 7-8 hours" "2 5-6 hours" "4 8-10 hours" ...
##  $ exercise            : chr  "1 less than 1 hour" "1 less than 1 hour" "1 less than 1 hour" "1 less than 1 hour" ...
##  $ pet                 : chr  "cat" "other" "no pets" "no pets" ...
##  $ covid_pos           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ covid_neg           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ big5_open           : num  3.67 4.33 5.67 6 5.67 ...
##  $ big5_con            : num  3 4 6 4 3.33 ...
##  $ big5_agr            : num  4.33 2.67 5.67 5.67 5 ...
##  $ big5_neu            : num  5.33 2.67 1 3.67 4.33 ...
##  $ big5_ext            : num  2 2.67 4.67 4.33 1.67 ...
##  $ pswq                : num  2.71 1.43 1.86 1.79 2.36 ...
##  $ iou                 : num  2.22 1.52 1.78 1.85 2.22 ...
##  $ mfq_26              : num  2.7 4.55 4.8 3.8 4.5 4 5.8 4.2 4.5 5.25 ...
##  $ mfq_state           : num  3 4.38 4.88 4.88 4.88 ...
##  $ rse                 : num  2.6 3.1 3.7 3 3 3 4 3.8 2.5 4 ...
##  $ school_covid_support: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ school_att          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ pas_covid           : num  3 3.44 4.67 2.44 1.56 ...
##  $ pss                 : num  2.75 2.25 3 2 1.75 2 1 1.25 3 1.25 ...
##  $ phq                 : num  1.56 1.44 1.11 1.33 1.44 ...
##  $ gad                 : num  1.14 1.29 1 1 1.14 ...
##  $ edeq12              : num  1.33 1.08 1 1 1.17 ...
##  $ brs                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ swemws              : num  3 2.86 4 3.57 3.86 ...
##  $ isolation_c         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ support             : num  2.83 3 4 4 3.67 ...

# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(d)

##                      vars   n    mean      sd  median trimmed     mad   min
## X                       1 996 5410.39 2529.61 6287.00 5608.85 2467.79 20.00
## gender*                 2 996    1.44    0.84    1.00    1.28    0.00  1.00
## trans*                  3 996    1.13    0.43    1.00    1.00    0.00  1.00
## sexual_orientation*     4 996    3.74    1.13    4.00    3.72    0.00  1.00
## ethnicity*              5 990    7.32    2.99    9.00    7.90    0.00  1.00
## relationship_status*    6 991    4.49    1.04    5.00    4.74    0.00  1.00
## age*                    7 996    1.08    0.27    1.00    1.00    0.00  1.00
## urban_rural*            8 985    2.73    1.14    3.00    2.79    1.48  1.00
## income*                 9   2    1.00    0.00    1.00    1.00    0.00  1.00
## education*             10 993    2.21    1.51    2.00    1.92    1.48  1.00
## employment*            11 996    1.13    0.58    1.00    1.00    0.00  1.00
## treatment*             12 864    2.79    1.16    3.00    2.63    1.48  1.00
## health*                13 918    6.78    1.04    7.00    6.97    0.00  1.00
## mhealth*               14 996    4.56    1.49    5.00    4.78    0.00  1.00
## sleep_hours*           15 726    2.95    1.03    3.00    2.97    1.48  1.00
## exercise*              16 630    2.27    0.96    2.00    2.18    1.48  1.00
## pet*                   17 974    5.05    2.04    5.50    5.12    2.22  1.00
## covid_pos              18 996    2.77    3.65    0.00    2.15    0.00  0.00
## covid_neg              19 996    1.63    2.03    0.00    1.34    0.00  0.00
## big5_open              20 950    5.26    1.13    5.33    5.35    0.99  1.00
## big5_con               21 949    4.49    1.15    4.33    4.50    0.99  1.00
## big5_agr               22 953    4.86    1.14    5.00    4.90    0.99  1.00
## big5_neu               23 943    4.67    1.47    5.00    4.77    1.48  1.00
## big5_ext               24 952    4.25    1.48    4.33    4.28    1.48  1.00
## pswq                   25 916    2.69    0.76    2.79    2.70    0.85  1.00
## iou                    26 838    2.72    0.94    2.59    2.69    1.04  1.00
## mfq_26                 27 803    4.21    0.68    4.25    4.23    0.67  1.50
## mfq_state              28 711    3.97    0.97    4.00    4.01    0.93  1.00
## rse                    29 810    2.48    0.71    2.40    2.47    0.74  1.00
## school_covid_support   30 137    1.51    0.34    1.60    1.52    0.30  1.00
## school_att             31 134    3.88    1.05    3.90    3.85    1.17  1.89
## pas_covid              32 744    3.25    0.68    3.33    3.27    0.66  1.00
## pss                    33 753    3.14    0.94    3.25    3.15    1.11  1.00
## phq                    34 727    2.28    0.86    2.22    2.23    0.99  1.00
## gad                    35 735    2.20    0.93    2.00    2.14    1.06  1.00
## edeq12                 36 696    1.95    0.77    1.83    1.89    0.86  1.00
## brs                    37 311    2.64    0.86    2.67    2.63    0.99  1.00
## swemws                 38 679    3.01    0.86    3.00    3.01    0.85  1.00
## isolation_c            39 720    2.29    0.81    2.25    2.30    1.11  1.00
## support                40 721    3.47    0.94    3.50    3.51    0.99  1.00
##                          max   range  skew kurtosis    se
## X                    8860.00 8840.00 -0.58    -0.93 80.15
## gender*                 4.00    3.00  1.53     0.65  0.03
## trans*                  3.00    2.00  3.52    11.43  0.01
## sexual_orientation*     6.00    5.00 -0.22     0.31  0.04
## ethnicity*              9.00    8.00 -1.40     0.17  0.10
## relationship_status*    5.00    4.00 -1.67     1.11  0.03
## age*                    2.00    1.00  3.11     7.67  0.01
## urban_rural*            4.00    3.00 -0.54    -1.13  0.04
## income*                 1.00    0.00   NaN      NaN  0.00
## education*              6.00    5.00  1.38     0.85  0.05
## employment*             5.00    4.00  5.02    26.15  0.02
## treatment*              6.00    5.00  1.22     1.62  0.04
## health*                 9.00    8.00 -2.11     6.75  0.03
## mhealth*                8.00    7.00 -1.33     1.81  0.05
## sleep_hours*            5.00    4.00 -0.05    -0.67  0.04
## exercise*               5.00    4.00  0.82     0.72  0.04
## pet*                    8.00    7.00 -0.22    -1.48  0.07
## covid_pos              15.00   15.00  1.11     0.16  0.12
## covid_neg               8.00    8.00  0.86    -0.52  0.06
## big5_open               7.00    6.00 -0.79     0.66  0.04
## big5_con                7.00    6.00 -0.06    -0.25  0.04
## big5_agr                7.00    6.00 -0.35    -0.05  0.04
## big5_neu                7.00    6.00 -0.55    -0.37  0.05
## big5_ext                7.00    6.00 -0.17    -0.82  0.05
## pswq                    5.00    4.00 -0.15    -0.89  0.03
## iou                     5.00    4.00  0.30    -0.80  0.03
## mfq_26                  6.00    4.50 -0.42     0.60  0.02
## mfq_state               6.00    5.00 -0.34    -0.05  0.04
## rse                     4.00    3.00  0.07    -0.74  0.02
## school_covid_support    2.00    1.00 -0.11    -1.24  0.03
## school_att              6.33    4.44  0.24    -0.73  0.09
## pas_covid               5.00    4.00 -0.31     0.17  0.03
## pss                     5.00    4.00 -0.13    -0.72  0.03
## phq                     4.00    3.00  0.37    -0.90  0.03
## gad                     4.00    3.00  0.43    -1.03  0.03
## edeq12                  4.00    3.00  0.53    -0.83  0.03
## brs                     5.00    4.00  0.13    -0.60  0.05
## swemws                  5.00    4.00  0.04    -0.43  0.03
## isolation_c             3.50    2.50 -0.03    -1.22  0.03
## support                 5.00    4.00 -0.29    -0.65  0.04

# next, use histograms to examine your continuous variables
hist(d$gad)

hist(d$phq)

# last, use scatterplots to examine your continuous variables together
# Remember to put INDEPENDENT VARIABLE FIRST, so that it goes on the x-axis
plot(d$gad, d$phq)

5 Run a Simple Regression

# to calculate standardized coefficients for the regression, we have to standardize our IV
d$gad_std <- scale(d$gad, center=T, scale=T)


# use the lm() command to run the regression
# dependent/outcome variable on the left of the ~, standardized independent/predictor variable on the right.
reg_model <- lm(phq ~ gad_std, data = d)

# NO PEEKING AT YOUR MODEL RESULTS YET!

6 Check Your Assumptions

6.1 Simple Regression Assumptions

Should have two measurements for each participant
Variables should be continuous and normally distributed
Relationship between the variables should be linear
Outliers should be identified and removed
Residuals should be approx. normal and have constant variance NOTE: We will NOT be evaluating whether our data meets this last assumption in this lab/homework.

6.2 Create plots and view residuals

# Create Plots
model.diag.metrics <- augment(reg_model)

# View Raw Residuals Plot
# NOTE: only replace the variables in 3 places in this line of code
ggplot(model.diag.metrics, aes(x = gad_std, y = phq)) +
  geom_point() +
  stat_smooth(method = lm, se = FALSE) +
  geom_segment(aes(xend = gad_std, yend = .fitted), color = "red", size = 0.3)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using formula = 'y ~ x'

6.3 Check linearity with Residuals vs Fitted plot

The plot below shows the residuals for each case and the fitted line. The red line is the average residual for the specified point of the dependent variable. If the assumption of linearity is met, the red line should be horizontal. This indicates that the residuals average to around zero. You can see that for this lab, the plot shows some non-linearity because there are more data points below the regression line than there are above it. Thus, there are some negative residuals that don’t have positive residuals to cancel them out. However, a bit of deviation is okay – just like with skewness and kurtosis with non-normality – there is a range of acceptability that we can work in before non-linearity becomes a critical issue.

For some examples of good Residuals vs Fitted plot and ones that show serious errors, check out this page. Looking at these examples, you can see the first case has a plot in which the red line sticks pretty closely to the zero line, while the other cases show some serious deviation. Our plot for the lab is much closer to the ‘good’ plot than it is to the ‘serious issues’ plots. So we’ll consider our data okay and proceed with our analysis. Obviously, this is quite a subjective decision. The key takeaway is that these evaluations are closely tied to the context of our sample, our data, and what we’re studying. It’s almost always a judgement call.

You’ll notice in the bottom right corner, there are some points with numbers included: these are participants (“cases”, indicated by row number) who have the most influence on the regression line (and so they might be outliers). We’ll cover more about outliers in the next section.

[NOTE: All of the above text is informational. You do NOT need to edit it for the HW.]

plot(reg_model, 1) #Residual vs Fitted plot

Interpretation: Our Residual vs Fitted plot suggests there is linearity between our independent and dependent variables, and we are okay to proceed with our regression.

[Remember to revise the above interpretation in you HW assignment. Then delete this reminder and the bolded text below.]

For your HW: You need to generate this plot and then talk about how your plot compares to the ‘good’ / ‘bad, problematic’ plots linked to above in the “Issues with my Data” section below. Is it closer to the ‘good’ plots or one of the ‘bad’ plots? This is going to be a judgement call, so just do your best!

6.4 Check for outliers

The plot below addresses leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.

The Cook’s distance plot is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka participant) in the dataframe. Cook’s distance tells us how much the regression would change if that data point was removed. Ideally, we want all points to have the same influence on the regression line, although we accept that there will be some variability. The cutoff for a high Cook’s distance score is .50. For our lab data, some points do exert more influence than others, but none of them are close to the cutoff. Remember, the plot will always identify the 3 most extreme values; it is your job to identify if any of those values are beyond the cutoff value.

[NOTE: All of the above text is informational. You do NOT need to edit it for the HW.]

# Cook's distance
plot(reg_model, 4)

Interpretation: Our data does not have severe outliers.

[Remember to revise the above interpretation in you HW assignment. Then delete this reminder and the bolded text below.]

For your HW: You need to generate the plot, assess Cook’s distance in your dataset and identify any potential cases/participants that are prominent outliers using the cutoff for a high Cook’s distance score of .50. You will summarize this in the “Issues with my Data” section below.

6.5 Issues with My Data

Before interpreting our results, we assessed our variables to see if they met the assumptions for a simple linear regression. Analysis of a Residuals vs Fitted plot suggested that there is linearity, as the red line is very close to the dotted line;; closer to the examples of good plots linked above. We also checked Cook’s distance plot to detect outliers. all cases were below the recommended cutoff for Cook’s distance of 0.5, so no outliers were detected.

[Remember to revise the above paragraph in you HW assignment. Then delete this reminder.]

7 View Statistical Test Output

summary(reg_model)

## 
## Call:
## lm(formula = phq ~ gad_std, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7695 -0.3441 -0.0322  0.3202  2.1995 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.27770    0.01823  124.91   <2e-16 ***
## gad_std      0.70535    0.01817   38.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4883 on 715 degrees of freedom
##   (279 observations deleted due to missingness)
## Multiple R-squared:  0.6782, Adjusted R-squared:  0.6777 
## F-statistic:  1507 on 1 and 715 DF,  p-value: < 2.2e-16

# NOTE: For the write-up section below, to type lowercase Beta (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work (upon releasing the Alt key), you should be able to copy/paste it from somewhere else in the write-up.

7.1 Effect size, based on Regression ß (Beta Estimate) value in our output:

Trivial: Less than 0.10 (ß < 0.10)
Small: 0.10–0.29 (0.10 < ß < 0.29)
Medium: 0.30–0.49 (0.30 < ß < 0.49)
Large: 0.50 or greater (ß > 0.50)

8 Write Up Results

To test our hypothesis that levels of anxiety significantly predicts levels of depression, and that the relationship would be positive, we used a simple linear regression to model the relationship between those variables. We confirmed that our data met the assumptions of a linear regression, checking the linearity of the relationship using a Residuals vs Fitted plot and checking for outliers using Cook’s distance plot. (Note: We are skipping the assumptions of normality and homogeneity of variance for this analysis.)

As predicted, we found that social exclusion significantly predicted loneliness, F(1, 715) = 1507, p < .001, Adj. R² = 0.6777. Additionally, the relationship between levels of anxiety and levels of depression was positive, ß = 0.71, t(715) = 38.82, p < .001 (refer to Figure 1). According to Cohen (1988), this constitutes a medium effect size (ß ).

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.

Simple Regression Homework

Gabriel Sommerkamp

2026-04-12