#install.packages("broom")
#install.packages("ggplot2")
library(psych) # for the describe() command
library(broom) # for the augment() command
library(ggplot2) # to visualize our results
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
d <- read.csv(file="Data/projectdata.csv", header=T)
We hypothesize that people’s reported level of perceived stress will significantly predict their level of subjective wellbeing, and that the relationship will be negative. This means that as people report higher levels of perceived stress, they will also report lower levels of subjective wellbeing.
My independent variable (the one doing the predicting) is: perceived stress My dependent variable (the one being predicted) is: subjective wellbeing
str(d)
## 'data.frame': 3162 obs. of 7 variables:
## $ ResponseID: chr "R_BJN3bQqi1zUMid3" "R_2TGbiBXmAtxywsD" "R_12G7bIqN2wB2N65" "R_39pldNoon8CePfP" ...
## $ gender : chr "f" "m" "m" "f" ...
## $ socmeduse : int 47 23 34 35 37 13 37 43 37 29 ...
## $ stress : num 3.3 3.3 4 3.2 3.1 3.5 3.3 2.4 2.9 2.7 ...
## $ swb : num 4.33 4.17 1.83 5.17 3.67 ...
## $ belong : num 2.8 4.2 3.6 4 3.4 4.2 3.9 3.6 2.9 2.5 ...
## $ support : num 6 6.75 5.17 5.58 6 ...
describe(d)
## vars n mean sd median trimmed mad min max range
## ResponseID* 1 3162 1581.50 912.94 1581.50 1581.50 1172.00 1.0 3162.0 3161.0
## gender* 2 3162 1.28 0.49 1.00 1.21 0.00 1.0 3.0 2.0
## socmeduse 3 3162 34.44 8.57 35.00 34.72 7.41 11.0 55.0 44.0
## stress 4 3162 3.05 0.60 3.00 3.05 0.59 1.3 4.7 3.4
## swb 5 3162 4.48 1.32 4.67 4.53 1.48 1.0 7.0 6.0
## belong 6 3162 3.23 0.61 3.30 3.25 0.59 1.3 5.0 3.7
## support 7 3162 5.53 1.13 5.75 5.66 0.99 0.0 7.0 7.0
## skew kurtosis se
## ResponseID* 0.00 -1.20 16.24
## gender* 1.40 0.88 0.01
## socmeduse -0.31 0.26 0.15
## stress 0.03 -0.17 0.01
## swb -0.36 -0.45 0.02
## belong -0.26 -0.12 0.01
## support -1.10 1.43 0.02
hist(d$stress)
hist(d$swb)
plot(d$stress, d$swb)
d$stress_std <- scale(d$stress, center=T, scale=T)
reg_model <- lm(swb ~ stress_std, data = d)
model.diag.metrics <- augment(reg_model)
ggplot(model.diag.metrics, aes(x = stress_std, y = swb)) +
geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = stress_std, yend = .fitted), color = "red", size = 0.3)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
The plot below shows the residuals for each case and the fitted line. The red line is the average residual for the specified point of the dependent variable. If the assumption of linearity is met, the red line should be horizontal. This indicates that the residuals average to around zero. You can see that for our data, the plot shows some non-linearity because there are more data points below the regression line than there are above it. Thus, there are some negative residuals that don’t have positive residuals to cancel them out. However, a bit of deviation is okay – just like with skewness and kurtosis with non-normality – there is a range of acceptability that we can work in before non-linearity becomes a critical issue.
For some examples of good Residuals vs Fitted plot and ones that show serious errors, check out this page. Looking at these examples, you can see the first case has a plot in which the red line sticks pretty closely to the zero line, while the other cases show some serious deviation. Our plot is much closer to the ‘good’ plot than it is to the ‘serious issues’ plots. So we’ll consider our data okay and proceed with our analysis. Obviously, this is quite a subjective decision. The key takeaway is that these evaluations are closely tied to the context of our sample, our data, and what we’re studying. It’s almost always a judgement call.
You’ll notice in the bottom right corner, there are some points with numbers included: these are participants (“cases”, indicated by row number) who have the most influence on the regression line (and so they might be outliers). We’ll cover more about outliers in the next section.
plot(reg_model, 1) #Residual vs Fitted plot
Interpretation: Our Residual vs Fitted plot suggests there is some minor non-linearity between our independent and dependent variables, but we are okay to proceed with the regression.
The plot below addresses leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.
The Cook’s distance plot is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka participant) in the dataframe. Cook’s distance tells us how much the regression would change if that data point was removed. Ideally, we want all points to have the same influence on the regression line, although we accept that there will be some variability. The cutoff for a high Cook’s distance score is .50. For our data, some points do exert more influence than others, but none of them are close to the cutoff. The plot will always identify the 3 most extreme values; it is your job to identify if any of those values are beyond the cutoff value.
# Cook's distance
plot(reg_model, 4)
Interpretation: Our data does not have severe outliers. All cases fell well below the Cook’s distance cutoff of 0.50, so no participants were identified as problematic outliers.
Before interpreting our results, we assessed our variables to see if they met the assumptions for a simple linear regression. Analysis of a Residuals vs Fitted plot suggested that there is some minor non-linearity, but not enough to violate the assumption of linearity. We also checked Cook’s distance plot to detect outliers. All cases were below the recommended cutoff for Cook’s distance of 0.50, so no outliers were detected.
summary(reg_model)
##
## Call:
## lm(formula = swb ~ stress_std, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2476 -0.8071 0.0747 0.8068 3.6266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.47554 0.02032 220.21 <2e-16 ***
## stress_std -0.66250 0.02033 -32.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.143 on 3160 degrees of freedom
## Multiple R-squared: 0.2516, Adjusted R-squared: 0.2513
## F-statistic: 1062 on 1 and 3160 DF, p-value: < 2.2e-16
To test our hypothesis that perceived stress significantly predicts subjective wellbeing, and that the relationship would be negative, we used a simple linear regression to model the relationship between those variables. We confirmed that our data met the assumptions of a linear regression, checking the linearity of the relationship using a Residuals vs Fitted plot and checking for outliers using Cook’s distance plot. (Note: We are skipping the assumptions of normality and homogeneity of variance for this analysis.)
As predicted, we found that perceived stress significantly predicted subjective wellbeing, Adj. R2 = 0.25, F(1, 3160) = 1062, p < .001. Additionally, the relationship between perceived stress and subjective wellbeing was negative, ß = −0.66, t(3160) = −32.59, p < .001 (refer to Figure 1). According to Cohen (1988), this constitutes a large effect size (ß > 0.50).
References
Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.