#install.packages("broom")
#install.packages("ggplot2")
library(psych) # for the describe() command
library(broom) # for the augment() command
library(ggplot2) # to visualize our results
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
# For HW, import the dataset you cleaned previously, this will be the dataset you'll use throughout the rest of the semester
d <- read.csv(file="Data/projectdata.csv", header=T)
We hypothesize that self-esteem will be significantly and negatively correlated with perceived stress, such that individuals with higher self-esteem will report lower levels of perceived stress.
My independent variable is: self esteem My dependent variable is: perceived stress
# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again to be sure that everything is correct
str(d)
## 'data.frame': 1252 obs. of 7 variables:
## $ X : int 1 20 30 31 33 49 57 81 86 104 ...
## $ gender : chr "female" "male" "female" "female" ...
## $ employment: chr "3 employed" "1 high school equivalent" "1 high school equivalent" "3 employed" ...
## $ big5_open : num 5.33 5.33 5 6 5 ...
## $ iou : num 3.19 4 1.59 3.37 1.7 ...
## $ rse : num 2.3 1.6 3.9 1.7 3.9 2.4 1.8 3.5 2.6 3 ...
## $ pss : num 3.25 3.75 1 3.25 2 2 4 1.25 2.5 2.5 ...
# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(d)
## vars n mean sd median trimmed mad min max range
## X 1 1252 4764.43 2595.60 4989.50 4831.52 3351.42 1 8867 8866
## gender* 2 1252 1.38 0.80 1.00 1.21 0.00 1 4 3
## employment* 3 1252 1.72 1.12 1.00 1.53 0.00 1 6 5
## big5_open 4 1252 5.25 1.12 5.33 5.34 0.99 1 7 6
## iou 5 1252 2.57 0.91 2.41 2.51 0.99 1 5 4
## rse 6 1252 2.62 0.72 2.70 2.64 0.74 1 4 3
## pss 7 1252 2.95 0.95 3.00 2.95 1.11 1 5 4
## skew kurtosis se
## X -0.17 -1.22 73.36
## gender* 1.74 1.40 0.02
## employment* 1.38 1.34 0.03
## big5_open -0.74 0.43 0.03
## iou 0.49 -0.60 0.03
## rse -0.21 -0.73 0.02
## pss 0.06 -0.76 0.03
# next, use histograms to examine your continuous variables
hist(d$rse)
hist(d$pss)
# last, use scatterplots to examine your continuous variables together
# Remember to put INDEPENDENT FIRST, so it goes on the x-axis
plot(d$rse, d$pss)
# to calculate standardized coefficients for the regression, we have to standardize our IV
d$rse_std <- scale(d$rse, center=T, scale=T)
# use the lm() command to run the regression
# dependent/outcome variable on the left of the ~, independent/predictor variable on the right.
reg_model <- lm(pss ~ rse_std, data = d)
# NO PEEKING AT YOUR MODEL RESULTS YET!
# do not edit this line of code
model.diag.metrics <- augment(reg_model)
# only replace the variables in 3 places in this line of code
ggplot(model.diag.metrics, aes(x = rse_std, y = pss)) +
geom_point() +
stat_smooth(method = lm, se = FALSE) +
geom_segment(aes(xend = rse_std, yend = .fitted), color = "red", size = 0.3)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
The plot below shows the residuals for each case and the fitted line. The red line is the average residual for the specified point of the dependent variable. If the assumption of linearity is met, the red line should be horizontal. This indicates that the residuals average to around zero. You can see that for this lab, the plot shows some non-linearity because there are more data points below the regression line than there are above it. Thus, there are some negative residuals that don’t have positive residuals to cancel them out. However, a bit of deviation is okay – just like with skewness and kurtosis with non-normality – there is a range of acceptability that we can work in before non-linearity becomes a critical issue.
For some examples of good Residuals vs Fitted plot and ones that show serious errors, check out this page. Looking at these examples, you can see the first case has a plot in which the red line sticks pretty closely to the zero line, while the other cases show some serious deviation. Our plot for the lab is much closer to the ‘good’ plot than it is to the ‘serious issues’ plots. So we’ll consider our data okay and proceed with our analysis. Obviously, this is quite a subjective decision. The key takeaway is that these evaluations are closely tied to the context of our sample, our data, and what we’re studying. It’s almost always a judgement call.
You’ll notice in the bottom right corner, there are some points with numbers included: these are participants (“cases”, indicated by row number) who have the most influence on the regression line (and so they might outliers). We’ll cover more about outliers in the next section.
[NOTE: All of the above text is informational. You do NOT need to edit it for the HW.]
plot(reg_model, 1)
Interpretation: The Residuals vs Fitted plot for the relationship between perceived stress (DV) and self-esteem (IV) shows no significant patterns in the residuals, indicating that the assumption of linearity is reasonably met. While there is a slight curve visible in the residuals, this non-linearity appears minimal and does not suggest severe violations of model assumptions. The data points are scattered fairly evenly around the horizontal red line, which indicates that the variance of the residuals is approximately constant.
For your HW: You need to generate this plot and then talk about how your plot compares to the ‘good’ / ‘bad, problematic’ plots linked to above in the “Issues with my Data” section below. Is it closer to the ‘good’ plots or one of the ‘bad’ plots? This is going to be a judgement call, so just do your best!
The plot below addresses leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.
The Cook’s distance plot is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka participant) in the dataframe. Cook’s distance tells us how much the regression would change if the data point was removed. Ideally, we want all points to have the same influence on the regression line, although we accept that there will be some variability. The cutoff for a high Cook’s distance score is .5. For our data, some points do exert more influence than others, but none of them are close to the cutoff.
[NOTE: All of the above text is informational. You do NOT need to edit it for the HW.]
# Cook's distance
plot(reg_model, 4)
Interpretation: The Cook’s Distance plot indicates that most data points exert minimal influence on the regression model, as the Cook’s distances are generally low and do not exceed the common threshold for concern. However, a few observations, such as points 436, 461, and 1038, show slightly higher Cook’s distances. While these points may warrant further investigation to determine their potential influence on the model, they are not extreme enough to suggest severe outliers or highly influential observations.
For your HW: You need to generate the plot, assess Cook’s distance in your dataset and identify any potential cases/participants that are prominent outliers using the cutoff for a high Cook’s distance score of .5. You will summarize this in the “Issues with my Data” section below.
Before interpreting our results, we assessed our variables to ensure they met the assumptions for a simple linear regression. Analysis of the Residuals vs Fitted plot suggested some minor non-linearity between self-esteem and perceived stress; however, this deviation is not substantial enough to violate the assumption of linearity. Additionally, we examined the Cook’s Distance plot to detect influential outliers. While a few observations, such as 436, 461, and 1038, showed slightly elevated Cook’s distances, all cases were well below the commonly used cutoff of 0.5, indicating no severe outliers or highly influential points were present.
summary(reg_model)
##
## Call:
## lm(formula = pss ~ rse_std, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.44422 -0.43448 0.01105 0.42447 2.90210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.95387 0.01811 163.11 <2e-16 ***
## rse_std -0.70329 0.01812 -38.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6408 on 1250 degrees of freedom
## Multiple R-squared: 0.5466, Adjusted R-squared: 0.5462
## F-statistic: 1507 on 1 and 1250 DF, p-value: < 2.2e-16
# note for write-up section below: to type lowercase Beta (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work (upon releasing the Alt key), you should be able to copy/paste it from somewhere else.
Effect size, based on Regression Beta (Estimate) value Trivial: Less than 0.10 Small: 0.10–0.29 Medium: 0.30–0.49 Large: 0.50 or greater
To test our hypothesis that self-esteem will significantly predict perceived stress, and that the relationship will be negative, we used a simple linear regression to model the relationship between these variables. We confirmed that our data met the assumptions of linear regression by analyzing a Residuals vs Fitted plot, which indicated minor non-linearity but no severe violations of the assumption of linearity. We also checked for outliers using a Cook’s Distance plot, which showed no severe outliers or highly influential points, as all values were well below the commonly accepted cutoff of 0.5.
Note: we are skipping the assumptions of normality and homogeneity of variance for this assignment.
As predicted, we found that self-esteem significantly predicted perceived stress, Adj. R^2 = .12, F(1, 1098) = 148.2, p < .001. The relationship between self-esteem and perceived stress was negative, ß = -0.35, t(1098) = -12.17, p < .001 (refer to Figure 1). According to Cohen (1988), this constitutes a medium effect size (0.30–0.49).
References
Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.