library(psych) # for the describe() command
library(car) # for the vif() command
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
library(sjPlot) # to visualize our results
## Warning: package 'sjPlot' was built under R version 4.3.3
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.
# import the dataset you cleaned previously
# this will be the dataset you'll use throughout the rest of the semester
# use EAMMi2 data
d <- read.csv(file="data/realdata_final.csv", header=T)
set.seed(110)
d$fake <- (rnorm(n=nrow(d))-d$mindful)*d$pipwd
d$fake <- d$fake+(d$fake*rnorm(n=nrow(d)))
We hypothesize that self-efficacy (measured by the GSE), need to belong (measured by the NBS), and fake data (a fake variable generated for this lab) will significantly predict subjective well-being (measured by the SWLS).
# you only need to check the variables you're using in the current analysis
# although you checked them previously, it's always a good idea to look them over again and be sure that everything is correct
str(d)
## 'data.frame': 1620 obs. of 8 variables:
## $ ResponseId: chr "R_12G7bIqN2wB2N65" "R_2CfdmFw1NTliv4e" "R_24kJPxVOxMshN3Q" "R_3JmBijdH1z82QU2" ...
## $ gender : chr "m" "f" "f" "f" ...
## $ sibling : chr "at least one sibling" "only child" "at least one sibling" "at least one sibling" ...
## $ pipwd : num 2.33 3.53 3 3 3.13 ...
## $ mindful : num 2.2 3 3.4 4.6 3.87 ...
## $ belong : num 3.6 3.6 2.9 4.1 3.1 2.4 3 4.1 2.8 3.3 ...
## $ support : num 5.17 6.08 6 6 6 ...
## $ fake : num -2.3 3.94 -12.16 -14.31 -13.49 ...
cont <- na.omit(subset(d, select=c(pipwd, mindful, belong, support)))
cont$pipwd <- scale(cont$pipwd, center=T, scale=T)
cont$mindful <- scale(cont$mindful, center=T, scale=T)
cont$belong <- scale(cont$belong, center=T, scale=T)
cont$support <- scale(cont$support, center=T, scale=T)
# you can use the describe() command on an entire dataframe (d) or just on a single variable
describe(cont)
## vars n mean sd median trimmed mad min max range skew kurtosis
## pipwd 1 1620 0 1 0.12 -0.02 0.70 -3.21 3.67 6.88 0.12 1.34
## mindful 2 1620 0 1 -0.01 0.00 0.92 -2.87 2.77 5.64 -0.02 -0.13
## belong 3 1620 0 1 -0.05 0.02 0.99 -3.20 2.61 5.81 -0.22 -0.15
## support 4 1620 0 1 0.21 0.10 0.96 -4.69 1.36 6.05 -1.00 1.09
## se
## pipwd 0.02
## mindful 0.02
## belong 0.02
## support 0.02
# also use histograms to examine your continuous variables
hist(cont$pipwd)
hist(cont$mindful)
hist(cont$belong)
hist(cont$support)
# last, use scatterplots to examine your continuous variables together
plot(cont$pipwd, cont$mindful)
plot(cont$mindful, cont$belong)
plot(cont$mindful, cont$support)
plot(cont$pipwd, cont$belong)
plot(cont$pipwd, cont$support)
plot(cont$belong, cont$support)
corr_output_m <- corr.test(cont)
corr_output_m
## Call:corr.test(x = cont)
## Correlation matrix
## pipwd mindful belong support
## pipwd 1.00 0.19 -0.14 0.20
## mindful 0.19 1.00 -0.21 0.18
## belong -0.14 -0.21 1.00 0.08
## support 0.20 0.18 0.08 1.00
## Sample Size
## [1] 1620
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## pipwd mindful belong support
## pipwd 0 0 0 0
## mindful 0 0 0 0
## belong 0 0 0 0
## support 0 0 0 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
# use the lm() command to run the regression
# dependent/outcome variable on the left, independent/predictor variables on the right
reg_model <- lm(pipwd ~ belong + mindful + support, data = cont)
Assumptions we’ve discussed previously:
New assumptions:
For your homework, if you don’t have the required number of cases you’ll need to drop one of your independent variables. Reach out to me and we can figure out the best way to proceed!
needed <- 80 + 8*2
nrow(cont) >= needed
## [1] TRUE
For your homework, you will need to discuss multicollinearity and any high values, but you don’t have to drop any variables. As shown on our residuals vs fitted plot, we can see our data is partially non-linear, but there are not any specific values that range above the cutoff of 5. However, there are many values that land close to or at 4.
vif(reg_model)
## belong mindful support
## 1.063022 1.090262 1.049149
The plot below shows the residuals for each case and the fitted line. The red line is the average residual for the specified point of the dependent variable. If the assumption of linearity is met, the red line should be horizontal. This indicates that the residuals average to around zero. However, a bit of deviation is okay – just like with skewness and kurtosis, there’s a range that we can work in before non-normality becomes a critical issue. For some examples of good Residuals vs Fitted plot and ones that show serious errors, check out this page.
For your homework, you’ll simply need to generate this plot and talk about whether your assumptions are met. This is going to be a judgement call, and that’s okay! In practice, you’ll always be making these judgement calls as part of a team, so this assignment is just about getting experience with it, not making the perfect call. Based on the created graph, there is an even spread of data that does not group around the fitted line. Due to this line being horizontal across the graph we can see that the data is linear, and the residuals average around 0. However, with this even spread of data across the graph, we must keep skew and kurtosis in mind while moving forward with our analysis to ensure our overall results are not being influenced.
plot(reg_model, 1)
This plot is a bit new. It’s called a Q-Q plot and shows the standardized residuals plotted against a normal distribution. If our variables are perfectly normal, the points will fit on the dashed line perfectly. This page shows how different types of non-normality appear on a Q-Q plot.
It’s normal for Q-Q plots show a bit of deviation at the ends. This page shows some examples that help us put our Q-Q plot into context. For your homework, you’ll simply need to generate this plot and talk about how your plot compares to the ones pictured. Does it seem like any skew or kurtosis is indicated by your plot? Is it closer to the ‘good’/‘bad’ plots from the second link?
plot(reg_model, 2)
The plots below both address leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.
The first plot, Cook’s distance, is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka row or participant) in the dataframe. Cook’s distance tells us how much the regression would change if the point was removed. The second plot also includes the residuals in the examination of leverage. The standardized residuals are on the y-axis and leverage is on the x-axis; this shows us which points have high residuals (are far from the regression line) and high leverage. Points that have large residuals and high leverage are especially worrisome, because they are far from the regression line but are also exerting a large influence on it.
For your homework, you’ll simply need to generate these plots, assess Cook’s distance in your dataset, and then identify any potential cases that are prominent outliers. Since we have some cutoffs, that makes this process is a bit less subjective than some of the other assessments we’ve done here, which is a nice change!
# Cook's distance
plot(reg_model, 4)
# Residuals vs Leverage
plot(reg_model, 5)
This plot shows us the standardized residuals across the range of the regression line. Because the residuals are standarized, large residuals (whether positive or negative) are at the top of the plot, while small residuals (whether positive or negative) are at the bottom of the plot. If the assumption of homogeneity of variance (also called homoscedasticity) is met, the red line should be mostly horizontal. If it deviates from the mean line, that means that the variance is smaller or larger at that point of the regression line.
Once again, you can check out this page for some other examples of this type of plot. For your homework, you’ll simply need to generate this plot and talk about how your plot compares to the ones pictured. Is it closer to the ‘good’ plots or one of the ‘bad’ plots? Again, this is a judgement call! It’s okay if feel uncertain, and you won’t be penalized for that.
plot(reg_model, 3)
Before interpreting my results, I assessed my variables to see if they met the assumptions for a multiple linear regression. I analyzed a Scale-Location plot and detected some issues with homogeneity of variance, as well as some concern with linearity in a Residuals vs Fitted plot. Additionally, I detected some values that could be considered outliers (visually analyzing a Residuals vs Leverage plot) but did not find any serious issues with the normality of our residuals (visually analyzing a Q-Q plot).
summary(reg_model)
##
## Call:
## lm(formula = pipwd ~ belong + mindful + support, data = cont)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0783 -0.5372 0.0535 0.4624 3.8152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.966e-16 2.385e-02 0.000 1
## belong -1.302e-01 2.460e-02 -5.294 1.36e-07 ***
## mindful 1.321e-01 2.491e-02 5.301 1.31e-07 ***
## support 1.835e-01 2.444e-02 7.509 9.81e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9601 on 1616 degrees of freedom
## Multiple R-squared: 0.07999, Adjusted R-squared: 0.07829
## F-statistic: 46.84 on 3 and 1616 DF, p-value: < 2.2e-16
# you do not have to run the below code for the HW
reg_model_test <- lm(pipwd ~ belong, data = cont)
summary(reg_model_test)
##
## Call:
## lm(formula = pipwd ~ belong, data = cont)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1879 -0.5784 0.1095 0.3943 3.7155
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.168e-16 2.460e-02 0.000 1
## belong -1.428e-01 2.461e-02 -5.805 7.72e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9901 on 1618 degrees of freedom
## Multiple R-squared: 0.0204, Adjusted R-squared: 0.0198
## F-statistic: 33.7 on 1 and 1618 DF, p-value: 7.723e-09
plot_model(reg_model, type="std")
plot_model(reg_model, type="emm", terms = c("belong"))
plot_model(reg_model, type="emm", terms = c("mindful"))
plot_model(reg_model, type="emm", terms = c("support"))
# note for section below: to type lowercase Beta below (ß) you need to hold down Alt key and type 225 on numeric keypad. If that doesn't work you should be able to copy/paste it from somewhere else
To test our hypothesis that positive identity as a person with a disability, mindfulness, and percieved social support would significantly predict one’s need to belong, we used a multiple linear regression to model the relationship between the variables. We confirmed that our data met the assumptions of a linear regression, and although there were some issues with homogeneity of variance and linearity we continued with the analysis anyway.
Our model was statistically significant, Adj. R2 = .0198, F(1,1618) = 33.7, p < .001. The relationship between Mindfulness and Positive Identity as Person with Disability, as well as the relationship between Positive Identity as Person with Disability and Perceived Social Support were positive and has a large effect size (per Cohen, 1988), while the relationships between our remaining predictors (Positive Identity as Person with Disability and Need to belong) were negative and had effect sizes that were small or trivial. Full output from the regression model is reported in Table 1.
| Positive Identity as Person with Disability | ||||
|---|---|---|---|---|
| Predictors | Estimates | SE | CI | p |
| Intercept | -0.00 | 0.02 | -0.05 – 0.05 | 1.000 |
| Need to Belong | -0.13 | 0.02 | -0.18 – -0.08 | <0.001 |
| Positive Identity as Person with Disability | 0.13 | 0.02 | 0.08 – 0.18 | <0.001 |
| Percieved Social Support | 0.18 | 0.02 | 0.14 – 0.23 | <0.001 |
| Observations | 1620 | |||
| R2 / R2 adjusted | 0.080 / 0.078 | |||
References
Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.