Multiple Linear Regression HW

Author

Michelle Karpinski

Loading Libraries

library(psych) # for the describe() command
library(car) # for the vif() command
library(sjPlot) # to visualize our results

Importing Data

d <- read.csv(file="Data/mydata.csv", header=T)

# since we're focusing on our continuous variables, we're going to drop our categorical variables. this will make some stuff we're doing later easier.
d <- subset(d, select=-c(edu, income))

Examine Your Variables

# use the scale() command to standardize all of your variables (IVs and DVs)
d$swb <- scale(d$swb, center=T, scale=T)
d$efficacy <- scale(d$efficacy, center=T, scale=T)
d$exploit <- scale(d$exploit, center=T, scale=T)
d$stress <- scale(d$stress, center=T, scale=T)

# also use histograms to examine your continuous variables
hist(d$swb)

hist(d$efficacy)

hist(d$exploit)

hist(d$stress)

# use scatterplots to examine your continuous variables in pairs
plot(d$swb, d$efficacy)

plot(d$swb, d$exploit)

plot(d$swb, d$stress)

plot(d$efficacy, d$exploit)

plot(d$efficacy, d$stress)

plot(d$stress, d$exploit)

# create a correlation matrix to examine the relationships between your variables
corr_output_m <- corr.test(d)
corr_output_m
Call:corr.test(x = d)
Correlation matrix 
           swb efficacy exploit stress
swb       1.00     0.40   -0.08  -0.50
efficacy  0.40     1.00   -0.01  -0.40
exploit  -0.08    -0.01    1.00   0.03
stress   -0.50    -0.40    0.03   1.00
Sample Size 
[1] 3148
Probability values (Entries above the diagonal are adjusted for multiple tests.) 
         swb efficacy exploit stress
swb        0     0.00    0.00   0.00
efficacy   0     0.00    0.67   0.00
exploit    0     0.67    0.00   0.14
stress     0     0.00    0.07   0.00

 To see confidence intervals of the correlations, print with the short=FALSE option

State Your Hypothesis - PART OF YOUR WRITEUP

State your hypotheses. Remember, you will have three IVs and one DV, and all variables will be continuous. You should describe how each of your IVs will relate to the DV (e.g., ‘higher scores on X will be predictive of higher scores on Y’).

One dependent we are trying to predict with a multiple regression: & Three IVs; how variables relate to eachother.

Homework Hypotheses: H1: We predict that subjective well-being will be positively related to efficacy. H2: We predict that exploitativeness will be positively related to efficacy. H3: We predict that stress will be negatively related to efficacy. H4: When we examine all of the three independent variables together, we predict that stress will be the strongest predictor of efficacy score. The effect of exploitativeness will disappear when accommodating the effects of the other variables.

Check Your Assumptions

Multiple Linear Regression Assumptions

  • Observations should be independent (confirmed by data report)
  • Number of cases should be adequate (N ≥ 80 + 8m, where m is the number of IVs). If you don’t have enough, it won’t run. (will check this below)
  • Independent variables should not be too correlated (aka multicollinearity). (will check this below)
  • Relationship between the variables should be linear. (will check this below)
  • Outliers should be identified and removed. (will check this below)
  • Residuals should be normally distributed and have constant variance. (will check this below) - this is the distance from each point from the predicted regression line (like line of best fit)

Check Number of Cases

For your homework, if you don’t have the required number of cases you’ll need to drop one of your independent variables. Reach out to me if this happens and we can figure out the best way to proceed!

needed <- 80 + 8*3
nrow(d) >= needed
[1] TRUE

Run a Multiple Linear Regression

To check the following assumptions, we run our regression and then check some output and diagnostic plots BEFORE looking at our results.

# use this commented out section only if you need to remove outliers
# to drop a single outlier, remove the # at the beginning of the line and use this code:
# d <- subset(d, row_id!=c(1108))

# to drop multiple outliers, remove the # at the beginning of the line and use this code:
# d <- subset(d, row_id!=c(1108) & row_id!=c(602))

# use the lm() command to run the regression!!!
# dependent/outcome variable on the left, independent/predictor variables on the right
reg_model <- lm(efficacy ~ exploit + stress + swb, data = d)

Check multicollinearity

  • Higher values indicate more multicollinearity. This usually requires dropping a variable. For your homework, you will need to discuss multicollinearity and any high values, but you don’t have to drop any variables.
  • Cutoff is usually 5
vif(reg_model)
 exploit   stress      swb 
1.006223 1.337258 1.344105 

Check linearity with Residuals vs Fitted plot

READ THIS TEXT

This plot (below) shows the residuals for each case and the fitted line. The red line is the average residual for the specified point of the dependent variable. If the assumption of linearity is met, the red line should be horizontal. This indicates that the residuals average to around zero. You can see that for this lab, the plot shows some non-linearity because there are more datapoints below the regression line than here are above it. Thus, there are some negative residuals that don’t have positive residuals to cancel them out. However, a bit of deviation is okay – just like with skewness and kurtosis, there’s a range that we can work in before non-normality or non-linearity becomes a critical issue. For some examples of good Residuals vs Fitted plot and ones that show serious errors, check out this page.

For your homework, you’ll simply need to generate this plot and talk about how your plot compares to the good and problematic plots linked to above. Is it closer to the ‘good’ plots or one of the ‘bad’ plots? This is going to be a judgement call, and that’s okay! In practice, you’ll always be making these judgement calls as part of a team, so this assignment is just about getting experience with it, not making the perfect call.

We checked nonlinearity visually…

plot(reg_model, 1)

Check for outliers using Cook’s distance and a Residuals vs Leverage plot

READ THIS TEXT

The plots below both address leverage, or how much each data point is able to influence the regression line. Outliers are points that have undue influence on the regression line, the way that Bill Gates entering the room has an undue influence on the mean income.

The first plot, Cook’s distance, is a visualization of a score called (you guessed it) Cook’s distance, calculated for each case (aka row or participant) in the dataframe. Cook’s distance tells us how much the regression would change if the point was removed. The second plot also includes the residuals in the examination of leverage. The standardized residuals are on the y-axis and leverage is on the x-axis; this shows us which points have high residuals (are far from the regression line) and high leverage. Points that have large residuals and high leverage are especially worrisome, because they are far from the regression line but are also exerting a large influence on it.

For your homework, you’ll simply need to generate these plots, assess Cook’s distance in your dataset, and then identify any potential cases that are prominent outliers. Since we have some cutoffs, that makes this process is a bit less subjective than some of the other assessments we’ve done here, which is a nice change! Cutoff = 0.5

# Cook's distance
plot(reg_model, 4)

# Residuals vs Leverage
plot(reg_model, 5)

Check homogeneity of variance in a Scale-Location plot

READ THIS TEXT

This plot is similar to the one’s we’ve seen, but it shows us the standardized residuals across the range of the regression line. Because the residuals are standardized, large residuals (whether positive or negative) are at the top of the plot, while small residuals (whether positive or negative) are at the bottom of the plot. If the assumption of homogeneity of variance (also called homoscedasticity) is met, the red line should be mostly horizontal. If it deviates from the mean line, that means that the variance is smaller or larger at that point of the regression line. Once again, you can check out this page for some other examples of this type of plot. (Notice that the Scale-Location plot is the third in the grids.)

For your homework, you’ll simply need to generate this plot and talk about how your plot compares to the ones pictured. Is it closer to the ‘good’ plots or one of the ‘bad’ plots? Again, this is a judgement call! It’s okay if feel uncertain, and you won’t be penalized for that.

Checked diagnostic plot (not levene’s test which has a p-value & conservative).

plot(reg_model, 3)

Check normality of residuals with a Q-Q plot

READ THIS TEXT

This plot is a bit new. It’s called a Q-Q plot and shows the standardized residuals plotted against a normal distribution. If our variables are perfectly normal, the points will fit on the dashed line perfectly. This page shows how different types of non-normality appear on a Q-Q plot. It’s normal for Q-Q plots show a bit of deviation at the ends.

This page also shows some examples that help us put our Q-Q plot into context. Although it isn’t perfect, we don’t have any serious issues and are okay to proceed.

For your homework, you’ll simply need to generate this plot and talk about how your plot compares to the ones pictured. Does it seem like any skew or kurtosis is indicated by your plot? Is it closer to the ‘good’/‘bad’ plots from the second link?

plot(reg_model, 2)

Issues with My Data - PART OF YOUR WRITEUP

Our observations were independent and there were a sufficient number of cases. Our data is within the required range that suggests little to no multicolinearity. We checked nonlinearity with a residuals vs fitted plot and found that the assumption of linearity is met. We checked for outliers using cook’s distance, and found that the values were within the cutoff range, and a residuals vs leverage plot, and found that our data were close to the line and that our residuals and leverage were relatively low and not concerning. We checked for homogeneity of variance with a diagnostic scale-location plot and found that our data appears to meet the assumption for homogeneity of variance. Lastly, we evaluated the normality of residuals using a q-q plot. Through visual analysis, we concluded that our data suggest a somewhat normal distribution.

View Test Output

Effect size cutoffs from Cohen (1988): * Trivial: < .1 * Small: between .1 and .3 * Medium: between .3 and .5 * Large: > .5

summary(reg_model)

Call:
lm(formula = efficacy ~ exploit + stress + swb, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9260 -0.5878 -0.0619  0.6033  2.6450 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.273e-16  1.580e-02    0.00    1.000    
exploit      2.156e-02  1.585e-02    1.36    0.174    
stress      -2.735e-01  1.828e-02  -14.96   <2e-16 ***
swb          2.621e-01  1.832e-02   14.31   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8866 on 3144 degrees of freedom
Multiple R-squared:  0.2147,    Adjusted R-squared:  0.214 
F-statistic: 286.5 on 3 and 3144 DF,  p-value: < 2.2e-16
# look at p-value of overall model first
# next adjusted r squared; explain _% of DV changes.
# next regression coefficients; check hypothesized relationships (+/-) & p

Write Up Results

For the purpose of this analysis, we predicted that subjective well-being would be positively related to efficacy, exploitativeness would be positively related to efficacy, and stress would be negatively related to efficacy. We further predicted that, when we examined all of the three independent variables together, stress would be the strongest predictor of efficacy score, and that the effect of exploitativeness would disappear when accommodating the effects of the other variables. Our data appeared to adhere to all of the assumptions of the multiple linear regressions test. With a significant model, F(3,3144) = 286.5, p < 0.001, Adj. R^2 = .214, we found that our predictions regarding the relationships between efficacy and our independent variables, stress and subjective well-being, were significantly supported (Figure 1), with both relationships displaying a small effect size (Cohen 1988). However, the positive relationship seen between efficacy and exploitativeness was not found to be statistically significant (p>0.05).

Table 1: Regression Output for the Relationships Between Exploitativeness, Stress, Subjective Well-Being, and Efficacy
  Efficacy
Predictors Estimates SE CI p
Intercept -0.00 0.02 -0.03 – 0.03 1.000
Exploitativeness 0.02 0.02 -0.01 – 0.05 0.174
Stress -0.27 0.02 -0.31 – -0.24 <0.001
Subjective Well-Being 0.26 0.02 0.23 – 0.30 <0.001
Observations 3148
R2 / R2 adjusted 0.215 / 0.214


 

References

Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences. New York, NY: Routledge Academic.