In the previous module we discussed bivariate regression models where one predictor or \(x\)-variable is used to create a model for one \(y\)-variable or dependent variable. If we use \(k\) predictors, say, \(x_1, x_2, \cdots, x_k\), then the parameters of our linear model will be \[y = \alpha + \beta_1x_1 + \beta_2x_2 + \cdots+ \beta_kx_k \] which will be estimated by the statistics calculated from our sample \[\hat y = a + b_1x_1 + b_2x_2 + \cdots+ b_kx_k\] In some stats apps, the coefficient \(a\) is referred to as \(b_0\). The analysis will be much the same as in the bivariate case, but multivariate linearity takes more than a scatter plot to decipher.
The data set we will use primarily is Data3350 which was produced in 2015 during an undergraduate research project about personality and humor. The VarsData3350 PDF file has descriptions of each variable in the Data3350 file. Both are available for download in D2L. Be sure to put the Data3350 in your R folder in Documents, and make sure your working directory is set the same way (Session menu). The code block below uses the library function to ensure that the Mosaic package is loaded and will import the data frame used in this module: Data3350.
library(mosaic)
library(readxl)
Data3350 = read_excel("Data3350.xlsx")
The assumptions for linear models are very much the same as before.
For the multivariate case, a scatter plot is insufficient. Instead, we will use a “fitted values” plot to assess linearity of the entire model including all predictors. We will still use the qq-plot to assess normality.
We completed the Anxiety vs. Optimism example in the previous unit. Let’s rework it briefly and then create a multi-predictor model.
mod = lm(Anx ~ Opt, data = Data3350)
summary(mod)
Call:
lm(formula = Anx ~ Opt, data = Data3350)
Residuals:
Min 1Q Median 3Q Max
-16.8569 -5.5833 -0.3629 5.5158 20.6461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.5964 2.9234 20.386 < 2e-16 ***
Opt -1.1243 0.1452 -7.741 1.68e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.915 on 142 degrees of freedom
(21 observations deleted due to missingness)
Multiple R-squared: 0.2968, Adjusted R-squared: 0.2918
F-statistic: 59.93 on 1 and 142 DF, p-value: 1.68e-12
We will use Mosaic’s qqmath function to create the qq-plots, and the native R plot function to get the fitted values plots.
qqmath(~resid(mod))
This is imperfect because we cannot see the line we’re supposed to be comparing the plot against. We use the type parameter to specify that we want both points (“p”) and the line (“r”). Type “l” gives a connected-line scatter plot, which we don’t want. Here’s the correct code block.
qqmath(~resid(mod), type = c("p","r"))
Now we can see that entirity of the plot looks great except for couple of outliers on the far left. This should not worry us at all. The qq-plot gives good evidence the normality assumption is valid.
R’s standard plot function generates 4 plots, but we only need the first one which explains the “which” parameter below.
plot(mod , which = 1)
For a fitted values plot, we are comparing the fitted values (using the model) to the residuals. If the linearity assumption is satisfied, a fitted values plot will have a roughly horizontal line of fit (e.g. randomly scattered around \(y=0\)). In the plot above, note the red line is quite close to the dotted line \(y = 0\). We have strong justification for the linearity assumption.
The qq-plot follows the 45-degree line quite closely, so the normality assumption appears to be satisfied as well. We begin to worry when the qq-plot deviates sharply from the 45-degree either showing a curve or S-shape or, sometimes, just a couple of outliers towards the endpoints that lie far away from the 45-degree line.
summary(mod)
Call:
lm(formula = Anx ~ Opt, data = Data3350)
Residuals:
Min 1Q Median 3Q Max
-16.8569 -5.5833 -0.3629 5.5158 20.6461
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.5964 2.9234 20.386 < 2e-16 ***
Opt -1.1243 0.1452 -7.741 1.68e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.915 on 142 degrees of freedom
(21 observations deleted due to missingness)
Multiple R-squared: 0.2968, Adjusted R-squared: 0.2918
F-statistic: 59.93 on 1 and 142 DF, p-value: 1.68e-12
In the second row from the bottom of the output, consider “Multiple R-squared: 0.4433.” We only have one predictor in our model, so the correlation coefficient \(r\) is the square root of the R-squared value: \[r = \sqrt{0.4433}\approx 0.6658\]
We will extend the model to include a second and then a third predictor. The analysis for multiple regression is not that much more difficult except for swapping the scatter plot (bivariate) for the fitted values plot (multivariate).
We will use the Optimism vs. Anxiety model from above but add two more predictor variables: Neuroticism (Neuro) and Self-Esteem (SE), variables which are likely to have bivariate correlations with Anxiety. The statistics formula for this model is: \[\text{Anx}\sim \text{Opt}+\text{Neuro}+\text{SE}\] In statistical formulas, an asterisk (*) indicates a grouping or category variable while a plus sign (+) indicates a numeric variable.
Here’s the model when we add only one additional variable: Neuro.
mod2 = lm(Anx ~ Opt + Neuro, data = Data3350)
summary(mod2)
Call:
lm(formula = Anx ~ Opt + Neuro, data = Data3350)
Residuals:
Min 1Q Median 3Q Max
-16.2644 -4.9861 -0.1781 4.6335 15.8281
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.76274 4.67521 6.152 7.56e-09 ***
Opt -0.42275 0.15506 -2.726 0.00722 **
Neuro 0.40004 0.04932 8.110 2.29e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.519 on 140 degrees of freedom
(22 observations deleted due to missingness)
Multiple R-squared: 0.5292, Adjusted R-squared: 0.5225
F-statistic: 78.7 on 2 and 140 DF, p-value: < 2.2e-16
Checking for normality works like before, but RStudio calculates the residuals based upon the 2-predictor model. The qq-plot, despite including multiple predictors, is analyzed exactly as before. Here, our model residuals do not vary much from a linear pattern, so the normality assumption is justified. Note that the R native plot function can generate the qq-plot as well, if we ask it for it’s first two graphics. The result using plot is identical to the qqmath plot used earlier, which you can check for yourself. Here’s the quick way to get both:
plot(mod2, which = c(1,2))
For the Anxiety model, the linearity assumption seems problematic, especially for values in the lowest quartile of the distribution. Should we stop analyzing? Recall the violations of the linearity assumption are a big problem. Regression statistics are only robust with respect the normality assumption. If we do proceed, it must be with an abundance of caution.
The new model has a “Multiple R-squared: 0.5292” which has produced a huge uptick in variance accounted for. Recall that before, our model \(R^2\) indicated we were accounting for about 44% of the variance in Anxiety. Now, we’re accounting for almost 53% of the variance, an increase of about 9%. While the linearity assumption seems wobbly, the model is working better in terms of prediction. We can let the example develop for instructional purposes to see what happens noting that we do have legitimate concerns about linearity.
Here’s the model when we add both Neuro and SE.
mod3 = lm(Anx ~ Opt + Neuro + SE, data = Data3350)
summary(mod3)
Call:
lm(formula = Anx ~ Opt + Neuro + SE, data = Data3350)
Residuals:
Min 1Q Median 3Q Max
-16.4391 -4.9244 0.2459 4.6683 15.6539
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.72105 5.80871 5.805 4.17e-08 ***
Opt -0.36660 0.15941 -2.300 0.0229 *
Neuro 0.36639 0.05449 6.723 4.24e-10 ***
SE -0.08915 0.06240 -1.429 0.1553
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.495 on 139 degrees of freedom
(22 observations deleted due to missingness)
Multiple R-squared: 0.5361, Adjusted R-squared: 0.526
F-statistic: 53.53 on 3 and 139 DF, p-value: < 2.2e-16
plot(mod3, which = c(1,2))
The normality assumption seems fine as the qq-plot shows values that do not deviate much from the 45-degree line. The problem seems to be with linearity. Given that the SE variable did not contribute meaningfully to the model, one should not be surprised that the fitted values plot looks remarkably similar to the 2-variable model and still problematic.
While we don’t have time in this course to discuss exploratory modeling, we can give a sense of how hierarchical regression modeling works. Suppose we believe the following five variables might predict Thrill-seeking due to the fact each one has a significant bivariate correlation with the Thrill.
Let’s create a linear model with all five variables: \[\text{Thrill} \sim \text{Anx} + \text{Narc} + \text{Play} + \text{CHS} + \text{Opt}\]
mod4 = lm(Thrill ~ Anx + Narc + Play + CHS + Opt, data = Data3350)
summary(mod4)
Call:
lm(formula = Thrill ~ Anx + Narc + Play + CHS + Opt, data = Data3350)
Residuals:
Min 1Q Median 3Q Max
-13.102 -3.569 0.480 3.569 12.091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.78315 5.37263 1.821 0.071010 .
Anx -0.08378 0.05730 -1.462 0.146239
Narc 0.61388 0.16302 3.766 0.000255 ***
Play 0.04295 0.02512 1.710 0.089782 .
CHS 0.06594 0.10607 0.622 0.535311
Opt 0.13931 0.11565 1.205 0.230650
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.908 on 125 degrees of freedom
(34 observations deleted due to missingness)
Multiple R-squared: 0.2296, Adjusted R-squared: 0.1988
F-statistic: 7.451 on 5 and 125 DF, p-value: 3.787e-06
Note that our multiple R-squared is 0.2296, so that our model is accounting for 23% of the variance in Thrill. This may not seem like much, but in the social sciences this indicates some significant relationships exist here. Before we check the diagnostics, notice the linear regression \(t\)-tests for the slope coefficients the \(p\)-values for which are listed in the last column of the “Coefficients” table. Only Narc is strongly significant. Notice the \(p-\)values for CHS and Opt show that they are the worst predictors in the model despite Opt having the 2nd highest bivariate correlation with Thrill.
The predictors are all correlated to some degree. Some of them are accounting for the same pieces of variance in the model. They’re fighting each other to predict roughly the same aspects of Thrill-seeking. Checking the diagnostics plots below, we see an issue on the left of the fitted values plot, but the qq-plot seems reasonable. Let’s remove the ill-fitting predictor CHS from the model.
plot(mod4, which = c(1,2))
mod5 = lm(Thrill ~ Anx + Narc + Play + Opt, data = Data3350)
summary(mod5)
Call:
lm(formula = Thrill ~ Anx + Narc + Play + Opt, data = Data3350)
Residuals:
Min 1Q Median 3Q Max
-13.4020 -3.5840 0.3962 3.5528 12.0888
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.40509 4.68510 2.434 0.016321 *
Anx -0.09623 0.05355 -1.797 0.074748 .
Narc 0.63071 0.16037 3.933 0.000138 ***
Play 0.04737 0.02403 1.971 0.050949 .
Opt 0.13005 0.11441 1.137 0.257830
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.896 on 126 degrees of freedom
(34 observations deleted due to missingness)
Multiple R-squared: 0.2272, Adjusted R-squared: 0.2027
F-statistic: 9.262 on 4 and 126 DF, p-value: 1.358e-06
Notice that the R-squared value is 0.227, only slightly less than before, and while Narc is still the only strongly significant predictor, both Anx and Play have grown much more significant without the CHS predictor in the model. Apparently, those three predictors were all fighting over the same pieces of variance. Also, Opt is just not working out. After checking the diagonostics, we’ll remove it.
plot(mod5, which = c(1,2))
The diagnostics have changed very little. We still have the lack of linearity on the left side of the fitted values plot and a reasonable qq-plot. Let’s see what removing Opt does.
mod6 = lm(Thrill ~ Anx + Narc + Play, data = Data3350)
summary(mod6)
Call:
lm(formula = Thrill ~ Anx + Narc + Play, data = Data3350)
Residuals:
Min 1Q Median 3Q Max
-13.4698 -3.7267 0.2912 3.6745 11.4900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14.30159 3.93603 3.634 0.000404 ***
Anx -0.12766 0.04592 -2.780 0.006260 **
Narc 0.65882 0.15863 4.153 5.97e-05 ***
Play 0.05250 0.02363 2.221 0.028092 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.902 on 127 degrees of freedom
(34 observations deleted due to missingness)
Multiple R-squared: 0.2193, Adjusted R-squared: 0.2008
F-statistic: 11.89 on 3 and 127 DF, p-value: 6.479e-07
Notice that the R-squared value is 0.219, still just a tiny bit less than before. Interestingly, both Anx and Play are now significant predictors. Anx is strongly significant. Given the \(p\)-values from the table, we would say that Narc is the “most influential predictor” of Thrill. While Narc is the primary predictor, Anx is a secondary predictor and still strongly significant. Play is significant but of less importance to the model. Hopefully you have gotten a small taste of exploratory modeling. On assignments for this class, you will be asked only investigate and evaluate linear models, not design them yourself. In any case, this set of examples helps to show what “bad” predictors look like and how they affect the model.
plot(mod6, which = c(1,2))
Sadly, getting rid of two ineffective predictors hasn’t improved our model diagnostics. This makes sense, because most of the variance in our model was due to the three predictors we still have left. We do have a linearity issue which does call into question the validity of the model. Is it severe? No.
What should we do? In a more advanced statistics class, we might try a transformation. If we can find the non-linearity, we might be able to take the square root or natural log of a predictor and remove it. For this class, since the lack of linearity was not too severe, we can report our analysis and discuss the potential non-linearity.
\[\text{Thrill} \sim \text{Anx} + \text{Narc} + \text{Play} + \text{Sex}\]
Using the HSAG variable from the Data3350 data frame, build a linear model for Aggressive Humor using the predictors Narcissism and Self-Defeating Humor Style: \[\text{HSAG} \sim \text{Narc} + \text{HSSD}\] Evaluate and analyze your model including all diagnostic plots.
Add the Eating Attitudes variable Eat as the third predictor in your model for HSAG:\[\text{HSAG} \sim \text{Narc} + \text{HSSD} + \text{Eat}\] Evaluate and analyze your model including all diagnostic plots, and compare your three-predictor model with your results from the two-predictor model. Is the correlation between Eat and HSAG positive or negative? How can you tell from the model summary statistics output? If higher scores on the Eat variable indicate higher levels of being calorie conscious, knowing the fat content of food items and thinking about burning calories when working out, does this relationship between Eat and HSAG make sense?
Using the OCD variable from the Data3350 data frame, build a two-predictor linear model using Perf and TypeA as predictors: \[\text{OCD} \sim \text{Perf} + \text{TypeA}\] Evaluate and analyze your model including all diagnostic plots.
Add Anx as the third predictor in your model for OCD: \[\text{OCD} \sim \text{Perf} + \text{TypeA}+ \text{Anx}\] Evaluate and analyze your model including all diagnostic plots, and compare your three-predictor model with your results from the two-predictor model.