Multiple Linear Regression
For recent data in Jacksonville, Florida, on y = selling price of home (in dollars), x1 = size of home (in square feet), and x2 = lot size (in square feet), the prediction equation is ŷ = −10,536 + 53.8x1 + 2.84x2.
The prediction equation is
\[{selling\;price}={-10,536}+{53.8}*{size\;of\;home}+{2.84}*{lot\;size}\] To calculate the predicted selling price, I’ll substitute the given values for the variables size of home and lot size.
# assign variables
a <- -10536
b1 <- 53.8
b2 <- 2.84
x1 <- 1240
x2 <- 18000
actualP <- 145000
# calculate predicted price
predP <- a + (b1 * x1) + (b2 * x2)
# view predicted price
predP
[1] 107296
# calculate residual
residual <- actualP - predP
# view residual
residual
[1] 37704
A one-square-foot increase in home size predicts a $53.80 increase in selling price. To calculate the square-foot increase in lot size that would predict the same increase in selling price, I’ll use the following simple equation.
\[2.84*x2=53.80\] or
\[x2=\frac{53.80}{2.84}\]# calculate lot size increase
lotInc <- b1/b2
# view lot size increase
lotInc
[1] 18.94366
(Data file: salary in alr4 R package). The data file concerns salary and other characteristics of all faculty in a small Midwestern college collected in the early 1980s for presentation in legal proceedings for which discrimination against women in salary was at issue. All persons in the data hold tenured or tenure track positions; temporary faculty are not included. The variables include degree, a factor with levels PhD and MS; rank, a factor with levels Asst, Assoc, and Prof; sex, a factor with levels Male and Female; Year, years in current rank; ysdeg, years since highest degree, and salary, academic year salary in dollars.
First I’ll inspect the data.
degree rank sex year ysdeg
Masters:34 Asst :18 Male :38 Min. : 0.000 Min. : 1.00
PhD :18 Assoc:14 Female:14 1st Qu.: 3.000 1st Qu.: 6.75
Prof :20 Median : 7.000 Median :15.50
Mean : 7.481 Mean :16.12
3rd Qu.:11.000 3rd Qu.:23.25
Max. :25.000 Max. :35.00
salary
Min. :15000
1st Qu.:18247
Median :23719
Mean :23798
3rd Qu.:27258
Max. :38045
There are five predictor variables, three of which are categorical (degree, rank, sex) and two of which are quantitative (year, which refers to years in current rank, and ysdeg, which refers to years since highest degree earned).
The null hypothesis is that the mean salary for males and females is the same. That is, sex has no effect on salary. The alternative hypothesis, then, is that the mean salary for males and females is not the same.
Thus \[H_0:\mu_1=\mu_2\] and \[H_a:\mu_1\ne\mu_2\]
I’ll use the t.test() function to conduct a two-sample t-test.
# conduct t test
t.test(salary ~ sex, data = salary, var.equal = TRUE)
Two Sample t-test
data: salary by sex
t = 1.8474, df = 50, p-value = 0.0706
alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
95 percent confidence interval:
-291.257 6970.550
sample estimates:
mean in group Male mean in group Female
24696.79 21357.14
While we can see that the mean salary for males ($24,696.79) is different from that of females ($21,357.14), a \(P\)-value of .071 tells us that this difference wouldn’t be so unlikely if the null hypothesis were true that we are forced to reject it. We are thus unable to reject the null hypothesis at this time.
Now I’ll fit a multiple regression model and get a 95% confidence interval for the difference in mean salaries.
Call:
lm(formula = salary ~ degree + rank + sex + year + ysdeg, data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15746.05 800.18 19.678 < 0.0000000000000002 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankAssoc 5292.36 1145.40 4.621 0.000032163431 ***
rankProf 11118.76 1351.77 8.225 0.000000000162 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 0.000008653790 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 0.00000000000000022
# get CI
confint(fitS, "sexFemale", level = .95)
2.5 % 97.5 %
sexFemale -697.8183 3030.565
The baseline category for sex is male, which we know because sexFemale is included in the above regression output. The coefficient for sexFemale tells us that, controlling for all other variables, a female professor is predicted to earn an average of $1,166.37 more than a male professor.
The 95% confidence interval is -697.8183 to 3030.565. This tells us that a female professor is predicted to earn anywhere from $697.82 less than a male professor to $3,030.57 more than a male professor. Because the confidence interval includes zero, it’s possible that male professors earn more than female professors, female professors earn more than male professors, or male and female professors earn the same as each other.
We are thus unable to reject the null hypothesis at this time. This corresponds with the conclusion reached based on the \(P\)-value.
We can see the coefficient/slope of each predictor variable by looking at the Estimate column in the summary of our lm() output (above). The coefficient/slope of each predictor variable indicates the mean expected increase in our response variable (salary) for each unit increase in that predictor variable if all other variables are held constant. The Std. Error column gives the standard error of the estimate for each predictor variable, the t value column gives the \(t\) statistic and the column Pr(>|t|) gives the \(P\)-value.
Let’s look at each predictor variable individually.
Looking at the predictor variables relative to one another, it appears that one’s rank and number of years in that rank are the most significant predictors of salary. The high \(P\)-values for every other predictor variable tell us that we cannot be sure that the effect of those variables is not zero.
We can see from the summary of our lm() output (above) that the baseline category of the rank variable is Assistant Prof. because it’s not included in the output. R automatically drops the baseline (or reference) category to avoid multicollinearity. We can use the levels() function to retrieve factor levels of a particular variable and the relevels() function to manually assign the factor levels of a particular variable.
Here I’ll change the baseline category from Assistant Prof. to Full Prof.
# get factor levels
levels(salary$rank)
[1] "Asst" "Assoc" "Prof"
Call:
lm(formula = salary ~ degree + rankNew + sex + year + ysdeg,
data = salary)
Residuals:
Min 1Q Median 3Q Max
-4045.2 -1094.7 -361.5 813.2 9193.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26864.81 1375.29 19.534 < 0.0000000000000002 ***
degreePhD 1388.61 1018.75 1.363 0.180
rankNewAsst -11118.76 1351.77 -8.225 0.000000000162 ***
rankNewAssoc -5826.40 1012.93 -5.752 0.000000727809 ***
sexFemale 1166.37 925.57 1.260 0.214
year 476.31 94.91 5.018 0.000008653790 ***
ysdeg -124.57 77.49 -1.608 0.115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2398 on 45 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.8357
F-statistic: 44.24 on 6 and 45 DF, p-value: < 0.00000000000000022
Two things to note:
We can also exclude particular variables from a regression model. Here I’ll exclude rank and see what happens.
Call:
lm(formula = salary ~ . - rank, data = salary)
Residuals:
Min 1Q Median 3Q Max
-8146.9 -2186.9 -491.5 2279.1 11186.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17183.57 1147.94 14.969 < 0.0000000000000002 ***
degreePhD -3299.35 1302.52 -2.533 0.014704 *
sexFemale -1286.54 1313.09 -0.980 0.332209
year 351.97 142.48 2.470 0.017185 *
ysdeg 339.40 80.62 4.210 0.000114 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3744 on 47 degrees of freedom
Multiple R-squared: 0.6312, Adjusted R-squared: 0.5998
F-statistic: 20.11 on 4 and 47 DF, p-value: 0.000000001048
Ignoring rank altogether appears to change the model rather dramatically. The effects of having a PhD, being female, and increasing number of years since highest degree attained have reversed (i.e. the sign of the coefficient/slope has changed). In this model, the variables with the highest degree of statistical significance are the number of years since highest degree attained, number of years in current rank, and having a PhD.
Placed in the context of an investigation of sex discrimination, I’m not sure we could justify ignoring rank. Discrimination of any kind could easily manifest itself in the promotion of one group over another, indicating that we should control for it (i.e. include it in our regression model) instead of ignoring it.
If the claim is that which dean one was hired by is a predictor of salary, a logical new variable would be the binary variable dean, with factor levels “new” and “old.” I’ll use the mutate() function to create this new variable, assigning each professor a category based on when he/she was hired. Those hired in the past fifteen years will be assigned to the new dean and those hired prior (over fifteen years ago) to the old dean.
degree rank sex year ysdeg salary dean
1 Masters Prof Male 25 35 36350 old
2 Masters Prof Male 13 22 35350 old
3 Masters Prof Male 10 23 28200 old
4 Masters Prof Female 7 27 26775 old
5 PhD Prof Male 19 30 33696 old
6 Masters Prof Male 16 21 28516 old
7 PhD Prof Female 0 32 24900 old
8 Masters Prof Male 16 18 31909 old
9 PhD Prof Male 13 30 31850 old
10 PhD Prof Male 13 31 32850 old
11 Masters Prof Male 12 22 27025 old
12 Masters Assoc Male 15 19 24750 old
13 Masters Prof Male 9 17 28200 old
14 PhD Assoc Male 9 27 23712 old
15 Masters Prof Male 9 24 25748 old
16 Masters Prof Male 7 15 29342 new
17 Masters Prof Male 13 20 31114 old
18 PhD Assoc Male 11 14 24742 new
19 PhD Assoc Male 10 15 22906 new
20 PhD Prof Male 6 21 24450 old
Now I can fit a new model that includes this new variable.
Call:
lm(formula = salary ~ ., data = salaryD)
Residuals:
Min 1Q Median 3Q Max
-3621.2 -1336.8 -271.6 530.1 9247.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15516.79 814.81 19.043 < 0.0000000000000002 ***
degreePhD 1135.00 1031.16 1.101 0.277
rankAssoc 5234.01 1138.47 4.597 0.000035985932 ***
rankProf 11411.45 1362.02 8.378 0.000000000116 ***
sexFemale 1084.09 921.49 1.176 0.246
year 460.35 95.09 4.841 0.000016263785 ***
ysdeg -47.86 97.71 -0.490 0.627
deanold -1749.09 1372.83 -1.274 0.209
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2382 on 44 degrees of freedom
Multiple R-squared: 0.8602, Adjusted R-squared: 0.838
F-statistic: 38.68 on 7 and 44 DF, p-value: < 0.00000000000000022
We can see from the regression output that the baseline category is those hired by the new dean. Based on this model, we expect those hired by the old dean to earn an average of $1,749.09 less than those hired by the new dean. This finding is not statistically significant, however, so based on this model we would not be able to reject the null hypothesis at this time.
It is possible that there is some multicollinearity contained within this model. Multicollinearity means that some or all of the predictor variables in a multiple regression model are highly correlated with one another. One indication of multicollinearity is the \(R^2\) values of our models. In our first model, \(R^2\)=.855, and in this newest one, \(R^2\)=.860. This slight increase is an indication that the new variable adds little predictive power to the model.
It’s also worth noting that larger sample sizes can help mitigate the problems posed by multicollinearity. A general rule is that a dataset should have 10x as many observations as predictor variables contained within the model. According to this guideline, we would ideally have 60 observations for our new model with 6 variables. Since we have only 52, we may be entering into the territory of a too-small sample size.
Given the potential for multicollinearity, I’ll fit a new model removing the variable ysdeg and compare the two.
Call:
lm(formula = salary ~ . - ysdeg, data = salaryD)
Residuals:
Min 1Q Median 3Q Max
-3403.3 -1387.0 -167.0 528.2 9233.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15491.84 806.32 19.213 < 0.0000000000000002 ***
degreePhD 818.93 797.48 1.027 0.3100
rankAssoc 4972.66 997.17 4.987 0.00000961362451 ***
rankProf 11096.95 1191.00 9.317 0.00000000000454 ***
sexFemale 907.14 840.54 1.079 0.2862
year 434.85 78.89 5.512 0.00000164625970 ***
deanold -2163.46 1072.04 -2.018 0.0496 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2362 on 45 degrees of freedom
Multiple R-squared: 0.8594, Adjusted R-squared: 0.8407
F-statistic: 45.86 on 6 and 45 DF, p-value: < 0.00000000000000022
We can see that the adjusted \(R^2\) value of this final model is 0.8407, which is higher than the adjusted \(R^2\) values for the original model (0.8357) and the model with both ysdeg and dean (0.838). This is a good indication that this final model is the best fit.
First I’ll inspect the data.
'data.frame': 100 obs. of 7 variables:
$ case : int 1 2 3 4 5 6 7 8 9 10 ...
$ Taxes: int 3104 1173 3076 1608 1454 2997 4054 3002 6627 320 ...
$ Beds : int 4 2 4 3 3 3 3 3 5 3 ...
$ Baths: int 2 1 2 2 3 2 2 2 4 2 ...
$ New : int 0 0 0 0 0 1 0 1 0 0 ...
$ Price: int 279900 146500 237700 200000 159900 499900 265500 289900 587000 70000 ...
$ Size : int 2048 912 1654 2068 1477 3153 1355 2075 3990 1160 ...
It’s worth noting that while the variable New is actually a factor variable, R currently understands it as an integer. I’m not sure if that will matter for my analysis but I’ll bear in mind that I may need to change it at some point.
Call:
lm(formula = Price ~ New + Size, data = hSP)
Residuals:
Min 1Q Median 3Q Max
-205102 -34374 -5778 18929 163866
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -40230.867 14696.140 -2.738 0.00737 **
New 57736.283 18653.041 3.095 0.00257 **
Size 116.132 8.795 13.204 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 53880 on 97 degrees of freedom
Multiple R-squared: 0.7226, Adjusted R-squared: 0.7169
F-statistic: 126.3 on 2 and 97 DF, p-value: < 0.00000000000000022
The coefficient/slope for the predictor variable new is 57736.283, indicating that we would expect an average increase in selling price of $57,736.28 if the house is new instead of not new.
The coefficient/slope for the predictor variable size is 116.132, indicating that we would expect an average increase in selling price of $116.13 per one-square-foot increase in house size.
Both coefficients are statistically significant at the .01 level so we can be confident in saying that the effect of size and of being new or not new is not zero.
New is a binary (or dummy) variable where 1 = “new” and 0 = “not new”. Thus our regression equation will look a little different than when we have all quantitative variables. Since new has two factor levels (“new” and “not new”), our equation will have one \(z\) term.
Starting with \[E(y)=\alpha+\beta{x}+\beta_1{z_1}\] we get \[{estimated\;selling\;price}=-40,230.867+116.132*size+57,736.283*z\] where \(z\)=1 if the house is new and \(z\)=0 if it’s not new.
If the house is new then the final term of the prediction equation will be 57,736.283 and the equation can be simplified to \[{estimated\;selling\;price}=17,505.42+116.132*size\] If the house is not new then the final term of the above equation will be 0 and the equation can be simplified to \[{estimated\;selling\;price}=-40,230.867+116.132*size\]
In both cases, the effect of size remains the same—a one-square-foot increase in size predicts an average increase in price of $116.13 per square foot. Both coefficients/slopes are statistically significant so we can be confident neither are zero. Put more simply, a house that is new will be more expensive than one that is not but for both new and not new houses, bigger houses will likely be more expensive.
We can calculate the price of a 3,000-square-foot house that is new as follows \[{estimated\;selling\;price}=17,505.42+116.132*3,000\]
and for a house that is not new as follows \[{estimated\;selling\;price}=-40,230.867+116.132*3,000\]
# assign objects
a <- -40230.867
b1 <- 116.132
bZ <- 57736.283
# calculate estimated price if new
priceNew <- a + (b1*3000) + (bZ*1)
# calculate price if not new
priceNotNew <- a + (b1*3000) + (bZ*0)
Thus, the estimated price for a 3,000-square-foot house that is new is $365901.42. The estimated price for a 3,000-square-foot house that is not new is $308165.13.
In addition to allowing us to include multiple predictor variables, multiple regression allows us to indicate interaction terms by creating cross-product terms. An interaction between two predictor variables exists when the value of one changes the effect of the other on the response variable.
I’ll now fit a model in which size interacts with whether the house is new or not new.
Call:
lm(formula = Price ~ Size * New, data = hSP)
Residuals:
Min 1Q Median 3Q Max
-175748 -28979 -6260 14693 192519
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22227.808 15521.110 -1.432 0.15536
Size 104.438 9.424 11.082 < 0.0000000000000002 ***
New -78527.502 51007.642 -1.540 0.12697
Size:New 61.916 21.686 2.855 0.00527 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52000 on 96 degrees of freedom
Multiple R-squared: 0.7443, Adjusted R-squared: 0.7363
F-statistic: 93.15 on 3 and 96 DF, p-value: < 0.00000000000000022
Looking at the output of our regression model, it appears that the interaction term \(size*new\) is statistically significant, which indicates interaction.
To get a sense of whether there is interaction between size and new, we can plot the prediction lines for new houses and those that are not new separately.
# plot by dummy variable
interact_plot(fitP2, pred = Size, modx = New, data = hSP, modx.labels = c("Not New", "New"))
This makes it easy to see what the summary of our lm() output (above) tells us: an increase in size predicts an increase in selling price for both houses that are new and those that are not, but being new confers an additional increase in selling price of $61.92 per one-square-foot increase in house size. The larger the house, the larger the difference in price based on being new or not.
Based on our new regression output, our new prediction equation is \[{estimated\;selling\;price}=-22,227.808+104.438*size-78,527.502*z+61.916*size*z\] where \(z\)=1 if the house is new and \(z\)=0 if it’s not.
# assign objects
a <- -22227.808
b1 <- 104.438
b2 <- -78527.502
b3 <- 61.916
# calculate price if new
priceNew2 <- a + (b1 * 3000) + (b2 * 1) + (b3 * 3000 * 1)
# calculate price if not new
priceNotNew2 <- a + (b1 * 3000) + (b2 * 0) + (b3 * 3000 * 0)
Thus, when we include the interaction term \(size*new\), the estimated price for a 3,000-square-foot house that is new is $398306.69 and the estimated price for a 3,000-square-foot house that is not new is $291086.19.
# calculate price if new
priceNew3 <- a + (b1 * 1500) + (b2 * 1) + (b3 * 1500 * 1)
# calculate price if not new
priceNotNew3 <- a + (b1 * 1500) + (b2 * 0) + (b3 * 1500 * 0)
The estimated price for a 1,500-square-foot house that is new is $148775.69 and the estimated price for a 1,500-square-foot house that is not new is $134429.19.
For a 3,000-square-foot house, the difference in price between one that’s new and one that’s not is $107220.5.
For a 1,500-square-food-house, the difference in price between one that’s new and one that’s not is $14346.5.
This fits with what’s represented in the above plot. The difference between the lines for new houses and not new houses is relatively small around the 1,500-square-foot mark on the \(x\)-axis and much larger around the 3,000-square-foot mark. As the houses get bigger in size, the impact of whether the house is new or not new increases.
The model with the interaction between size and new seems more appropriate to me, for a few reasons.