Sample: Linear Regression

Using linear regression and evaluating models

Megan Georges
2022-08-25

Fitting Model for House Selling Prices

(modified practice question from Alan Agresti’s Statistical Methods for the Social Sciences - 5th Edition, questions 14.3 & 14.4)

(Data: house.selling.price.2 from smss package)

house.selling.price.2 correlation matrix and model fit using four predictors of selling price

With these four predictors…

For backward elimination, which variable would be deleted first?

data("house.selling.price.2")
head(house.selling.price.2)
      P    S Be Ba New
1  48.5 1.10  3  1   0
2  55.0 1.01  3  2   0
3  68.0 1.45  3  2   0
4 137.0 2.40  3  3   0
5 309.4 3.30  4  3   1
6  17.5 0.40  1  1   0
PriceALL <- lm(P ~ ., data = house.selling.price.2)
summary(PriceALL)

Call:
lm(formula = P ~ ., data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16

For backward elimination, you fit a model using all possible explanatory values to predict the output (in this case selling price). One by one, you delete the least significant explanatory variable in the model, which would have the largest P-value. In this example, we would delete beds first, which has a P-value of 0.487.

For forward selection, which variable would be added first?

With forward selection, you begin with no explanatory variables, then add one variable at a time to the model. The variable you add should be the most significant one, based on it having the lowest P-value of the group of possible explanatory variables. The book (SMSS) specifies that this is the variable with the largest t-statistic and increase in R-squared. In this example, the first variable to add to the model is Size, given its extremely small P-value (< 2e-16).

Why do you think that BEDS has such a large P-value in the multiple regression model, even though it has a substantial correlation with PRICE?

While the variable beds does have a strong correlation with price, when adding additional variables using a regression model, the relationship significantly diminishes, thus the other variables may act as a control on the bed variable.

Using Various Criteria to Find Best Model

Using software with these four predictors, find the model that would be selected using each criterion.

PriceOnly <- lm(P ~ 1, data = house.selling.price.2)
PriceALL <- lm(P ~ ., data = house.selling.price.2)
Stepwise <- step(PriceOnly, direction = "both", scope = formula(PriceALL))
Start:  AIC=705.63
P ~ 1

       Df Sum of Sq    RSS    AIC
+ S     1    145097  34508 554.22
+ Ba    1     91484  88121 641.41
+ Be    1     62578 117028 667.79
+ New   1     22833 156772 694.99
<none>              179606 705.63

Step:  AIC=554.22
P ~ S

       Df Sum of Sq    RSS    AIC
+ New   1      7275  27234 534.20
+ Ba    1      4476  30033 543.30
<none>               34508 554.22
+ Be    1        40  34468 556.11
- S     1    145097 179606 705.63

Step:  AIC=534.2
P ~ S + New

       Df Sum of Sq    RSS    AIC
+ Ba    1      3550  23684 523.21
+ Be    1       589  26645 534.17
<none>               27234 534.20
- New   1      7275  34508 554.22
- S     1    129539 156772 694.99

Step:  AIC=523.21
P ~ S + New + Ba

       Df Sum of Sq   RSS    AIC
<none>              23684 523.21
+ Be    1       131 23553 524.70
- Ba    1      3550 27234 534.20
- New   1      6349 30033 543.30
- S     1     54898 78582 632.75
summary(Stepwise)

Call:
lm(formula = P ~ S + New + Ba, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
New           18.371      3.761   4.885 4.54e-06 ***
Ba            20.072      5.495   3.653 0.000438 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16

Using stepwise regression, the strongest model predicting selling price uses Size, New, and Bath as explanatory variables. First, I’ll compare R-squared and adjusted R-squared for the model using all explanatory variables and the model excluding Beds as a variable.

R-squared:

summary(PriceALL)

Call:
lm(formula = P ~ ., data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-36.212  -9.546   1.277   9.406  71.953 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -41.795     12.104  -3.453 0.000855 ***
S             64.761      5.630  11.504  < 2e-16 ***
Be            -2.766      3.960  -0.698 0.486763    
Ba            19.203      5.650   3.399 0.001019 ** 
New           18.984      3.873   4.902  4.3e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.36 on 88 degrees of freedom
Multiple R-squared:  0.8689,    Adjusted R-squared:  0.8629 
F-statistic: 145.8 on 4 and 88 DF,  p-value: < 2.2e-16
ElimBed <- lm(P ~ . - Be, data = house.selling.price.2)
summary(ElimBed)

Call:
lm(formula = P ~ . - Be, data = house.selling.price.2)

Residuals:
    Min      1Q  Median      3Q     Max 
-34.804  -9.496   0.917   7.931  73.338 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -47.992      8.209  -5.847 8.15e-08 ***
S             62.263      4.335  14.363  < 2e-16 ***
Ba            20.072      5.495   3.653 0.000438 ***
New           18.371      3.761   4.885 4.54e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.31 on 89 degrees of freedom
Multiple R-squared:  0.8681,    Adjusted R-squared:  0.8637 
F-statistic: 195.3 on 3 and 89 DF,  p-value: < 2.2e-16

As expected, the model with the most explanatory variables (Size, New, Bed, Bath) has the highest R-squared value at 0.8689. Therefore, if you were to select a model solely based on maximizing the R-squared value, it would be: \(\hat{y}\) = -41.79 + 64.76(Size) - 2.77(Beds) + 19.2(Baths) + 18.98(New)

PriceALL$coefficients
(Intercept)           S          Be          Ba         New 
 -41.794504   64.760606   -2.765564   19.203105   18.984237 

Adjusted R-squared:

However, if you were to select a model based on adjusted R-squared, which accounts for R-squared’s upward bias when adding more variables, the best model for predicting selling price would exclude Beds and use Size, Baths, and New as explanatory variables. The adjusted R-squared value see a slight increase when Beds is removed (from 0.8629 to 0.8637). The model would be: \(\hat{y}\) = -47.99 + 62.26(Size) + 20.07(Baths) + 18.37(New)

ElimBed$coefficients
(Intercept)           S          Ba         New 
  -47.99179    62.26268    20.07203    18.37096 

PRESS:

MPV::PRESS(PriceALL)
[1] 28390.22
MPV::PRESS(ElimBed)
[1] 27860.05

When considering PRESS, a smaller PRESS value indicates a better predictive model. Comparing the PRESS value of the model with all variables and the model excluding Bed, the PRESS values would lead us to select the model with Size, Baths, and New as variables for predicting selling price.

AIC:

AIC(PriceALL)
[1] 790.6225
AIC(ElimBed)
[1] 789.1366

When considering the AIC for both models, the value is slightly lower for the model that excludes Bed as a variable. Therefore, the AIC would lead us to use the model with Size, Baths, and New as explanatory variables to predicting selling price.

BIC:

BIC(PriceALL)
[1] 805.8181
BIC(ElimBed)
[1] 801.7996

Lastly, like AIC, the BIC value is lower for the model that excludes Bed as a variable. Once again, we’d select the model that uses Size, Baths, and New as explanatory variables to predict selling price.

Selected Model

Given the results from the various criteria above, the model I would prefer to use to predict selling price is that which excludes Bed and includes Size, Bath, and New as variables: \(\hat{y}\) = -41.79 + 64.76(Size) - 2.77(Beds) + 19.2(Baths) + 18.98(New). This is because each of the criterion indicate this model as slightly stronger in its predictive power than the model that includes all variables (except R-squared, which cannot be used alone to determine model strength).


Multiple Regression with Diagnostic Plots

(Data file: trees from base R)

From the documentation: “This data set provides measurements of the diameter, height and volume of timber in 31 felled black cherry trees. Note that the diameter (in inches) is erroneously labelled Girth in the data. It is measured at 4 ft 6 in above the ground.”

Use the trees data to build a basic model of tree volume prediction. In particular…

data("trees")
head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7

Fit a multiple regression model with the Volume as the outcome and Girth and Height as the explanatory variables.

VOL <- lm(Volume ~ Girth + Height, data = trees)
summary(VOL)

Call:
lm(formula = Volume ~ Girth + Height, data = trees)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4065 -2.6493 -0.2876  2.2003  8.4847 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
Girth         4.7082     0.2643  17.816  < 2e-16 ***
Height        0.3393     0.1302   2.607   0.0145 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.882 on 28 degrees of freedom
Multiple R-squared:  0.948, Adjusted R-squared:  0.9442 
F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16

E(Volume) = -57.99 + 4.71(Girth) + 0.34(Height)

Run regression diagnostic plots on the model. Based on the plots, do you think any of the regression assumptions is violated?

par(mfrow = c(2, 3)) 
plot(VOL, which = 1:6)

Based on the residuals vs. fitted values plot, the central points appear to roughly bounce randomly above and below 0, but the lowest and highest point appear to be very influential residuals. The red line should be flat along 0 horizontally, but it is U-shaped. This curvature may suggest a violation in the linearity assumption.

With the normal Q-Q plot, it’s difficult to confidently say that the assumption of normality appears to be violated. The points generally run along the trend-line, but they do deviate above the line for the higher points. It’s a noteworthy deviation, but it’s difficult to make a certain decision based on the plot.

In the scale-location plot, the line is not horizontal (the trend decreases then increases), thus suggesting a violation in the assumption of constant variance.

Cook’s distance (as seen in the cook’s distance and residuals vs. leverage plots) suggests that the 31st observation is above the threshold (4/n or red dashed line), meaning it is too influential as one observation.


Simple Linear Regression with Diagnostic Plots

(modified practice question from Sanford Weisberg’s Applied Linear Regression - 4th Edition, question 9.16)

(Data: florida in alr package)

Background: “In the 2000 election for U.S. president, the counting of votes in Florida was controversial. In Palm Beach County in south Florida, for example, voters used a so-called butterfly ballot. Some believe that the layout of the ballot caused some voters to cast votes for Buchanan when their intended choice was Gore.”

Simple Linear Regression Model

The Buchanan vote is the outcome and the Bush vote is the explanatory variable

data("florida")
head(florida)
           Gore   Bush Buchanan
ALACHUA   47300  34062      262
BAKER      2392   5610       73
BAY       18850  38637      248
BRADFORD   3072   5413       65
BREVARD   97318 115185      570
BROWARD  386518 177279      789
SLR <- lm(Buchanan ~ Bush, data = florida)
summary(SLR)

Call:
lm(formula = Buchanan ~ Bush, data = florida)

Residuals:
    Min      1Q  Median      3Q     Max 
-907.50  -46.10  -29.19   12.26 2610.19 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.529e+01  5.448e+01   0.831    0.409    
Bush        4.917e-03  7.644e-04   6.432 1.73e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 353.9 on 65 degrees of freedom
Multiple R-squared:  0.3889,    Adjusted R-squared:  0.3795 
F-statistic: 41.37 on 1 and 65 DF,  p-value: 1.727e-08

Regression Diagnostic Plots

par(mfrow = c(2, 3)) 
plot(SLR, which = 1:6)

Is Palm Beach County an outlier based on the diagnostic plots? Why or why not?

Based on the diagnostic plots, Palm Beach County is an outlier. First, when looking at the residuals vs fitted plot, the Palm Beach County residual is very large (over 2,000). When referring to the summary of the simple regression model, the third quartile for residuals is 12.26, yet the max is 2610.19. This is a significant jump and indicative of the value being an outlier. The normal Q-Q plot also indicates that the residuals for the model are generally normal except for the Palm Beach County residual, as it greatly deviates from the line in the plot. The Cook’s distance plot shows two points that may be of concern (as outliers) if you follow the metric of observations scoring over 1, which are DADE and Palm Beach at about 2 (or following the 4/n metric instead, with 67 observations (4/67 = .06)). The residuals and leverages plot shows the Palm Beach County standardized residual value beyond the dashed line indicating Cook’s distance. This also suggests that the observation is an outlier and the observation has the potential to influence the regression model.