1 A simple routine to find optimal predictive models based on a given dataset

1.1 Objective

Our overall objective is to find a (linear) regression model that helps us predict the price of homes from a data set.

I want my prediction to be as accurate (make as few errors) as possible.

I want my prediction model to be as simple and easy to understand as possible.

With some prior knowledge in house hunting, I believe a-priori that size (sqft) and location (zip codes) of the houses would be important information.

In regression lingo, home price is the response variable. The factors we use to predict home prices are called predictable variables.

1.2 Loading the dataset and required packages

# load the ggplot2 package
library(ggplot2)
# import dataset house
house <- read.csv("/cloud/project/house.csv")

house = house[house$SqFt<5000&house$Price<6e+5,]

We will learn the following new pacakges:

#install and library 2 packages for summarizing regression results
library(leaps) #command regsubsets()
library(sjPlot) #command tab_model()
library(texreg) #command screenreg(), or texreg()
## Version:  1.36.23
## Date:     2017-03-03
## Author:   Philip Leifeld (University of Glasgow)
## 
## Please cite the JSS article in your publications -- see citation("texreg").
library(performance) #command compare_performance()

1.3 Using package leaps to find an optimal predictive model

To apply the routine, we make a list of all information we will be considering for our regression:

  • The response variable: Price
  • Candidates for predictor variables: SqFt, factor(Zip), BedRooms, Baths, and Garage
  • The Price-per-SqFt relationship could change by factor(Zip): SqFt*factor(Zip)
  • The Price-per-SqFt relationship could change for houses with more than 1 garage: SqFt*(Garage>1)
  • The Price-per-SqFt relationship could change for houses with more than 2 bathrooms: SqFt*(Baths>2)

The number of potential predictors we are considering are: 7+3+2+2=14 (How does this work?)

#use command regsubsets from command leaps package
library(leaps)

#considering all models that could be extracted from the complete model
#Price~SqFt+BedRooms+Baths+Garage+factor(Zip)+SqFt*factor(Zip)+SqFt*(Garage>1)+SqFt*(Baths>2)

#run command regsubsets and save results in object opmodels
opmodels = regsubsets(Price ~SqFt+factor(Zip)+BedRooms+Baths+Garage+
                       SqFt*factor(Zip)+SqFt*(Garage>1)+SqFt*(Baths>2),
                    # enter dataset
                    data = house, 
                    # a couple of technical options
                    nbest = 1, method = "seqrep")

#to print results, use summary()
summary(opmodels)
## Subset selection object
## Call: regsubsets.formula(Price ~ SqFt + factor(Zip) + BedRooms + Baths + 
##     Garage + SqFt * factor(Zip) + SqFt * (Garage > 1) + SqFt * 
##     (Baths > 2), data = house, nbest = 1, method = "seqrep")
## 14 Variables  (and intercept)
##                     Forced in Forced out
## SqFt                    FALSE      FALSE
## factor(Zip)5            FALSE      FALSE
## factor(Zip)6            FALSE      FALSE
## factor(Zip)9            FALSE      FALSE
## BedRooms                FALSE      FALSE
## Baths                   FALSE      FALSE
## Garage                  FALSE      FALSE
## Garage > 1TRUE          FALSE      FALSE
## Baths > 2TRUE           FALSE      FALSE
## SqFt:factor(Zip)5       FALSE      FALSE
## SqFt:factor(Zip)6       FALSE      FALSE
## SqFt:factor(Zip)9       FALSE      FALSE
## SqFt:Garage > 1TRUE     FALSE      FALSE
## SqFt:Baths > 2TRUE      FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: 'sequential replacement'
##          SqFt factor(Zip)5 factor(Zip)6 factor(Zip)9 BedRooms Baths Garage
## 1  ( 1 ) "*"  " "          " "          " "          " "      " "   " "   
## 2  ( 1 ) "*"  " "          " "          " "          " "      "*"   " "   
## 3  ( 1 ) "*"  " "          " "          " "          " "      "*"   " "   
## 4  ( 1 ) "*"  " "          "*"          " "          " "      "*"   " "   
## 5  ( 1 ) "*"  " "          "*"          " "          " "      "*"   "*"   
## 6  ( 1 ) " "  " "          "*"          " "          " "      "*"   "*"   
## 7  ( 1 ) " "  " "          "*"          " "          " "      "*"   "*"   
## 8  ( 1 ) " "  " "          "*"          " "          " "      "*"   "*"   
##          Garage > 1TRUE Baths > 2TRUE SqFt:factor(Zip)5 SqFt:factor(Zip)6
## 1  ( 1 ) " "            " "           " "               " "              
## 2  ( 1 ) " "            " "           " "               " "              
## 3  ( 1 ) " "            " "           " "               "*"              
## 4  ( 1 ) " "            " "           " "               "*"              
## 5  ( 1 ) " "            " "           " "               "*"              
## 6  ( 1 ) "*"            " "           " "               "*"              
## 7  ( 1 ) "*"            " "           "*"               "*"              
## 8  ( 1 ) "*"            "*"           " "               "*"              
##          SqFt:factor(Zip)9 SqFt:Garage > 1TRUE SqFt:Baths > 2TRUE
## 1  ( 1 ) " "               " "                 " "               
## 2  ( 1 ) " "               " "                 " "               
## 3  ( 1 ) " "               " "                 " "               
## 4  ( 1 ) " "               " "                 " "               
## 5  ( 1 ) " "               " "                 " "               
## 6  ( 1 ) " "               "*"                 " "               
## 7  ( 1 ) " "               "*"                 " "               
## 8  ( 1 ) " "               "*"                 "*"

The command regsubsets of package leaps:

  • Considers all possible combinations of the 14 predictor variables.
  • Among possibles model with 1 predictor variable, the best predicting variable is SqFt.
  • Among possibles model with 2 predictor variables, the best combo of 2 predicting variables is SqFt & Baths.
  • Among possibles model with 3 predictor variables, the best combo of 3 predicting variables is SqFt, Baths & SqFt factor(Zip)6 (Sq ft of homes in zip code 6).

regsubsets determines that, after 8 predictors, the models no longer improves, so it stops at 8.

For my purpose, I decide that the optimal model with 5 predictor variables is optimal. This combo of 5 predicting variables is Sqft, Baths, Garage, zipcode6, interaction for zipcode6 and SqFt.

An R command that would run this optimal regression command for me is:

opmod = lm(Price~SqFt*factor(Zip==6)+Baths+Garage, data = house)
summary(opmod)
## 
## Call:
## lm(formula = Price ~ SqFt * factor(Zip == 6) + Baths + Garage, 
##     data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -157677  -18007   -2047   17260  218114 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -29267.366   5844.763  -5.007 7.69e-07 ***
## SqFt                          65.112      3.896  16.714  < 2e-16 ***
## factor(Zip == 6)TRUE      -65883.242  11455.659  -5.751 1.56e-08 ***
## Baths                      21124.707   3639.127   5.805 1.15e-08 ***
## Garage                      9904.628   2750.238   3.601 0.000349 ***
## SqFt:factor(Zip == 6)TRUE     40.453      5.533   7.311 1.08e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35630 on 493 degrees of freedom
## Multiple R-squared:  0.7954, Adjusted R-squared:  0.7934 
## F-statistic: 383.4 on 5 and 493 DF,  p-value: < 2.2e-16

Exercise: Use regsubsets() command to find the best combination of 3 predictors variable from the variables SqFt+factor(Zip)+BedRooms+Baths+Garage to predict Price from the dataset house.

Please make sure to change the name of your object to be different from opmodels

#use command regsubsets from command leaps package
library(leaps)

#considering all models that could be extracted from the complete model
#Price~SqFt+BedRooms+Baths+Garage+factor(Zip)+SqFt*factor(Zip)+SqFt*(Garage>1)+SqFt*(Baths>2)

summary(regsubsets(Price ~SqFt+factor(Zip)+BedRooms+Baths+Garage,
                    # enter dataset
                    data = house, 
                    # a couple of technical options
                    nbest = 1, method = "seqrep"))
## Subset selection object
## Call: regsubsets.formula(Price ~ SqFt + factor(Zip) + BedRooms + Baths + 
##     Garage, data = house, nbest = 1, method = "seqrep")
## 7 Variables  (and intercept)
##              Forced in Forced out
## SqFt             FALSE      FALSE
## factor(Zip)5     FALSE      FALSE
## factor(Zip)6     FALSE      FALSE
## factor(Zip)9     FALSE      FALSE
## BedRooms         FALSE      FALSE
## Baths            FALSE      FALSE
## Garage           FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: 'sequential replacement'
##          SqFt factor(Zip)5 factor(Zip)6 factor(Zip)9 BedRooms Baths Garage
## 1  ( 1 ) "*"  " "          " "          " "          " "      " "   " "   
## 2  ( 1 ) "*"  " "          " "          " "          " "      "*"   " "   
## 3  ( 1 ) "*"  " "          "*"          " "          " "      "*"   " "   
## 4  ( 1 ) "*"  " "          "*"          " "          " "      "*"   "*"   
## 5  ( 1 ) "*"  " "          "*"          " "          "*"      "*"   "*"   
## 6  ( 1 ) "*"  "*"          "*"          "*"          "*"      "*"   " "   
## 7  ( 1 ) "*"  "*"          "*"          "*"          "*"      "*"   "*"

The best combination of three predictor variables is sqft, zipcode6, and baths.

Exercise: Use regsubsets() command to find the best combination of 3 predictors variable from the variables SqFt+factor(Zip)+BedRooms+Baths+factor(Garage) to predict Price from the dataset house.

Please make sure to change the name of your object to be different from opmodels

#use command regsubsets from command leaps package
library(leaps)

#considering all models that could be extracted from the complete model
#Price~SqFt+BedRooms+Baths+Garage+factor(Zip)+SqFt*factor(Zip)+SqFt*(Garage>1)+SqFt*(Baths>2)

summary(regsubsets(Price ~SqFt+factor(Zip)+BedRooms+Baths+factor(Garage),
                    # enter dataset
                    data = house, 
                    # a couple of technical options
                    nbest = 1, method = "seqrep"))
## Subset selection object
## Call: regsubsets.formula(Price ~ SqFt + factor(Zip) + BedRooms + Baths + 
##     factor(Garage), data = house, nbest = 1, method = "seqrep")
## 9 Variables  (and intercept)
##                 Forced in Forced out
## SqFt                FALSE      FALSE
## factor(Zip)5        FALSE      FALSE
## factor(Zip)6        FALSE      FALSE
## factor(Zip)9        FALSE      FALSE
## BedRooms            FALSE      FALSE
## Baths               FALSE      FALSE
## factor(Garage)1     FALSE      FALSE
## factor(Garage)2     FALSE      FALSE
## factor(Garage)3     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: 'sequential replacement'
##          SqFt factor(Zip)5 factor(Zip)6 factor(Zip)9 BedRooms Baths
## 1  ( 1 ) "*"  " "          " "          " "          " "      " "  
## 2  ( 1 ) "*"  " "          " "          " "          " "      "*"  
## 3  ( 1 ) "*"  " "          " "          " "          " "      "*"  
## 4  ( 1 ) "*"  " "          "*"          " "          " "      "*"  
## 5  ( 1 ) "*"  " "          "*"          " "          "*"      "*"  
## 6  ( 1 ) "*"  " "          "*"          " "          "*"      "*"  
## 7  ( 1 ) "*"  "*"          "*"          "*"          "*"      "*"  
## 8  ( 1 ) "*"  "*"          "*"          "*"          "*"      "*"  
##          factor(Garage)1 factor(Garage)2 factor(Garage)3
## 1  ( 1 ) " "             " "             " "            
## 2  ( 1 ) " "             " "             " "            
## 3  ( 1 ) " "             " "             "*"            
## 4  ( 1 ) " "             " "             "*"            
## 5  ( 1 ) " "             " "             "*"            
## 6  ( 1 ) "*"             " "             "*"            
## 7  ( 1 ) " "             " "             "*"            
## 8  ( 1 ) "*"             "*"             " "

The best combination of three predictor variables is sqft, baths and garage=3.

1.4 The plan for the rest of the discussion

To conduct and understand this process of model selection with confidence, we will need to master a number of technical skills:

  1. Interpret regression result on categorical and dummy variables.
  2. Interpret regression result on interaction between variables.
  3. Learn criteria of predictive modeling predictions.
  4. Combine all the skills above to determine a model that best fits our objectives.

2 Regression with indicator and categorical variables

Categorical variables are mathematically simpler than quantitative variables, but they can be less straightforward when operting and interpreting in regression setting. This section makes sure you get comfortable with them.

2.1 Review categorical variables

Example: Use the command table() to describe the distribution of variable Garage: distribution means the list of all possible outcomes (0 to 3 garages) and the frequencies for the respective outcomes.

table((house$Garage))
## 
##   0   1   2   3 
##  41  60 350  48

Exercise: Use the command table() to describe the distribution of variable Baths (Hint: You’ll need factor(Baths) for table() to understand that it’s dealing with a factor variable.

table(factor(house$Baths))
## 
##   1 1.5   2 2.5   3 3.5   4 4.5 
##  82  46 166 143  34  19   5   4

2.2 Creating indicator/dummy variables

From these categorical variables we could create indicator variables (aka “dummy variables”). For example we can create variable bath2 as follow:

house$bath2 = factor(house$Baths>2)

table(factor(house$Baths))
## 
##   1 1.5   2 2.5   3 3.5   4 4.5 
##  82  46 166 143  34  19   5   4
table(factor(house$bath2))
## 
## FALSE  TRUE 
##   294   205

Observation: The command factor(house$Baths>2) creates a new variable bath2 that takes value TRUE if a house has more than 2 baths and FALSE if a house has 2 or fewer baths.

Exercise: create a variable called gar1 to indicate if a house has more than 1 garages.

house$gar1 = factor(house$Garage>1)

table(factor(house$Garage))
## 
##   0   1   2   3 
##  41  60 350  48
table(factor(house$gar1))
## 
## FALSE  TRUE 
##   101   398

101 houses have 1 or fewer garages and 398 houses have more than 1 garages.

2.3 Regression interpretation of categorical variables

Exercise: Regress Price on Baths to find the average linear precition of price-per-bathroom and interpret the estimated coefficient under `Baths.

lm(Price~ Baths, data = house)
## 
## Call:
## lm(formula = Price ~ Baths, data = house)
## 
## Coefficients:
## (Intercept)        Baths  
##      -19491        83183

On average the price per bath is $83,183

Exercise: Run a regression of Price on factor variable bath2.

lm(Price~ bath2, data = house)
## 
## Call:
## lm(formula = Price ~ bath2, data = house)
## 
## Coefficients:
## (Intercept)    bath2TRUE  
##      114830        97884

The estimated intercept is the average price for houses with 2 or fewwer baths (bath2 = FALSE). The estimated coefficient under bath2 TRUE is how much on average houses with more than 2 baths costs more than others.

Exercise: Run a regression of Price on factor(Baths>2). Compare your regression result with the regression above.

lm(Price~ factor(Baths>2), data = house)
## 
## Call:
## lm(formula = Price ~ factor(Baths > 2), data = house)
## 
## Coefficients:
##           (Intercept)  factor(Baths > 2)TRUE  
##                114830                  97884

Interpretation:

The coefficient under bath2TRUE equals 98,000. This means that, the difference in average house Price between the group more than two baths and the group _2 or fewer baths__ is 98,000.

In addition, the coefficient under (Intercept) is 115,000. For indicator regression, this interecept does have a meaning. Namely, the average Price of houses with 2 or fewer baths equal 115,000.

We can extend the interpretation of regression result indicator variables to that of categorical variables.

Exercise: Regress Price on factor(Baths) and interpret the result

lm(Price~factor(Baths), data = house)
## 
## Call:
## lm(formula = Price ~ factor(Baths), data = house)
## 
## Coefficients:
##      (Intercept)  factor(Baths)1.5    factor(Baths)2  factor(Baths)2.5  
##            82637             34072             47575            107874  
##   factor(Baths)3  factor(Baths)3.5    factor(Baths)4  factor(Baths)4.5  
##           117433            232537            238103            409613
  1. The average price of a house with 1 bathroom is 82,637 (1 bathroom houses are treated as default).
  2. The average price of a house with 3.5 bathrooms is 232,537+82,637.

Technical note: When regressing on a categorical variable, the first value of the variable is always treated as default. The estimated coefficients associated with other values of the categorical variable is interpreted as how much the price of the category is more than the default average.

3 Multiplevariable regression with interactions between continuous and categorical variables

In this section we learn how to interpret regression results when predictors include different types of variables.

We code 3 models that are different from one another by only one predictor each time and compare how that one difference change the regression results of the models.

When considering a few models at a time, we use command screenreg() to put their regression estimates side by side in an orderly manner.

We create our models from randomized data, which subject to the model estimates to random sampling errors. We will briefly discuss the implications for interpretation.

3.1 Multivariable regression examples

Example: From dataset house, regress Price on SqFt and factor(Zip==6). Save your result as model moda (for “model a”).

moda = lm(Price~SqFt+factor(Zip==6), data=house)
summary(moda)
## 
## Call:
## lm(formula = Price ~ SqFt + factor(Zip == 6), data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -177126  -20879    -147   18538  222136 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -23839.355   5034.440  -4.735 2.86e-06 ***
## SqFt                     97.013      2.699  35.940  < 2e-16 ***
## factor(Zip == 6)TRUE  15420.928   4046.159   3.811 0.000156 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39580 on 496 degrees of freedom
## Multiple R-squared:  0.746,  Adjusted R-squared:  0.745 
## F-statistic: 728.3 on 2 and 496 DF,  p-value: < 2.2e-16

Example: Run a linear multiple-variable regression model and call it modb. This model enables you to predict Price of a house based on consideration of SqFt, whether the house is in Zip 6, and whether it has more than 2 BedRooms.

modb = lm(Price~SqFt+factor(Zip==6)+factor(BedRooms>2), data=house)
summary(modb)
## 
## Call:
## lm(formula = Price ~ SqFt + factor(Zip == 6) + factor(BedRooms > 
##     2), data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179074  -20740     655   17867  220790 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -13132.973   7684.514  -1.709   0.0881 .  
## SqFt                         98.309      2.783  35.321  < 2e-16 ***
## factor(Zip == 6)TRUE      15896.370   4044.707   3.930  9.7e-05 ***
## factor(BedRooms > 2)TRUE -14044.753   7629.697  -1.841   0.0662 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39490 on 495 degrees of freedom
## Multiple R-squared:  0.7477, Adjusted R-squared:  0.7462 
## F-statistic:   489 on 3 and 495 DF,  p-value: < 2.2e-16

Interaction term between two predictor variables: Run a linear multiple-variable regression model and call it modc. This model enables you to predict Price of a house based on consideration of SqFt and allows for the Price-per-SqFt to adjust depending on whether the house is in Zip 6.

To do this we apply a technique called the interaction term between Price and factor(Zip==6).

#regress Price on the interaction term SqFt*factor(Zip==6)
modc = lm(Price~SqFt*factor(Zip==6), data=house)
summary(modc)
## 
## Call:
## lm(formula = Price ~ SqFt * factor(Zip == 6), data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -163356  -19499    -397   15363  241829 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -6435.145   5492.266  -1.172    0.242    
## SqFt                          86.751      3.014  28.783  < 2e-16 ***
## factor(Zip == 6)TRUE      -61378.923  12184.295  -5.038 6.62e-07 ***
## SqFt:factor(Zip == 6)TRUE     39.139      5.886   6.649 7.80e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37960 on 495 degrees of freedom
## Multiple R-squared:  0.7668, Adjusted R-squared:  0.7654 
## F-statistic: 542.6 on 3 and 495 DF,  p-value: < 2.2e-16

3.2 Present and compare model estimates

Each regression model contains a ton of numerical information. To help compare them, we install 2 (optional) packages that summarize and compare model regression estimates:

#install and library 2 packages for summarizing regression results
library(sjPlot) #command tab_model()
library(texreg) #command screenreg(), or texreg()
tab_model(moda, modb, modc, 
          show.aic = TRUE, show.ci = FALSE,collapse.se = TRUE, p.style = "asterisk")
  Price Price Price
Predictors Estimates Estimates Estimates
(Intercept) -23839.35 ***
(5034.44)
-13132.97
(7684.51)
-6435.14
(5492.27)
SqFt 97.01 ***
(2.70)
98.31 ***
(2.78)
86.75 ***
(3.01)
factor(Zip == 6)TRUE 15420.93 ***
(4046.16)
15896.37 ***
(4044.71)
-61378.92 ***
(12184.30)
factor(BedRooms > 2)TRUE -14044.75
(7629.70)
SqFt : factor(Zip ==
6)TRUE
39.14 ***
(5.89)
Observations 499 499 499
R2 / R2 adjusted 0.746 / 0.745 0.748 / 0.746 0.767 / 0.765
AIC 11986.062 11984.658 11945.369
  • p<0.05   ** p<0.01   *** p<0.001
screenreg(list(moda, modb, modc))
## 
## ======================================================================
##                            Model 1        Model 2        Model 3      
## ----------------------------------------------------------------------
## (Intercept)                -23839.35 ***  -13132.97       -6435.14    
##                             (5034.44)      (7684.51)      (5492.27)   
## SqFt                           97.01 ***      98.31 ***      86.75 ***
##                                (2.70)         (2.78)         (3.01)   
## factor(Zip == 6)TRUE        15420.93 ***   15896.37 ***  -61378.92 ***
##                             (4046.16)      (4044.71)     (12184.30)   
## factor(BedRooms > 2)TRUE                  -14044.75                   
##                                            (7629.70)                  
## SqFt:factor(Zip == 6)TRUE                                    39.14 ***
##                                                              (5.89)   
## ----------------------------------------------------------------------
## R^2                             0.75           0.75           0.77    
## Adj. R^2                        0.74           0.75           0.77    
## Num. obs.                     499            499            499       
## RMSE                        39582.51       39487.55       37963.23    
## ======================================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Model interpretation:

  1. Use each of the models above to manually predict the price of a house that is 2000 sqft, located in zip code 5, and has 3 bedrooms. Call these predictions preda1, predb1, predc1, respectively
-23839.35 + 2000*97.01 + 0*15420.93      #model A
## [1] 170180.6
-13122.97 + 2000*98.31 + 0*15896.37 + 1*(-14044.75)   #model B
## [1] 169452.3
-6435.14 + 2000*88.78 + 0 + 0      #model C
## [1] 171124.9
  1. Use each of the models above to manually predict the price of a house that is also 2000 sqft, located in zip code 6, and has 2 bedrooms. Call these predictions preda1, predb1, predc1, respectively
-23839.35 + 2000*97.01 + 1*15420.93      #model A
## [1] 185601.6
-13122.97 + 2000*98.31 + 1*15896.37 + 0*(-14044.75)   #model B
## [1] 199393.4
-6435.14 + 2000*88.78 + 1*(-61378.92) + 2000*39.14    #model C
## [1] 188025.9
  1. In model b,
    • The estimated coefficient for factor(Zip == 6)TRUE equals 15896.37. This means that using model b, a house in zip code 6 predicts 15896.37 more in price compared to other houses.
    • The estimate has standard error 4044.71. This means that the model prediction is subject to an error of 4044.71 due to the fact that our data is from a random sample.
    • The estimate gets *** for a p-value less than 0.001. This means that if the location in zip code 6 does NOT make a difference for house price, controlling for other variables, the chance that we would obtain a coefficient estimate as big as 15896.37 would be miniscule.
  2. Also in model b,
    • The estimated coefficient for factor(BedRooms > 2)TRUE equals -14,000. This means that using model b, a house with more than 2 bedrooms predicts a loss of 14,000 in price compared to houses with fewer than 2 bedrooms when we also consider sqft.
    • The estimate has standard error 7,600. This means that the model prediction is subject to an error of 7600 due to the fact that our data is from a random sample.
    • The estimate gets 0 * for a p-value larger than 5%. This means that if having more than 2 bedrooms does NOT make a difference for house price, controlling for other variables, the chance that we would obtain a coefficient as big as -14000 would be likely.
  3. In model c, the interaction terms SqFt*factor(Zip==6) means that we are running 2 simultaneous regressions:
    • The first regression is for houses with zip is not 6. For this group, our estimated prediction formula is: _-6400 + 86.75*SqFt_.
    • The second regression is for houses with zip = 6. For this group, our estimated prediction formula is: -6400 + 86.75SqFt + -61000 + 39.14SqFt.
    • Based on these formula, therefore, one additional squared foot of a house outsize zip code 6 predicts $86.75 increase in price, with statistical error of $3. This estimate is 3 stars so it is statistically significant.
    • Based on these formulae, therefore, one additional squared foot of a house in zip code 6 predicts an additional $39.14 per SqFt, with statistical error of 5.9. This estimate is 3 stars.

Exercise

  1. Code a regression model, called modd that predicts Price based on the interaction between SqFt and factor(BathRooms>2).
modd = lm(Price~SqFt*factor(BedRooms>2), data=house)
summary(modd)
## 
## Call:
## lm(formula = Price ~ SqFt * factor(BedRooms > 2), data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -189584  -21014     392   16804  226129 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -2401.11   23228.91  -0.103    0.918    
## SqFt                              89.32      20.46   4.365 1.55e-05 ***
## factor(BedRooms > 2)TRUE      -24778.35   23862.87  -1.038    0.300    
## SqFt:factor(BedRooms > 2)TRUE     11.57      20.65   0.560    0.576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40090 on 495 degrees of freedom
## Multiple R-squared:   0.74,  Adjusted R-squared:  0.7384 
## F-statistic: 469.6 on 3 and 495 DF,  p-value: < 2.2e-16
  1. Code a regression model, called mode that adds predictor variable factor(Zip==6) to modd.
mode = lm(Price~SqFt*factor(BedRooms>2)+factor(Zip==6), data=house)
summary(mode)
## 
## Call:
## lm(formula = Price ~ SqFt * factor(BedRooms > 2) + factor(Zip == 
##     6), data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -179600  -20536     553   17922  220381 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2534.75   22926.98   0.111    0.912    
## SqFt                              83.79      20.22   4.145 4.00e-05 ***
## factor(BedRooms > 2)TRUE      -30211.09   23557.86  -1.282    0.200    
## factor(Zip == 6)TRUE           16013.28    4049.85   3.954 8.81e-05 ***
## SqFt:factor(BedRooms > 2)TRUE     14.77      20.37   0.725    0.469    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39510 on 494 degrees of freedom
## Multiple R-squared:  0.748,  Adjusted R-squared:  0.7459 
## F-statistic: 366.5 on 4 and 494 DF,  p-value: < 2.2e-16
  1. Code a regression model, called modf that adds predictor variable Baths to mode.
modf = lm(Price~Baths+SqFt*factor(BedRooms>2)+factor(Zip==6), data=house)
summary(modf)
## 
## Call:
## lm(formula = Price ~ Baths + SqFt * factor(BedRooms > 2) + factor(Zip == 
##     6), data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -158529  -21452    -399   16159  212328 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -4875.03   21751.28  -0.224  0.82275    
## Baths                          27531.78    3647.73   7.548 2.16e-13 ***
## SqFt                              59.07      19.44   3.039  0.00250 ** 
## factor(BedRooms > 2)TRUE      -44551.45   22407.73  -1.988  0.04734 *  
## factor(Zip == 6)TRUE           14310.91    3844.88   3.722  0.00022 ***
## SqFt:factor(BedRooms > 2)TRUE     19.41      19.31   1.005  0.31539    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37440 on 493 degrees of freedom
## Multiple R-squared:  0.7741, Adjusted R-squared:  0.7718 
## F-statistic: 337.8 on 5 and 493 DF,  p-value: < 2.2e-16
  1. Write a screenreg() command to report regression results all models moda to modf.
screenreg(list(moda, modb, modc, modd, mode, modf))
## 
## =======================================================================================================================
##                                Model 1        Model 2        Model 3        Model 4        Model 5        Model 6      
## -----------------------------------------------------------------------------------------------------------------------
## (Intercept)                    -23839.35 ***  -13132.97       -6435.14       -2401.11        2534.75       -4875.03    
##                                 (5034.44)      (7684.51)      (5492.27)     (23228.91)     (22926.98)     (21751.28)   
## SqFt                               97.01 ***      98.31 ***      86.75 ***      89.32 ***      83.79 ***      59.07 ** 
##                                    (2.70)         (2.78)         (3.01)        (20.46)        (20.22)        (19.44)   
## factor(Zip == 6)TRUE            15420.93 ***   15896.37 ***  -61378.92 ***                  16013.28 ***   14310.91 ***
##                                 (4046.16)      (4044.71)     (12184.30)                     (4049.85)      (3844.88)   
## factor(BedRooms > 2)TRUE                      -14044.75                     -24778.35      -30211.09      -44551.45 *  
##                                                (7629.70)                    (23862.87)     (23557.86)     (22407.73)   
## SqFt:factor(Zip == 6)TRUE                                        39.14 ***                                             
##                                                                  (5.89)                                                
## SqFt:factor(BedRooms > 2)TRUE                                                   11.57          14.77          19.41    
##                                                                                (20.65)        (20.37)        (19.31)   
## Baths                                                                                                      27531.78 ***
##                                                                                                            (3647.73)   
## -----------------------------------------------------------------------------------------------------------------------
## R^2                                 0.75           0.75           0.77           0.74           0.75           0.77    
## Adj. R^2                            0.74           0.75           0.77           0.74           0.75           0.77    
## Num. obs.                         499            499            499            499            499            499       
## RMSE                            39582.51       39487.55       37963.23       40086.20       39506.46       37442.37    
## =======================================================================================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
  1. Compare how the coefficient estimates for SqFt changes from modc to modf. The more the coefficient estimate SqFt decreases the less stars and, therefore, the less statistically significant it gets.

3.4 Random sampling and interpretation of regression estimates

Since our data comes from a random sample, the dataset is a one among many possible outcomes of a random variable.

We estimate our model coefficients based on the dataset, therefore, the coefficents we obtain are also going to be random.

Let’s pretend that we could run the same regression model using data from the population and obtain a set of coefficients for the model.

Compared to the coefficients obtained from the population model (if we COULD run it), the coefficients we obtain by regression on the random sample are subject to errors from random sampling procedure.

4 Model selection criteria

4.1 R-squared, adjusted R-squared, and AIC

In choosing an optimal model, we use two technical criteria:

  1. The simpler and fewer predictor variables, the better. (The technical word is “parsimonious”, which is a fancy choice word for “stingy”) For example, model b uses one more variable than model a. This means that, unless model b makes better prediction than model a (it does not), the additional variable is an unnecessary complication.

  2. The less (total squared) prediction error, the better. A direct measurement of how well a model reduces predictive errors is called R^2. For example, the R62 of model c above is 0.77. What this means is that that model can reduce roughly 77% of all prediction error relative to predicting Price by solely with the average price in the sample. The higher is this percentage, the better is a model at reducing prediction errors.

To balance between minimizing the complexity of the model and prediction errors for regression models, we learn two most widely-used criteria:

  1. Adjusted R-squared: The “adjusted” part refer to the number of predictor variables
  2. AIC is acronym for Akaike Information Criterian

4.2 Comparing models’ overall performance

The command compare_performance() by “performance” package provides a comprehensive lists of criteria, including one we have yet learned.

#load package "performance" to compare model's performance based on a number of criterion
library(performance)

compare_performance(moda, modb, modc)
## # Comparison of Model Performance Indices
## 
## Model | Type |      AIC |      BIC |   R2 | R2_adjusted |     RMSE |          BF
## --------------------------------------------------------------------------------
## moda  |   lm | 11986.06 | 12002.91 | 0.75 |        0.74 | 39463.35 |            
## modb  |   lm | 11984.66 | 12005.72 | 0.75 |        0.75 | 39328.96 |        0.25
## modc  |   lm | 11945.37 | 11966.43 | 0.77 |        0.77 | 37810.76 | 8.34868e+07

For convenience, though, the tab_model() command reports adjusted R-squared by default. We can add AIC quite simply by an option:

tab_model(moda, modb, modc, 
          show.aic = TRUE, show.ci = FALSE,collapse.se = TRUE, p.style = "asterisk")
  Price Price Price
Predictors Estimates Estimates Estimates
(Intercept) -23839.35 ***
(5034.44)
-13132.97
(7684.51)
-6435.14
(5492.27)
SqFt 97.01 ***
(2.70)
98.31 ***
(2.78)
86.75 ***
(3.01)
factor(Zip == 6)TRUE 15420.93 ***
(4046.16)
15896.37 ***
(4044.71)
-61378.92 ***
(12184.30)
factor(BedRooms > 2)TRUE -14044.75
(7629.70)
SqFt : factor(Zip ==
6)TRUE
39.14 ***
(5.89)
Observations 499 499 499
R2 / R2 adjusted 0.746 / 0.745 0.748 / 0.746 0.767 / 0.765
AIC 11986.062 11984.658 11945.369
  • p<0.05   ** p<0.01   *** p<0.001

4.3 The anova test compare performance between competing models

Criteria such ase AIC and adjusted R-squarerd are subject to _____ errors. Why? _______

Since model estimates from a randomized dataset are all ______, AIC and adjusted R-squared we obtained from models are also _______.

To test the difference in performance between 2 models, we use command anova:

anova(moda,modb)
## Analysis of Variance Table
## 
## Model 1: Price ~ SqFt + factor(Zip == 6)
## Model 2: Price ~ SqFt + factor(Zip == 6) + factor(BedRooms > 2)
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
## 1    496 7.7712e+11                               
## 2    495 7.7184e+11  1 5283649620 3.3885 0.06625 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(moda,modc)
## Analysis of Variance Table
## 
## Model 1: Price ~ SqFt + factor(Zip == 6)
## Model 2: Price ~ SqFt * factor(Zip == 6)
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1    496 7.7712e+11                                   
## 2    495 7.1340e+11  1 6.3723e+10 44.215 7.796e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So, even though model b has better (lower) AIC than model a, this difference could have been due to ______. To test this, we apply a test called the anova test.

In contrast, model modc’s improvement over moda is very unlikely to be due to random sampling error.

4.4 Thinking critically about model criteria

  • Prediction is subject to error. If you predict a coin flip one time, you could be right or you could be wrong. If you predict a price of one particular house, you’re ____________ and the question is how much you’d be off by. However, when you have a sample of houses to predict, and that you are comfortable with prediction errors, then making choice to ________ is a useful practice for science, business, and policy making.
  • It is important for you to realize that the technical criteria we defined above to evaluate regressions model are based on their performance over a _________, but not necessarily for _________, nor even for _________. This recognition may affect our choice of the optimal model later.

5 Determining an optimal model

Let’s revisit the list of optimal models we create with regsubsets() at the beginning of the discussion:

summary(opmodels)
## Subset selection object
## Call: regsubsets.formula(Price ~ SqFt + factor(Zip) + BedRooms + Baths + 
##     Garage + SqFt * factor(Zip) + SqFt * (Garage > 1) + SqFt * 
##     (Baths > 2), data = house, nbest = 1, method = "seqrep")
## 14 Variables  (and intercept)
##                     Forced in Forced out
## SqFt                    FALSE      FALSE
## factor(Zip)5            FALSE      FALSE
## factor(Zip)6            FALSE      FALSE
## factor(Zip)9            FALSE      FALSE
## BedRooms                FALSE      FALSE
## Baths                   FALSE      FALSE
## Garage                  FALSE      FALSE
## Garage > 1TRUE          FALSE      FALSE
## Baths > 2TRUE           FALSE      FALSE
## SqFt:factor(Zip)5       FALSE      FALSE
## SqFt:factor(Zip)6       FALSE      FALSE
## SqFt:factor(Zip)9       FALSE      FALSE
## SqFt:Garage > 1TRUE     FALSE      FALSE
## SqFt:Baths > 2TRUE      FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: 'sequential replacement'
##          SqFt factor(Zip)5 factor(Zip)6 factor(Zip)9 BedRooms Baths Garage
## 1  ( 1 ) "*"  " "          " "          " "          " "      " "   " "   
## 2  ( 1 ) "*"  " "          " "          " "          " "      "*"   " "   
## 3  ( 1 ) "*"  " "          " "          " "          " "      "*"   " "   
## 4  ( 1 ) "*"  " "          "*"          " "          " "      "*"   " "   
## 5  ( 1 ) "*"  " "          "*"          " "          " "      "*"   "*"   
## 6  ( 1 ) " "  " "          "*"          " "          " "      "*"   "*"   
## 7  ( 1 ) " "  " "          "*"          " "          " "      "*"   "*"   
## 8  ( 1 ) " "  " "          "*"          " "          " "      "*"   "*"   
##          Garage > 1TRUE Baths > 2TRUE SqFt:factor(Zip)5 SqFt:factor(Zip)6
## 1  ( 1 ) " "            " "           " "               " "              
## 2  ( 1 ) " "            " "           " "               " "              
## 3  ( 1 ) " "            " "           " "               "*"              
## 4  ( 1 ) " "            " "           " "               "*"              
## 5  ( 1 ) " "            " "           " "               "*"              
## 6  ( 1 ) "*"            " "           " "               "*"              
## 7  ( 1 ) "*"            " "           "*"               "*"              
## 8  ( 1 ) "*"            "*"           " "               "*"              
##          SqFt:factor(Zip)9 SqFt:Garage > 1TRUE SqFt:Baths > 2TRUE
## 1  ( 1 ) " "               " "                 " "               
## 2  ( 1 ) " "               " "                 " "               
## 3  ( 1 ) " "               " "                 " "               
## 4  ( 1 ) " "               " "                 " "               
## 5  ( 1 ) " "               " "                 " "               
## 6  ( 1 ) " "               "*"                 " "               
## 7  ( 1 ) " "               "*"                 " "               
## 8  ( 1 ) " "               "*"                 "*"

Exercise: Code 5 regression models based on the combinations suggested by regsubsets() above for 1, 2, 4, 5, and 6 optimal predictor variables, respectively (We skipped mod3 for it takes more work than worth it). Name the models mod1, mod2, mod4, mod5, and mod6, respectively. I have written the first three for you.

mod1=lm(Price~SqFt, data = house)
mod2=lm(Price~SqFt+Baths, data = house)
mod4=lm(Price~SqFt+Baths+SqFt*(Zip=="6"), data = house)
mod5=lm(Price~SqFt+Baths+Garage+SqFt*(Zip=="6"), data = house)
mod6=lm(Price~Baths+Garage+SqFt*(Zip=="6")+ SqFt*(Garage>1), data = house)

Exercise Use commands tab_model() or screenreg() to summarize regression estimates from the models above.

screenreg(list(mod1, mod2, mod4, mod5, mod6))
## 
## ==============================================================================================
##                      Model 1        Model 2        Model 3        Model 4        Model 5      
## ----------------------------------------------------------------------------------------------
## (Intercept)          -23906.23 ***  -42467.46 ***  -24375.75 ***  -29267.37 ***   25011.56 *  
##                       (5102.45)      (5500.07)      (5753.21)      (5844.76)     (12323.69)   
## SqFt                     99.49 ***      79.05 ***      67.16 ***      65.11 ***      16.34    
##                          (2.66)         (3.80)         (3.90)         (3.90)         (9.72)   
## Baths                                26366.04 ***   25598.88 ***   21124.71 ***   24224.99 ***
##                                      (3659.06)      (3461.69)      (3639.13)      (3569.82)   
## Zip == "6"TRUE                                     -63679.57 ***  -65883.24 ***  -56596.49 ***
##                                                    (11577.07)     (11455.66)     (11139.77)   
## SqFt:Zip == "6"TRUE                                    39.34 ***      40.45 ***      36.29 ***
##                                                        (5.59)         (5.53)         (5.37)   
## Garage                                                              9904.63 ***   22398.66 ***
##                                                                    (2750.24)      (4447.87)   
## Garage > 1TRUE                                                                   -93675.19 ***
##                                                                                  (14809.49)   
## SqFt:Garage > 1TRUE                                                                  51.00 ***
##                                                                                      (9.70)   
## ----------------------------------------------------------------------------------------------
## R^2                       0.74           0.76           0.79           0.80           0.81    
## Adj. R^2                  0.74           0.76           0.79           0.79           0.81    
## Num. obs.               499            499            499            499            499       
## RMSE                  40117.51       38207.87       36058.22       35629.15       34326.87    
## ==============================================================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

reg subset creates models and tells that 6 is the best model, but model 6 doesn’t include sqft.

Observation: In pure technical terms (AIC), we can keep improving the models by _____. We know this because, as we add new predictor variables as recommended by regsubsets(), R-squared ______ and AIC ______.

Exercise: Conduct an anova test to compare performance of mod5 and mod6 and see if the improvement from mod5 to mod6 is due to random sampling errors.

anova(mod5,mod6)
## Analysis of Variance Table
## 
## Model 1: Price ~ SqFt + Baths + Garage + SqFt * (Zip == "6")
## Model 2: Price ~ Baths + Garage + SqFt * (Zip == "6") + SqFt * (Garage > 
##     1)
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1    493 6.2583e+11                                  
## 2    491 5.7856e+11  2 4.727e+10 20.058 4.232e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is a disadvantage of models mod1 and mod2?

What is a disadvantage of models mod6 (and models with more interaction terms)?

Exercise: Conduct an anova test to compare performance of mod5 and mod6:

Between models mod4 and mod5, which one is more reliable and useful?

Exercise: What if we replace Baths by factor(Baths) in model 5? Compare the performance of this model with model 5.

Exercise: What if we replace Garages by factor(Garages) in model 5? Compare the performance of this model with model 5.

6 What we learned

  1. To find optimal regression models (based on AIC) from a set of predictor variables, we run command regression subset from package leap.
  2. To run regression models that predict response variable Y based on predictor variables X, we code command lm(Y~X, data =).
  3. The linear regression model assumes a linear relationship that can be written as _Y= a+bx_ where a and b are numbers. The model then estimate* coefficients a and b to minimize predictive errors. For a given value of X, we use the model to predict Y by calculating _Y= a+b*x_.
    • To compare regression results between models, we code comparison performance.
  4. We calculate basic performance criteria (AIC, adjusted R-squared) using package performance.
  5. Select models based on: available predictors, predictive performance, and usefulness for the purpose in hand.