Our overall objective is to find a (linear) regression model that helps us predict the price of homes from a data set.
I want my prediction to be as accurate (make as few errors) as possible.
I want my prediction model to be as simple and easy to understand as possible.
With some prior knowledge in house hunting, I believe a-priori that size (sqft) and location (zip codes) of the houses would be important information.
In regression lingo, home price is the response variable. The factors we use to predict home prices are called predictable variables.
# load the ggplot2 package
library(ggplot2)
# import dataset house
house <- read.csv("/cloud/project/house.csv")
house = house[house$SqFt<5000&house$Price<6e+5,]
We will learn the following new pacakges:
#install and library 2 packages for summarizing regression results
library(leaps) #command regsubsets()
library(sjPlot) #command tab_model()
library(texreg) #command screenreg(), or texreg()
## Version: 1.36.23
## Date: 2017-03-03
## Author: Philip Leifeld (University of Glasgow)
##
## Please cite the JSS article in your publications -- see citation("texreg").
library(performance) #command compare_performance()
leaps to find an optimal predictive modelTo apply the routine, we make a list of all information we will be considering for our regression:
PriceSqFt, factor(Zip), BedRooms, Baths, and GaragePrice-per-SqFt relationship could change by factor(Zip): SqFt*factor(Zip)Price-per-SqFt relationship could change for houses with more than 1 garage: SqFt*(Garage>1)Price-per-SqFt relationship could change for houses with more than 2 bathrooms: SqFt*(Baths>2)The number of potential predictors we are considering are: 7+3+2+2=14 (How does this work?)
#use command regsubsets from command leaps package
library(leaps)
#considering all models that could be extracted from the complete model
#Price~SqFt+BedRooms+Baths+Garage+factor(Zip)+SqFt*factor(Zip)+SqFt*(Garage>1)+SqFt*(Baths>2)
#run command regsubsets and save results in object opmodels
opmodels = regsubsets(Price ~SqFt+factor(Zip)+BedRooms+Baths+Garage+
SqFt*factor(Zip)+SqFt*(Garage>1)+SqFt*(Baths>2),
# enter dataset
data = house,
# a couple of technical options
nbest = 1, method = "seqrep")
#to print results, use summary()
summary(opmodels)
## Subset selection object
## Call: regsubsets.formula(Price ~ SqFt + factor(Zip) + BedRooms + Baths +
## Garage + SqFt * factor(Zip) + SqFt * (Garage > 1) + SqFt *
## (Baths > 2), data = house, nbest = 1, method = "seqrep")
## 14 Variables (and intercept)
## Forced in Forced out
## SqFt FALSE FALSE
## factor(Zip)5 FALSE FALSE
## factor(Zip)6 FALSE FALSE
## factor(Zip)9 FALSE FALSE
## BedRooms FALSE FALSE
## Baths FALSE FALSE
## Garage FALSE FALSE
## Garage > 1TRUE FALSE FALSE
## Baths > 2TRUE FALSE FALSE
## SqFt:factor(Zip)5 FALSE FALSE
## SqFt:factor(Zip)6 FALSE FALSE
## SqFt:factor(Zip)9 FALSE FALSE
## SqFt:Garage > 1TRUE FALSE FALSE
## SqFt:Baths > 2TRUE FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: 'sequential replacement'
## SqFt factor(Zip)5 factor(Zip)6 factor(Zip)9 BedRooms Baths Garage
## 1 ( 1 ) "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " "*" " "
## 3 ( 1 ) "*" " " " " " " " " "*" " "
## 4 ( 1 ) "*" " " "*" " " " " "*" " "
## 5 ( 1 ) "*" " " "*" " " " " "*" "*"
## 6 ( 1 ) " " " " "*" " " " " "*" "*"
## 7 ( 1 ) " " " " "*" " " " " "*" "*"
## 8 ( 1 ) " " " " "*" " " " " "*" "*"
## Garage > 1TRUE Baths > 2TRUE SqFt:factor(Zip)5 SqFt:factor(Zip)6
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " "*"
## 4 ( 1 ) " " " " " " "*"
## 5 ( 1 ) " " " " " " "*"
## 6 ( 1 ) "*" " " " " "*"
## 7 ( 1 ) "*" " " "*" "*"
## 8 ( 1 ) "*" "*" " " "*"
## SqFt:factor(Zip)9 SqFt:Garage > 1TRUE SqFt:Baths > 2TRUE
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " "*" " "
## 7 ( 1 ) " " "*" " "
## 8 ( 1 ) " " "*" "*"
The command regsubsets of package leaps:
SqFt.SqFt & Baths.SqFt, Baths & SqFt factor(Zip)6 (Sq ft of homes in zip code 6).regsubsets determines that, after 8 predictors, the models no longer improves, so it stops at 8.
For my purpose, I decide that the optimal model with 5 predictor variables is optimal. This combo of 5 predicting variables is Sqft, Baths, Garage, zipcode6, interaction for zipcode6 and SqFt.
An R command that would run this optimal regression command for me is:
opmod = lm(Price~SqFt*factor(Zip==6)+Baths+Garage, data = house)
summary(opmod)
##
## Call:
## lm(formula = Price ~ SqFt * factor(Zip == 6) + Baths + Garage,
## data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -157677 -18007 -2047 17260 218114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29267.366 5844.763 -5.007 7.69e-07 ***
## SqFt 65.112 3.896 16.714 < 2e-16 ***
## factor(Zip == 6)TRUE -65883.242 11455.659 -5.751 1.56e-08 ***
## Baths 21124.707 3639.127 5.805 1.15e-08 ***
## Garage 9904.628 2750.238 3.601 0.000349 ***
## SqFt:factor(Zip == 6)TRUE 40.453 5.533 7.311 1.08e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35630 on 493 degrees of freedom
## Multiple R-squared: 0.7954, Adjusted R-squared: 0.7934
## F-statistic: 383.4 on 5 and 493 DF, p-value: < 2.2e-16
Exercise: Use regsubsets() command to find the best combination of 3 predictors variable from the variables SqFt+factor(Zip)+BedRooms+Baths+Garage to predict Price from the dataset house.
Please make sure to change the name of your object to be different from opmodels
#use command regsubsets from command leaps package
library(leaps)
#considering all models that could be extracted from the complete model
#Price~SqFt+BedRooms+Baths+Garage+factor(Zip)+SqFt*factor(Zip)+SqFt*(Garage>1)+SqFt*(Baths>2)
summary(regsubsets(Price ~SqFt+factor(Zip)+BedRooms+Baths+Garage,
# enter dataset
data = house,
# a couple of technical options
nbest = 1, method = "seqrep"))
## Subset selection object
## Call: regsubsets.formula(Price ~ SqFt + factor(Zip) + BedRooms + Baths +
## Garage, data = house, nbest = 1, method = "seqrep")
## 7 Variables (and intercept)
## Forced in Forced out
## SqFt FALSE FALSE
## factor(Zip)5 FALSE FALSE
## factor(Zip)6 FALSE FALSE
## factor(Zip)9 FALSE FALSE
## BedRooms FALSE FALSE
## Baths FALSE FALSE
## Garage FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: 'sequential replacement'
## SqFt factor(Zip)5 factor(Zip)6 factor(Zip)9 BedRooms Baths Garage
## 1 ( 1 ) "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " "*" " "
## 3 ( 1 ) "*" " " "*" " " " " "*" " "
## 4 ( 1 ) "*" " " "*" " " " " "*" "*"
## 5 ( 1 ) "*" " " "*" " " "*" "*" "*"
## 6 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
The best combination of three predictor variables is sqft, zipcode6, and baths.
Exercise: Use regsubsets() command to find the best combination of 3 predictors variable from the variables SqFt+factor(Zip)+BedRooms+Baths+factor(Garage) to predict Price from the dataset house.
Please make sure to change the name of your object to be different from opmodels
#use command regsubsets from command leaps package
library(leaps)
#considering all models that could be extracted from the complete model
#Price~SqFt+BedRooms+Baths+Garage+factor(Zip)+SqFt*factor(Zip)+SqFt*(Garage>1)+SqFt*(Baths>2)
summary(regsubsets(Price ~SqFt+factor(Zip)+BedRooms+Baths+factor(Garage),
# enter dataset
data = house,
# a couple of technical options
nbest = 1, method = "seqrep"))
## Subset selection object
## Call: regsubsets.formula(Price ~ SqFt + factor(Zip) + BedRooms + Baths +
## factor(Garage), data = house, nbest = 1, method = "seqrep")
## 9 Variables (and intercept)
## Forced in Forced out
## SqFt FALSE FALSE
## factor(Zip)5 FALSE FALSE
## factor(Zip)6 FALSE FALSE
## factor(Zip)9 FALSE FALSE
## BedRooms FALSE FALSE
## Baths FALSE FALSE
## factor(Garage)1 FALSE FALSE
## factor(Garage)2 FALSE FALSE
## factor(Garage)3 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: 'sequential replacement'
## SqFt factor(Zip)5 factor(Zip)6 factor(Zip)9 BedRooms Baths
## 1 ( 1 ) "*" " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " "*"
## 3 ( 1 ) "*" " " " " " " " " "*"
## 4 ( 1 ) "*" " " "*" " " " " "*"
## 5 ( 1 ) "*" " " "*" " " "*" "*"
## 6 ( 1 ) "*" " " "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*"
## factor(Garage)1 factor(Garage)2 factor(Garage)3
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " "*"
## 4 ( 1 ) " " " " "*"
## 5 ( 1 ) " " " " "*"
## 6 ( 1 ) "*" " " "*"
## 7 ( 1 ) " " " " "*"
## 8 ( 1 ) "*" "*" " "
The best combination of three predictor variables is sqft, baths and garage=3.
To conduct and understand this process of model selection with confidence, we will need to master a number of technical skills:
Categorical variables are mathematically simpler than quantitative variables, but they can be less straightforward when operting and interpreting in regression setting. This section makes sure you get comfortable with them.
Example: Use the command table() to describe the distribution of variable Garage: distribution means the list of all possible outcomes (0 to 3 garages) and the frequencies for the respective outcomes.
table((house$Garage))
##
## 0 1 2 3
## 41 60 350 48
Exercise: Use the command table() to describe the distribution of variable Baths (Hint: You’ll need factor(Baths) for table() to understand that it’s dealing with a factor variable.
table(factor(house$Baths))
##
## 1 1.5 2 2.5 3 3.5 4 4.5
## 82 46 166 143 34 19 5 4
From these categorical variables we could create indicator variables (aka “dummy variables”). For example we can create variable bath2 as follow:
house$bath2 = factor(house$Baths>2)
table(factor(house$Baths))
##
## 1 1.5 2 2.5 3 3.5 4 4.5
## 82 46 166 143 34 19 5 4
table(factor(house$bath2))
##
## FALSE TRUE
## 294 205
Observation: The command factor(house$Baths>2) creates a new variable bath2 that takes value TRUE if a house has more than 2 baths and FALSE if a house has 2 or fewer baths.
Exercise: create a variable called gar1 to indicate if a house has more than 1 garages.
house$gar1 = factor(house$Garage>1)
table(factor(house$Garage))
##
## 0 1 2 3
## 41 60 350 48
table(factor(house$gar1))
##
## FALSE TRUE
## 101 398
101 houses have 1 or fewer garages and 398 houses have more than 1 garages.
Exercise: Regress Price on Baths to find the average linear precition of price-per-bathroom and interpret the estimated coefficient under `Baths.
lm(Price~ Baths, data = house)
##
## Call:
## lm(formula = Price ~ Baths, data = house)
##
## Coefficients:
## (Intercept) Baths
## -19491 83183
On average the price per bath is $83,183
Exercise: Run a regression of Price on factor variable bath2.
lm(Price~ bath2, data = house)
##
## Call:
## lm(formula = Price ~ bath2, data = house)
##
## Coefficients:
## (Intercept) bath2TRUE
## 114830 97884
The estimated intercept is the average price for houses with 2 or fewwer baths (bath2 = FALSE). The estimated coefficient under bath2 TRUE is how much on average houses with more than 2 baths costs more than others.
Exercise: Run a regression of Price on factor(Baths>2). Compare your regression result with the regression above.
lm(Price~ factor(Baths>2), data = house)
##
## Call:
## lm(formula = Price ~ factor(Baths > 2), data = house)
##
## Coefficients:
## (Intercept) factor(Baths > 2)TRUE
## 114830 97884
Interpretation:
The coefficient under bath2TRUE equals 98,000. This means that, the difference in average house Price between the group more than two baths and the group _2 or fewer baths__ is 98,000.
In addition, the coefficient under (Intercept) is 115,000. For indicator regression, this interecept does have a meaning. Namely, the average Price of houses with 2 or fewer baths equal 115,000.
We can extend the interpretation of regression result indicator variables to that of categorical variables.
Exercise: Regress Price on factor(Baths) and interpret the result
lm(Price~factor(Baths), data = house)
##
## Call:
## lm(formula = Price ~ factor(Baths), data = house)
##
## Coefficients:
## (Intercept) factor(Baths)1.5 factor(Baths)2 factor(Baths)2.5
## 82637 34072 47575 107874
## factor(Baths)3 factor(Baths)3.5 factor(Baths)4 factor(Baths)4.5
## 117433 232537 238103 409613
Technical note: When regressing on a categorical variable, the first value of the variable is always treated as default. The estimated coefficients associated with other values of the categorical variable is interpreted as how much the price of the category is more than the default average.
In this section we learn how to interpret regression results when predictors include different types of variables.
We code 3 models that are different from one another by only one predictor each time and compare how that one difference change the regression results of the models.
When considering a few models at a time, we use command screenreg() to put their regression estimates side by side in an orderly manner.
We create our models from randomized data, which subject to the model estimates to random sampling errors. We will briefly discuss the implications for interpretation.
Example: From dataset house, regress Price on SqFt and factor(Zip==6). Save your result as model moda (for “model a”).
moda = lm(Price~SqFt+factor(Zip==6), data=house)
summary(moda)
##
## Call:
## lm(formula = Price ~ SqFt + factor(Zip == 6), data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -177126 -20879 -147 18538 222136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23839.355 5034.440 -4.735 2.86e-06 ***
## SqFt 97.013 2.699 35.940 < 2e-16 ***
## factor(Zip == 6)TRUE 15420.928 4046.159 3.811 0.000156 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39580 on 496 degrees of freedom
## Multiple R-squared: 0.746, Adjusted R-squared: 0.745
## F-statistic: 728.3 on 2 and 496 DF, p-value: < 2.2e-16
Example: Run a linear multiple-variable regression model and call it modb. This model enables you to predict Price of a house based on consideration of SqFt, whether the house is in Zip 6, and whether it has more than 2 BedRooms.
modb = lm(Price~SqFt+factor(Zip==6)+factor(BedRooms>2), data=house)
summary(modb)
##
## Call:
## lm(formula = Price ~ SqFt + factor(Zip == 6) + factor(BedRooms >
## 2), data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179074 -20740 655 17867 220790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13132.973 7684.514 -1.709 0.0881 .
## SqFt 98.309 2.783 35.321 < 2e-16 ***
## factor(Zip == 6)TRUE 15896.370 4044.707 3.930 9.7e-05 ***
## factor(BedRooms > 2)TRUE -14044.753 7629.697 -1.841 0.0662 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39490 on 495 degrees of freedom
## Multiple R-squared: 0.7477, Adjusted R-squared: 0.7462
## F-statistic: 489 on 3 and 495 DF, p-value: < 2.2e-16
Interaction term between two predictor variables: Run a linear multiple-variable regression model and call it modc. This model enables you to predict Price of a house based on consideration of SqFt and allows for the Price-per-SqFt to adjust depending on whether the house is in Zip 6.
To do this we apply a technique called the interaction term between Price and factor(Zip==6).
#regress Price on the interaction term SqFt*factor(Zip==6)
modc = lm(Price~SqFt*factor(Zip==6), data=house)
summary(modc)
##
## Call:
## lm(formula = Price ~ SqFt * factor(Zip == 6), data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -163356 -19499 -397 15363 241829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6435.145 5492.266 -1.172 0.242
## SqFt 86.751 3.014 28.783 < 2e-16 ***
## factor(Zip == 6)TRUE -61378.923 12184.295 -5.038 6.62e-07 ***
## SqFt:factor(Zip == 6)TRUE 39.139 5.886 6.649 7.80e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37960 on 495 degrees of freedom
## Multiple R-squared: 0.7668, Adjusted R-squared: 0.7654
## F-statistic: 542.6 on 3 and 495 DF, p-value: < 2.2e-16
Each regression model contains a ton of numerical information. To help compare them, we install 2 (optional) packages that summarize and compare model regression estimates:
#install and library 2 packages for summarizing regression results
library(sjPlot) #command tab_model()
library(texreg) #command screenreg(), or texreg()
tab_model(moda, modb, modc,
show.aic = TRUE, show.ci = FALSE,collapse.se = TRUE, p.style = "asterisk")
| Price | Price | Price | |
|---|---|---|---|
| Predictors | Estimates | Estimates | Estimates |
| (Intercept) |
-23839.35 *** (5034.44) |
-13132.97 (7684.51) |
-6435.14 (5492.27) |
| SqFt |
97.01 *** (2.70) |
98.31 *** (2.78) |
86.75 *** (3.01) |
| factor(Zip == 6)TRUE |
15420.93 *** (4046.16) |
15896.37 *** (4044.71) |
-61378.92 *** (12184.30) |
| factor(BedRooms > 2)TRUE |
-14044.75 (7629.70) |
||
|
SqFt : factor(Zip == 6)TRUE |
39.14 *** (5.89) |
||
| Observations | 499 | 499 | 499 |
| R2 / R2 adjusted | 0.746 / 0.745 | 0.748 / 0.746 | 0.767 / 0.765 |
| AIC | 11986.062 | 11984.658 | 11945.369 |
|
|||
screenreg(list(moda, modb, modc))
##
## ======================================================================
## Model 1 Model 2 Model 3
## ----------------------------------------------------------------------
## (Intercept) -23839.35 *** -13132.97 -6435.14
## (5034.44) (7684.51) (5492.27)
## SqFt 97.01 *** 98.31 *** 86.75 ***
## (2.70) (2.78) (3.01)
## factor(Zip == 6)TRUE 15420.93 *** 15896.37 *** -61378.92 ***
## (4046.16) (4044.71) (12184.30)
## factor(BedRooms > 2)TRUE -14044.75
## (7629.70)
## SqFt:factor(Zip == 6)TRUE 39.14 ***
## (5.89)
## ----------------------------------------------------------------------
## R^2 0.75 0.75 0.77
## Adj. R^2 0.74 0.75 0.77
## Num. obs. 499 499 499
## RMSE 39582.51 39487.55 37963.23
## ======================================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
Model interpretation:
-23839.35 + 2000*97.01 + 0*15420.93 #model A
## [1] 170180.6
-13122.97 + 2000*98.31 + 0*15896.37 + 1*(-14044.75) #model B
## [1] 169452.3
-6435.14 + 2000*88.78 + 0 + 0 #model C
## [1] 171124.9
-23839.35 + 2000*97.01 + 1*15420.93 #model A
## [1] 185601.6
-13122.97 + 2000*98.31 + 1*15896.37 + 0*(-14044.75) #model B
## [1] 199393.4
-6435.14 + 2000*88.78 + 1*(-61378.92) + 2000*39.14 #model C
## [1] 188025.9
factor(Zip == 6)TRUE equals 15896.37. This means that using model b, a house in zip code 6 predicts 15896.37 more in price compared to other houses.*** for a p-value less than 0.001. This means that if the location in zip code 6 does NOT make a difference for house price, controlling for other variables, the chance that we would obtain a coefficient estimate as big as 15896.37 would be miniscule.factor(BedRooms > 2)TRUE equals -14,000. This means that using model b, a house with more than 2 bedrooms predicts a loss of 14,000 in price compared to houses with fewer than 2 bedrooms when we also consider sqft.* for a p-value larger than 5%. This means that if having more than 2 bedrooms does NOT make a difference for house price, controlling for other variables, the chance that we would obtain a coefficient as big as -14000 would be likely.SqFt*factor(Zip==6) means that we are running 2 simultaneous regressions:
Exercise
modd that predicts Price based on the interaction between SqFt and factor(BathRooms>2).modd = lm(Price~SqFt*factor(BedRooms>2), data=house)
summary(modd)
##
## Call:
## lm(formula = Price ~ SqFt * factor(BedRooms > 2), data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -189584 -21014 392 16804 226129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2401.11 23228.91 -0.103 0.918
## SqFt 89.32 20.46 4.365 1.55e-05 ***
## factor(BedRooms > 2)TRUE -24778.35 23862.87 -1.038 0.300
## SqFt:factor(BedRooms > 2)TRUE 11.57 20.65 0.560 0.576
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40090 on 495 degrees of freedom
## Multiple R-squared: 0.74, Adjusted R-squared: 0.7384
## F-statistic: 469.6 on 3 and 495 DF, p-value: < 2.2e-16
mode that adds predictor variable factor(Zip==6) to modd.mode = lm(Price~SqFt*factor(BedRooms>2)+factor(Zip==6), data=house)
summary(mode)
##
## Call:
## lm(formula = Price ~ SqFt * factor(BedRooms > 2) + factor(Zip ==
## 6), data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -179600 -20536 553 17922 220381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2534.75 22926.98 0.111 0.912
## SqFt 83.79 20.22 4.145 4.00e-05 ***
## factor(BedRooms > 2)TRUE -30211.09 23557.86 -1.282 0.200
## factor(Zip == 6)TRUE 16013.28 4049.85 3.954 8.81e-05 ***
## SqFt:factor(BedRooms > 2)TRUE 14.77 20.37 0.725 0.469
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39510 on 494 degrees of freedom
## Multiple R-squared: 0.748, Adjusted R-squared: 0.7459
## F-statistic: 366.5 on 4 and 494 DF, p-value: < 2.2e-16
modf that adds predictor variable Baths to mode.modf = lm(Price~Baths+SqFt*factor(BedRooms>2)+factor(Zip==6), data=house)
summary(modf)
##
## Call:
## lm(formula = Price ~ Baths + SqFt * factor(BedRooms > 2) + factor(Zip ==
## 6), data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -158529 -21452 -399 16159 212328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4875.03 21751.28 -0.224 0.82275
## Baths 27531.78 3647.73 7.548 2.16e-13 ***
## SqFt 59.07 19.44 3.039 0.00250 **
## factor(BedRooms > 2)TRUE -44551.45 22407.73 -1.988 0.04734 *
## factor(Zip == 6)TRUE 14310.91 3844.88 3.722 0.00022 ***
## SqFt:factor(BedRooms > 2)TRUE 19.41 19.31 1.005 0.31539
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37440 on 493 degrees of freedom
## Multiple R-squared: 0.7741, Adjusted R-squared: 0.7718
## F-statistic: 337.8 on 5 and 493 DF, p-value: < 2.2e-16
moda to modf.screenreg(list(moda, modb, modc, modd, mode, modf))
##
## =======================================================================================================================
## Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
## -----------------------------------------------------------------------------------------------------------------------
## (Intercept) -23839.35 *** -13132.97 -6435.14 -2401.11 2534.75 -4875.03
## (5034.44) (7684.51) (5492.27) (23228.91) (22926.98) (21751.28)
## SqFt 97.01 *** 98.31 *** 86.75 *** 89.32 *** 83.79 *** 59.07 **
## (2.70) (2.78) (3.01) (20.46) (20.22) (19.44)
## factor(Zip == 6)TRUE 15420.93 *** 15896.37 *** -61378.92 *** 16013.28 *** 14310.91 ***
## (4046.16) (4044.71) (12184.30) (4049.85) (3844.88)
## factor(BedRooms > 2)TRUE -14044.75 -24778.35 -30211.09 -44551.45 *
## (7629.70) (23862.87) (23557.86) (22407.73)
## SqFt:factor(Zip == 6)TRUE 39.14 ***
## (5.89)
## SqFt:factor(BedRooms > 2)TRUE 11.57 14.77 19.41
## (20.65) (20.37) (19.31)
## Baths 27531.78 ***
## (3647.73)
## -----------------------------------------------------------------------------------------------------------------------
## R^2 0.75 0.75 0.77 0.74 0.75 0.77
## Adj. R^2 0.74 0.75 0.77 0.74 0.75 0.77
## Num. obs. 499 499 499 499 499 499
## RMSE 39582.51 39487.55 37963.23 40086.20 39506.46 37442.37
## =======================================================================================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
SqFt changes from modc to modf. The more the coefficient estimate SqFt decreases the less stars and, therefore, the less statistically significant it gets.Since our data comes from a random sample, the dataset is a one among many possible outcomes of a random variable.
We estimate our model coefficients based on the dataset, therefore, the coefficents we obtain are also going to be random.
Let’s pretend that we could run the same regression model using data from the population and obtain a set of coefficients for the model.
Compared to the coefficients obtained from the population model (if we COULD run it), the coefficients we obtain by regression on the random sample are subject to errors from random sampling procedure.
In choosing an optimal model, we use two technical criteria:
The simpler and fewer predictor variables, the better. (The technical word is “parsimonious”, which is a fancy choice word for “stingy”) For example, model b uses one more variable than model a. This means that, unless model b makes better prediction than model a (it does not), the additional variable is an unnecessary complication.
The less (total squared) prediction error, the better. A direct measurement of how well a model reduces predictive errors is called R^2. For example, the R62 of model c above is 0.77. What this means is that that model can reduce roughly 77% of all prediction error relative to predicting Price by solely with the average price in the sample. The higher is this percentage, the better is a model at reducing prediction errors.
To balance between minimizing the complexity of the model and prediction errors for regression models, we learn two most widely-used criteria:
The command compare_performance() by “performance” package provides a comprehensive lists of criteria, including one we have yet learned.
#load package "performance" to compare model's performance based on a number of criterion
library(performance)
compare_performance(moda, modb, modc)
## # Comparison of Model Performance Indices
##
## Model | Type | AIC | BIC | R2 | R2_adjusted | RMSE | BF
## --------------------------------------------------------------------------------
## moda | lm | 11986.06 | 12002.91 | 0.75 | 0.74 | 39463.35 |
## modb | lm | 11984.66 | 12005.72 | 0.75 | 0.75 | 39328.96 | 0.25
## modc | lm | 11945.37 | 11966.43 | 0.77 | 0.77 | 37810.76 | 8.34868e+07
For convenience, though, the tab_model() command reports adjusted R-squared by default. We can add AIC quite simply by an option:
tab_model(moda, modb, modc,
show.aic = TRUE, show.ci = FALSE,collapse.se = TRUE, p.style = "asterisk")
| Price | Price | Price | |
|---|---|---|---|
| Predictors | Estimates | Estimates | Estimates |
| (Intercept) |
-23839.35 *** (5034.44) |
-13132.97 (7684.51) |
-6435.14 (5492.27) |
| SqFt |
97.01 *** (2.70) |
98.31 *** (2.78) |
86.75 *** (3.01) |
| factor(Zip == 6)TRUE |
15420.93 *** (4046.16) |
15896.37 *** (4044.71) |
-61378.92 *** (12184.30) |
| factor(BedRooms > 2)TRUE |
-14044.75 (7629.70) |
||
|
SqFt : factor(Zip == 6)TRUE |
39.14 *** (5.89) |
||
| Observations | 499 | 499 | 499 |
| R2 / R2 adjusted | 0.746 / 0.745 | 0.748 / 0.746 | 0.767 / 0.765 |
| AIC | 11986.062 | 11984.658 | 11945.369 |
|
|||
Criteria such ase AIC and adjusted R-squarerd are subject to _____ errors. Why? _______
Since model estimates from a randomized dataset are all ______, AIC and adjusted R-squared we obtained from models are also _______.
To test the difference in performance between 2 models, we use command anova:
anova(moda,modb)
## Analysis of Variance Table
##
## Model 1: Price ~ SqFt + factor(Zip == 6)
## Model 2: Price ~ SqFt + factor(Zip == 6) + factor(BedRooms > 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 496 7.7712e+11
## 2 495 7.7184e+11 1 5283649620 3.3885 0.06625 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(moda,modc)
## Analysis of Variance Table
##
## Model 1: Price ~ SqFt + factor(Zip == 6)
## Model 2: Price ~ SqFt * factor(Zip == 6)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 496 7.7712e+11
## 2 495 7.1340e+11 1 6.3723e+10 44.215 7.796e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, even though model b has better (lower) AIC than model a, this difference could have been due to ______. To test this, we apply a test called the anova test.
In contrast, model modc’s improvement over moda is very unlikely to be due to random sampling error.
Let’s revisit the list of optimal models we create with regsubsets() at the beginning of the discussion:
summary(opmodels)
## Subset selection object
## Call: regsubsets.formula(Price ~ SqFt + factor(Zip) + BedRooms + Baths +
## Garage + SqFt * factor(Zip) + SqFt * (Garage > 1) + SqFt *
## (Baths > 2), data = house, nbest = 1, method = "seqrep")
## 14 Variables (and intercept)
## Forced in Forced out
## SqFt FALSE FALSE
## factor(Zip)5 FALSE FALSE
## factor(Zip)6 FALSE FALSE
## factor(Zip)9 FALSE FALSE
## BedRooms FALSE FALSE
## Baths FALSE FALSE
## Garage FALSE FALSE
## Garage > 1TRUE FALSE FALSE
## Baths > 2TRUE FALSE FALSE
## SqFt:factor(Zip)5 FALSE FALSE
## SqFt:factor(Zip)6 FALSE FALSE
## SqFt:factor(Zip)9 FALSE FALSE
## SqFt:Garage > 1TRUE FALSE FALSE
## SqFt:Baths > 2TRUE FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: 'sequential replacement'
## SqFt factor(Zip)5 factor(Zip)6 factor(Zip)9 BedRooms Baths Garage
## 1 ( 1 ) "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " "*" " "
## 3 ( 1 ) "*" " " " " " " " " "*" " "
## 4 ( 1 ) "*" " " "*" " " " " "*" " "
## 5 ( 1 ) "*" " " "*" " " " " "*" "*"
## 6 ( 1 ) " " " " "*" " " " " "*" "*"
## 7 ( 1 ) " " " " "*" " " " " "*" "*"
## 8 ( 1 ) " " " " "*" " " " " "*" "*"
## Garage > 1TRUE Baths > 2TRUE SqFt:factor(Zip)5 SqFt:factor(Zip)6
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " "*"
## 4 ( 1 ) " " " " " " "*"
## 5 ( 1 ) " " " " " " "*"
## 6 ( 1 ) "*" " " " " "*"
## 7 ( 1 ) "*" " " "*" "*"
## 8 ( 1 ) "*" "*" " " "*"
## SqFt:factor(Zip)9 SqFt:Garage > 1TRUE SqFt:Baths > 2TRUE
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " "*" " "
## 7 ( 1 ) " " "*" " "
## 8 ( 1 ) " " "*" "*"
Exercise: Code 5 regression models based on the combinations suggested by regsubsets() above for 1, 2, 4, 5, and 6 optimal predictor variables, respectively (We skipped mod3 for it takes more work than worth it). Name the models mod1, mod2, mod4, mod5, and mod6, respectively. I have written the first three for you.
mod1=lm(Price~SqFt, data = house)
mod2=lm(Price~SqFt+Baths, data = house)
mod4=lm(Price~SqFt+Baths+SqFt*(Zip=="6"), data = house)
mod5=lm(Price~SqFt+Baths+Garage+SqFt*(Zip=="6"), data = house)
mod6=lm(Price~Baths+Garage+SqFt*(Zip=="6")+ SqFt*(Garage>1), data = house)
Exercise Use commands tab_model() or screenreg() to summarize regression estimates from the models above.
screenreg(list(mod1, mod2, mod4, mod5, mod6))
##
## ==============================================================================================
## Model 1 Model 2 Model 3 Model 4 Model 5
## ----------------------------------------------------------------------------------------------
## (Intercept) -23906.23 *** -42467.46 *** -24375.75 *** -29267.37 *** 25011.56 *
## (5102.45) (5500.07) (5753.21) (5844.76) (12323.69)
## SqFt 99.49 *** 79.05 *** 67.16 *** 65.11 *** 16.34
## (2.66) (3.80) (3.90) (3.90) (9.72)
## Baths 26366.04 *** 25598.88 *** 21124.71 *** 24224.99 ***
## (3659.06) (3461.69) (3639.13) (3569.82)
## Zip == "6"TRUE -63679.57 *** -65883.24 *** -56596.49 ***
## (11577.07) (11455.66) (11139.77)
## SqFt:Zip == "6"TRUE 39.34 *** 40.45 *** 36.29 ***
## (5.59) (5.53) (5.37)
## Garage 9904.63 *** 22398.66 ***
## (2750.24) (4447.87)
## Garage > 1TRUE -93675.19 ***
## (14809.49)
## SqFt:Garage > 1TRUE 51.00 ***
## (9.70)
## ----------------------------------------------------------------------------------------------
## R^2 0.74 0.76 0.79 0.80 0.81
## Adj. R^2 0.74 0.76 0.79 0.79 0.81
## Num. obs. 499 499 499 499 499
## RMSE 40117.51 38207.87 36058.22 35629.15 34326.87
## ==============================================================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
reg subset creates models and tells that 6 is the best model, but model 6 doesn’t include sqft.
Observation: In pure technical terms (AIC), we can keep improving the models by _____. We know this because, as we add new predictor variables as recommended by regsubsets(), R-squared ______ and AIC ______.
Exercise: Conduct an anova test to compare performance of mod5 and mod6 and see if the improvement from mod5 to mod6 is due to random sampling errors.
anova(mod5,mod6)
## Analysis of Variance Table
##
## Model 1: Price ~ SqFt + Baths + Garage + SqFt * (Zip == "6")
## Model 2: Price ~ Baths + Garage + SqFt * (Zip == "6") + SqFt * (Garage >
## 1)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 493 6.2583e+11
## 2 491 5.7856e+11 2 4.727e+10 20.058 4.232e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What is a disadvantage of models mod1 and mod2?
What is a disadvantage of models mod6 (and models with more interaction terms)?
Exercise: Conduct an anova test to compare performance of mod5 and mod6:
Between models mod4 and mod5, which one is more reliable and useful?
Exercise: What if we replace Baths by factor(Baths) in model 5? Compare the performance of this model with model 5.
Exercise: What if we replace Garages by factor(Garages) in model 5? Compare the performance of this model with model 5.
Y based on predictor variables X, we code command lm(Y~X, data =).