0.1 Introduction

BEH Statistical Consulting presents this report to ACME Real Estate Investing, Inc. Our task is to develop a model to predict the selling price of a given home in Ames, Iowa. ACME can use this information to identify undervalued properties that are good investment opportunities.

Our model performs extremely well, accounting for roughly 93% of the variability in price. The root mean squared error (RMSE) is about $17,500.

In the rest of this report, we explain how we developed our model and discuss its implications.

0.2 Data

Our model is based on the ames dataset, which records the details of 2580 property sales in Ames, Iowa, with 81 variables describing each record. The dataset has been partitioned at random into three pieces: a training data set, a testing data set, and a validation data set. Our model is developed with the training data, then tested and validated with the other data subsets.

Although the data is comprehensive, it is biased by its exclusive focus on the years 2006-2010, which closely encompass the Great Recession. These years were extremely bad for real estate, to put it very mildly. Generalizing from our results should take this history into account.

Statistically speaking, we report correlations in this report. It is intuitive to associate causality with the prediction variables in our models, but because our research is purely observational, we can only scientifically report correlation, not causation. Any causal language used in this report is colloquial and should not be interpreted scientifically.

Our ames dataset is adapted from Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project [DeCock 2011].

load("ames_train.Rdata")
load("ames_test.Rdata")
load("ames_validation.Rdata")

We model with R using the following packages:

library(BAS)
## Warning: package 'BAS' was built under R version 3.4.2
library(MASS)
library(lm.beta)
library(tidyverse)
## Warning: package 'dplyr' was built under R version 3.4.2
library(png)
library(grid)

1 Exploratory Data Analysis (EDA)

1.1 Clean data

As a critical prelude to EDA, we first look for glitches in the data that would compromise our analysis. We find many glitches and clean them. In general the glitches are of the following types:
  • Factor variables with blank "" levels, which we repair;
  • Erroneous values of MS.SubClass for several properties, which we repair;
  • Variables coded as integers that should be factors, which we repair.
The following are also repaired. These are technically not glitches but do adversely impact modeling:
  • Abnormal sales are excluded from our model;
  • Missing values (NA) are, as appropriate, recoded as None or 0;
  • Factor variables that are intrinsically scalar are converted to integers;
  • Factor levels with very low representation (<10) are merged, as appropriate, into factor levels with more significant sample sizes (>10).

In order to test and validate our model, we need to clean ames_test and ames_validation exactly the same way that we clean ames_train. Because our cleaning is extensive, we do it just once on a combined collection of all the data (training, testing, and validating). After cleaning all the data, we partition the full dataset into exactly the same randomized subsets, ames_train, ames_test, and ames_validation.

# Compile all data into one place for cleaning
ames.clean.all <- rbind(ames_train,ames_test,ames_validation)

# keep track of original partitions
PID.train <- ames_train %>% select(PID)
PID.test <- ames_test %>% select(PID)
PID.validation <- ames_validation %>% select(PID)

1.1.1 Repair blank factor levels

The data include many factor-type variables. Several of these variables have an extraneous level of "" that we remove.

# Bsmt.Qual
ames.clean.all$Bsmt.Qual <- factor(ames.clean.all$Bsmt.Qual, levels = levels(ames.clean.all$Bsmt.Qual)[2:5])

# Bsmt.Cond
ames.clean.all$Bsmt.Cond <- factor(ames.clean.all$Bsmt.Cond, levels = levels(ames.clean.all$Bsmt.Cond)[2:6])

# Bsmt.Exposure
ames.clean.all$Bsmt.Exposure <- factor(ames.clean.all$Bsmt.Exposure, levels = levels(ames.clean.all$Bsmt.Exposure)[2:5])

# BsmtFin.Type.1
ames.clean.all$BsmtFin.Type.1 <- factor(ames.clean.all$BsmtFin.Type.1, levels = levels(ames.clean.all$BsmtFin.Type.1)[2:7])

# BsmtFin.Type.2
ames.clean.all$BsmtFin.Type.2 <- factor(ames.clean.all$BsmtFin.Type.2, levels = levels(ames.clean.all$BsmtFin.Type.2)[2:7])

# Garage.Qual
ames.clean.all$Garage.Qual <- factor(ames.clean.all$Garage.Qual, levels = levels(ames.clean.all$Garage.Qual)[2:6])

# Garage.Cond
ames.clean.all$Garage.Cond <- factor(ames.clean.all$Garage.Cond, levels = levels(ames.clean.all$Garage.Cond)[2:6])

# Garage.Finish
ames.clean.all$Garage.Finish <- factor(ames.clean.all$Garage.Finish, levels = levels(ames.clean.all$Garage.Finish)[2:4])

# Electrical
ames.clean.all$Electrical <- factor(ames.clean.all$Electrical, levels = levels(ames.clean.all$Electrical)[2:6])

# Mas.Vnr.Type
ames.clean.all$Mas.Vnr.Type <- factor(ames.clean.all$Mas.Vnr.Type, levels = levels(ames.clean.all$Mas.Vnr.Type)[2:6])

1.1.2 Repair MS.SubClass

Variable MS.SubClass codes dwelling type, and it is defined in terms of House.Style, Bldg.Type, and Year.Built. Unfortunately this variable is miscoded quite a few times in the data. Rather than track down each error, we recalculate MS.SubClass from scratch according to its definition:

ames.clean.all[(ames.clean.all$Bldg.Type=="1Fam" &
              ames.clean.all$House.Style=="1Story" &
              ames.clean.all$Year.Built>=1946 &
              ames.clean.all$MS.SubClass!=40),"MS.SubClass"] <- 20
ames.clean.all[(ames.clean.all$Bldg.Type=="1Fam" &
              ames.clean.all$House.Style=="1Story" &
              ames.clean.all$Year.Built<1946 &
              ames.clean.all$MS.SubClass!=40),"MS.SubClass"] <- 30
ames.clean.all[(ames.clean.all$Bldg.Type=="1Fam" &
              ames.clean.all$House.Style=="1.5Unf"),"MS.SubClass"] <- 45
ames.clean.all[(ames.clean.all$Bldg.Type=="1Fam" &
              ames.clean.all$House.Style=="1.5Fin"),"MS.SubClass"] <- 50
ames.clean.all[(ames.clean.all$Bldg.Type=="1Fam" &
              ames.clean.all$House.Style=="2Story" &
              ames.clean.all$Year.Built>=1946),"MS.SubClass"] <- 60
ames.clean.all[(ames.clean.all$Bldg.Type=="1Fam" &
              ames.clean.all$House.Style=="2Story" &
              ames.clean.all$Year.Built<1946),"MS.SubClass"] <- 70
ames.clean.all[(ames.clean.all$Bldg.Type=="1Fam" &
                  (ames.clean.all$House.Style=="2.5Unf" |
                     ames.clean.all$House.Style=="2.5Fin")),"MS.SubClass"] <- 75
ames.clean.all[(ames.clean.all$Bldg.Type=="1Fam" &
              ames.clean.all$House.Style=="SLvl"),"MS.SubClass"] <- 80
ames.clean.all[(ames.clean.all$Bldg.Type=="1Fam" &
              ames.clean.all$House.Style=="SFoyer"),"MS.SubClass"] <- 85
ames.clean.all[ames.clean.all$Bldg.Type=="Duplex","MS.SubClass"] <- 90
ames.clean.all[((ames.clean.all$Bldg.Type=="Twnhs" | ames.clean.all$Bldg.Type=="TwnhsE") &
              ames.clean.all$House.Style=="1Story" &
              ames.clean.all$Year.Built<1946),"MS.SubClass"] <- 120
ames.clean.all[((ames.clean.all$Bldg.Type=="Twnhs" | ames.clean.all$Bldg.Type=="TwnhsE") &
              (ames.clean.all$House.Style=="1.5Fin" |
                 ames.clean.all$House.Style=="1.5Unf")), "MS.SubClass"] <- 150
ames.clean.all[((ames.clean.all$Bldg.Type=="Twnhs" | ames.clean.all$Bldg.Type=="TwnhsE") &
              ames.clean.all$House.Style=="2Story" &
              ames.clean.all$Year.Built>=1946),"MS.SubClass"] <- 160
ames.clean.all[((ames.clean.all$Bldg.Type=="Twnhs" | ames.clean.all$Bldg.Type=="TwnhsE") &
              (ames.clean.all$House.Style=="SLvl" |
                 ames.clean.all$House.Style=="SFoyer") &
              ames.clean.all$Year.Built>=1946),"MS.SubClass"] <- 180
ames.clean.all[ames.clean.all$Bldg.Type=="2fmCon","MS.SubClass"] <- 190

1.1.3 Repair missing values

Some variables have large numbers of NAs. We replace NA with None (or 0) because NAs are excluded from the model-buildling, wheras we want to model None as a legitimate value.

addLevel <- function(x, newlevel=NULL){ 
  if(is.factor(x)) 
    return(factor(x, levels=c(levels(x), newlevel))) 
  return(x) 
}

# Pool.QC
ames.clean.all$Pool.QC <- addLevel(ames.clean.all$Pool.QC, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$Pool.QC),"Pool.QC"] <- "None"

# Misc.Feature
ames.clean.all$Misc.Feature <- addLevel(ames.clean.all$Misc.Feature, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$Misc.Feature),"Misc.Feature"] <- "None"

# Alley
ames.clean.all$Alley <- addLevel(ames.clean.all$Alley, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$Alley),"Alley"] <- "None"

# Fence
ames.clean.all$Fence <- addLevel(ames.clean.all$Fence, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$Fence),"Fence"] <- "None"

# Fireplace.Qu
ames.clean.all$Fireplace.Qu <- addLevel(ames.clean.all$Fireplace.Qu, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$Fireplace.Qu),"Fireplace.Qu"] <- "None"

# Lot.Frontage -- NA means 0
ames.clean.all[is.na(ames.clean.all$Lot.Frontage),"Lot.Frontage"] <- 0

# Garage.Yr.Blt
# Not adjusted here because we remove this variable entirely from consideration

# Garage.Qual
ames.clean.all$Garage.Qual <- addLevel(ames.clean.all$Garage.Qual, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$Garage.Qual),"Garage.Qual"] <- "None"

# Garage.Cond
ames.clean.all$Garage.Cond <- addLevel(ames.clean.all$Garage.Cond, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$Garage.Cond),"Garage.Cond"] <- "None"

# Garage.Type
ames.clean.all$Garage.Type <- addLevel(ames.clean.all$Garage.Type, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$Garage.Type),"Garage.Type"] <- "None"

# Garage.Finish
ames.clean.all$Garage.Finish <- addLevel(ames.clean.all$Garage.Finish, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$Garage.Finish),"Garage.Finish"] <- "None"

# Bsmt.Qual
ames.clean.all$Bsmt.Qual <- addLevel(ames.clean.all$Bsmt.Qual, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$Bsmt.Qual),"Bsmt.Qual"] <- "None"

# Bsmt.Cond
ames.clean.all$Bsmt.Cond <- addLevel(ames.clean.all$Bsmt.Cond, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$Bsmt.Cond),"Bsmt.Cond"] <- "None"

# Bsmt.Exposure
ames.clean.all$Bsmt.Exposure <- addLevel(ames.clean.all$Bsmt.Exposure, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$Bsmt.Exposure),"Bsmt.Exposure"] <- "None"

# BsmtFin.Type.1
ames.clean.all$BsmtFin.Type.1 <- addLevel(ames.clean.all$BsmtFin.Type.1, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$BsmtFin.Type.1),"BsmtFin.Type.1"] <- "None"

# BsmtFin.Type.2
ames.clean.all$BsmtFin.Type.2 <- addLevel(ames.clean.all$BsmtFin.Type.2, newlevel = "None")
ames.clean.all[is.na(ames.clean.all$BsmtFin.Type.2),"BsmtFin.Type.2"] <- "None"

# Bsmt.Full.Bath -- NA means 0
ames.clean.all[is.na(ames.clean.all$Bsmt.Full.Bath),"Bsmt.Full.Bath"] <- 0
# Ignore Bsmt.Half.Bath completely

# Mas.Vnr.Area -- NA means 0 
ames.clean.all[is.na(ames.clean.all$Mas.Vnr.Area),"Mas.Vnr.Area"] <- 0

#Mas.Vnr.Type
ames.clean.all[is.na(ames.clean.all$Mas.Vnr.Type),"Mas.Vnr.Type"] <- "None"

1.1.4 Exclude abnormal sales

The data include broadly different categories of sales, as described by the variable Sale.Condition. We focus only on Normal sales, which are 834 of 1000. The other categories of sales include foreclosures, short sales, partial sales, and family sales, which potentially follow entirely different pricing models than normal sales.

ames.clean.all <- ames.clean.all %>% filter(Sale.Condition=="Normal")

1.1.5 Repair non-scalar integers

A couple variables are stored as integers even though the magnitude of the integer is meaningless to the variable. These variables should be factors, not integers:

# MS.subclass
ames.clean.all$MS.SubClass <- as.factor(ames.clean.all$MS.SubClass)

# Month Sold
ames.clean.all$Mo.Sold <- as.factor(ames.clean.all$Mo.Sold)

# Technically PID is another case, but we never use it in the linear modeling

1.1.6 Convert scalar factors to integers

Sometimes a factor variable has a clearly implied magnitude. For example, a Likert scale of poor, fair, typical, good, excellent has a clear progression from low (poor) to high (excellent), and we can reasonably recode it as 1, 2, 3, 4, 5. It will help our ability to develop a linear model if we identify every such “scalar” factor and convert it to an integer:

# silence warnings in this block
oldw <- getOption("warn")
options(warn = -1)

# "Quality" likert scales

numeric_score <- 
  data.frame(Kitchen.Qual = c("None","Po","Fa","TA","Gd","Ex"),
             Kitchen.Qual.int = c(0,1,2,3,4,5))
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="Kitchen.Qual")

colnames(numeric_score) <- c("Exter.Qual", "Exter.Qual.int")
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="Exter.Qual")

colnames(numeric_score) <- c("Bsmt.Qual", "Bsmt.Qual.int")
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="Bsmt.Qual")

colnames(numeric_score) <- c("Garage.Qual", "Garage.Qual.int")
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="Garage.Qual")

colnames(numeric_score) <- c("Heating.QC", "Heating.QC.int")
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="Heating.QC")

colnames(numeric_score) <- c("Fireplace.Qu", "Fireplace.Qu.int")
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="Fireplace.Qu")

# "Condition" Likert scales

colnames(numeric_score) <- c("Exter.Cond", "Exter.Cond.int")
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="Exter.Cond")

colnames(numeric_score) <- c("Bsmt.Cond", "Bsmt.Cond.int")
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="Bsmt.Cond")

colnames(numeric_score) <- c("Garage.Cond", "Garage.Cond.int")
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="Garage.Cond")

# More subtle cases that are judged to be likert scales
# Based on definitions of factor levels in datadocumentation.txt

numeric_score <- 
  data.frame(BsmtFin.Type.1 = c("None","Unf","LwQ","Rec","BLQ","ALQ", "GLQ"),
             BsmtFin.Type.1.int = c(0,1,2,3,4,5,6))
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="BsmtFin.Type.1")

colnames(numeric_score) <- c("BsmtFin.Type.2", "BsmtFin.Type.2.int")
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="BsmtFin.Type.2")

numeric_score <- 
  data.frame(Bsmt.Exposure = c("None","No","Mn","Av","Gd"),
             Bsmt.Exposure.int = c(0,1,2,3,4))
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="Bsmt.Exposure")

numeric_score <- 
  data.frame(Land.Slope = c("Gtl","Mod","Sev"),
             Land.Slope.int = c(0,1,2))
ames.clean.all <- left_join(ames.clean.all,numeric_score,by="Land.Slope")

options(warn = oldw)

1.1.7 Merge underpopulated factor levels

Many factor variables include some levels that occur very infrequently (<10 times) in ames_train. To make each level statistically significant, we merge underpopulated levels. We do this carefully to preserve as much information as possible.

Note that we do not merge any levels of Neighborhood. The neighborhoods with very few houses are all far too distinct to be merged. Generally the smaller neighborhoods are planned developments of townhouses and condos (BrDale, Blueste), or they are planned communities for retirement (GrnHill) or affordable housing (MeadowV).

# MS.Subclass: adjustments to houses with 1.5 stories and/or finished attics
ames.clean.all[ames.clean.all$MS.SubClass=="40",]$House.Style <- "1.5Fin"
ames.clean.all[ames.clean.all$MS.SubClass=="40",]$MS.SubClass <- "50"

# House.Style: equate all 2.5 houses
ames.clean.all$House.Style <- addLevel(ames.clean.all$House.Style, newlevel = "2.5Story")
ames.clean.all[ames.clean.all$MS.SubClass=="75",]$House.Style <- "2.5Story"

# MS.Zoning: equate all non-residential zones
ames.clean.all$MS.Zoning <- addLevel(ames.clean.all$MS.Zoning, newlevel = "Other")
ames.clean.all[ames.clean.all$MS.Zoning=="A (agr)",]$MS.Zoning <- "Other"
ames.clean.all[ames.clean.all$MS.Zoning=="C (all)",]$MS.Zoning <- "Other"
ames.clean.all[ames.clean.all$MS.Zoning=="I (all)",]$MS.Zoning <- "Other"

# Condition1.2: compile condition.1 and condition.2 into one variable
ames.clean.all$Condition.1 <- as.character(ames.clean.all$Condition.1)
ames.clean.all$Condition.2 <- as.character(ames.clean.all$Condition.2)
ames.clean.all <- ames.clean.all %>% 
  mutate(Condition.1.2 =
             as.character(
               ifelse(Condition.2=="Norm" | Condition.1==Condition.2,
                      Condition.1,
                      "Feedr + RR"))) %>%
  select(-Condition.1,
         -Condition.2)
ames.clean.all$Condition.1.2 <- as.factor(ames.clean.all$Condition.1.2)

# Lot.Shape: equate IR3 and IR2
ames.clean.all[ames.clean.all$Lot.Shape=="IR3",]$Lot.Shape <- "IR2"

# Lot.Config: equate FR3 and FR2
ames.clean.all[ames.clean.all$Lot.Config=="FR3",]$Lot.Config <- "FR2"

# we will ignore Land.Slope completely

# Roof.Style: equate Mansard and Gambrel
ames.clean.all[ames.clean.all$Roof.Style=="Mansard",]$Roof.Style <- "Gambrel"


# Exterior.1st & 2nd: multiple adjustments
ames.clean.all$Exterior.1st <- addLevel(ames.clean.all$Exterior.1st, newlevel = "Other")

ames.clean.all[ames.clean.all$Exterior.1st=="AsphShn",]$Exterior.1st <- "Other"
ames.clean.all[ames.clean.all$Exterior.2nd=="AsphShn",]$Exterior.2nd <- "Other"

ames.clean.all[ames.clean.all$Exterior.1st=="BrkComm",]$Exterior.1st <- "BrkFace"
# ames.clean.all[ames.clean.all$Exterior.2nd=="Brk Com",]$Exterior.2nd <- "BrkFace"

ames.clean.all[ames.clean.all$Exterior.1st=="CBlock",]$Exterior.1st <- "Other"
ames.clean.all[ames.clean.all$Exterior.1st=="PreCast",]$Exterior.1st <- "Other"
ames.clean.all[ames.clean.all$Exterior.1st=="ImStucc",]$Exterior.1st <- "Other"
ames.clean.all[ames.clean.all$Exterior.2nd=="ImStucc",]$Exterior.2nd <- "Other"

# Mas.Vnr.Type: equate types of brick
ames.clean.all[ames.clean.all$Mas.Vnr.Type=="BrkCmn",]$Mas.Vnr.Type <- "BrkFace"

# Foundation: equate "other" types
ames.clean.all$Foundation <- addLevel(ames.clean.all$Foundation, newlevel = "Other")
ames.clean.all[ames.clean.all$Foundation=="Slab",]$Foundation <- "Other"
ames.clean.all[ames.clean.all$Foundation=="Stone",]$Foundation <- "Other"
ames.clean.all[ames.clean.all$Foundation=="Wood",]$Foundation <- "Other"

# Electrical: equate poor and fair old-style fuses
ames.clean.all[is.na(ames.clean.all$Electrical),"Electrical"] <- "FuseF"
ames.clean.all[ames.clean.all$Electrical=="FuseP",]$Electrical <- "FuseF"

# Functional: equate Maj2 and Maj1
ames.clean.all[ames.clean.all$Functional=="Maj2",]$Functional <- "Maj1"

# sale.type: multiple adjustments
ames.clean.all[is.na(ames.clean.all$Sale.Type),"Sale.Type"] <- "Oth"
ames.clean.all[ames.clean.all$Sale.Type=="CWD","Sale.Type"] <- "WD "
ames.clean.all[ames.clean.all$Sale.Type=="VWD","Sale.Type"] <- "WD "
ames.clean.all[ames.clean.all$Sale.Type=="ConLD","Sale.Type"] <- "Con"
ames.clean.all[ames.clean.all$Sale.Type=="ConLI","Sale.Type"] <- "Con"
ames.clean.all[ames.clean.all$Sale.Type=="ConLw","Sale.Type"] <- "Con"

1.1.8 Restore data partitions

We are done cleaning. Now we restore the original data partitions. Any transformations from here on must be applied to all three partitions.

# Save original data
ames_train_original <- ames_train
ames_test_original <- ames_test
ames_validation_original <- ames_validation

# Put cleaned data in its place
ames_train <- inner_join(PID.train, ames.clean.all, by = "PID")
ames_test <- inner_join(PID.test, ames.clean.all, by = "PID")
ames_validation <- inner_join(PID.validation, ames.clean.all, by = "PID")

1.2 EDA

1.2.1 Variable transformations

Our primary variable of interest, price, is extremely right-skewed. By using the logarithm of price, we obtain a more normally distributed variable that is more suitable for our modeling.

Similarly, Lot.Area is very right-skewed, and it is an important predictor of price. We log-transform Lot.Area as well.

ames_train <- ames_train %>% mutate(
  log_price = log(price),
  log_lot = log(Lot.Area)
)

# Do same to test and validation data

ames_test <- ames_test %>% mutate(
  log_price = log(price),
  log_lot = log(Lot.Area)
)
ames_validation <- ames_validation %>% mutate(
  log_price = log(price),
  log_lot = log(Lot.Area)
)

1.2.2 Stagnating prices

The first significant finding is that we see no growth in prices over time.

ames_train %>% ggplot(aes(x=Yr.Sold,y=price/1000)) + geom_jitter() + geom_smooth() + ggtitle("Prices are stagnant during Great Recession") + xlab("Year Sold") + ylab("Price in $1000 (log scale)") + scale_y_log10(breaks=c(20,50,100,200,500))
## `geom_smooth()` using method = 'loess'

This is bad news for potential investors. The key fact here is that our dataset covers 2006-2010 and so it represents the epicenter of the Great Recession. Given that timing, it is actually surprising that we don’t see prices tanking in 2009-2010. News reports confirm that Ames prices held steady through this horrible time for real estate. A local expert in finance was quoted, “The recession hit the whole country hard, and in places like Ames, what we tended to see was house prices didn’t fall, but they didn’t rise, either.” (See this article from the Ames Tribune.)

Housing prices in Ames are now climbing steadily and have been for years (per the news analysis quoted above). Our model does not reflect this and would need significant retraining to predict the price growth that Ames experiences now.

1.2.3 Neighborhoods

Exact addresses for all property records are available in a free Excel spreadsheet on the reports page of the Ames Assessor’s website. With geoservices.tamu.edu we obtain precise latitude and longitude for each address. We save all this location information in CSV format, which we join to the dataframe here:

ames_train <- 
  left_join(ames_train, read.csv("ames_train_geo2.csv"), by="PID")

With this information we can display longitude=x and latitude=y for a house and integrate charts with a Google map of Ames.

# Locate neighborhood labels

ames_nhood <- ames_train %>% 
  filter(!is.na(Latitude)) %>%
  group_by(Neighborhood) %>%
  summarise(n=n(),
            Longitude=mean(Longitude),
            Latitude=mean(Latitude))

# Nudge for legibility
ames_nhood$Latitude[ames_nhood$Neighborhood=="Somerst"] <-
  ames_nhood$Latitude[ames_nhood$Neighborhood=="Somerst"] - 0.003
ames_nhood$Latitude[ames_nhood$Neighborhood=="Veenker"] <-
  ames_nhood$Latitude[ames_nhood$Neighborhood=="Veenker"] - 0.003
ames_nhood$Latitude[ames_nhood$Neighborhood=="Mitchel"] <-
  ames_nhood$Latitude[ames_nhood$Neighborhood=="Mitchel"] + 0.005
ames_nhood$Longitude[ames_nhood$Neighborhood=="MeadowV"] <-
  ames_nhood$Longitude[ames_nhood$Neighborhood=="MeadowV"] + 0.005
ames_nhood$Longitude[ames_nhood$Neighborhood=="BrDale"] <-
  ames_nhood$Longitude[ames_nhood$Neighborhood=="BrDale"] + 0.002
ames_nhood$Latitude[ames_nhood$Neighborhood=="BrDale"] <-
  ames_nhood$Latitude[ames_nhood$Neighborhood=="BrDale"] + 0.001
ames_nhood$Longitude[ames_nhood$Neighborhood=="NPkVill"] <-
  ames_nhood$Longitude[ames_nhood$Neighborhood=="NPkVill"] + 0.002
ames_nhood$Latitude[ames_nhood$Neighborhood=="NPkVill"] <-
  ames_nhood$Latitude[ames_nhood$Neighborhood=="NPkVill"] - 0.001
ames_nhood$Longitude[ames_nhood$Neighborhood=="NWAmes"] <-
  ames_nhood$Longitude[ames_nhood$Neighborhood=="NWAmes"] - 0.002

googlemap <- readPNG("ames_googlemap.png")
ames_train %>% filter(!is.na(Latitude)) %>%
  ggplot(aes(x=Longitude,y=Latitude,color=Neighborhood)) + 
  annotation_custom(rasterGrob(googlemap,
                               width = unit(1,"npc"),
                               height = unit(1,"npc")),
                    -Inf, Inf, -Inf, Inf) +
  geom_point(alpha=0.15) + 
  scale_y_continuous(limits = c(41.988,42.073)) +
  scale_x_continuous(limits = c(-93.696,-93.565)) +
  geom_label(data=ames_nhood,label=ames_nhood$Neighborhood,alpha=0.8,size=3) + ggtitle("Neighborhoods of Ames, IA")
## Warning: Removed 5 rows containing missing values (geom_point).

The big empty space surrounded by a ring of neighborhoods is the campus of Iowa State University.

1.2.3.1 Neighborhood and Price

Neighborhoods are priced differently from each other, as one might expect. In particular we see most of the highest prices in NRidgHt and `NoRidge, right next to the Ames Country Club.

ames_train %>% filter(!is.na(Latitude)) %>% 
  ggplot(aes(x=Longitude,y=Latitude,color=cut_number(price/1000,7))) + 
    annotation_custom(rasterGrob(googlemap,
                               width = unit(1,"npc"),
                               height = unit(1,"npc")),
                    -Inf, Inf, -Inf, Inf) +
  scale_y_continuous(limits = c(41.988,42.073)) +
  scale_x_continuous(limits = c(-93.696,-93.565)) +
  geom_point() + 
  ggtitle("House price by location") + 
  scale_color_discrete(name="Price in $1000")
## Warning: Removed 5 rows containing missing values (geom_point).

1.2.3.2 Neighborhood and Year Built

Housing construction in Ames has radiated outward over the decades. The very oldest homes (pre-1910) are just east of the university. A ring of 1910-1950 construction surrounds the campus. Newer construction then radiates outward from campus, especially north and west. Much of the south remains non-residential.

ames_train %>% filter(!is.na(Latitude)) %>% 
  mutate(decade=as.integer(trunc(Year.Built/10)*10)) %>% 
  ggplot(aes(x=Longitude,y=Latitude,color=cut_width(decade,20))) + 
    annotation_custom(rasterGrob(googlemap,
                               width = unit(1,"npc"),
                               height = unit(1,"npc")),
                    -Inf, Inf, -Inf, Inf) +
  scale_y_continuous(limits = c(41.988,42.073)) +
  scale_x_continuous(limits = c(-93.696,-93.565)) +
  geom_point() + 
  scale_color_discrete(name="Year Built") + 
  ggtitle("Housing construction history by location")
## Warning: Removed 5 rows containing missing values (geom_point).

1.2.4 Price per square foot vs type of dwelling

Within the typical range of house sizes, the amount of living area has a strong positive linear relationship to price. In a scatter plot, the slope of the line represents the rate of increase, and can be interpreted as something analogous to price per square foot. This slope or rate of increase changes according to the type of dwelling.

We create a new variable Dwell.Type that groups together houses with similar slopes or rates of price increase per area:

simplify_ms.subclass <- data.frame(
  MS.SubClass = 
    as.factor(c(20,30,40,45,50,60,70,75,80,85,90,120,150,160,180,190)),
  Dwell.Type = 
    c("1Fam 1Story", "1Fam 1Story", "40", "1Fam 1Story", "1Fam 2Story", 
      "1Fam 2Story", "1Fam 2Story", "1Fam 2Story", "Split", "Split", 
      "Twnhs2 or Dup", "Twnhs1 or split", "150", "Twnhs2 or Dup", 
      "Twnhs1 or split", "2FamCon")
)

ames_train <- left_join(ames_train, simplify_ms.subclass, by="MS.SubClass")

# Do same to test and validation data

ames_test <- left_join(ames_test, simplify_ms.subclass, by="MS.SubClass")
ames_validation <- left_join(ames_validation, simplify_ms.subclass,
                             by="MS.SubClass")
In the chart below we see that the rate of price increase differs across Dwell.Type generally as follows:
  1. One-story and split-level townhouses have the steepest slope, followed closely by one-story one-family houses;
  2. Two-story houses (one-family, townhouses, and duplex) have generally the next steepest slope;
  3. Two-family conversions have the lowest slope, and split-level one-families the second-lowest.
ames_train %>% ggplot(aes(x=area,y=price/1000,color=Dwell.Type)) + geom_point() + geom_smooth(method = "lm") + ggtitle("Price vs Area trends differ by dwelling type") + xlab("Area (square feet)") + ylab("Price in $1000 (log scale)") + scale_y_log10(breaks=c(20,50,100,200,500))

1.2.5 Price vs year built

All else held equal, a newer house tends to cost more than an older house. We observe this as a positive correlation between Year.Built and log_price.

Below we see evidence that this correlation only holds to a certain age limit, after which a house is simply “very old” and additional aging no longer corresponds to a price decrease. In fact, very old houses appear to increase in price, perhaps as they become more “vintage” or “historical.”

The price impact of aging appears to hit its limit in the 1940s. We incorporate this into our model by creating a new variable floored.Year.Built where any house built before 1946 is simply “very old” and gets a floored-year-built value of 1945.

very.old <- 1945

ames_train <- ames_train %>% mutate(
  floored.Year.Built = ifelse(Year.Built<=very.old,very.old,Year.Built))

# Do the same for test and validation data
ames_test <- ames_test %>% mutate(
  floored.Year.Built = ifelse(Year.Built<very.old,very.old,Year.Built))
ames_validation <- ames_validation %>% mutate(
  floored.Year.Built = ifelse(Year.Built<very.old,very.old,Year.Built))

Below we see that with this transformation, floored.Year.Built has a strong linear correlation to log_price, as opposed to Year.Built, which has a distinctly U-shaped correlation.

gather(ames_train, Year.Built, floored.Year.Built, key=age.method, value=yr.blt) %>% ggplot(aes(x=yr.blt,y=price/1000,color=age.method)) + geom_jitter(alpha=0.2) + geom_smooth() + facet_wrap(~age.method) + ggtitle("Price vs Year Built") + xlab("Year of house construction") + ylab("Price in $1000 (log scale)") + scale_y_log10(breaks=c(20,50,100,200,500))
## `geom_smooth()` using method = 'loess'


2 Model development and assessment

2.1 A simple initial model

Based on our EDA, we consider the following 10 variables to be important determinants of the price of a property:
  • Overall.Qual: The quality rating (from 1 to 10) is one of the strongest predictors of house price.
  • Overall.Cond: Even though most homes receive an average condition rating of 5 out of 10, there is enough variability in this rating to help predict price.
  • floored.Year.Built: As discussed above.
  • Dwell.Type and area: As discussed above.
  • BsmtFin.SF.1: Finished basement living area is not included in area but it impacts price the same way.
  • Neighborhood: Location, location, location! Neighborhoods are priced differently, as discussed above.
  • MS.Zoning: High-density residential is cheaper than medium- or low-density. Non-residential (agricultural, commercial, industrial) is cheaper still.
  • log_lot More land predicts higher price.
  • Kitchen.Qual.int: Doesn’t everybody want an awesome kitchen?

Below we construct a linear model for log_price using these 10 variables. We obtain excellent results, with \(R^2 = 0.937\).

model.ez <- lm(log_price ~
                 Overall.Qual +
                 Overall.Cond +
                 floored.Year.Built +
                 area +
                 log_lot +
                 Neighborhood +
                 Dwell.Type +
                 MS.Zoning +
                 Kitchen.Qual.int +
                 BsmtFin.SF.1,
               data = ames_train)
summary(model.ez)
## 
## Call:
## lm(formula = log_price ~ Overall.Qual + Overall.Cond + floored.Year.Built + 
##     area + log_lot + Neighborhood + Dwell.Type + MS.Zoning + 
##     Kitchen.Qual.int + BsmtFin.SF.1, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57616 -0.05634  0.00549  0.05608  0.30859 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -4.217e-01  8.915e-01  -0.473 0.636333    
## Overall.Qual               7.261e-02  4.763e-03  15.244  < 2e-16 ***
## Overall.Cond               5.042e-02  3.675e-03  13.721  < 2e-16 ***
## floored.Year.Built         5.136e-03  4.425e-04  11.607  < 2e-16 ***
## area                       3.269e-04  1.260e-05  25.948  < 2e-16 ***
## log_lot                    1.096e-01  1.123e-02   9.759  < 2e-16 ***
## NeighborhoodBlueste        6.840e-02  7.449e-02   0.918 0.358777    
## NeighborhoodBrDale        -4.604e-02  6.136e-02  -0.750 0.453321    
## NeighborhoodBrkSide       -1.595e-03  5.109e-02  -0.031 0.975103    
## NeighborhoodClearCr       -6.121e-03  5.413e-02  -0.113 0.910005    
## NeighborhoodCollgCr       -8.925e-02  4.334e-02  -2.059 0.039804 *  
## NeighborhoodCrawfor        6.346e-02  4.899e-02   1.295 0.195527    
## NeighborhoodEdwards       -1.041e-01  4.540e-02  -2.294 0.022044 *  
## NeighborhoodGilbert       -9.091e-02  4.544e-02  -2.000 0.045794 *  
## NeighborhoodGreens         5.839e-02  6.335e-02   0.922 0.356960    
## NeighborhoodGrnHill        3.439e-01  8.177e-02   4.206 2.90e-05 ***
## NeighborhoodIDOTRR        -4.188e-02  5.534e-02  -0.757 0.449355    
## NeighborhoodMeadowV       -7.410e-02  5.264e-02  -1.408 0.159625    
## NeighborhoodMitchel       -4.447e-02  4.521e-02  -0.983 0.325682    
## NeighborhoodNAmes         -3.208e-02  4.471e-02  -0.718 0.473169    
## NeighborhoodNoRidge       -4.267e-02  4.583e-02  -0.931 0.352157    
## NeighborhoodNPkVill       -1.610e-02  6.362e-02  -0.253 0.800238    
## NeighborhoodNridgHt       -7.869e-03  4.320e-02  -0.182 0.855508    
## NeighborhoodNWAmes        -1.227e-01  4.562e-02  -2.689 0.007322 ** 
## NeighborhoodOldTown       -9.708e-02  5.081e-02  -1.911 0.056423 .  
## NeighborhoodSawyer        -5.471e-02  4.565e-02  -1.198 0.231165    
## NeighborhoodSawyerW       -1.295e-01  4.432e-02  -2.921 0.003583 ** 
## NeighborhoodSomerst       -1.446e-02  5.231e-02  -0.276 0.782349    
## NeighborhoodStoneBr       -3.632e-02  4.830e-02  -0.752 0.452357    
## NeighborhoodSWISU         -7.163e-02  5.544e-02  -1.292 0.196751    
## NeighborhoodTimber        -4.699e-02  4.856e-02  -0.968 0.333506    
## NeighborhoodVeenker       -3.234e-02  5.298e-02  -0.610 0.541771    
## Dwell.Type1Fam 2Story     -4.156e-02  1.108e-02  -3.752 0.000188 ***
## Dwell.Type2FamCon         -1.991e-02  2.634e-02  -0.756 0.449783    
## Dwell.TypeSplit            8.863e-04  1.490e-02   0.059 0.952594    
## Dwell.TypeTwnhs1 or split  1.233e-02  1.937e-02   0.637 0.524553    
## Dwell.TypeTwnhs2 or Dup   -8.697e-02  1.832e-02  -4.748 2.44e-06 ***
## MS.ZoningRH               -1.566e-01  5.518e-02  -2.838 0.004660 ** 
## MS.ZoningRL               -2.471e-02  3.681e-02  -0.671 0.502141    
## MS.ZoningRM               -8.966e-02  4.049e-02  -2.214 0.027079 *  
## MS.ZoningOther            -3.454e-01  5.991e-02  -5.766 1.17e-08 ***
## Kitchen.Qual.int           3.960e-02  8.184e-03   4.839 1.57e-06 ***
## BsmtFin.SF.1               1.385e-04  9.759e-06  14.195  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09836 on 791 degrees of freedom
## Multiple R-squared:  0.937,  Adjusted R-squared:  0.9337 
## F-statistic: 280.1 on 42 and 791 DF,  p-value: < 2.2e-16

Discussion: The model coefficients in the above summary table indicate how each variable impacts the price of a house. When everything else is held constant and the variable changes by one unit, the coefficient indicates how much log_price changes. Therefore positive coefficients indicate variables that correlate positively to price, and negative coefficients indicate variables that correlate negatively to price.

As long as the coefficient is close to 0 (roughly between -0.2 and 0.2) then we can easily estimate the variable’s impact on price (as opposed to log_price). Multiply the coefficient by 100 and interpret the result as percent. For each unit change in the variable (all else held constant), the price changes by that percentage.

For example, the coefficient for Overall.Qual is 0.073. The variable Overall.Qual is a rating score from 0 to 10. For each unit increase in that score (all else held constant), the average price of a house increases by 7.3%

For factor variables like Neighborhood, Dwell.Type, and MS.Zoning, it is important to compare coefficients of the different levels for that factor and interpret them relative to each other. For example, the neighborhood GrnHill has a coefficient of 0.34, which is by far the largest of any neighbohood. The next largest is Greens with 0.08. A coefficient of 0.34 actually indicates a percentage price of 40% (0.34 is a bit too far from 0 for our simple percentage conversion to work). A coefficient of 0.08 indicates an 8% percentage increase. So, all else held equal, a house in GrnHill is 32% more expensive on average than one in Greens. This is the premium that elderly people pay to live in the retirement community of Green Hill. (We hope that there are extra services or benefits associated with these retirement homes that offset or justify this premium.)

Looking at another factor, we see the coefficients for Dwell.Type levels indicate a 4-8% price drop associated with 2-story townhouses (again, all else held equal). Our previous EDA on dwelling type does not explain this. With further analysis we could find interactions between our variables that cause this seeming discrepancy.

The last group of coefficients we call out is for MS.Zoning. We see non-residential (Other) corresponds to roughly a 40% price drop The next lowest-priced zoning is high-density residential (RH) with a price drop of about 16%. So, all else held equal, a non-residential house would on average be 24% cheaper than one in a high-density zone.

As indicated by the p-values in the summary, it is very unlikely that the correlations we observe are due to random flukes in our data. We can be confident that these results are meaningful.


2.2 Developing / Selecting a full model

In this section we use two different methods to develop candidate full models: backward elimination with BIC, and backward elimination with AIC. We use backwards elimination because it is an ideal way to take a somewhat-too-large set of candidate variables and winnow it down to a more parsimonious model.

AIC and BIC are similar in how they step backwards, differing in the formula they use to evaluate and compare models. AIC tends to favor models with more variables, while BIC has a more aggresive eliminating effect that tends to produce more parsimony. We can integrate AIC and BIC by starting with the results of BIC and then using the results of AIC to see if there are any variables it includes that we may want to add back into the BIC-derived model.

We choose 65 variables as our starting point for backward elimination. The original data has 81 variables and we have about 100 total after cleaning and EDA. We choose 65 based on two main steps. First we cull variables that are strongly colinear. Second, we use Markov Chain Monte Carlo (MCMC) to explore the possibilities of choosing from the remaining variables. From that MCMC we examine the posterior probabilities of the variable coefficients and choose those with inclusion probability of at least 0.5. We exclude our MCMC calculations from this report, mostly because of technical incompatibility between MCMC and knitting in R. (These technical incompatibilities ultimately do not impact our final results at all.)

candidate.vars <-
  c("PID", "price", "log_price", "Alley", "Bedroom.AbvGr", "Bldg.Type", "Bsmt.Cond.int", "Bsmt.Exposure.int", "Bsmt.Full.Bath", "Bsmt.Half.Bath", "Bsmt.Qual.int", "Bsmt.Unf.SF", "BsmtFin.SF.1", "BsmtFin.SF.2", "BsmtFin.Type.1.int", "BsmtFin.Type.2.int", "Central.Air", "Condition.1.2", "Dwell.Type", "Electrical", "Enclosed.Porch", "Exter.Cond.int", "Exter.Qual.int", "Exterior.1st", "Fence", "Fireplace.Qu.int", "Fireplaces", "floored.Year.Built", "Foundation", "Full.Bath", "Functional", "Garage.Area", "Garage.Cars", "Garage.Cond.int", "Garage.Finish", "Garage.Qual.int", "Garage.Type", "Half.Bath", "Heating.QC.int", "House.Style", "Kitchen.AbvGr", "Kitchen.Qual.int", "Land.Contour", "Land.Slope.int", "log_lot", "Lot.Config", "Lot.Frontage", "Lot.Shape", "Low.Qual.Fin.SF", "Mas.Vnr.Area", "Mas.Vnr.Type", "Misc.Val", "Mo.Sold", "MS.Zoning", "Neighborhood", "Open.Porch.SF", "Overall.Cond", "Overall.Qual", "Paved.Drive", "Sale.Type", "Screen.Porch", "Total.Bsmt.SF", "TotRms.AbvGrd", "Wood.Deck.SF", "X1st.Flr.SF", "X2nd.Flr.SF", "X3Ssn.Porch", "Year.Remod.Add", "Yr.Sold")

ames_train <- ames_train %>% select(candidate.vars)
ames_test <- ames_test %>% select(candidate.vars)
ames_validation <- ames_validation %>% select(candidate.vars)

We create model.full with all candidate variables, to use as a starting point for backward elimination:

model.full <- lm(log_price ~ . -PID -price, data = ames_train)

2.2.1 BIC

Backward elimination with BIC produces a model with 20 variables and \(R^2=0.941\). A small but real improvement over model.ez.

Surprisingly, this model does not include Neighborhood.

model.step.bic <- 
  stepAIC(model.full, direction = "backward", k=log(834), trace = 0)
model.bic <- eval(model.step.bic$call)
summary(model.bic)
## 
## Call:
## lm(formula = log_price ~ Bedroom.AbvGr + Bsmt.Full.Bath + Bsmt.Unf.SF + 
##     BsmtFin.SF.1 + BsmtFin.SF.2 + BsmtFin.Type.1.int + Central.Air + 
##     Exter.Qual.int + Fireplace.Qu.int + floored.Year.Built + 
##     Garage.Cars + Heating.QC.int + Kitchen.Qual.int + log_lot + 
##     MS.Zoning + Overall.Cond + Overall.Qual + Paved.Drive + X1st.Flr.SF + 
##     X2nd.Flr.SF, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50095 -0.05167  0.00238  0.05597  0.52726 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.376e+00  5.848e-01   7.483 1.89e-13 ***
## Bedroom.AbvGr      -1.799e-02  5.932e-03  -3.032 0.002506 ** 
## Bsmt.Full.Bath      2.712e-02  8.701e-03   3.118 0.001888 ** 
## Bsmt.Unf.SF         9.531e-05  1.568e-05   6.080 1.85e-09 ***
## BsmtFin.SF.1        1.549e-04  1.884e-05   8.221 8.02e-16 ***
## BsmtFin.SF.2        1.291e-04  2.397e-05   5.387 9.40e-08 ***
## BsmtFin.Type.1.int  8.721e-03  2.587e-03   3.371 0.000784 ***
## Central.AirY        8.748e-02  1.746e-02   5.011 6.64e-07 ***
## Exter.Qual.int      3.070e-02  1.037e-02   2.962 0.003150 ** 
## Fireplace.Qu.int    1.421e-02  2.380e-03   5.970 3.55e-09 ***
## floored.Year.Built  2.618e-03  2.963e-04   8.837  < 2e-16 ***
## Garage.Cars         2.996e-02  6.407e-03   4.676 3.43e-06 ***
## Heating.QC.int      1.776e-02  4.366e-03   4.067 5.24e-05 ***
## Kitchen.Qual.int    2.915e-02  8.105e-03   3.597 0.000342 ***
## log_lot             9.932e-02  8.338e-03  11.912  < 2e-16 ***
## MS.ZoningRH        -1.946e-01  4.306e-02  -4.520 7.10e-06 ***
## MS.ZoningRL        -6.386e-02  1.799e-02  -3.551 0.000407 ***
## MS.ZoningRM        -1.163e-01  1.971e-02  -5.901 5.32e-09 ***
## MS.ZoningOther     -3.338e-01  4.387e-02  -7.609 7.66e-14 ***
## Overall.Cond        4.258e-02  3.662e-03  11.628  < 2e-16 ***
## Overall.Qual        6.387e-02  4.577e-03  13.954  < 2e-16 ***
## Paved.DriveP       -4.666e-03  2.215e-02  -0.211 0.833248    
## Paved.DriveY        4.990e-02  1.539e-02   3.242 0.001236 ** 
## X1st.Flr.SF         2.799e-04  1.886e-05  14.841  < 2e-16 ***
## X2nd.Flr.SF         2.667e-04  1.320e-05  20.210  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09416 on 809 degrees of freedom
## Multiple R-squared:  0.941,  Adjusted R-squared:  0.9392 
## F-statistic: 537.2 on 24 and 809 DF,  p-value: < 2.2e-16

2.2.2 AIC

Backward elimination with AIC produces a model with many more variables and \(R^2=0.962\). A significant increase in predictive performance. How close can we get to this \(R^2\) without using so many variables?

model.step.aic <- 
  stepAIC(model.full, direction = "backward", k=2, trace = 0)
model.aic <- eval(model.step.aic$call)
summary(model.aic)
## 
## Call:
## lm(formula = log_price ~ Alley + Bldg.Type + Bsmt.Cond.int + 
##     Bsmt.Exposure.int + Bsmt.Full.Bath + Bsmt.Unf.SF + BsmtFin.SF.1 + 
##     BsmtFin.SF.2 + BsmtFin.Type.1.int + Central.Air + Condition.1.2 + 
##     Exter.Qual.int + Exterior.1st + Fireplace.Qu.int + floored.Year.Built + 
##     Foundation + Functional + Garage.Cars + Heating.QC.int + 
##     Kitchen.AbvGr + Kitchen.Qual.int + log_lot + Lot.Config + 
##     Lot.Frontage + Low.Qual.Fin.SF + Mas.Vnr.Area + Mas.Vnr.Type + 
##     Mo.Sold + MS.Zoning + Neighborhood + Open.Porch.SF + Overall.Cond + 
##     Overall.Qual + Paved.Drive + Sale.Type + Screen.Porch + Wood.Deck.SF + 
##     X1st.Flr.SF + X2nd.Flr.SF + X3Ssn.Porch + Year.Remod.Add, 
##     data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50797 -0.04372  0.00039  0.04457  0.28419 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.420e+00  9.413e-01   1.508 0.131885    
## AlleyPave                5.034e-02  2.708e-02   1.859 0.063458 .  
## AlleyNone                4.147e-02  1.950e-02   2.127 0.033741 *  
## Bldg.Type2fmCon         -5.472e-03  2.552e-02  -0.214 0.830244    
## Bldg.TypeDuplex         -8.862e-02  3.038e-02  -2.917 0.003645 ** 
## Bldg.TypeTwnhs          -4.722e-02  2.522e-02  -1.872 0.061617 .  
## Bldg.TypeTwnhsE         -2.208e-02  1.769e-02  -1.248 0.212282    
## Bsmt.Cond.int            1.428e-02  8.095e-03   1.764 0.078127 .  
## Bsmt.Exposure.int        8.447e-03  3.502e-03   2.412 0.016109 *  
## Bsmt.Full.Bath           2.333e-02  7.942e-03   2.937 0.003418 ** 
## Bsmt.Unf.SF              9.339e-05  1.698e-05   5.499 5.31e-08 ***
## BsmtFin.SF.1             1.527e-04  1.903e-05   8.020 4.26e-15 ***
## BsmtFin.SF.2             1.261e-04  2.323e-05   5.427 7.82e-08 ***
## BsmtFin.Type.1.int       7.840e-03  2.328e-03   3.368 0.000798 ***
## Central.AirY             1.008e-01  1.696e-02   5.945 4.29e-09 ***
## Condition.1.2Feedr      -5.475e-04  2.504e-02  -0.022 0.982565    
## Condition.1.2Feedr + RR -2.130e-02  4.458e-02  -0.478 0.632836    
## Condition.1.2Norm        6.903e-02  2.098e-02   3.290 0.001051 ** 
## Condition.1.2PosA        8.067e-02  3.927e-02   2.054 0.040320 *  
## Condition.1.2PosN        8.341e-02  3.564e-02   2.341 0.019527 *  
## Condition.1.2RRAe        4.528e-02  3.725e-02   1.216 0.224490    
## Condition.1.2RRAn        4.614e-02  4.433e-02   1.041 0.298273    
## Condition.1.2RRNe       -5.146e-02  6.402e-02  -0.804 0.421784    
## Condition.1.2RRNn       -7.258e-03  6.218e-02  -0.117 0.907115    
## Exter.Qual.int           2.385e-02  9.659e-03   2.469 0.013780 *  
## Exterior.1stBrkFace      1.283e-01  3.458e-02   3.710 0.000223 ***
## Exterior.1stCemntBd      1.241e-01  4.003e-02   3.101 0.002005 ** 
## Exterior.1stHdBoard      5.258e-02  3.234e-02   1.626 0.104375    
## Exterior.1stMetalSd      8.667e-02  3.160e-02   2.743 0.006239 ** 
## Exterior.1stPlywood      5.695e-02  3.380e-02   1.685 0.092431 .  
## Exterior.1stStucco       7.555e-02  3.733e-02   2.024 0.043341 *  
## Exterior.1stVinylSd      5.486e-02  3.197e-02   1.716 0.086567 .  
## Exterior.1stWd Sdng      7.492e-02  3.146e-02   2.381 0.017502 *  
## Exterior.1stWdShing      7.575e-02  3.654e-02   2.073 0.038503 *  
## Exterior.1stOther        5.184e-02  8.941e-02   0.580 0.562207    
## Fireplace.Qu.int         1.057e-02  2.233e-03   4.734 2.65e-06 ***
## floored.Year.Built       3.299e-03  4.794e-04   6.882 1.28e-11 ***
## FoundationCBlock         2.166e-02  1.285e-02   1.685 0.092341 .  
## FoundationPConc          3.121e-02  1.447e-02   2.158 0.031284 *  
## FoundationOther          1.199e-01  3.046e-02   3.937 9.06e-05 ***
## FunctionalMin1           2.261e-02  4.214e-02   0.537 0.591684    
## FunctionalMin2           2.272e-02  4.084e-02   0.556 0.578183    
## FunctionalMod           -5.999e-02  4.296e-02  -1.396 0.163026    
## FunctionalTyp            3.419e-02  3.722e-02   0.919 0.358642    
## Garage.Cars              3.270e-02  6.136e-03   5.329 1.32e-07 ***
## Heating.QC.int           1.313e-02  4.304e-03   3.050 0.002372 ** 
## Kitchen.AbvGr            5.368e-02  3.019e-02   1.778 0.075789 .  
## Kitchen.Qual.int         1.389e-02  7.473e-03   1.858 0.063518 .  
## log_lot                  7.832e-02  1.085e-02   7.221 1.31e-12 ***
## Lot.ConfigCulDSac        1.942e-02  1.415e-02   1.372 0.170340    
## Lot.ConfigFR2           -1.437e-02  1.659e-02  -0.866 0.386921    
## Lot.ConfigInside         1.215e-02  8.196e-03   1.482 0.138765    
## Lot.Frontage             1.943e-04  9.998e-05   1.943 0.052350 .  
## Low.Qual.Fin.SF          2.024e-04  5.916e-05   3.421 0.000660 ***
## Mas.Vnr.Area             5.875e-05  2.670e-05   2.200 0.028111 *  
## Mas.Vnr.TypeNone         2.012e-02  9.460e-03   2.127 0.033759 *  
## Mas.Vnr.TypeStone       -8.223e-03  1.504e-02  -0.547 0.584825    
## Mo.Sold2                 3.340e-02  2.025e-02   1.650 0.099462 .  
## Mo.Sold3                -3.955e-03  1.769e-02  -0.224 0.823193    
## Mo.Sold4                -1.513e-02  1.764e-02  -0.857 0.391459    
## Mo.Sold5                 3.551e-02  1.654e-02   2.147 0.032158 *  
## Mo.Sold6                 1.616e-02  1.595e-02   1.013 0.311579    
## Mo.Sold7                 1.807e-02  1.652e-02   1.094 0.274362    
## Mo.Sold8                 2.473e-02  1.758e-02   1.407 0.159884    
## Mo.Sold9                 1.718e-02  1.988e-02   0.864 0.387860    
## Mo.Sold10               -9.122e-03  2.082e-02  -0.438 0.661400    
## Mo.Sold11                2.980e-02  2.055e-02   1.451 0.147325    
## Mo.Sold12                2.078e-02  2.105e-02   0.987 0.323814    
## MS.ZoningRH             -5.736e-02  4.878e-02  -1.176 0.240089    
## MS.ZoningRL              9.990e-03  3.397e-02   0.294 0.768791    
## MS.ZoningRM             -2.843e-02  3.734e-02  -0.761 0.446720    
## MS.ZoningOther          -3.001e-01  5.301e-02  -5.660 2.18e-08 ***
## NeighborhoodBlueste     -3.471e-02  6.350e-02  -0.547 0.584842    
## NeighborhoodBrDale      -4.783e-02  5.323e-02  -0.899 0.369145    
## NeighborhoodBrkSide      2.539e-03  4.466e-02   0.057 0.954675    
## NeighborhoodClearCr      1.111e-02  4.676e-02   0.238 0.812259    
## NeighborhoodCollgCr     -7.242e-02  3.712e-02  -1.951 0.051449 .  
## NeighborhoodCrawfor      6.368e-02  4.229e-02   1.506 0.132596    
## NeighborhoodEdwards     -8.631e-02  3.930e-02  -2.196 0.028414 *  
## NeighborhoodGilbert     -6.816e-02  3.934e-02  -1.733 0.083596 .  
## NeighborhoodGreens       5.801e-02  5.584e-02   1.039 0.299204    
## NeighborhoodGrnHill      4.730e-01  7.087e-02   6.673 4.98e-11 ***
## NeighborhoodIDOTRR      -8.901e-03  4.848e-02  -0.184 0.854392    
## NeighborhoodMeadowV     -1.835e-01  5.064e-02  -3.624 0.000311 ***
## NeighborhoodMitchel     -4.332e-02  3.910e-02  -1.108 0.268234    
## NeighborhoodNAmes       -3.979e-02  3.893e-02  -1.022 0.307065    
## NeighborhoodNoRidge     -2.916e-02  4.017e-02  -0.726 0.468116    
## NeighborhoodNPkVill     -8.180e-03  5.766e-02  -0.142 0.887227    
## NeighborhoodNridgHt     -4.649e-03  3.804e-02  -0.122 0.902762    
## NeighborhoodNWAmes      -9.593e-02  4.067e-02  -2.359 0.018600 *  
## NeighborhoodOldTown     -6.345e-02  4.500e-02  -1.410 0.158945    
## NeighborhoodSawyer      -1.954e-02  3.981e-02  -0.491 0.623628    
## NeighborhoodSawyerW     -8.172e-02  3.859e-02  -2.118 0.034521 *  
## NeighborhoodSomerst      1.872e-02  4.581e-02   0.409 0.682829    
## NeighborhoodStoneBr     -7.949e-03  4.216e-02  -0.189 0.850511    
## NeighborhoodSWISU       -4.229e-02  4.752e-02  -0.890 0.373864    
## NeighborhoodTimber      -4.018e-02  4.126e-02  -0.974 0.330477    
## NeighborhoodVeenker     -1.292e-02  4.597e-02  -0.281 0.778712    
## Open.Porch.SF            1.623e-04  5.309e-05   3.058 0.002314 ** 
## Overall.Cond             3.997e-02  3.878e-03  10.306  < 2e-16 ***
## Overall.Qual             4.967e-02  4.413e-03  11.257  < 2e-16 ***
## Paved.DriveP            -4.555e-03  2.027e-02  -0.225 0.822257    
## Paved.DriveY             2.448e-02  1.539e-02   1.590 0.112199    
## Sale.TypeCon             4.615e-02  2.998e-02   1.539 0.124225    
## Sale.TypeOth            -1.460e-01  8.660e-02  -1.685 0.092334 .  
## Sale.TypeWD              2.070e-03  2.204e-02   0.094 0.925218    
## Screen.Porch             1.376e-04  5.683e-05   2.421 0.015730 *  
## Wood.Deck.SF             4.650e-05  2.796e-05   1.663 0.096718 .  
## X1st.Flr.SF              2.601e-04  1.989e-05  13.072  < 2e-16 ***
## X2nd.Flr.SF              2.600e-04  1.099e-05  23.671  < 2e-16 ***
## X3Ssn.Porch              1.715e-04  9.886e-05   1.735 0.083120 .  
## Year.Remod.Add           7.823e-04  2.485e-04   3.148 0.001714 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07975 on 722 degrees of freedom
## Multiple R-squared:  0.9622, Adjusted R-squared:  0.9564 
## F-statistic: 165.6 on 111 and 722 DF,  p-value: < 2.2e-16

2.2.3 Choose model

Our chosen model for now is based on BIC. In later sections we will improve it based on the results of AIC.

model.chosen <- model.bic

2.3 Initial Model Residuals

We check the residuals to confirm that linear modelling assumptions hold true. Everything looks good except for heavy tails in the overall distribution of residuals.

Linear relationship for every variable: The residual scatterplots for all numerical variables show appropriate random scatter around 0. There are many numerical variables we could plot. Here are two of the most important, Overall.Qual and X2nd.Flr.SF:

plot(model.chosen$residuals ~ ames_train$Overall.Qual)

# plot(model.chosen$residuals ~ ames_train$Overall.Cond)
plot(model.chosen$residuals ~ ames_train$X2nd.Flr.SF)

# plot(model.chosen$residuals ~ ames_train$X1st.Flr.SF)
# plot(model.chosen$residuals ~ ames_train$BsmtFin.SF.1)
# plot(model.chosen$residuals ~ ames_train$BsmtFin.SF.2)
# plot(model.chosen$residuals ~ ames_train$Bsmt.Unf.SF)
# plot(model.chosen$residuals ~ ames_train$floored.Year.Built)
# plot(model.chosen$residuals ~ ames_train$log_lot)
# plot(model.chosen$residuals ~ ames_train$Fireplace.Qu.int)
# plot(model.chosen$residuals ~ ames_train$Exter.Qual.int)
# plot(model.chosen$residuals ~ ames_train$BsmtFin.Type.1.int)
# plot(model.chosen$residuals ~ ames_train$Kitchen.Qual.int)
# plot(model.chosen$residuals ~ ames_train$Heating.QC.int)
# plot(model.chosen$residuals ~ ames_train$Bedroom.AbvGr)
# plot(model.chosen$residuals ~ ames_train$Garage.Cars)

Normal distribution: The histogram and Q-Q plot show an essentially normal distribution of residuals, but with tails that are too heavy. We recommend additional analysis to investigate the exact points in the heavy tails.

hist(model.chosen$residuals)

qqnorm(model.chosen$residuals)
qqline(model.chosen$residuals)

Constant variability of residuals: Plotting residuals vs fitted values does not show a fan shape.

plot(model.chosen$residuals ~ model.chosen$fitted.values)

Independent Residuals: Plotting residuals by index does not reveal any time pattern.

plot(model.chosen$residuals)


2.4 Initial Model RMSE

Root mean squared error (RSME) is our preferred way to report average error of model predictions. Our model predicts log_price and so by definition the residuals are log(actual price) - log(predicted price). Using the properties of logarithms and simple arithmetic this means that (predicted price) = (actual price) / exp(residual). With that derivation we compute RSME = $16700. We round RSME to the nearest $100.

ames_train$res.log_price <- model.chosen$residuals

ames_train <- ames_train %>% mutate(
  predicted.price = price / exp(res.log_price),
  residual = price - predicted.price
)

# residual sum of squares
RSS <- c(crossprod(ames_train$residual))

# Mean squared error:
MSE <- RSS / length(ames_train$residual)

# RMSE:
as.integer(round(sqrt(MSE)/100)*100)
## [1] 16700

2.5 Testing the full model


We next test our model with out-of-sample data in ames_test. This gives us a good indication if the model suffers from “overfitting” (being tuned too specifically to details in the training data).

In order to run the model on ames_test, we need to remove one property from ames_test in the Landmrk neighborhood. This neighborhood is not present in ames_train. Without re-training, the model cannot accomodate a neighborhood it has never encountered.

We see that our model achieves \(R^2=0.921\) on the test data, with RMSE of $18300. Compared to results in training, that is a 2% decrease in \(R^2\) and a $1600 increase in RMSE. These differences are reasonable, and yield a successful test of our model on out-of-sample data.

ames_test <- ames_test %>% filter(Neighborhood!="Landmrk")

ames_test$predicted.log.price <- predict.lm(model.chosen, ames_test)

ames_test <- ames_test %>% mutate(
  res.log_price = log_price - predicted.log.price,
  predicted.price = price / exp(res.log_price),
  residual = price - predicted.price
)

1 - sum(ames_test$res.log_price^2)/sum((ames_test$log_price-mean(ames_test$log_price))^2)
## [1] 0.9211636
# residual sum of squares
RSS <- c(crossprod(ames_test$residual))

# Mean squared error:
MSE <- RSS / length(ames_test$residual)

# RMSE:
as.integer(round(sqrt(MSE)/100)*100)
## [1] 18300

As an extra test, we check the distribution of residuals, and observe results very similar to what we see with the training data.

hist(ames_test$res.log_price)

qqnorm(ames_test$res.log_price)
qqline(ames_test$res.log_price)

With the test data, it appears there is a low outlier even more extreme than what we see in the training data. This suggests that there may be a small but significant number of sales priced much less than our model would predict. This actually seems very plausible. We recommend additional research to investigate this possibility.


3 Developing of a Final Model

We use model.chosen as the starting point to develop our final model. This chosen model has exactly 20 variables, but we hope to find a better 20. We do in fact improve the model by adding 3 variables and dropping 3 variables.

We use package lm.beta to provide standardized model coefficients to help us judge the relative importance of each variable in the model. With that judgment we can choose variables from model.aic to add to model.chosen. Then we use the same kind of judgment to choose variables to drop so that we have no more than 20 variables in our final model.

The standardized coefficients of lm.beta come from scaling each variable of the model so that its mean is 0 and its standard deviation is 1. Then “unit change” is consistent across variables, and the absolute values of the coefficients give us a handy way to judge the relative impact of each variable.

The full process is in function compare.std.coef() which is documented in the appendix.

First we compare the variables in model.aic, model.bic, and model.chosen Note that the latter two models are idential for now. By looking for big red bars at the bottom of this chart, we see some impactful variables that are in model.aic but not the other two models.

model.comparison <- compare.std.coef()

# Show top 30 variables
gather(head(model.comparison,30) %>% select(-bit3.models,-impact), 
       chosen.max, aic.max, bic.max, key = model, value = imp) %>%
  ggplot(aes(x=var ,y=imp, fill=model)) + 
  geom_col(position = position_dodge()) + 
  coord_flip(ylim = c(0,0.35)) + 
  ylab("Abs(standardized coefficient)") + 
  xlab("Model variable") + 
  ggtitle("Standardized coefficients for model variables\nNegative bars = not in model") + 
  scale_fill_discrete(name="Model",labels=c("AIC","BIC","Chosen (=BIC)"))

Three variables stand out: Exterior.1st, Neighborhood, and Condition.1.2. We add these to model.chosen.

model.chosen <- lm(formula = log_price ~ BsmtFin.SF.1 + floored.Year.Built +
    Overall.Qual + Bsmt.Unf.SF + X2nd.Flr.SF + Overall.Cond +
    log_lot + MS.Zoning + BsmtFin.SF.2 + Central.Air + Fireplace.Qu.int +
    Exter.Qual.int + BsmtFin.Type.1.int + Kitchen.Qual.int +
    Heating.QC.int + Paved.Drive + X1st.Flr.SF + Bsmt.Full.Bath +
    Bedroom.AbvGr + Garage.Cars +
      Exterior.1st +
      Neighborhood +
      Condition.1.2,
    data = ames_train)

That gives us 23 variables, so we look for the 3 least impactful to drop, which would correspond to the three shortest blue bars in this chart:

model.comparison <- compare.std.coef()

gather(model.comparison %>% 
         filter(is.finite(chosen.max)) %>% 
         select(-bit3.models,-impact), 
       chosen.max, aic.max, bic.max, key = model, value = imp) %>%
  ggplot(aes(x=var ,y=imp, fill=model)) + 
  geom_col(position = position_dodge()) + 
  coord_flip(ylim = c(0,0.35)) + 
  ylab("Abs(standardized coefficient)") + 
  xlab("Model variable") + 
  ggtitle("Standardized coefficients for model variables\nNegative bars = not in model") + 
  scale_fill_discrete(name="Model",labels=c("AIC","BIC","Updated Chosen"))

The three variables with the smallest standardized model.chosen coefficients are BedroomAbvGr, Exter.Qual.int, and Heating.QC.int. We verify with this query:

model.comparison %>% 
  filter(is.finite(chosen.max)) %>% 
  arrange(chosen.max) %>% 
  select(var,chosen.max) %>% 
  rename(std.coeff.chosen.model = chosen.max) %>%
  head(6)
## # A tibble: 6 x 2
##                  var std.coeff.chosen.model
##               <fctr>                  <dbl>
## 1      Bedroom.AbvGr            0.009958002
## 2     Exter.Qual.int            0.026330290
## 3     Heating.QC.int            0.031714859
## 4        Paved.Drive            0.038274722
## 5     Bsmt.Full.Bath            0.039388324
## 6 BsmtFin.Type.1.int            0.040218476

We drop BedroomAbvGr, Exter.Qual.int, and Heating.QC.int to obtain our final model.

3.1 Final Model


Our final model has \(R^2=0.954\) and RSME = $14800. This is a good improvement on BIC without all the complexity of the many variables selected by AIC. We are proud of this model. We next discuss, test, and validate it.

model.final <- lm(formula = log_price ~ 
                    Overall.Qual + # 1
                    Overall.Cond + # 2
                    floored.Year.Built + #3
                    X1st.Flr.SF + #4
                    X2nd.Flr.SF + #5
                    log_lot + #6
                    MS.Zoning + #7
                    BsmtFin.SF.1 + #8 
                    BsmtFin.SF.2 + #9
                    Bsmt.Unf.SF + #10
                    BsmtFin.Type.1.int + #11
                    Bsmt.Full.Bath + #12
                    Central.Air + #13
                    Fireplace.Qu.int + #14
                    Kitchen.Qual.int + #15
                    Paved.Drive + #16
                    Garage.Cars + #17
                    Exterior.1st + #18
                    Neighborhood + #19
                    Condition.1.2, #20
                  data = ames_train)
summary(model.final)
## 
## Call:
## lm(formula = log_price ~ Overall.Qual + Overall.Cond + floored.Year.Built + 
##     X1st.Flr.SF + X2nd.Flr.SF + log_lot + MS.Zoning + BsmtFin.SF.1 + 
##     BsmtFin.SF.2 + Bsmt.Unf.SF + BsmtFin.Type.1.int + Bsmt.Full.Bath + 
##     Central.Air + Fireplace.Qu.int + Kitchen.Qual.int + Paved.Drive + 
##     Garage.Cars + Exterior.1st + Neighborhood + Condition.1.2, 
##     data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52692 -0.04647  0.00097  0.05222  0.27713 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.126e+00  8.856e-01   1.271 0.204006    
## Overall.Qual             5.640e-02  4.414e-03  12.777  < 2e-16 ***
## Overall.Cond             4.923e-02  3.501e-03  14.063  < 2e-16 ***
## floored.Year.Built       4.282e-03  4.399e-04   9.735  < 2e-16 ***
## X1st.Flr.SF              2.615e-04  1.738e-05  15.046  < 2e-16 ***
## X2nd.Flr.SF              2.641e-04  1.036e-05  25.499  < 2e-16 ***
## log_lot                  9.436e-02  9.906e-03   9.525  < 2e-16 ***
## MS.ZoningRH             -8.333e-02  5.055e-02  -1.649 0.099656 .  
## MS.ZoningRL             -1.011e-02  3.394e-02  -0.298 0.765842    
## MS.ZoningRM             -7.835e-02  3.727e-02  -2.102 0.035841 *  
## MS.ZoningOther          -3.095e-01  5.450e-02  -5.679 1.92e-08 ***
## BsmtFin.SF.1             1.666e-04  1.797e-05   9.269  < 2e-16 ***
## BsmtFin.SF.2             1.464e-04  2.250e-05   6.504 1.41e-10 ***
## Bsmt.Unf.SF              1.040e-04  1.479e-05   7.029 4.62e-12 ***
## BsmtFin.Type.1.int       6.981e-03  2.401e-03   2.908 0.003747 ** 
## Bsmt.Full.Bath           2.950e-02  8.192e-03   3.601 0.000337 ***
## Central.AirY             9.131e-02  1.627e-02   5.613 2.78e-08 ***
## Fireplace.Qu.int         1.218e-02  2.297e-03   5.303 1.49e-07 ***
## Kitchen.Qual.int         3.573e-02  7.209e-03   4.956 8.86e-07 ***
## Paved.DriveP             1.581e-03  2.119e-02   0.075 0.940546    
## Paved.DriveY             4.353e-02  1.571e-02   2.771 0.005723 ** 
## Garage.Cars              2.920e-02  6.210e-03   4.702 3.06e-06 ***
## Exterior.1stBrkFace      1.050e-01  3.573e-02   2.938 0.003398 ** 
## Exterior.1stCemntBd      1.076e-01  4.104e-02   2.623 0.008890 ** 
## Exterior.1stHdBoard      3.728e-02  3.337e-02   1.117 0.264361    
## Exterior.1stMetalSd      7.352e-02  3.262e-02   2.254 0.024492 *  
## Exterior.1stPlywood      2.613e-02  3.451e-02   0.757 0.449159    
## Exterior.1stStucco       6.621e-02  3.924e-02   1.687 0.091951 .  
## Exterior.1stVinylSd      5.000e-02  3.319e-02   1.507 0.132333    
## Exterior.1stWd Sdng      6.316e-02  3.276e-02   1.928 0.054261 .  
## Exterior.1stWdShing      6.628e-02  3.805e-02   1.742 0.081921 .  
## Exterior.1stOther        2.965e-02  9.387e-02   0.316 0.752162    
## NeighborhoodBlueste     -4.201e-02  6.342e-02  -0.662 0.507889    
## NeighborhoodBrDale      -5.867e-02  5.204e-02  -1.127 0.259915    
## NeighborhoodBrkSide      3.329e-02  4.335e-02   0.768 0.442845    
## NeighborhoodClearCr      1.318e-02  4.746e-02   0.278 0.781258    
## NeighborhoodCollgCr     -6.177e-02  3.648e-02  -1.693 0.090790 .  
## NeighborhoodCrawfor      6.004e-02  4.232e-02   1.419 0.156419    
## NeighborhoodEdwards     -7.671e-02  3.897e-02  -1.968 0.049393 *  
## NeighborhoodGilbert     -5.804e-02  3.846e-02  -1.509 0.131643    
## NeighborhoodGreens       6.117e-02  5.635e-02   1.085 0.278068    
## NeighborhoodGrnHill      4.453e-01  7.357e-02   6.053 2.22e-09 ***
## NeighborhoodIDOTRR       2.560e-02  4.767e-02   0.537 0.591325    
## NeighborhoodMeadowV     -1.572e-01  5.207e-02  -3.019 0.002622 ** 
## NeighborhoodMitchel     -4.049e-02  3.867e-02  -1.047 0.295421    
## NeighborhoodNAmes       -4.182e-02  3.804e-02  -1.099 0.272031    
## NeighborhoodNoRidge     -1.982e-02  3.945e-02  -0.502 0.615514    
## NeighborhoodNPkVill     -3.558e-02  5.739e-02  -0.620 0.535523    
## NeighborhoodNridgHt     -2.323e-03  3.718e-02  -0.062 0.950199    
## NeighborhoodNWAmes      -1.040e-01  3.988e-02  -2.609 0.009267 ** 
## NeighborhoodOldTown     -3.178e-02  4.345e-02  -0.731 0.464750    
## NeighborhoodSawyer      -3.034e-02  3.925e-02  -0.773 0.439804    
## NeighborhoodSawyerW     -8.830e-02  3.836e-02  -2.302 0.021611 *  
## NeighborhoodSomerst      9.831e-04  4.490e-02   0.022 0.982539    
## NeighborhoodStoneBr     -2.622e-02  4.356e-02  -0.602 0.547459    
## NeighborhoodSWISU       -3.704e-02  4.774e-02  -0.776 0.438083    
## NeighborhoodTimber      -3.595e-02  4.091e-02  -0.879 0.379863    
## NeighborhoodVeenker     -8.393e-03  4.753e-02  -0.177 0.859880    
## Condition.1.2Feedr       1.438e-02  2.564e-02   0.561 0.574960    
## Condition.1.2Feedr + RR -1.569e-02  4.640e-02  -0.338 0.735266    
## Condition.1.2Norm        7.874e-02  2.150e-02   3.662 0.000267 ***
## Condition.1.2PosA        9.597e-02  4.053e-02   2.368 0.018143 *  
## Condition.1.2PosN        1.060e-01  3.687e-02   2.876 0.004136 ** 
## Condition.1.2RRAe        4.535e-02  3.843e-02   1.180 0.238316    
## Condition.1.2RRAn        6.453e-02  4.593e-02   1.405 0.160447    
## Condition.1.2RRNe        1.237e-03  6.620e-02   0.019 0.985097    
## Condition.1.2RRNn       -1.349e-02  6.549e-02  -0.206 0.836814    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08579 on 767 degrees of freedom
## Multiple R-squared:  0.9535, Adjusted R-squared:  0.9495 
## F-statistic: 238.5 on 66 and 767 DF,  p-value: < 2.2e-16
ames_train$predicted.log.price <- model.final$fitted.values

ames_train <- ames_train %>% mutate(
  res.log_price = log_price - predicted.log.price,
  predicted.price = price / exp(res.log_price),
  residual = price - predicted.price
)

# residual sum of squares
RSS <- c(crossprod(ames_train$residual))

# Mean squared error:
MSE <- RSS / length(ames_train$residual)

# RMSE:
as.integer(round(sqrt(MSE)/100)*100)
## [1] 14800

3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.


We transformed many variables as explained in the preceding sections. The most fundamental transformations are logarithmic ones for log_price and log_lot. These two variables are very right-skewed. A logarithmic tranformation makes both of them approximately normally distributed, which helps greatly with linear modeling.

We considered a similar transformation for log_area. It turns out, however, that our model.ez performs better with area. We also notice that all our candidate full models do not use area at all, but instead drill down and use multiple more specific variables representing different parts of area, like X1st.Flr.SF (1st floor area), X2nd.Flr.SF (2nd floor area), and BsmtFin.SF.1 (finished basement area).

A complete list of transformed variables that made it into model.final:

  • Bsmt.Full.Bath: NA changed to 0
  • Condition.1.2: A synthesis of Condition.1 and Condition.2
  • Rarely occuring values merged into bigger categories
    • floored.Year.Built
    • MS.Zoning
    • Exterior.1st
  • Scalar factors converted to integers
    • BsmtFin.Type.1.int
    • Fireplace.Qu.int
    • Kitchen.Qual.int
  • Log transforms
    • log_price
    • log_lot

3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.


Short answer: no.

In linear modeling it is possible to add terms that model the impact of variable interactions. This is similar in spirit to our analysis of how dwelling type correlates to different slopes of price per area. Continuing along the lines of that example, below we add the interaction term area * Dwell.Type to model.ez. Notice that the resulting coefficients that start with area:Dwell correspond well with the slope analysis in our EDA.

ames_train_area <- ames_train_original %>% select(PID,area)
ames_train <- left_join(ames_train, ames_train_area)
## Joining, by = "PID"
model.interact <- lm(log_price ~
                 Overall.Qual +
                 Overall.Cond +
                 floored.Year.Built +
                 area +
                 log_lot +
                 Neighborhood +
                 Dwell.Type +
                 MS.Zoning +
                 Kitchen.Qual.int +
                 BsmtFin.SF.1 +
                area * Dwell.Type,
               data = ames_train)
summary(model.interact)
## 
## Call:
## lm(formula = log_price ~ Overall.Qual + Overall.Cond + floored.Year.Built + 
##     area + log_lot + Neighborhood + Dwell.Type + MS.Zoning + 
##     Kitchen.Qual.int + BsmtFin.SF.1 + area * Dwell.Type, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59736 -0.05155  0.00705  0.05709  0.29815 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -6.685e-01  8.846e-01  -0.756  0.45009    
## Overall.Qual                    6.809e-02  4.844e-03  14.055  < 2e-16 ***
## Overall.Cond                    5.156e-02  3.676e-03  14.024  < 2e-16 ***
## floored.Year.Built              5.257e-03  4.391e-04  11.971  < 2e-16 ***
## area                            3.836e-04  1.960e-05  19.565  < 2e-16 ***
## log_lot                         1.050e-01  1.123e-02   9.354  < 2e-16 ***
## NeighborhoodBlueste             9.259e-02  7.596e-02   1.219  0.22325    
## NeighborhoodBrDale             -2.223e-02  6.374e-02  -0.349  0.72734    
## NeighborhoodBrkSide             2.362e-02  5.193e-02   0.455  0.64933    
## NeighborhoodClearCr             2.029e-02  5.520e-02   0.368  0.71331    
## NeighborhoodCollgCr            -6.288e-02  4.510e-02  -1.394  0.16367    
## NeighborhoodCrawfor             9.327e-02  5.041e-02   1.850  0.06463 .  
## NeighborhoodEdwards            -7.688e-02  4.706e-02  -1.634  0.10275    
## NeighborhoodGilbert            -6.267e-02  4.694e-02  -1.335  0.18223    
## NeighborhoodGreens              1.061e-01  6.565e-02   1.617  0.10633    
## NeighborhoodGrnHill             3.525e-01  8.097e-02   4.353 1.52e-05 ***
## NeighborhoodIDOTRR             -1.834e-02  5.582e-02  -0.329  0.74259    
## NeighborhoodMeadowV            -4.323e-02  5.684e-02  -0.761  0.44715    
## NeighborhoodMitchel            -1.767e-02  4.681e-02  -0.378  0.70586    
## NeighborhoodNAmes              -2.613e-03  4.622e-02  -0.057  0.95493    
## NeighborhoodNoRidge            -5.634e-03  4.773e-02  -0.118  0.90607    
## NeighborhoodNPkVill             2.135e-02  6.597e-02   0.324  0.74626    
## NeighborhoodNridgHt             1.310e-02  4.430e-02   0.296  0.76749    
## NeighborhoodNWAmes             -9.484e-02  4.716e-02  -2.011  0.04467 *  
## NeighborhoodOldTown            -6.890e-02  5.165e-02  -1.334  0.18257    
## NeighborhoodSawyer             -2.615e-02  4.711e-02  -0.555  0.57897    
## NeighborhoodSawyerW            -1.020e-01  4.616e-02  -2.209  0.02747 *  
## NeighborhoodSomerst             3.910e-03  5.359e-02   0.073  0.94186    
## NeighborhoodStoneBr            -1.327e-02  4.834e-02  -0.275  0.78375    
## NeighborhoodSWISU              -3.998e-02  5.653e-02  -0.707  0.47958    
## NeighborhoodTimber             -3.007e-02  5.005e-02  -0.601  0.54819    
## NeighborhoodVeenker             1.547e-02  5.448e-02   0.284  0.77650    
## Dwell.Type1Fam 2Story           3.639e-02  3.090e-02   1.178  0.23925    
## Dwell.Type2FamCon               1.652e-01  7.796e-02   2.118  0.03445 *  
## Dwell.TypeSplit                 1.692e-01  5.524e-02   3.064  0.00226 ** 
## Dwell.TypeTwnhs1 or split      -5.384e-02  7.514e-02  -0.716  0.47391    
## Dwell.TypeTwnhs2 or Dup        -3.316e-02  5.420e-02  -0.612  0.54084    
## MS.ZoningRH                    -1.634e-01  5.472e-02  -2.986  0.00292 ** 
## MS.ZoningRL                    -3.528e-02  3.657e-02  -0.965  0.33496    
## MS.ZoningRM                    -9.494e-02  4.032e-02  -2.355  0.01878 *  
## MS.ZoningOther                 -3.425e-01  5.976e-02  -5.731 1.43e-08 ***
## Kitchen.Qual.int                3.646e-02  8.201e-03   4.446 1.00e-05 ***
## BsmtFin.SF.1                    1.285e-04  9.924e-06  12.946  < 2e-16 ***
## area:Dwell.Type1Fam 2Story     -6.129e-05  1.998e-05  -3.067  0.00224 ** 
## area:Dwell.Type2FamCon         -1.324e-04  4.920e-05  -2.690  0.00729 ** 
## area:Dwell.TypeSplit           -1.383e-04  4.351e-05  -3.179  0.00154 ** 
## area:Dwell.TypeTwnhs1 or split  5.470e-05  5.925e-05   0.923  0.35622    
## area:Dwell.TypeTwnhs2 or Dup   -5.242e-05  3.510e-05  -1.494  0.13569    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09736 on 786 degrees of freedom
## Multiple R-squared:  0.9387, Adjusted R-squared:  0.935 
## F-statistic: 255.9 on 47 and 786 DF,  p-value: < 2.2e-16

Our final model does not include Dwell.Type and we did not find any other interactions worth adding to our model.


3.4 Variable Selection

We discuss variable selection at length in earlier sections. Briefly, we followed these steps: (1) extensive EDA, (2) culling colinear variables, (3) MCMC to prioritize variables to include in the full candidate list, (4) backward elimination BIC as a starting point for the final model, (5) backward elimination AIC to help identify improvements to make to the final model, (6) standardized coefficients to prioritize the last few changes to the final model. These steps suit our desire to explore the data at length and to use our statistical expertise as a way to help us learn more about real estate pricing.


3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.

Our testing on out-of-sample data was successful. This confirms our model training. It suggests one possible change to our model. In both training and test data, we see a small but significant number of houses that sell for much less than predicted by our model. The number of these outliers is very small, however, and so we do not change the model for now. We recommend additional investigation.


4 Final Model Assessment

4.1 Final Model Residual

Looking at residuals of model.final, we re-check the same model assumptions that we checked with model.chosen. Everything still looks good, and the heavy tails are much reduced.

Linear relationship: All numeric variables show linear relationship with the response variable. Here is one of many, first floor area:

# plot(model.final$residuals ~ ames_train$Overall.Qual)
# plot(model.final$residuals ~ ames_train$Overall.Cond)
# plot(model.final$residuals ~ ames_train$X2nd.Flr.SF)
plot(model.final$residuals ~ ames_train$X1st.Flr.SF)

Normal distribution: The heavy tails are much reduced. This is great news for our final model. We still do see one major negative outlier. This is a persistent trend in our analysis that merits further investigation.

hist(model.final$residuals)

qqnorm(model.final$residuals)
qqline(model.final$residuals)

Constant variability of residuals: Plotting residuals vs fitted values does not show a fan shape.

plot(model.final$residuals ~ model.final$fitted.values)


4.2 Final Model RMSE

Our final model has \(R^2=0.953\) and RMSE = $14800 on ames_train. For homes with a median price around $159000, this means the model prediction is typically within 9-10% of the actual price. That may not be accurate enough to submit as a formal bid on a house, but it is accurate enough to help identify promising investment opportunities.

ames_train$predicted.log.price <- predict.lm(model.final, ames_train)

ames_train <- ames_train %>% mutate(
  res.log_price = log_price - predicted.log.price,
  predicted.price = price / exp(res.log_price),
  residual = price - predicted.price
)

1 - sum(ames_train$res.log_price^2)/sum((ames_train$log_price-mean(ames_train$log_price))^2)
## [1] 0.9535378
# residual sum of squares
RSS <- c(crossprod(ames_train$residual))

# Mean squared error:
MSE <- RSS / length(ames_train$residual)

# RMSE:
as.integer(round(sqrt(MSE)/100)*100)
## [1] 14800

4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?

Overall, our model is very strong and reliably predicts price very well even on out-of-sample data.

Notable weaknesses are the following:
  • Being trained with data from the Great Recession, our model has a very unrealistic expectation that prices never increase with time.
  • A small but seemingly significant number of houses sell for much less than predicted.
  • Our model cannot predict prices in one neighborhood (Landmrk) which was absent from the training data.

4.4 Final Model Validation

4.4.1 RSME

Finally we check our model with ames_validation, which hasn’t been used for testing until now. The results are very similar to those with ames_test, and again the difference with ames_train is reasonable.
  • ames_validation: \(R^2=0.93\) and RSME = $17800;
  • ames_test: \(R^2=0.93\) and RSME = $17100;
  • ames_training: \(R^2=0.95\) and RSME = $14800.
ames_test$predicted.log.price <- predict.lm(model.final, ames_test)

ames_test <- ames_test %>% mutate(
  res.log_price = log_price - predicted.log.price,
  predicted.price = price / exp(res.log_price),
  residual = price - predicted.price
)

1 - sum(ames_test$res.log_price^2)/sum((ames_test$log_price-mean(ames_test$log_price))^2)
## [1] 0.9322736
# residual sum of squares
RSS <- c(crossprod(ames_test$residual))

# Mean squared error:
MSE <- RSS / length(ames_test$residual)

# RMSE:
as.integer(round(sqrt(MSE)/100)*100)
## [1] 17100
ames_validation <- ames_validation %>% filter(Neighborhood!="Landmrk")

ames_validation$predicted.log.price <- predict.lm(model.final, ames_validation)

ames_validation <- ames_validation %>% mutate(
  res.log_price = log_price - predicted.log.price,
  predicted.price = price / exp(res.log_price),
  residual = price - predicted.price
)

1 - sum(ames_validation$res.log_price^2)/sum((ames_validation$log_price-mean(ames_validation$log_price))^2)
## [1] 0.929661
# residual sum of squares
RSS <- c(crossprod(ames_validation$residual))

# Mean squared error:
MSE <- RSS / length(ames_validation$residual)

# RMSE:
as.integer(round(sqrt(MSE)/100)*100)
## [1] 17800

4.4.2 Predictive confidence intervals

We check how many houses are priced within the 95% predictive confidence interval of our model. If our modeling is rock-solid, we should see roughly 95% of them inside the interval:
  • ames_train: 806 out of 834 = 96.6%
  • ames_test: 766 out of 817 = 93.8%
  • ames_validation: 715 out of 764 = 93.6%

Because of the sample size, it is unlikely that the lower values are a fluke. This is evidence that something is slightly awry with how our model reflects uncertainty. To refinforce this evidence, we also check the distribution of residuals with the validation data. We see very heavy tails, indicating a problem with a small but significant number of large-magnitude residuals (both positive and negative).

# Test predictive confidence intervals with training

predicted <- predict.lm(model.final, ames_train, se.fit = TRUE, interval = "predict")
ames_train$lwr <- predicted$fit[,'lwr']
ames_train$upr <- predicted$fit[,'upr']
ames_train <- ames_train %>% mutate(
  inside.interval = (log(price)>=lwr & log(price)<=upr)
)
ames_train %>% count(inside.interval)
## # A tibble: 2 x 2
##   inside.interval     n
##             <lgl> <int>
## 1           FALSE    28
## 2            TRUE   806
# Test predictive confidence intervals with test data

predicted <- predict.lm(model.final, ames_test, se.fit = TRUE, interval = "predict")
ames_test$lwr <- predicted$fit[,'lwr']
ames_test$upr <- predicted$fit[,'upr']
ames_test <- ames_test %>% mutate(
  inside.interval = (log(price)>=lwr & log(price)<=upr)
)
ames_test %>% count(inside.interval)
## # A tibble: 2 x 2
##   inside.interval     n
##             <lgl> <int>
## 1           FALSE    51
## 2            TRUE   766
# Test predictive confidence intervals with validation data

predicted <- predict.lm(model.final, ames_validation, se.fit = TRUE, interval = "predict")
ames_validation$lwr <- predicted$fit[,'lwr']
ames_validation$upr <- predicted$fit[,'upr']
ames_validation <- ames_validation %>% mutate(
  inside.interval = (log(price)>=lwr & log(price)<=upr)
)
ames_validation %>% count(inside.interval)
## # A tibble: 2 x 2
##   inside.interval     n
##             <lgl> <int>
## 1           FALSE    49
## 2            TRUE   715

Here is the heavy-tailed distribution of residuals with validation data:

hist(ames_validation$res.log_price)

qqnorm(ames_validation$res.log_price)
qqline(ames_validation$res.log_price)

4.4.3 Overvalued and undervalued houses

We compile a list of all homes where log(price) was under- or over-predicted by more than 0.3. We draw from all three partitions, train, test, and validation.

These 16 properties deserve further investigation to determine why the model is so far off.

large_resids <-
  rbind(ames_train %>% select(PID, price, predicted.price, res.log_price),
      ames_test %>% select(PID, price, predicted.price, res.log_price),
      ames_validation %>% 
        select(PID, price, predicted.price, res.log_price)) %>%
  filter(abs(res.log_price)>0.3) %>%
  arrange(desc(abs(res.log_price)))

homes_large_resids <-
  left_join(large_resids, 
            read.csv("ames_address_PID.csv"), 
            by="PID") %>% 
  select(-City,-State) %>%
  arrange(res.log_price)

homes_large_resids
## # A tibble: 16 x 5
##          PID  price predicted.price res.log_price                 Address
##        <int>  <int>           <dbl>         <dbl>                  <fctr>
##  1 532378120  62383       120532.21    -0.6586242       1313 GARFIELD AVE
##  2 911102170  40000        67747.98    -0.5269151        208 S WALNUT AVE
##  3 535379060  64500       103956.04    -0.4773029        1407 DOUGLAS AVE
##  4 909101330  35000        53683.09    -0.4277501           202 STATE AVE
##  5 534451170  52000        79314.59    -0.4221783         1430 SUMMIT AVE
##  6 902302150  90000       134762.02    -0.4037007           824 CLARK AVE
##  7 535125010 180000       258897.05    -0.3634736           2735 DUFF AVE
##  8 905103060 121500       174525.64    -0.3621574            3925 ROSS RD
##  9 533352170 130500       186288.34    -0.3559225           1314 IOWA CIR
## 10 535126100 110000       148650.24    -0.3011158           2407 DUFF AVE
## 11 909455060 200000       146008.98     0.3146492 1510 LITTLE BLUESTEM CT
## 12 923125030  81500        59099.24     0.3213850         3325 S DUFF AVE
## 13 535454070 166000       118410.77     0.3378281           133 E 13TH ST
## 14 902400110 475000       334343.63     0.3511455            804 DUFF AVE
## 15 905376090 216000       144336.34     0.4031322        3714 WOODLAND ST
## 16 905426030 260000       159661.39     0.4876263         3424 OAKLAND ST

Here is where these homes are located:

homes_large_resids <- left_join(homes_large_resids, read.csv("large_resid_geo.csv"), by="PID")

homes_large_resids %>% 
  ggplot(aes(x=Longitude,y=Latitude,color=cut_width(sign(res.log_price),1))) + 
    annotation_custom(rasterGrob(googlemap,
                               width = unit(1,"npc"),
                               height = unit(1,"npc")),
                    -Inf, Inf, -Inf, Inf) +
  scale_y_continuous(limits = c(41.988,42.073)) +
  scale_x_continuous(limits = c(-93.696,-93.565)) +
  geom_point(aes(size=abs(res.log_price))) + 
  ggtitle("Overvalued and undervalued homes") + 
  scale_color_discrete(name="Over/Under valued",labels=c("High prediction","Low prediction")) + 
  scale_size_continuous(name="Magnitude of log_price error")

That is a good starting point to launch an investigation of high- and low-residual homes.


5 Conclusion

We developed a good model of Ames real estate prices, with \(R^2=0.93\) and RMSE of about $17500 on out-of-sample data. This model is a great first step to help ACME identify opportunities for investment. To be more helpful, the model needs retraining to include Landmrk, and it needs retraining to model price increases with time. Further investigation is needed regarding a small but significant number of very large residuals, which may lead to new insights in modeling.

The relatively linear narrative of this report belies the fact that we developed our model with a highly iterative multi-faceted approach. We have documented all the facets in this report. (Except for MCMC. As we said previously, MCMC displays technical inconsistencies with knitting in R, and so we omitted MCMC from this report.) We found it quite rewarding to see how all these facets could work together to produce our final model.

What we learned about the data: Diligent data cleaning pays off. Before we did any data cleaning, we thought that \(R^2=0.7\) was good. After we cleaned the data, we literally never saw \(R^2\) below 0.89 again, no matter how simple a model we tried. The cleaning is extremely tedious work. We wonder how many people, with greater statistical expertise than we have, are unknowingly held back by “dirty” data.

What we learned about our model: We are surprised how little the location actually contributed to the model. Based on our peripheral research to explain this, it appears that several other important predictors are highly co-linear with location, such as overall quality and year built. We are also interested how our model development discarded the single variable area in favor of a plethora of more specific area measures in different parts of the home. We did not anticipate this, but having seen the results, it makes sense to model price per area differently in different parts of the home.

We are curious how well other students/researchers’ models perform with this data.


6 Appendix

Function to compute and compare standardized coefficients

compare.std.coef
## function() {
##   
## model.aic.beta <- lm.beta(model.aic)
## model.bic.beta <- lm.beta(model.bic)
## model.chosen.beta <- lm.beta(model.chosen)
## 
## coefficients.all <- data.frame(
##   var_w_level = names(model.full$coefficients)
## )
## 
## coefficients.aic <- data.frame(
##   var_w_level = names(model.aic.beta$standardized.coefficients),
##   aic.val = model.aic.beta$standardized.coefficients
## )
## coefficients.bic <- data.frame(
##   var_w_level = names(model.bic.beta$standardized.coefficients),
##   bic.val = model.bic.beta$standardized.coefficients
## )
## coefficients.chosen <- data.frame(
##   var_w_level = names(model.chosen.beta$standardized.coefficients),
##   chosen.val = model.chosen.beta$standardized.coefficients
## )
## 
## # silence warnings in this block
## oldw <- getOption("warn")
## options(warn = -1)
## 
## coefficients.all <- 
##   left_join(
##     left_join(
##       left_join(coefficients.all, 
##                 coefficients.chosen, 
##                 by = "var_w_level"
##                 ),
##       coefficients.aic,
##       by = "var_w_level"
##       ),
##     coefficients.bic,
##     by = "var_w_level"
##     ) 
## 
## coefficients.all <- coefficients.all %>% group_by(var_w_level) %>%
##   mutate(
##     max.impact = max(abs(chosen.val),abs(aic.val),abs(bic.val), na.rm=TRUE)
##   )
## 
## options(warn = oldw)
## 
## 
## coef.important <- coefficients.all %>% filter(is.finite(max.impact), var_w_level!="(Intercept)") %>% arrange(desc(max.impact))
## 
## coef.important$var_w_level <- 
##   factor(coef.important$var_w_level,
##          levels = 
##            coef.important$var_w_level[order(coef.important$max.impact,
##                                             decreasing = FALSE)])
## 
## 
## # silence warnings in this block
## oldw <- getOption("warn")
## options(warn = -1)
## 
## model.comparison <- 
##   left_join(coef.important, 
##             read.csv("var_w_level.csv"),
##             by="var_w_level"
##             ) %>%
##   group_by(var) %>% 
##   summarise(chosen.max=max(abs(chosen.val), na.rm=TRUE),
##             aic.max=max(abs(aic.val), na.rm = TRUE),
##             bic.max=max(abs(bic.val), na.rm = TRUE),
##             bit3.models = 2*is.finite(chosen.max) +
##                           is.finite(aic.max) + 
##                           4*is.finite(bic.max),
##             impact=max(max.impact,na.rm=TRUE)) %>% 
##   arrange(desc(bit3.models),desc(impact))
## 
## options(warn = oldw)
## 
## model.comparison$var <- 
##   factor(model.comparison$var,
##          levels = 
##            model.comparison$var[order(model.comparison$bit3.models,
##                                       model.comparison$impact, 
##                                       decreasing = FALSE)])
## 
## return(model.comparison)
## }
## <bytecode: 0x0000000015da9d78>