Intro

The 2008 financial crisis affected the lives of everyone in the United States—some more than others—and there is an abundance of retrospective research that seeks to identify which sectors of the economy were hit hardest during this time and why. The first sector that comes to mind is housing as the financial crisis was “characterized by a rise in subprime mortgage delinquencies and foreclosures, and the resulting decline of securities” backed by these subprime mortgages (North Carolina Department of Statistics 2016).1 For this reason, we decided to use our multifaceted data science toolbox to answer meaningful questions about the housing market. The goal of this project is ultimately to build a regression model to predict defined outcomes. So, while we will not be analyzing the housing market from an economic perspective like the majority of the literature, we can offer up a predictive model for academic purposes.

As a conceptual framework for our research, we adopted Peng and Matsui’s epicycles of analysis to iteratively manage the steps of our data analysis (Elizabeth Matsui 2017).2 After choosing the broad topic of the housing market, we narrowed the scope of our research. This included brainstorming as a team to figure out what specific question we wanted to answer about the housing market. We unanimously decided to predict the sale price of a residential house sale based on common characteristics that a property appraiser or assessor’s office would use to assess property value.

The Dataset

The dataset we chose for our project contains information from the Ames Assessorâs Office on residential property sales that occurred in Ames, Iowa from 2006 to 2010 (Cock 2011).3 The dataset was published in the Journal of Statistics Education for students and researchers alike to have an opportunity to practice predictive modeling on messy, real world data. The dataset contains 2930 observations of 80 variables and the unit of observation is a single property sale in Ames, Iowa in a given year.

Of the 80 variables, 23 are nominal, 23 are ordinal, 14 are discrete, and 20 are continuous. The variables included are basic characteristics that anyone wanting to buy a house would be interested in. For the most part, the different variables may be split up into specific groups. In general, the 20 continuous variables relate to measurements of area dimensions for each observation. These include, among others, the sizes of lots, rooms, porches, and garages. The 14 discrete variables mostly have to do with the number of bedrooms, bathrooms, kitchens, etc. that a given property has. There are several geographic categorical variables that start profiling properties at the individual Parcel ID level and end at the neighborhood level. The rest of the nominal variables identify characteristics of the property and dwelling type/structure. Most of the ordinal variables are rankings of the quality/condition of rooms and lot characteristics. For more information on the variables in the dataset, consult the included DataDescription.txt file.

The dataset was fairly messy with regard to the number of missing values and the way levels of categorical variables were organized and coded. In order to tackle the large amount of cleaning required, we thought it was best to split the data into four equal parts—leaving about 20 variables for each team member to clean. So, we changed all variable names to lowercase as some were upper case to standardize them and divided the variables up accordingly.

There were 13960 missing values in the dataset when we started to clean and analyze it. Table 1 breaks down the number of missing values by variable. After analyzing the variables with missing data and consulting the aforementioned DataDescription.txt file, we noticed that most of the missing values were actually not missing. The documentation states the data curator coded certain variables as NA to specify that a house did not have that feature. For example, in the case of the alley variable, NA means there is No Alley access to the property. By recoding these variables, we were able to fix most of the missing values in categorical variables.

Table 1: Number of Missing Values by Variable, Number Missing > 75
Variable Name	Number of Missing Values
pool.qc	2917
misc.feature	2824
alley	2732
fence	2358
fireplace.qu	1422
lot.frontage	490
garage.yr.blt	159
garage.qual	158
garage.cond	158
garage.type	157
garage.finish	157
bsmt.qual	79
bsmt.cond	79
bsmt.exposure	79
bsmtfin.type.1	79
bsmtfin.type.2	79

The solution to fill missing values for quantitative variables was just as simple. Most of the quantitative variables missing large numbers of values were NA because a given house feature was missing as well. For example, all missing values in the garage.yr.blt variable were because the house does not have a garage. The same reasoning for missing values applies to the square footage variables. For variables that were not missing a large number of variables we used mode (categorical) and median (quantitative) imputation to fill missing values.

Once the variables were cleaned and all missing values were filled, we performed exploratory data analysis of the data. We noticed that the distribution of sale price was heavily skewed to the right. Figure 1 illustrates this heavy skewness. These high-end outliers will make it very difficult to accurately predict as they are significantly going to drive up the root mean square error (RMSE) of our predictions. Keeping these observations would have disproportionately penalized the predictions of our model.

Figure 1: Histogram of Sale Price Shows Many High-end Outliers

To remedy this issue, we thought of two possible solutions: 1) drop the outlier observations; or 2) linearly transform the saleprice variable by taking the natural logarithm. We ultimately decided to drop the top and bottom one percent of observations as our goal is to predict typical sales in Ames, Iowa. Taking the natural logarithm of the sale price would change the scale of our predictions to a logarithmic scale and would abstract the interpretability of model predictions. In total, we only dropped 58 observations by excluding the top and bottom one percent of sale price.

The cleaned dataset has 2838 observations of 82 variables. This means that we only dropped 92 observations (3.14% of the original data) throughout the cleaning process. After cleaning the data, we used the caret package’s createDataPartition function to split our cleaned dataset into a train and test set. We decided to split 70 percent of the data into train and the remaining 30 percent into test. The three resulting datasets—cleaned, train, and test—can be found in the included project repository (https://github.com/mikearango/DATS_Final), along with the data cleaning file, Cleaning.Rmd. The knitted data Cleaning.Rmd HTML output can be viewed at http://rpubs.com/yesh747/GW_DATS_Final_Cleaning.

Exploratory Data Analysis and Ordinary Least Squares Regression

We decided to begin our post-cleaning analysis and model development process by generating several simple models. Linear models are the first models chosen to examine as they are the easiest to fit and most intuitively interpretable. The first model we decided to create was an ordinary least squares (OLS) model containing all the predictors in the data set. The results of the model can be seen in the regression output below (Table 2: Model 1).

# kitchen sink model
all.model <- lm(saleprice ~ ., data = df_train)
# predict on test set
pred.all <- predict(all.model, df_test)
RMSE.all <- sqrt(mean((pred.all - test_y)^2))

The test RMSE of the kitchen sink model with all variables is $30198.86. It is worth noting that we do not cross validate the simpler linear models, so they are not easily comparable to later models in which a robust k-fold cross validation technique is implemented. Moreover, we only look at the test RMSE for the simple linear models to see how our simple models handle new observations that they have not seen before. From the regression output below, we can tell there are several instances of multicollinearity since there are places in which coefficient estimates are not generate and the output tells us the fitted values come from a rank-deficient matrix—this implies the matrix is not full rank which entails the columns are not linearly independent of one another and linear dependence implies we may create columns from linear combinations of one another. Many predictors have high p-values and appear to be statistically insignificant. All predictors that demonstrated perfect multicollinearity or have a p-value > 0.2 were dropped to reduce dimensionality and include variables that account for actual variance in the sale price variable as opposed to random noise. Then, we ran the modified model to see if our changes result in a lower prediction RMSE (Table 2: Model 2).

# calculate variance inflation factors
vif(model.2)

##       ms.zoning        lot.area          street    land.contour 
##        1.298679        2.411166        1.053408        1.220227 
##      land.slope    neighborhood     condition.1     condition.2 
##        1.307085        1.196766        1.071314        1.028327 
##       bldg.type     house.style    overall.qual    overall.cond 
##        2.384922        1.796207        3.430808        1.392270 
##      year.built      roof.style    exterior.1st    exterior.2nd 
##        3.722310        1.202457        3.933904        4.059303 
##    mas.vnr.type    mas.vnr.area      exter.qual       bsmt.qual 
##        1.502206        1.834623        2.236477        2.357395 
##   bsmt.exposure    bsmtfin.sf.1    bsmtfin.sf.2     bsmt.unf.sf 
##        1.440682        5.126374        1.628612        4.276286 
##      heating.qc     x1st.flr.sf     x2nd.flr.sf  bsmt.full.bath 
##        1.494538        4.801416        3.475515        2.168480 
##  bsmt.half.bath       full.bath   bedroom.abvgr      functional 
##        1.196433        2.392854        2.097285        1.166541 
##      fireplaces    fireplace.qu   garage.yr.blt     garage.cars 
##        1.538815        1.163415        1.698895        6.452749 
##     garage.area    wood.deck.sf  enclosed.porch    screen.porch 
##        5.643962        1.248809        1.219637        1.101414 
##    misc.feature       sale.type  sale.condition ln.lot.frontage 
##        1.053644        1.314799        1.070549        2.859249 
##     ln.lot.area 
##        5.251100

Surprisingly, the RMSE of the modified model increased slightly to $30200.18. This is most likely because we are no longer overfitting the model as badly as before. As we can see from the variance inflation factors (VIFs), the high levels of multicollinearity between variables has been drastically decreased, but there are still some variables that need to be dropped. As a rule-of-thumb, economists usually drop variables with a VIF of greater than five to tease out multicollinearity between variables. The $R^2$ of the model is equal to 0.874 while the adjusted $R^2$ of the model is equal to 0.871. This shows that despite the high number of predictors variables in the model (45) we have not been penalized heavily since most variables account for a significant amount of variance in sale price. We can further simplify and improve this model by dropping highly correlated variables and insignificant predictors.

# linear model 3
model.3 <- lm(saleprice ~ overall.qual + neighborhood + bldg.type + bsmt.qual + total.bsmt.sf + gr.liv.area + full.bath + half.bath + totrms.abvgrd + kitchen.qual + garage.qual + garage.area + wood.deck.sf + screen.porch + misc.feature, data = df_train)
# Calculate RMSE on test set
pred.3 <- predict(model.3, df_test)
RMSE.3 <- sqrt(mean((pred.3 - test_y)^2))

The most simplistic model still explains about 83 percent of the variation in sale price—only 4 percent less than very complicated models. But, if we look at the difference in the regression coefficients of the three models, the third model has the most reasonable value (about $7,000). Hypothetically, this implies that a typical residential property in Ames, Iowa with none of the included features in the model, would be priced, on average, at about $7000. This is a much more reasonable assumption than the $500000 or $-700000 that the first and second model provide. Thus, even though the most simplistic model has the largest RMSE ($33777.69) of the three linear models, the third model is the most statistically sound model for prediction.

**Table 2: Simple OLS Regressions**

	saleprice
	(1)	(2)	(3)

ms.zoning	-2,361.22^**	-2,161.52^**
	(1,121.70)	(1,055.99)
lot.frontage	135.21
	(113.17)
lot.area	0.20^*	0.24^**
	(0.12)	(0.11)
street	24,371.36^*	24,103.85^*
	(12,896.88)	(13,056.23)
alley	436.90
	(2,548.54)
lot.shape	-331.95
	(678.92)
land.contour	1,419.47	1,221.97
	(940.15)	(946.14)
lot.config	440.10
	(372.48)
land.slope	8,618.25^***	7,677.98^**
	(3,220.91)	(3,239.62)
neighborhood	150.80	114.16	76.27
	(100.64)	(100.75)	(107.25)
condition.1	-1,424.27^**	-988.88
	(675.89)	(677.64)
condition.2	-14,630.07^***	-12,195.09^***
	(3,439.55)	(3,475.48)
bldg.type	-2,613.51^***	-3,700.89^***	-4,638.57^***
	(767.78)	(714.04)	(560.03)
house.style	-1,416.73^***	-1,418.19^***
	(411.16)	(399.61)
overall.qual	12,429.25^***	13,917.63^***	17,571.79^***
	(814.27)	(793.75)	(782.04)
overall.cond	5,086.54^***	5,572.92^***
	(708.14)	(621.87)
year.built	249.91^***	334.88^***
	(50.38)	(37.22)
year.remod.add	42.19
	(44.29)
roof.style	1,873.22^**	2,165.68^***
	(782.36)	(779.18)
roof.matl	-3,656.56
	(5,308.46)
exterior.1st	-1,113.66^***	-1,100.15^***
	(366.87)	(368.45)
exterior.2nd	623.08^**	585.59^**
	(295.38)	(295.06)
mas.vnr.type	1,203.40^*	1,236.71^*
	(666.90)	(671.40)
mas.vnr.area	18.99^***	19.11^***
	(4.45)	(4.47)
exter.qual	-5,488.00^***	-8,512.54^***
	(1,363.74)	(1,294.04)
exter.cond	885.55
	(824.49)
foundation	635.81
	(1,135.09)
bsmt.qual	-2,384.14^***	-2,880.30^***	-5,305.23^***
	(510.81)	(505.40)	(505.38)
bsmt.cond	-391.49
	(631.22)
bsmt.exposure	-1,795.42^***	-1,855.53^***
	(437.81)	(441.27)
bsmtfin.type.1	-350.49
	(349.24)
bsmtfin.sf.1	21.77^***	23.45^***
	(3.19)	(2.97)
bsmtfin.type.2	-2.69
	(652.98)
bsmtfin.sf.2	14.85^***	16.27^***
	(5.48)	(4.38)
bsmt.unf.sf	9.87^***	9.78^***
	(2.81)	(2.66)
total.bsmt.sf			28.61^***
			(2.07)
heating	-3,775.58
	(3,693.85)
heating.qc	-765.44^*	-1,199.70^***
	(403.34)	(397.90)
central.air	-1,833.50
	(3,068.63)
electrical	-630.97
	(824.61)
x1st.flr.sf	37.12^***	41.82^***
	(3.82)	(3.30)
x2nd.flr.sf	54.22^***	47.13^***
	(4.53)	(2.52)
low.qual.fin.sf	8.33
	(12.71)
gr.liv.area			41.72^***
			(3.02)
bsmt.full.bath	5,423.25^***	5,497.06^***
	(1,611.68)	(1,607.62)
bsmt.half.bath	-6,303.91^**	-5,594.29^**
	(2,519.41)	(2,551.06)
full.bath	2,716.35	1,019.91	-341.26
	(1,760.59)	(1,603.29)	(1,725.80)
half.bath	661.96		3,851.52^**
	(1,744.11)		(1,604.71)
bedroom.abvgr	-2,251.99^**	-2,158.85^**
	(1,110.50)	(1,012.50)
kitchen.abvgr	-14,588.11^***
	(3,547.68)
kitchen.qual	-7,329.63^***		-11,452.20^***
	(986.62)		(1,011.57)
totrms.abvgrd	1,493.72^*		-1,452.48^*
	(783.89)		(751.51)
functional	2,644.75^***	2,236.88^***
	(680.85)	(669.71)
fireplaces	6,303.08^***	7,188.58^***
	(1,112.86)	(1,102.28)
fireplace.qu	-2,477.46^***	-3,105.26^***
	(529.11)	(533.28)
garage.type	124.53
	(437.13)
garage.yr.blt	-4.86^**	-6.35^***
	(2.19)	(1.76)
garage.finish	-1,396.07^**
	(619.79)
garage.cars	6,464.39^***	5,900.78^***
	(1,961.79)	(1,949.00)
garage.area	21.60^***	22.98^***	46.14^***
	(6.49)	(6.48)	(4.14)
garage.qual	-972.41		1,264.54
	(799.65)		(793.72)
garage.cond	967.95
	(991.05)
paved.drive	1,554.32
	(1,405.31)
wood.deck.sf	10.51^**	14.32^***	34.50^***
	(5.10)	(5.07)	(5.49)
open.porch.sf	-11.86
	(9.60)
enclosed.porch	24.90^**	25.16^**
	(10.20)	(10.20)
x3ssn.porch	9.44
	(26.38)
screen.porch	41.27^***	44.30^***	61.06^***
	(10.32)	(10.45)	(11.51)
pool.area	3.96
	(17.68)
pool.qc	-6,722.17
	(6,149.03)
fence	-377.94
	(556.42)
misc.feature	8,073.51^***	7,738.92^**	7,938.94^**
	(3,122.94)	(3,152.99)	(3,531.19)
mo.sold	-105.98
	(216.76)
yr.sold	-521.00
	(443.62)
sale.type	-4,562.80^***	-5,115.05^***
	(1,168.12)	(1,174.10)
sale.condition	-8,716.80^***	-9,096.40^***
	(2,101.68)	(2,131.99)
ln.lot.frontage	-11,499.13	-2,636.28
	(7,953.15)	(2,904.21)
ln.lot.area	9,659.24^***	7,352.87^***
	(2,714.59)	(2,585.67)
house.age

yrs.since.remod

has.2nd.floor	-11,065.17^***
	(3,277.88)
Constant	504,876.80	-700,374.60^***	6,938.26
	(893,342.10)	(82,138.23)	(10,288.62)
N	1,989	1,989	1,989
R²	0.88	0.87	0.83
Adjusted R²	0.88	0.87	0.83
Residual Std. Error	24,845.42 (df = 1911)	25,416.30 (df = 1943)	28,868.88 (df = 1973)
F Statistic	184.06^*** (df = 77; 1911)	298.36^*** (df = 45; 1943)	662.67^*** (df = 15; 1973)

p < .1; p < .05; p < .01

Classical Assumptions of the Ordinary Least Squares Model

From these three simplistic models, we have seen several clues that OLS may not be the best model to predict sale price given the dataset. We hypothesize this is because there are several violations of the classical assumptions of OLS. Ordinary least squares regression models make five key assumptions about the data:

Linearity: There is a linear relationship between the independent variables and the dependent variable
Full rank: There is no exact linear relationship among any of the independent variables
Exogeneity of the independent variables: The conditional mean of the error given the independent variables is equal to zero.
Homoscedasticity and nonautocorrelation: Each residual has roughly the same finite variance and is uncorrelated with every other residual
Normal Distribution: The residuals are approximately normally distributed

In the cleaning process, we decided to linearly transform a few variables as the relationship between the explanatory variables and the response variable was not inherently linear (i.e. lot area). Moreover, we found that most of the categorical quality variables were not linearly related to the sale price variable. Figure 2 clearly illustrates this example as median sale price seems to increase quadratically as the overall quality of a house increases.

Figure 2: Histogram of Sale Price Shows Many High-end Outliers

In addition to the non-linearity between the independent and dependent variables, the regression output above showed that the full rank assumption is not met—as indicated by NA values for regression coefficients and very high VIFs. We plotted several graphs to check the assumptions about the the distribution and correlations of the residuals which can be seen in Figure 3. The graph of residuals vs. fitted values shows that there is correlation among the residuals as illustrated by the non-linearity of the red trend line. The second graph in Figure 3 is a quantile-quantile plot which shows we have reason to believe the distribution of errors may not be approximately normal. While four of the five assumptions are clearly violated, it seems to be the case that exogeneity of the independent variables still holds as the mean of the residuals is approximately equal to zero.

Figure 3: Residual Plots of the Simple OLS Model to Check Error Term Assumptions

A Heuristic Approach to Prediction

In this section, we take a systematic approach to generating a regression model used to predict sales prices of residential properties in Ames, Iowa from 2006 to 2010. This model serves to better understand the factors that influence sales price and can be considered another form of exploratory data analysis. We first sorted the independent variables by their $R^2$ values in a univariate OLS regression with sale price. This was done since we cannot compute the correlation coefficient between a categorical, independent variable and the sale price. By computing the $R^2$ values of high variance predictors, we were able to better understand which variables singularly explain the most variance in sale price.

##           Variable  RSquared
## 15    overall.qual 0.6374596
## 10    neighborhood 0.5903957
## 28       bsmt.qual 0.5010441
## 44     gr.liv.area 0.4914032
## 25      exter.qual 0.4577111
## 51    kitchen.qual 0.4527205
## 59     garage.cars 0.4297990
## 60     garage.area 0.4112817
## 36   total.bsmt.sf 0.3717401
## 41     x1st.flr.sf 0.3520744
## 80       house.age 0.3311327
## 17      year.built 0.3302985
## 58   garage.finish 0.3271849
## 55    fireplace.qu 0.3079559
## 47       full.bath 0.3061259
## 81 yrs.since.remod 0.2976624
## 18  year.remod.add 0.2955791
## 27      foundation 0.2877760
## 56     garage.type 0.2710881
## 52   totrms.abvgrd 0.2356259

Analyzing the Correlation Between Residential Sale Price and Neighborhood

Based on the order of the variables above, we see that some variables explain much more variance in the sale price than others. This list suggests that the overall quality of the house, neighborhood the house is located in, and several other factors are relatively important. Figure 4 sale price varies widely across different neighborhoods, which demonstrates the real estate adage of “location, location, location.”

Figure 4: Boxplot of Sale Price by Neighborhood

Regression Analysis

In order to best understand the influence of variables that individually explain relatively large portions of variance in sale price, the following multivariate OLS regression models were run:


	saleprice
	(1)	(2)

overall.qual		11,046.01^***
		(882.10)
neighborhoodIDOTRR	-14,525.50^**	3,320.73
	(7,233.71)	(6,829.34)
neighborhoodBrDale	13,760.72^*	-4,550.72
	(7,954.41)	(8,024.00)
neighborhoodOldTown	-17,347.13^***	-2,869.43
	(6,439.57)	(6,067.01)
neighborhoodBrkSide	-3,242.00	10,769.40^*
	(6,802.63)	(6,480.99)
neighborhoodEdwards	-15,943.27^**	6,602.48
	(6,643.53)	(6,106.12)
neighborhoodSWISU	-15,226.92^*	-4,119.43
	(8,125.98)	(7,865.01)
neighborhoodSawyer	-8,587.73	14,113.34^**
	(6,814.02)	(6,256.24)
neighborhoodNPkVill	14,971.06^*	-1,466.86
	(8,973.11)	(9,043.62)
neighborhoodBlueste	1,381.42	-13,411.84
	(12,477.83)	(12,492.57)
neighborhoodNAmes	-7,015.38	13,740.13^**
	(6,437.42)	(5,909.23)
neighborhoodMitchel	-7,475.07	15,238.90^**
	(7,166.64)	(6,585.19)
neighborhoodSawyerW	-5,805.53	11,502.79^*
	(6,886.17)	(6,403.19)
neighborhoodNWAmes	-3,937.52	12,367.67^*
	(6,863.00)	(6,395.51)
neighborhoodGilbert	-9,476.82	14,133.43^**
	(6,941.87)	(6,310.24)
neighborhoodGreens	15,331.28	6,674.83
	(11,950.16)	(11,972.17)
neighborhoodBlmngtn	-802.52	-2,273.02
	(8,386.19)	(8,329.74)
neighborhoodCollgCr	1,521.38	17,436.13^***
	(6,656.20)	(6,195.67)
neighborhoodCrawfor	17,436.60^**	34,022.77^***
	(7,079.72)	(6,631.68)
neighborhoodClearCr	1,932.84	33,226.90^***
	(8,427.60)	(7,689.73)
neighborhoodSomerst	20,901.74^***	24,623.34^***
	(6,681.96)	(6,534.81)
neighborhoodTimber	16,318.01^**	34,539.87^***
	(7,604.46)	(7,102.50)
neighborhoodVeenker	19,233.11^**	39,188.60^***
	(9,297.28)	(8,841.85)
neighborhoodNridgHt	47,162.73^***	55,359.66^***
	(7,134.14)	(6,858.63)
neighborhoodStoneBr	52,188.52^***	54,116.89^***
	(7,896.43)	(7,794.25)
neighborhoodNoRidge	43,509.76^***	54,903.64^***
	(7,747.59)	(7,432.32)
bsmt.qual	7,646.17^***	5,002.74^***
	(1,077.19)	(1,078.44)
total.bsmt.sf	16.64^***	18.90^***
	(2.10)	(2.05)
gr.liv.area	40.66^***	40.13^***
	(1.75)	(1.74)
exter.qual	9,672.94^***	3,928.98^**
	(1,853.89)	(1,879.30)
kitchen.qual	12,669.79^***	9,998.07^***
	(1,454.90)	(1,468.84)
garage.area	31.16^***	42.29^***
	(4.10)	(3.88)
garage.finish	5,634.58^***
	(956.88)
fireplace.qu	3,566.17^***	3,370.64^***
	(424.37)	(423.79)
yrs.since.remod	-252.13^***	-199.24^***
	(43.15)	(42.91)
ln.lot.area	19,770.29^***
	(1,886.92)
Constant	-226,059.80^***	-80,902.35^***
	(16,897.88)	(9,193.92)
N	1,988	1,989
R²	0.86	0.86
Adjusted R²	0.85	0.85
Residual Std. Error	27,138.41 (df = 1952)	27,016.38 (df = 1954)
F Statistic	329.38^*** (df = 35; 1952)	342.61^*** (df = 34; 1954)

p < .1; p < .05; p < .01

From these two models, we can observe that the overall quality of a house is significant and positively correlated with saleprice at the 1% level. The neighborhood a given house is located in generally has a significant impact on sales price in Model 2 at the 5% level. However, it is worth noting that some neighborhoods appear to be uncorrelated with any change in the sale price. All other variables used in the model appear to have significant relationships with sale price at the 1% level.

Application of Model

Using these multivariate OLS regression models to predict sale price in the test data results in fairly low RMSE values. These results can be seen in Table 3 below. However, these models are likely not strong performers for prediction because of high multicollinearity between the independent variables in the mode. For example, overall.qual will be highly correlated with kitchen.qual and exter.qual. In addition, the neighborhoods with higher prices may be highly correlated with overall.qual, kitchen.qual, and exter.qual. We already saw in the Exploratory Data Analysis and Ordinary Least Squares Regression section that OLS is not the best model for predicting sale price given our housing data as many classical assumptions are violated. Nevertheless, the models are important in that they help us to develop an understanding of the variables that influence sale price.

Table 3: Test RMSE of Heuristic Approach OLS Models
Model	Test RMSE
Model 1	$29932
Model 2	$29740

Principal Component Regression

Principal Component Regression (PCR) is an application of Principal Component Analysis (PCA) to regression modeling. By creating components—orthogonal transformations of possibly correlated variables into sets of uncorrelated variables—we can reduce the dimensionality of the feature space and minimize correlation between independent variables by opting to use these components in the regression model, in place of the original variables. This can create a better prediction model, with a lower risk of overfitting, than a traditional OLS regression model. However, interpretability of the estimated regression coefficients is lost as each component variable is an orthogonal transformation of the original variables in the dataset. In this section, we use the PCR method to build a more robust and sound predictive model.

Model Generation

Because PCR methods work best with numerical data, we decided to drop all factor variables from our training dataset, keeping only integer and numerical datatypes, before running a model. The dataset contains 40 variables after dropping factors. Then, we generated a kitchen sink PCR model using all the remaining variables in the training dataset to predict saleprice.

pcr.saleprice <- pcr(saleprice ~ ., data=df_train_num, scale = TRUE, validation = "CV")

Model Selection

After fitting a kitchen sink PCR model to get a baseline evaluation of how the model fits to the data, we performed feature selection. In order to select the number of components we want to use for prediction, we examined the RMSE as we allowed the number of principal components used in the model. The results can be seen in Figure 5 below. From the regression summary and the validated RMSE plot (Figure 5), we see large increases in the explained percentage of variance in sale price and decreases in RMSE are observed from 6 to 7 components, and again from 23 to 24. Thus, we assessed models with 1, 7, and 24 components.

Figure 5: Boxplot of Sale Price by Neighborhood

Model Validation

The regression output below compares the summary statistics of the predicted values in the PCR model to the actual values in the test dataset.


Statistic	N	Median	St. Dev.	Min	Max

test_data	849	160,000	69,045.96	62,383	455,000
Prediction.1.Component	849	172,976.20	61,862.66	37,678.95	622,881.70
Prediction.7.Component	849	173,170.20	63,569.53	10,893.61	665,717.10
Prediction.24.Component	849	172,128.70	65,360.61	18,940.55	714,284.50

Comparing the 3 Models, we see the following RMSE Values when applying the model to the test dataset:

Model 1 (1 Component): $36420
Model 2 (7 Component): $33924
Model 3 (24 Component): $33345

Model 2 appears to be the best, as Model 3 risks overfitting in exchange for a marginally improved RMSE. This is a clear example of the bias-variance trade-off in practice.

Random Forest Regression, Elastic Net Regularization, and Other Complex Models

Data Pre-processing

Most complex machine learning models do better if we convert factors to integers when we run models, so we converted all categorical variables to integers. Then we created our own folds to be used in k-fold cross validation (with k = 5) so that we can compare the out-of-sample RMSE of our trained models on the same cross validation folds. This ensures we have an apples-to-apples comparison of our models. In addition to making our own folds, we created a custom trainControl object so we can specify to use these folds.

# set seed for reproducibility
set.seed(256)
# create 5 folds to be used in cross validation
myFolds <- createFolds(train_mike, k = 5)
# create a custom trainControl object to use our folds; index = myFolds
myControl = trainControl(verboseIter = FALSE, index = myFolds)

Modeling

This section deals with some slightly more complicated prediction models. We ran a few random forest regression models, followed by two elastic net regularization models, and finished off with two more specialized models (k-NN regression and gradient boosting).

Random Forest

Random forests are aggregations of several decision tree models. A single decision tree is very fast but not too accurate. The earlier comparisons between the decision tree and a simple linear model illustrated this. While random forests take longer to run, they are very accurate and robust to overfitting. Random forests improve accuracy by fitting many trees to a bootstrap sample of the original dataset. This is a technique called bootstrap aggregation or bagging. In addition to making trees by bagging, “random forests change how the classification or regression trees are constructed. In standard trees, each node is split using the best split among all variables (Matthew Wiener 2002). In a random forest, each node is split using the best among a subset of predictors randomly chosen at that node.”4

set.seed(12)
# train the model 
modelrf1 <- train(
  # formula
  saleprice ~ .,
  # data
  train_mike,
  # fast random forest
  method = "ranger",
  importance = "impurity",
  # grid search for optimal number of columns to randomly sample at each split
  tuneGrid = data.frame(mtry = seq(2, 80, 2)),
  # set trainCrol as our custom object
  trControl = myControl
)
# print the model
print(modelrf1)

## Random Forest 
## 
## 1989 samples
##   81 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (5 reps) 
## Summary of sample sizes: 13, 17, 15, 22, 15 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared 
##    2    54756.52  0.7091453
##    4    49699.01  0.7214052
##    6    48742.55  0.7173932
##    8    47887.77  0.7119289
##   10    47377.39  0.7223138
##   12    47048.90  0.7127126
##   14    46976.19  0.7128854
##   16    46735.22  0.7097362
##   18    47076.57  0.7033098
##   20    46707.23  0.7049011
##   22    46800.19  0.6950737
##   24    46728.82  0.6984333
##   26    46513.82  0.6950204
##   28    46825.14  0.6922453
##   30    46323.01  0.6937254
##   32    46985.10  0.6881620
##   34    46684.63  0.6866490
##   36    46471.22  0.6886453
##   38    46738.11  0.6851717
##   40    46296.22  0.6918906
##   42    46482.95  0.6823281
##   44    46312.35  0.6884379
##   46    46765.75  0.6777096
##   48    46740.75  0.6787865
##   50    46558.02  0.6827944
##   52    46586.09  0.6782660
##   54    46435.39  0.6813909
##   56    46832.13  0.6759283
##   58    46553.33  0.6829673
##   60    46694.15  0.6772624
##   62    46698.44  0.6804232
##   64    46600.05  0.6751546
##   66    46434.20  0.6787169
##   68    46567.68  0.6801114
##   70    46681.59  0.6789195
##   72    46609.75  0.6746143
##   74    46856.95  0.6769123
##   76    46297.66  0.6764235
##   78    47018.30  0.6710605
##   80    46538.05  0.6782378
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 40.

We started by running a kitchen sink model (includes all independent variables) since random forest does its own feature selection. Random forest is simple because it only has two hyperparameters we need to worry about—the number of variables in the random subset at each node and the number of trees in the forest. We decided to run a grid search to find the optimal number of variables to include at each split, while leaving the number of trees constant at 500 (the default value). Figure 6 shows the optimal number of predictors at each split (mtry) was 62. Using the optimal number of predictors, the model yields a training RMSE of $46347.72 and the importance of the top 20 predictor variables is shown in Figure 7.

Figure 6: Training RMSE of Random Forest by Number of Predictors (mtry)

Figure 7: Top 20 Independent Variables by Importance

Next, we decided to remove the zero variance predictors before running the next random forest regression to see if this results in a lower training RMSE. Since we removed zero variance predictors, the number of variables in the dataset decreased. Thus, we had to run a new grid search when we ran the second random forest model. But, it’s worth noting that the regression output shows no variables were actually removed after pre-processing. Regardless, these modifications made the model slightly worse and increased the RMSE to $46327.18. We hypothesize that this is because we ran a smaller grid search since we would expect a handful of variables to be zero variance predictors in a dataset this large.

In an effort to reduce the dimensionality of the dataset, we decided to drop near-zero variance predictors, center and scale all remaining variables, and perform a Principal Components Analysis (PCA) before feeding the data into the third random forest model.

## Random Forest 
## 
## 1989 samples
##   81 predictor
## 
## Pre-processing: centered (62), scaled (62), principal component
##  signal extraction (62), remove (19) 
## Resampling: Bootstrapped (5 reps) 
## Summary of sample sizes: 13, 17, 15, 22, 15 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared 
##    1    67833.29  0.3563931
##    2    63820.00  0.4216134
##    3    60929.07  0.4709958
##    4    58702.32  0.4959577
##    5    56539.22  0.5073119
##    6    55063.55  0.5219993
##    7    53326.99  0.5451829
##    8    52474.46  0.5408367
##    9    51833.96  0.5423409
##   10    50727.18  0.5552007
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 10.

Surprisingly, this model had the highest training RMSE of the three models run to this point ($50954.65). The regression output shows that 19 variables were removed because they were near-zero variance predictors and the remaining 62 were centered, scaled, and subject to a PCA. This model also explained the least amount of variation in sale price.

Elastic Net Regularization

Elastic net regularization is a mixed method model that is closely linked to lasso (least absolute shrinkage and selection operator) and ridge regression. Lasso and ridge regression are two very common ways of dealing with collinearity of variables and can be seen as a subsitute to stepwsie regression and dimensionality reduction techniques. However, each method deals with collinearity differently. Lasso regression penalizes the number of non-zero coefficients whereas ridge regression penalizes the absolute magnitude of the coefficients. Elastic net regularization is a combination of the two models with a parameter $\alpha$, bounded between zero and one, that is referred to as the mixing parameter—0 is pure-lasso regression and 1 is pure-ridge regression. Elastic net also has a $\lambda$ paramater which ranges from zero to infinity which is referred to as the regularization—this parameter represents the size of the penalty and higher values of lambda correspond to simpler models (The MathWorks 2017).5

set.seed(500)
# Train glmnet with custom trainControl and tuning: model
enet1 <- train(
  # formula
  saleprice ~ ., 
  # data
  train_mike,
  # set grid search parameters for alpha
  tuneGrid = expand.grid(alpha = seq(0,1,0.1), 
                         lambda = seq(0.0001, 1, 10)),
  # use glmnet method for lasso, ridge, and elastic net 
  method = "glmnet",
  # trainControl
  trControl = myControl
)

## Warning: package 'glmnet' was built under R version 3.3.3

## Warning: package 'foreach' was built under R version 3.3.3

# Print model output to console
print(enet1)

## glmnet 
## 
## 1989 samples
##   81 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (5 reps) 
## Summary of sample sizes: 13, 17, 15, 22, 15 
## Resampling results across tuning parameters:
## 
##   alpha  RMSE      Rsquared 
##   0.0    54241.07  0.4982317
##   0.1    65806.73  0.5870422
##   0.2    70465.42  0.5705243
##   0.3    72072.57  0.5602204
##   0.4    72408.13  0.5550632
##   0.5    73241.00  0.5502134
##   0.6    73475.49  0.5492261
##   0.7    72782.38  0.5542529
##   0.8    75491.28  0.5401454
##   0.9    74684.25  0.5545181
##   1.0    44758.84  0.6681522
## 
## Tuning parameter 'lambda' was held constant at a value of 0.0001
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.0001.

For the first model, we decided to run a grid search for the optimal mixing parameter while holding the penalty factor constant. Figure 8 shows the optimal value of $\alpha$ was one for the grid search we ran. This implies that pure ridge regression will minimize the RMSE of our predictions for the regularization models. Note that we only run a grid search for $0 \leq \alpha \leq 1$ by steps of 0.1. That is, it is entirely possible that $\alpha_{optimal} \in (0.9,1)$ but it would have been more computationally expensive to perform a more robust grid search. Ridge regression results in the lowest training RMSE of the complex models ($44758.84), but this is before we even try to find the RMSE minimizing value of the regularization parameter.

Figure 8: RMSE of Elastic Net Regularization for Varying Levels of the Mixing Parameter

Since we found our optimal value of the mixing parameter to be one, we run a pure-ridge regression model as the second regularization model. So, holding $\alpha$ constant and equal to one, we performed a grid search for the RMSE minimizing value of the regularization parameter. Again, a very robust grid search would be computationally expensive, so we performed a grid search for $\lambda \in [0,1000]$ by steps of 1000.

set.seed(1267)
# Train glmnet with custom trainControl and tuning: model
ridge <- train(
  # formula
  saleprice ~ ., 
  # data
  train_mike,
  # set grid search parameters for lambda
  tuneGrid = expand.grid(alpha = 1, 
                         lambda = (0:15) * 1000),
  # use glmnet method for lasso, ridge, and elastic net 
  method = "glmnet",
  # trainControl
  trControl = myControl
)
# Print model output to console
print(ridge)

## glmnet 
## 
## 1989 samples
##   81 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (5 reps) 
## Summary of sample sizes: 13, 17, 15, 22, 15 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE      Rsquared 
##       0   44758.84  0.6681522
##    1000   43745.22  0.6829261
##    2000   43350.02  0.6850346
##    3000   42909.67  0.6831511
##    4000   42039.22  0.6891672
##    5000   41908.15  0.6850255
##    6000   41903.15  0.6792737
##    7000   41864.39  0.6748741
##    8000   41868.84  0.6711100
##    9000   41933.36  0.6673462
##   10000   42086.41  0.6627630
##   11000   42394.97  0.6567517
##   12000   42838.36  0.6497837
##   13000   43369.07  0.6423837
##   14000   44062.11  0.6331428
##   15000   44753.20  0.6247382
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were alpha = 1 and lambda = 7000.

The regression results show that the optimal model drops our training RMSE down to $41864.39 which is significantly better than before we performed a grid search for the regularization parameter. Below, Figure 9 shows that $\lambda_{optimal} = 7000$. We can assume that we have at least found a local minimum for $\lambda_{optimal}$ as the figure shows a U-shaped curve for the regularization parameter and the slope of the curve is equal to zero at $\lambda = 7000$.

Figure 9: RMSE of Elastic Net Regularization for Varying Levels of the Mixing Parameter

k-Nearest Neighbors Regression and Gradient Boosting

The k-nearest neighbors algorithm is a non-parametric method that can be used for classification or regression. We take the k closest examples to the point we are trying to predict via a given distance metric (Euclidean, Manhattan, Malinowski, etc.) and fit the best line between the neighbors. This process is repeated for each point in the model, so it is a technique that is very computationally expensive if used on a larger dataset. It is important to find the right value of k as we can easily over-fit by picking a k that is too low or under-fit by picking a k that is too high.

set.seed(579)
# Train glmnet with custom trainControl and tuning: model
knn1 <- train(
  # formula
  saleprice ~ ., 
  # data
  train_mike,
  # knn regression
  method = "kknn",
  # trainControl
  trControl = myControl
)

## Warning: package 'kknn' was built under R version 3.3.3

# Print model to console
print(knn1)

## k-Nearest Neighbors 
## 
## 1989 samples
##   81 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (5 reps) 
## Summary of sample sizes: 13, 17, 15, 22, 15 
## Resampling results across tuning parameters:
## 
##   kmax  RMSE      Rsquared 
##   5     47494.79  0.6024412
##   7     47430.67  0.6129031
##   9     47432.89  0.6159750
## 
## Tuning parameter 'distance' was held constant at a value of 2
## 
## Tuning parameter 'kernel' was held constant at a value of optimal
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were kmax = 7, distance = 2 and
##  kernel = optimal.

Most of the literature uses a value of k between three and nine, but we ran a grid search between five and nine for the optimal value of k.The training RMSE for k-nearest neighbor regression was $47430.67 which is higher than most of our models. Figure 10 shows that $k_{max} = 7$ is the value of k that optimizes the results of the model on the training data.

Figure 10: RMSE for Varying Numbers of Maximum Neighbors

The last model we decided to run is a gradient boosting linear model that uses penalized least squares regression to optimize the RMSE of our model on the training data. This model starts by fitting an additive model in a forward stepwise manner (similar to a decision tree) and uses the concept of gradient descent—minimizing a function by moving in the opposite direction of the gradient—to add new inputs to the model. For the boosting model, we decided not to prune our boosted trees and to find the optimal number of boosting iterations (mstop) between between 10 and 150 by increments of 10.

glmBoostGrid1 = expand.grid(mstop = seq(10, 150, 10),
                           prune = c("no"))
set.seed(1254)
# Train glmnet with custom trainControl and tuning: model
glmboost1 <- train(
  # formula
  saleprice ~ ., 
  # data
  train_mike,
  tuneGrid = glmBoostGrid1,
  # gradient boosting tree
  method = "glmboost",
  # trainControl
  trControl = myControl
)

## Warning: package 'mboost' was built under R version 3.3.3

## Warning: package 'stabs' was built under R version 3.3.3

# Print model to console
print(glmboost1)

## Boosted Generalized Linear Model 
## 
## 1989 samples
##   81 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (5 reps) 
## Summary of sample sizes: 13, 17, 15, 22, 15 
## Resampling results across tuning parameters:
## 
##   mstop  RMSE      Rsquared 
##    10    51018.94  0.5469574
##    20    43646.83  0.6351759
##    30    42211.98  0.6587426
##    40    41693.69  0.6715619
##    50    41982.33  0.6745126
##    60    42241.63  0.6741847
##    70    42575.89  0.6737011
##    80    42955.73  0.6712201
##    90    43303.32  0.6701332
##   100    43853.97  0.6657788
##   110    44318.39  0.6618264
##   120    44615.73  0.6600767
##   130    45009.99  0.6589798
##   140    45439.37  0.6573927
##   150    45768.18  0.6564806
## 
## Tuning parameter 'prune' was held constant at a value of no
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were mstop = 40 and prune = no.

The gradient boosted tree gave us a marginally lower RMSE ($41693.69) than the optimized ridge regression model. Figure 11 shows the number optimal number of boosting iterations for this model was 40. It would be ideal if the gradient boosted tree model ends up having the lowest RMSE on the training data as well since the coefficients are easily interpretable just as in OLS regression.

Figure 11: RMSE of Gradient Boosted Model at Different Numbers of Boosting Iterations

Model Comparison

To compare models in which we cross-validated on the same folds, we decided make a list of all the models and their results. Then we used the resamples function in the caret package to take bootstrap samples of the model results. This allows us to make statistical inferences about the performance differences between the models. Below, Figure 12 shows a box-and-whisker plot of the RMSE of the bootstrap samples of our models and Figure 13 shows a box-and-whisker plot of the $R^2$ of the bootstrap samples.

## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: rf1, rf2, rf3, elastic net, ridge, knn, glmboost 
## Number of resamples: 5 
## 
## RMSE 
##              Min. 1st Qu. Median  Mean 3rd Qu.  Max. NA's
## rf1         38010   41670  47780 46300   49780 54240    0
## rf2         37490   42050  47520 46170   49910 53900    0
## rf3         40740   42370  53920 50730   56680 59920    0
## elastic net 37200   42490  46870 44760   48500 48740    0
## ridge       36460   38370  40930 41860   46750 46810    0
## knn         44210   46120  47580 47430   48670 50570    0
## glmboost    34480   41540  43350 41690   43410 45690    0
## 
## Rsquared 
##               Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## rf1         0.6091  0.6725 0.7057 0.6919  0.7147 0.7575    0
## rf2         0.6208  0.6835 0.6987 0.6929  0.7053 0.7559    0
## rf3         0.4151  0.4354 0.5698 0.5552  0.6530 0.7027    0
## elastic net 0.5665  0.6531 0.6877 0.6682  0.7068 0.7267    0
## ridge       0.5829  0.6451 0.6902 0.6749  0.7220 0.7343    0
## knn         0.5714  0.5904 0.6203 0.6129  0.6209 0.6615    0
## glmboost    0.6319  0.6361 0.6535 0.6716  0.6665 0.7698    0

Figure 12: RMSE of Bootstrap Samples of The Trained Models

Figure 13: R-squared of Bootstrap Samples of The Trained Models

Evaluation of Cross-Validated Models

We calculated the predicted values of each of our models on the test dataset and calculated the RMSE of each. The results can be seen in Table 4 below.

Table 4: Test RMSE of Machine Learning Models
Model Name	Test RMSE
pcr.1	$36420.37
pcr.7	$33924.24
pcr.24	$33344.96
rf1	$22272.18
rf2	$22058.44
rf3	$27642.27
elastic net	$30015.88
ridge	$31782.36
k-NN	$28943.03
glmboost	$31778.77

** Note that pcr.1, pcr.7, and pcr.24 were cross-validated using the pcr function’s built-in instead of the custom folds

From Table 4, we can see that the rf1 and rf2 are likely the strongest and most reliable prediction models, as they much lower RMSE values than the other models.

Example Predictions

Prediction on test data

Table 5 demonstrates the ability of our model to predict the saleprice, given specific variables from the test data set. Table 5 shows the predicted values for median and mode are close to the median of the actual dataset.

**Table 5: Prediction Results**

Statistic	N	Median	St. Dev.	Min	Max

Actual_SalePrice	849	160,000	69,045.96	62,383	455,000
Prediction_RF1_SalePrice	849	160,987.40	63,721.28	80,094.48	402,857.60
Prediction_RF2_SalePrice	849	161,187.30	63,018.59	81,806.04	397,026.70

Prediction for individual house

Using the following input paramters (from the 1st observation of the test data), the rf1 model predicts the sale price to be $178034.34, and rf2 model predicts sale price to be $178034.34. The true price of this house is $189900.

ms.zoning lot.frontage lot.area street alley lot.shape land.contour 1 6 74 13830 2 2 1 4 lot.config land.slope neighborhood condition.1 condition.2 bldg.type 1 5 1 9 3 3 1 house.style overall.qual overall.cond year.built year.remod.add 1 6 5 5 1997 1998 roof.style roof.matl exterior.1st exterior.2nd mas.vnr.type mas.vnr.area 1 2 1 11 12 3 0 exter.qual exter.cond foundation bsmt.qual bsmt.cond bsmt.exposure 1 4 4 3 3 6 5 bsmtfin.type.1 bsmtfin.sf.1 bsmtfin.type.2 bsmtfin.sf.2 bsmt.unf.sf 1 3 791 7 0 137 total.bsmt.sf heating heating.qc central.air electrical x1st.flr.sf 1 928 1 3 2 5 928 x2nd.flr.sf low.qual.fin.sf gr.liv.area bsmt.full.bath bsmt.half.bath 1 701 0 1629 0 0 full.bath half.bath bedroom.abvgr kitchen.abvgr kitchen.qual 1 2 1 3 1 5 totrms.abvgrd functional fireplaces fireplace.qu garage.type 1 6 7 1 6 2 garage.yr.blt garage.finish garage.cars garage.area garage.qual 1 1997 1 2 482 6 garage.cond paved.drive wood.deck.sf open.porch.sf enclosed.porch 1 6 3 212 34 0 x3ssn.porch screen.porch pool.area pool.qc fence misc.feature mo.sold 1 0 0 0 3 3 2 3 yr.sold sale.type sale.condition ln.lot.frontage ln.lot.area house.age 1 2010 3 1 4.304065 9.534595 13 yrs.since.remod has.2nd.floor 1 12 1

Example For Changing One Feature - Garage

If we change the garage size from a 2 car garage to a 3 car garage, model rf1 predicts price to increase to $182090.83 (a 2.28% change in sale price). Model rf2 predicts price to increase to $187004.04 (a 4.46% change in sale price).

Limitations

These models are primarily limited by external validity. The data is limited to Ames, Iowa from 2006 to 2010, and our tests of prediction ability (RMSE) was conducted using test data from Ames, Iowa. Applying these models to other areas, like New York City, to commercial markets, or to time periods before 2006 or after 2010. Additionally, the financial crisis of 2006 and 2012 may make these models specific to this time period and region. Future work may compare Ames, Iowa to other housing markets or develop a model based off housing data from a larger region or at a national scope. This difficulty in developing these datasets is consistency in variables across regions, especially for subject features, like overall house quality (overall.qual).

Conclusion

We attempted to several approaches for modeling. We found that OLS linear regression models failed to match all the assumptions needed for proper interpretations. However, a regression model does allow us to understand some factors that are important to sale price, such as overall house quality and neighborhood. Using more complicated predictive models, we sacrificed interpretability of variable coefficients for prediction ability. The most robust models were the random forest models (rf1, RMSE: $22272.18 and rf2, RMSE: $22272.18).

References

Cock, Dean De. 2011. “The Art of Data Science.” Journal of Statistics Education 19 (april). www.amstat.org/publications/jse/v19n3/decock.pdf.

Elizabeth Matsui, Roger D. Peng &. 2017. “The Art of Data Science,” april. https://bookdown.org/rdpeng/artofdatascience/.

Matthew Wiener, Andy Liaw &. 2002. “Classification and Regression by RandomForest,” december. http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf.

North Carolina Department of Statistics, University of. 2016. “Subprime Mortgage Crisis,” april. http://www.stat.unc.edu/faculty/cji/fys/2012/Subprime\%20mortgage\%20crisis.pdf.

The MathWorks, Inc. 2017. “Lasso and Elastic Net.” https://www.mathworks.com/help/stats/lasso-and-elastic-net.html.

Using Neighborhood Level Data to Predict the Residential Sale Price of Properties in Ames, Iowa

Yeshwant Chillakuru, Michael Arango, Jack Crum, Paul Brewster

5/1/2017