Introduction

The goal of this analysis is to build a model that predicts house prices, and identifies the most important factors in house value. To me, this meant that the real estate company is mainly interested in a good prediction model, and would value predictive power over interpretability and inference. For this reason, my final prediction model used gradient boosting. This yielded a low prediction error, but the tradeoff came in its lack of interpretability. Still, I was able to obtain variable importance plots to provide PA-VA Realty with the most important factors.

Exploratory Data Analysis

First, I read in the data and kept the names train and test. I ran a summary for all columns in the train dataset, and plotted each variable against price. This gave me a broader picture of the distribution of each predictor and its relationship with price. Right off the bat, the variable with the strongest relationship to price seemed to be square feet. I also checked the summary statistics for the test dataset to compare it with the train dataset, and they looked similar enough to not cause too many issues. In both datasets, the lot area variable was skewed far to the right, indicated by its mean exceeding its median. The categorical variables for the train and test set also had similar distributions.

Then, I checked each column for NA values, and found that fireplace was the only column with any missing values. Rather than deleting these rows, I decided to impute the NA cells with the column mean for both the training and test set. I also removed the ID column in the training set since this had no relationship with the price. Then I removed observations from the training set that had odd realizations of categorical variables. There was a single observation which had mobile home as the description, and this caused trouble in attempting to use the validation set approach to calculate the MSE. So I decided to delete this row from train, as well as homes that had log and concrete as the exterior finish. In all, I deleted 8 rows from the train set before I started building models.

I then took a quick look at the lot area variable again, and found it strange that some houses had 0 sqft and others had extremely high values. I filtered the rows with lot area equal to 0 and found they were all condominiums, which made sense. Despite the median lot area being 7744 sqft, there were 62 observations with 100,000 sqft or more of lot area. I decided to keep these in the training set since there were far too many to dismiss them as outliers. We would need to take this large variation into account when building models so that houses with large lot areas can have accurate predictions.

I viewed a histogram of sqft and price just to get a better sense of how each is distributed, and they were far more symmetrical than lot area, with only a few outliers on the right tail.

Lastly, I deleted the zipcode column completely from the dataset. I tried to include this in the models at first (after converting it to a factor) but it created very high dimensionality and it was often insignificant. We still have the average income column, which is directly tied to zip code, so we can use this variable to obtain some information about the relationship between zip code and home prices.

Model Building

Before training any models, I split the train dataset into train1 and test1. I tried to include cross-validated error whenever possible, but in some instances had to use validation set error.

The first model was the baseline linear model with all predictors. The 10-fold CV MSE ended up as 18,198,966,972. Interestingly, almost every predictor was significant. I manually adjusted the linear model to only include the signficant predictors and obtained a 10-fold MSE of 18,592,376,008. My manual adjustment did not decrease the MSE, so I moved on to automated methods. I also looked at the regression diagnostics for these models and found evidence of heteroskedasticity in the standardized residual plot.

I tried both ridge regression and lasso, with the tuning parameter chosen by cross-validation. The 10-fold CV MSE for the chosen ridge regression model was 18,247,763,194 and for lasso it was 18,836,506,616. I then tried OLS after lasso, and obtained a CV MSE of 17,869,313,365. These methods attempt to reduce variance and protect from overfitting, but they ended up performing worse than OLS. This told me that OLS was already very biased and a more flexible approach was needed.

I built a single tree using train1 and estimated the MSE on the test set (test1). This yielded an MSE of 20,770,133,160. I tried to use pruning to prevent overfitting, but the number of terminal nodes chosen by CV ended up with the full original tree.

The breakthrough occured when I tried bagging, random forest, and boosting. The validation set MSE from bagging ended up as 8,684,574,624 which was a vast improvement. I then used a for loop to find the best mtry parameter for the random forest model with respect to test error. The best mtry was 3, and the resulting MSE for this model was 6,556,605,797. For the gradient boosted model, I tuned over the learning rate and number of trees simultaneously using the validation set approach. The best number of trees was 500, and the best lambda was .1. This resulted in an MSE of 6,170,604,767.

The best performing models in terms of MSE were the random forest and gradient boosted model. The variable importance for both were similar: sqft was by far the most important variable, then number of bathrooms was second. After these top two, there was a steep dropoff in relative variable importance, and the models begin to diverge (total rooms and lot are 3rd and 4th for random forest but lot area and rooftype are 3rd and 4th for boosting). The remaining variables were more or less equal in importance according to these models.

Conclusion

Overall, the most challenging aspect of this dataset was predicting observations with very high prices. For the chosen prediction model, the MSE for observations with a price greater than $1,000,000 (97th quantile for test1) was 34,898,460,468 and for observations below $1,000,000 the MSE was 5,286,260,545. So this prediction model should perform well for houses in an ordinary price range, but will have difficulty predicting home price for outliers. There was also a fair amount of non-linearity and high-dimensionality in the data, evidenced by the poor regression diagnostics. This made interpretation difficult, and in order to reduce bias, flexible non-parametric methods like random forest and gradient boosting had to be used in order to make good predictions. I would trust my best model, for the most part, since its MSE was very low relative to the other models. However, it is a black box method and there were no tests for significance, so I would not put my complete trust in the model and measures of variable importance.

If I had more time, I would have included polynomial and interaction terms in the linear model, and chosen the best model based on a combination of forward selection, backward selection, and best subsets (or OLS after Lasso). This would have allowed me to have a flexible model that is more interpretable than the random forest or boosted models. However, the model selection algorithms take a while to load on my PC and it was not worth the time.