Thesis

After consideration of several variables shown below, I was able to create a model to predict the value of homes with a median error of $2,000 and maximum errors around +/- $180,0000. My model has an R squared value of 0.64.

Method

I removed several outliers from the data including houses above $500k and below $75k, and houses with more than 4 bathrooms. I then separated the data into a training and test group in order to create a model that maximizes R^2 while avoiding overfitting. I created several models and ran cor.tests to determine the best variables to include.

After looking at the correlations shown above, running several cor.tests, and testing multiple models, I settled on a model based on the following variables:

  • Living Area- This variable had the strongest positive correlation with sales price, which is to be expected.

  • Lot Area- The pairs diagram doesn’t reveal it as clearly, but there is a positive relationship between lot area and sale price.

  • Age of the house- There is a strong negative correlation between the age of the house and the sale price. This relationship was more linear and had a tighter confidence interval than remodel age.

  • Central Air AC- I noticed that all homes above a certain $ amount had central air, so it must have a strong importance in pricing.

  • All of these variables had P-values of 2.2E-16

  • While not included in the final model, # of bathrooms (full + half) had a strong positive correlation with sale price. Lot area was ultimately chosen over # of bathrooms because it is a continuous variable and produced a higher R^2 value.

Shown below is a summary of the model I created

## 
## Call:
## lm(formula = sale_price ~ gr_liv_area + lot_area + age + ac, 
##     data = t_test)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110698  -24315   -2101   18887  153105 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 52622.8109 10118.8883   5.200          0.000000277 ***
## gr_liv_area    93.9162     3.4857  26.943 < 0.0000000000000002 ***
## lot_area        0.7566     0.1805   4.192          0.000032034 ***
## age          -934.4950    57.4332 -16.271 < 0.0000000000000002 ***
## ac          13380.4335  7719.8676   1.733               0.0836 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38240 on 575 degrees of freedom
## Multiple R-squared:  0.7097, Adjusted R-squared:  0.7077 
## F-statistic: 351.5 on 4 and 575 DF,  p-value: < 0.00000000000000022