Variable Selection

After analyzing the various factors in the Ames, Iowa housing prices dataset, I found that the ones that were able to best predict the sales price of the houses was how many fireplaces the house contained, the year it was built, the size of the garage, the square footage available, and whether or not the house was in good condition.

Model Building

Based on the correlation plot above, all variables I picked have a positive correlation with sales price, with total living area and garage area having the strongest correlation. This means that as one increases, the other is expected to increase as well. The variables I chose also have fairly low correlation with eachother, so collinearity is not a big issue. Based on these correlations, I decided to use them in my model, but first I need to split the data into a training and test set.

Data Splitting

To be able to verify the performance of the model I create, I split the data into a training set, houses sold from 2006 - 2009, and a test set, houses sold in 2010.

This plot shows that the distributions of the Sales Price for both my train and test set follow similar patterns. This means that a model that is trained on the training data should be able to accurately model the test set as well.

Model Performance

## 
## Call:
## lm(formula = SalePrice ~ GrLivArea + GarageArea + is_good_condition + 
##     YearBuilt + Fireplaces, data = t_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470200  -23863   -4074   17377  297015 
## 
## Coefficients:
##                       Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)       -1543368.827    62283.652 -24.780 < 0.0000000000000002 ***
## GrLivArea               72.162        2.059  35.049 < 0.0000000000000002 ***
## GarageArea              81.187        4.961  16.366 < 0.0000000000000002 ***
## is_good_condition    30547.979     3757.902   8.129 0.000000000000000664 ***
## YearBuilt              780.214       32.357  24.113 < 0.0000000000000002 ***
## Fireplaces           18030.575     1482.407  12.163 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 42920 on 2583 degrees of freedom
## Multiple R-squared:  0.7164, Adjusted R-squared:  0.7158 
## F-statistic:  1305 on 5 and 2583 DF,  p-value: < 0.00000000000000022

The linear regression model for predicting the Sale Price of a house sold in Ames, Iowa can be expressed as:

\[ \begin{align*} \text{SalePrice} = & -1543368.827 + 72.162 \times \text{GrLivArea} \\ & + 81.187 \times \text{GarageArea} \\ & + 30547.979 \times \text{is_good_condition} \\ & + 780.214 \times \text{YearBuilt} \\ & + 18030.575 \times \text{Fireplaces} \end{align*} \]

Where:

  • \(\text{GrLivArea}\) represents the above ground living area in square feet.
    • Shows that when the total living area increases by 1 square foot, the price of the house is expected to increase by about 72 dollars.
  • \(\text{GarageArea}\) represents the garage area in square feet.
    • Shows that when the garage area increases by 1 square foot, the price of the house is expected to increase by about 81 dollars.
  • \(\text{is_good_condition}\) is a binary variable indicating whether the property is in good condition.
    • Shows that when the house is in good condition (average or better), the price of the house is expected to increase by about 30,500 dollars.
  • \(\text{YearBuilt}\) represents the year the house was built.
    • Shows that when the year built increases by a year (newer house), the price of the house is expected to increase by about 780 dollars.
  • \(\text{Fireplaces}\) represents the number of fireplaces in the house.
    • Shows that when the number of fireplaces increases by 1, the price of the house is expected to increase by about 18,000 dollars.

This plot shows that the actual price vs predicted price follows linear trend for the most part. My model seems to not perform as well on the higher priced houses, typically predicting a lower sale price than is actually seen.

Shows that the error of my model on the test data set is normally distributed around zero as well as having an r^2 value of .73 and an RMSE of 38826 on the test data.

  • This RMSE value means that on average, my price predictions on the test data set are off by about 40,000 dollars, which for house prices, I think is fairly decent.
  • This r^2 value means that about 73% of the variation in the Sale Price is explained by the factors in my model