After analyzing the various factors in the Ames, Iowa housing prices
dataset, I found that the ones that were able to best predict the sales
price of the houses was how many fireplaces the house contained, the
year it was built, the size of the garage, the square footage available,
and whether or not the house was in good condition.
Based on the correlation plot above, all variables I picked have a positive correlation with sales price, with total living area and garage area having the strongest correlation. This means that as one increases, the other is expected to increase as well. The variables I chose also have fairly low correlation with eachother, so collinearity is not a big issue. Based on these correlations, I decided to use them in my model, but first I need to split the data into a training and test set.
To be able to verify the performance of the model I create, I split the data into a training set, houses sold from 2006 - 2009, and a test set, houses sold in 2010.
This plot shows that the distributions of the Sales Price for both my train and test set follow similar patterns. This means that a model that is trained on the training data should be able to accurately model the test set as well.
##
## Call:
## lm(formula = SalePrice ~ GrLivArea + GarageArea + is_good_condition +
## YearBuilt + Fireplaces, data = t_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -470200 -23863 -4074 17377 297015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1543368.827 62283.652 -24.780 < 0.0000000000000002 ***
## GrLivArea 72.162 2.059 35.049 < 0.0000000000000002 ***
## GarageArea 81.187 4.961 16.366 < 0.0000000000000002 ***
## is_good_condition 30547.979 3757.902 8.129 0.000000000000000664 ***
## YearBuilt 780.214 32.357 24.113 < 0.0000000000000002 ***
## Fireplaces 18030.575 1482.407 12.163 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 42920 on 2583 degrees of freedom
## Multiple R-squared: 0.7164, Adjusted R-squared: 0.7158
## F-statistic: 1305 on 5 and 2583 DF, p-value: < 0.00000000000000022
The linear regression model for predicting the Sale Price of a house sold in Ames, Iowa can be expressed as:
\[ \begin{align*} \text{SalePrice} = & -1543368.827 + 72.162 \times \text{GrLivArea} \\ & + 81.187 \times \text{GarageArea} \\ & + 30547.979 \times \text{is_good_condition} \\ & + 780.214 \times \text{YearBuilt} \\ & + 18030.575 \times \text{Fireplaces} \end{align*} \]
Where:
This plot shows that the actual price vs predicted price follows linear trend for the most part. My model seems to not perform as well on the higher priced houses, typically predicting a lower sale price than is actually seen.
Shows that the error of my model on the test data set is normally distributed around zero as well as having an r^2 value of .73 and an RMSE of 38826 on the test data.