In this model I am trying to predict the prices of houses sold using a few different variables. I predicted that age, quality and the number of garage car spaces will have an significant impact on price. I have found that while age tends to decrease the sale price, quality and number of garage spaces increase it.
As you can see in the table below, most of our data is above average. I wanted to focus specifically on higher quality houses to see the effect on sale price. I I have corrected for outliers as their are a few outliers that may be skewing out data. I decided to cap the number of garage car spaces to 4, as most of out data is between 1 and 3. I have created a field that tells us how old each house is. I have also created a field called “goodquality” that labels any house that was previously labeled as Very Excellent, Excellent, and Good, as Prefer, and anything else is labeled as Avoid. This allows me to focus primary on the high quality houses.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.766 2.000 5.000
##
## Above_Average Average Below_Average Excellent Fair
## 732 825 226 107 40
## Good Poor Very_Excellent Very_Good Very_Poor
## 602 13 31 350 4
I was able to use the log command and better distribute the sale
prices, which allows for less skewing of out data from outliers.
After doing a simple correlation, we cans see their is a definite
correlation between sale price and garage car spaces, years, and overall
quality. We cans see their is a negative correlation between sale price
and years old, meaning every as a house gets older, its sale price
drops.
## Model
In this model you can see that we have an adjusted r value of .5944, meaning that our model has a pretty strong correlation between vairables and 59.44 percent of the sale prices can be explained through this model. We can also see that if quality is considered good quality, price increases. As age increases every year, price decreases. Finally, we see that as number of garage car spaces increase, their is an increase in price. We can very this significance by looking at each p value of every variable, which is very small.
##
## Call:
## lm(formula = log_sale_price ~ goodquality + years_old_mod + garage_cars_capped,
## data = tt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48536 -0.14183 -0.00819 0.14153 1.03738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.7605323 0.0185734 633.19 <0.0000000000000002 ***
## goodquality 0.4476312 0.0238930 18.73 <0.0000000000000002 ***
## years_old_mod -0.0044717 0.0001887 -23.69 <0.0000000000000002 ***
## garage_cars_capped 0.2278066 0.0076938 29.61 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2596 on 2926 degrees of freedom
## Multiple R-squared: 0.5948, Adjusted R-squared: 0.5944
## F-statistic: 1432 on 3 and 2926 DF, p-value: < 0.00000000000000022
As shown below, our rmse is 48,338.93 which is showing that our prediction prices and actual prices are off. For instance on average we are off $48,338.93 from the actual sale price. Although we see a pretty clear correlation between quality, house age, and number of garage cars spaces when determining price, their may be things we are missing. For instance location, size, and number of bathrooms in the house. This model can be developed further for potentially a more accurate model, but I have chose to only a focusing on 3 variables.
## rmse
## 1 48338.93