Introduction

In this model I am trying to predict the prices of houses sold using a few different variables. I predicted that age, quality and the number of garage car spaces will have an significant impact on price. I have found that while age tends to decrease the sale price, quality and number of garage spaces increase it.

Variables and Data Cleanup

As you can see in the table below, most of our data is above average. I wanted to focus specifically on higher quality houses to see the effect on sale price. I I have corrected for outliers as their are a few outliers that may be skewing out data. I decided to cap the number of garage car spaces to 4, as most of out data is between 1 and 3. I have created a field that tells us how old each house is. I have also created a field called “goodquality” that labels any house that was previously labeled as Very Excellent, Excellent, and Good, as Prefer, and anything else is labeled as Avoid. This allows me to focus primary on the high quality houses.

Key Variables

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.766   2.000   5.000
## 
##  Above_Average        Average  Below_Average      Excellent           Fair 
##            732            825            226            107             40 
##           Good           Poor Very_Excellent      Very_Good      Very_Poor 
##            602             13             31            350              4

Variable Outliers

I was able to use the log command and better distribute the sale prices, which allows for less skewing of out data from outliers.

Variable Correlation

After doing a simple correlation, we cans see their is a definite correlation between sale price and garage car spaces, years, and overall quality. We cans see their is a negative correlation between sale price and years old, meaning every as a house gets older, its sale price drops. ## Model

In this model you can see that we have an adjusted r value of .5944, meaning that our model has a pretty strong correlation between vairables and 59.44 percent of the sale prices can be explained through this model. We can also see that if quality is considered good quality, price increases. As age increases every year, price decreases. Finally, we see that as number of garage car spaces increase, their is an increase in price. We can very this significance by looking at each p value of every variable, which is very small.

## 
## Call:
## lm(formula = log_sale_price ~ goodquality + years_old_mod + garage_cars_capped, 
##     data = tt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48536 -0.14183 -0.00819  0.14153  1.03738 
## 
## Coefficients:
##                      Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)        11.7605323  0.0185734  633.19 <0.0000000000000002 ***
## goodquality         0.4476312  0.0238930   18.73 <0.0000000000000002 ***
## years_old_mod      -0.0044717  0.0001887  -23.69 <0.0000000000000002 ***
## garage_cars_capped  0.2278066  0.0076938   29.61 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2596 on 2926 degrees of freedom
## Multiple R-squared:  0.5948, Adjusted R-squared:  0.5944 
## F-statistic:  1432 on 3 and 2926 DF,  p-value: < 0.00000000000000022

Predictions and Limitations

As shown below, our rmse is 48,338.93 which is showing that our prediction prices and actual prices are off. For instance on average we are off $48,338.93 from the actual sale price. Although we see a pretty clear correlation between quality, house age, and number of garage cars spaces when determining price, their may be things we are missing. For instance location, size, and number of bathrooms in the house. This model can be developed further for potentially a more accurate model, but I have chose to only a focusing on 3 variables.

##       rmse
## 1 48338.93