Part 1 - Introduction

Is above ground living area (in square feet) and overall finish of the house predictive of it’s sales price?

If you are in the market to buy a new property or to sell a property, what factors drive the sale price? This is an interesting question to lot of buyers and sellers. In this research project, I am exploring data from Ames, Iowa to see what factors drives sale price?

Part 2 - Data

The data was collected directly from the Ames City Assessor’s Office in the form of a data dump from their records system. The initial data set contained 113 variables describing 3970 property sales that had occurred in Ames, Iowa between 2006 and 2010. The variables were a mix of nominal, ordinal, continuous, and discrete variables used in calculation of assessed values and included physical property measurements in addition to computation variables used in the city’s assessment process. Some of the variables were removed as they were related to weighting and adjustment factors used in the city’s current modeling system. They required special knowledge or previous calculation for their use.

After removal of these extraneous variables, 80 variables remained that were directly related to property sales. Most of the variables are exactly the type of information that a typical home buyer would want to know about a potential property (e.g. When was it built? How big is the lot? How many square feet of living space is in the dwelling? Is the basement finished? How many bathrooms are there?).

The refined data set contains 2930 house sales with the property details. The two main features which can be directly related to sales price of the property are above ground living area and overall quality of the house. Sales price and above ground living area are continuous variables. Overall quality of the house is a categorical variable with 10 levels ranging from 1 being very poor to 10 being very excellent. Sales price is the response variable and other two are exploratory variables.

Since this is an observational study, the results can only be used to show association between the exploratory variables and the response variable. The result of the study can be generalized to housing market in Iowa since the data set contains samples less than 10% of the total sales.

Part 3 - Exploratory data analysis

dim(data_set)
## [1] 2930   81
str(data_set[,c(1:10, 81)])
## Classes 'tbl_df', 'tbl' and 'data.frame':    2930 obs. of  11 variables:
##  $ Gr.Liv.Area   : int  1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
##  $ Overall.Qual  : Factor w/ 10 levels "1","2","3","4",..: 6 5 6 7 5 6 8 8 8 7 ...
##  $ SalePrice     : int  215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
##  $ PID           : int  526301100 526350040 526351010 526353030 527105010 527105030 527127150 527145080 527146030 527162130 ...
##  $ MS.SubClass   : int  20 20 20 20 60 60 120 120 120 60 ...
##  $ MS.Zoning     : Factor w/ 7 levels "A (agr)","C (all)",..: 6 5 6 6 6 6 6 6 6 6 ...
##  $ Lot.Frontage  : int  141 80 81 93 74 78 41 43 39 60 ...
##  $ Lot.Area      : int  31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
##  $ Street        : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley         : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ Sale.Condition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...

The response variable - SalesPrice

ggplot(data=data_set, mapping = aes(x=SalePrice)) +
geom_histogram(fill="blue", binwidth = 10000) +
scale_x_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

qqnorm(data_set$SalePrice)
qqline(data_set$SalePrice)

describe(data_set$SalePrice)
##    vars    n     mean       sd median  trimmed     mad   min    max  range
## X1    1 2930 180796.1 79886.69 160000 170429.1 54856.2 12789 755000 742211
##    skew kurtosis      se
## X1 1.74      5.1 1475.84

The sales price are right skewed. This is expected as few people can afford to buy expensive houses.

Correlations with SalesPrice

numericVars <- which(sapply(data_set, is.numeric))
numericVarNames <- names(numericVars)
all_numVar <- data_set[, numericVars]
cor_numVar <- cor(all_numVar, use="pairwise.complete.obs") #correlations of all numeric variables
#sort on decreasing correlations with SalePrice
cor_sorted <- as.matrix(sort(cor_numVar[,'SalePrice'], decreasing = TRUE))
#select only high corelations
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt")

From the visualization above the two highly correlated variables to SalePrice is Gr.Liv.Area.

Above grade (ground) living area square feet

Gr.Liv.Area is the continuous variable with highest correlation to Sale Price. They have a linear relationship as evident from the plot.

ggplot(data=data_set, aes(x=Gr.Liv.Area, y=SalePrice))+
geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

describe(data_set$Gr.Liv.Area)
##    vars    n    mean     sd median trimmed    mad min  max range skew
## X1    1 2930 1499.69 505.51   1442 1452.25 461.09 334 5642  5308 1.27
##    kurtosis   se
## X1     4.12 9.34

Overall quality

Overall quality, a categorical variable, is also considered as part of model. As mentioned before, it rates the overall material and finish of the house on a scale from 1 (very poor) to 10 (very excellent).

ggplot(data=data_set, mapping = aes(x=factor(Overall.Qual), y=SalePrice))+
geom_boxplot(col='blue') + labs(x='Overall Quality') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

Part 4 - Inference

Multiple Linear Regression

multiple_reg <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual, data = data_set)
summary(multiple_reg)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Overall.Qual, data = data_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -443413  -19743     182   18346  228183 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -172.24   18783.62  -0.009 0.992685    
## Gr.Liv.Area        54.77       1.69  32.408  < 2e-16 ***
## Overall.Qual2   16234.61   21414.04   0.758 0.448435    
## Overall.Qual3   25475.71   19638.73   1.297 0.194659    
## Overall.Qual4   43475.04   18893.03   2.301 0.021455 *  
## Overall.Qual5   65979.88   18778.45   3.514 0.000449 ***
## Overall.Qual6   82762.85   18797.81   4.403 1.11e-05 ***
## Overall.Qual7  113605.35   18831.17   6.033 1.81e-09 ***
## Overall.Qual8  167934.86   18903.89   8.884  < 2e-16 ***
## Overall.Qual9  254166.12   19176.31  13.254  < 2e-16 ***
## Overall.Qual10 294564.71   20165.97  14.607  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37450 on 2919 degrees of freedom
## Multiple R-squared:  0.781,  Adjusted R-squared:  0.7803 
## F-statistic:  1041 on 10 and 2919 DF,  p-value: < 2.2e-16

P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable.

hist(multiple_reg$residuals)

qqnorm(multiple_reg$residuals)
qqline(multiple_reg$residuals)

plot(multiple_reg$residuals ~ data_set$Gr.Liv.Area)
abline(h = 0)

Nearly normal residuals : The plot closely follows normal modal.
Constant variability : Variability is constant around 0.
Independent observations: We can assume this to be met since this is an observational study that represents less than 10% of the population.

Confidence interval

The 95% confidence interval is

confint(multiple_reg, level = 0.95)
##                      2.5 %       97.5 %
## (Intercept)    -37002.7360  36658.26327
## Gr.Liv.Area        51.4576     58.08536
## Overall.Qual2  -25753.5362  58222.76195
## Overall.Qual3  -13031.4547  63982.88297
## Overall.Qual4    6430.0166  80520.06680
## Overall.Qual5   29159.5176 102800.23772
## Overall.Qual6   45904.5355 119621.15786
## Overall.Qual7   76681.6251 150529.07081
## Overall.Qual8  130868.5584 205001.16948
## Overall.Qual9  216565.6560 291766.58704
## Overall.Qual10 255023.7281 334105.68391

Part 5 - Conclusion

  • The interpretation of the coefficients in multiple regression is slightly different from that of simple regression. The estimate of Gr.Liv.Area reflects difference in sale price for each additional square foot above ground living area when holding the overall quality constant.
  • The research question provided insight into what key factors could drive the price of a property on sale.
  • Since this is an observational study, the results of the analysis shows sale price has strong association with indicator variables.
  • Model using Lasso regression is a next step for this research question.

References

Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project Journal of Statistics Education, Volume 19, Number 3(2011)