Is above ground living area (in square feet) and overall finish of the house predictive of it’s sales price?
If you are in the market to buy a new property or to sell a property, what factors drive the sale price? This is an interesting question to lot of buyers and sellers. In this research project, I am exploring data from Ames, Iowa to see what factors drives sale price?
The data was collected directly from the Ames City Assessor’s Office in the form of a data dump from their records system. The initial data set contained 113 variables describing 3970 property sales that had occurred in Ames, Iowa between 2006 and 2010. The variables were a mix of nominal, ordinal, continuous, and discrete variables used in calculation of assessed values and included physical property measurements in addition to computation variables used in the city’s assessment process. Some of the variables were removed as they were related to weighting and adjustment factors used in the city’s current modeling system. They required special knowledge or previous calculation for their use.
After removal of these extraneous variables, 80 variables remained that were directly related to property sales. Most of the variables are exactly the type of information that a typical home buyer would want to know about a potential property (e.g. When was it built? How big is the lot? How many square feet of living space is in the dwelling? Is the basement finished? How many bathrooms are there?).
The refined data set contains 2930 house sales with the property details. The two main features which can be directly related to sales price of the property are above ground living area and overall quality of the house. Sales price and above ground living area are continuous variables. Overall quality of the house is a categorical variable with 10 levels ranging from 1 being very poor to 10 being very excellent. Sales price is the response variable and other two are exploratory variables.
Since this is an observational study, the results can only be used to show association between the exploratory variables and the response variable. The result of the study can be generalized to housing market in Iowa since the data set contains samples less than 10% of the total sales.
dim(data_set)
## [1] 2930 81
str(data_set[,c(1:10, 81)])
## Classes 'tbl_df', 'tbl' and 'data.frame': 2930 obs. of 11 variables:
## $ Gr.Liv.Area : int 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
## $ Overall.Qual : Factor w/ 10 levels "1","2","3","4",..: 6 5 6 7 5 6 8 8 8 7 ...
## $ SalePrice : int 215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
## $ PID : int 526301100 526350040 526351010 526353030 527105010 527105030 527127150 527145080 527146030 527162130 ...
## $ MS.SubClass : int 20 20 20 20 60 60 120 120 120 60 ...
## $ MS.Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 5 6 6 6 6 6 6 6 6 ...
## $ Lot.Frontage : int 141 80 81 93 74 78 41 43 39 60 ...
## $ Lot.Area : int 31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
## $ Sale.Condition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
ggplot(data=data_set, mapping = aes(x=SalePrice)) +
geom_histogram(fill="blue", binwidth = 10000) +
scale_x_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
qqnorm(data_set$SalePrice)
qqline(data_set$SalePrice)
describe(data_set$SalePrice)
## vars n mean sd median trimmed mad min max range
## X1 1 2930 180796.1 79886.69 160000 170429.1 54856.2 12789 755000 742211
## skew kurtosis se
## X1 1.74 5.1 1475.84
The sales price are right skewed. This is expected as few people can afford to buy expensive houses.
numericVars <- which(sapply(data_set, is.numeric))
numericVarNames <- names(numericVars)
all_numVar <- data_set[, numericVars]
cor_numVar <- cor(all_numVar, use="pairwise.complete.obs") #correlations of all numeric variables
#sort on decreasing correlations with SalePrice
cor_sorted <- as.matrix(sort(cor_numVar[,'SalePrice'], decreasing = TRUE))
#select only high corelations
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt")
From the visualization above the two highly correlated variables to SalePrice is Gr.Liv.Area.
Gr.Liv.Area is the continuous variable with highest correlation to Sale Price. They have a linear relationship as evident from the plot.
ggplot(data=data_set, aes(x=Gr.Liv.Area, y=SalePrice))+
geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
describe(data_set$Gr.Liv.Area)
## vars n mean sd median trimmed mad min max range skew
## X1 1 2930 1499.69 505.51 1442 1452.25 461.09 334 5642 5308 1.27
## kurtosis se
## X1 4.12 9.34
Overall quality, a categorical variable, is also considered as part of model. As mentioned before, it rates the overall material and finish of the house on a scale from 1 (very poor) to 10 (very excellent).
ggplot(data=data_set, mapping = aes(x=factor(Overall.Qual), y=SalePrice))+
geom_boxplot(col='blue') + labs(x='Overall Quality') +
scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
multiple_reg <- lm(SalePrice ~ Gr.Liv.Area + Overall.Qual, data = data_set)
summary(multiple_reg)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Overall.Qual, data = data_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -443413 -19743 182 18346 228183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -172.24 18783.62 -0.009 0.992685
## Gr.Liv.Area 54.77 1.69 32.408 < 2e-16 ***
## Overall.Qual2 16234.61 21414.04 0.758 0.448435
## Overall.Qual3 25475.71 19638.73 1.297 0.194659
## Overall.Qual4 43475.04 18893.03 2.301 0.021455 *
## Overall.Qual5 65979.88 18778.45 3.514 0.000449 ***
## Overall.Qual6 82762.85 18797.81 4.403 1.11e-05 ***
## Overall.Qual7 113605.35 18831.17 6.033 1.81e-09 ***
## Overall.Qual8 167934.86 18903.89 8.884 < 2e-16 ***
## Overall.Qual9 254166.12 19176.31 13.254 < 2e-16 ***
## Overall.Qual10 294564.71 20165.97 14.607 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37450 on 2919 degrees of freedom
## Multiple R-squared: 0.781, Adjusted R-squared: 0.7803
## F-statistic: 1041 on 10 and 2919 DF, p-value: < 2.2e-16
P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable.
hist(multiple_reg$residuals)
qqnorm(multiple_reg$residuals)
qqline(multiple_reg$residuals)
plot(multiple_reg$residuals ~ data_set$Gr.Liv.Area)
abline(h = 0)
Nearly normal residuals : The plot closely follows normal modal.
Constant variability : Variability is constant around 0.
Independent observations: We can assume this to be met since this is an observational study that represents less than 10% of the population.
The 95% confidence interval is
confint(multiple_reg, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -37002.7360 36658.26327
## Gr.Liv.Area 51.4576 58.08536
## Overall.Qual2 -25753.5362 58222.76195
## Overall.Qual3 -13031.4547 63982.88297
## Overall.Qual4 6430.0166 80520.06680
## Overall.Qual5 29159.5176 102800.23772
## Overall.Qual6 45904.5355 119621.15786
## Overall.Qual7 76681.6251 150529.07081
## Overall.Qual8 130868.5584 205001.16948
## Overall.Qual9 216565.6560 291766.58704
## Overall.Qual10 255023.7281 334105.68391
Gr.Liv.Area reflects difference in sale price for each additional square foot above ground living area when holding the overall quality constant.Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project Journal of Statistics Education, Volume 19, Number 3(2011)