Exploratory Data Analysis of Housing Prices in Ames, Iowa

by Ronak Vimadalal

Introduction

Housing prices are an important factor in decision making at all levels: from the recently graduated student to policy-makers, all are concerned by this extremely volatile index. This analysis is a first step in predicting the housing prices in Ames, IA through advanced regression techniques.

Overview of the Dataset

## [1] 1460   81

It can be seen that we have data for 1460 house sales spanning around 80 variables. An overview of the variables can be seen below:

## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 16 12 22 15 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
##  $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
##  $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
##  $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
##  $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
##  $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
##  $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","CWD","Con",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Univariate Plots Section

As a first step, I plot a histogram of housing prices. The expected shape is a bell curve of a normal distribution.

As predicted, the data for the housing prices takes the shape of a bell curve.

Taking a step further, I decided to see how the time of purchase and the state of the house when it was sold impacted its selling price.

The above charts reveal certain surprises. In 2010, the number of houses sold dropped to less than half of the number in 2009. Otherwise, all the other years have a fairly similar number of houses sold. This could be due to the foreclosure crisis in 2010 when people put off selling their houses due to houses being forclosured in ther neighborhood leading to a reduction in prices.

Surprisingly, this trend does not hold in the houses sold per month. May, June and July have much higher house sales than the other months and the months of December-January have no house sales. This can be explained by a survey conducted by Zillow , an online real estate database company, which attributes high sales in middle of the year to “buyer-desparation” since buyers jump at every offer during this period due to a dearth in the houses being put up for sale early in the year.

Most of the houses sold have an overwhelming condition of 5. Houses sold below this quality are rare, since that would mean that the owner would rebuild or conduct extensive repairs to the house. Houses sold above 5 are greater than those above 5 but are relatively low compared to houses of condition 5. This would mean that for a house to be sold it has to have a minimum acceptable condition of 5.

It could be insightful to explore the age of the houses sold. As seen in the plot below, a vast majority of houses were built between 1950 and 2010. A major dip is seen around the 1980’s, although it can be attributed to lesser homes being built during the 80’s due to the housing market crash in that period. There are some homes built more than centuries ago. It would be interesting to plot these versus the price and quality in the next section.

Following this, I wanted to explore which neighbourhoods have the most number of houses sold.

North Ames and Old Town have the most houses sold. It should be noted that certain areas do overlap, since the exact coordinates of each purchase are not known. The locations are plotted by querying the name of the neighborhood to the Google Maps API.

Exploratory Univariate Analysis Conclusion

There are 1460 house sale transactions, each with 80 attributes. The month of sale of the house has shown an unusual but explainanble pattern which would need to be incorporated in the final model. The main feautures of the dataset seem to be the time of sale of the house, the neighborhood and the Sale Condition. According to J. Zietz, E.N.Zietz. and Sirmans certain factors have a higher coefficient of regression as the prices increase. This could be expected for ‘luxury’ factors such as Pool Area and Garage Area. This will be explored in detail in the following factors.

Investigating how different factors play diferent roles at difeerent prices will be the key to this EDA.

Bivariate Plots Section

##     0%    25%    50%    75%   100% 
##  34900 129975 163000 214000 755000

As discussed in the earlier section, I decide to explore the impact of a few factors w.r.t SalePrice. I plot a few boxplots to see the variation of the factors with price. Some results are in accordance with what is expected. There is a large spread of prices on houses that are on level ground since level ground is the most suitable to construct upon and offers sturdy foundations, especially for homes of larger sizes. Thus, the pricier homes are almost exclusively found on level terrain. A similar explanation holds for paved and gravelled streets, with the former being vastly preferred.

In the following section, I want to look at variables that would have a higher regression coefficient at higher prices.

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

It can be seen that having an excellent pool significantly boosts the median price of the house. Most of the houses have a garage size from 250 sq.ft. to 1000sq.ft. It can be explained that there is an upper limit on the size of a car, no matter how expensive it is, so Garage Area is limited to a certain range. This is also evident from a large majority of higher-priced homes having space in their garages for just 2 cars. In order to gain further insight, I deicde to conduct a Spearman’s correlation test:

## 
##  Spearman's rank correlation rho
## 
## data:  prices$GarageArea and prices$SalePrice
## S = 181860000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.6493785

It is fairly correlated to the price, but I would predict that it would impact it only to a certain extent. This will be discussed further in the next section.

A clear preference for a paved drive emerges, especially amongst the higher-end homeowners.

Thereafter, I consider a highly important factor: the Neighborhood. It is seen that Northridge, Northridge Heights and Stony Brook are the most expensive neighborhoods. Northridge Heights has a signficantly large range in the sale price of the houses.

ggplot(data=prices, aes(x=Neighborhood, y=(SalePrice/1000), fill=Neighborhood))+
  geom_boxplot()+
  scale_y_continuous(name="Sale Price")+
  theme(axis.text.x  = element_text(angle=90, vjust=0.5))

In the plot below, I delve into the kind of sale that is the most prevalent. Warranty Deed(WD) sales are extremely popular at all prices, followed by new homes, which have a higher price as is intuitive.

Exploratory Bivariate Analysis Conclusion

Some factors such as level terrain and paved street are highly desired by buyers of all classes. Thus not having these in a home would highly reduce its price. Some features such as a pool and a garage that could occupy 2 cars would positively impact the selling price. The elitest properties are concentrated in a select group of localities, namely Northridge(including Northridge Heights) and Stony Brooks. I also realised that the lot area is not highly correlated to the sale price.

## Warning in cor.test.default(prices$LotArea, prices$SalePrice, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  prices$LotArea and prices$SalePrice
## S = 281930000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4564606

Multivariate Plots Section

Condition of the House and Neighborhood

In the plot below, I explore the condition of the houses by neighborhood and its impact on prices. It can be seen that most of the houses in the posh neighborhoods also lie in the category of ‘barely acceptable’ i.e. rated a 5/10. It can be presumed that buyers of such homes are looking for redevelopment of their properties or the house is merely for investment.

To delve deeprer I decide to do a correlation test between condition and sale price.

## 
##  Spearman's rank correlation rho
## 
## data:  prices$OverallCond and prices$SalePrice
## S = 585770000, p-value = 7.119e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.1293249

This correlation coefficient points towards the lack of an significant correaltion I can conclude that the condition of the house is a benchmark rating. An increase beyond 5 does not merit an increase in sale Price but lower than 5 would cause a significant drop in the house price.

Garage Area

In the plot above, it is visible that uptill 1000 sq.ft the sale price is proportional to the size of the garage to occupy about 4 cars. It seems that the celine is 4 cars per house. To verify this I plot the same plot as above, but replace the number of cars with the number of bedrooms.

It is inferred that a house of with 6-7 bedrooms needs around 4 cars to park. Since houses with bedrooms greater than this are rare, parking spots accomodating further than the aforesaid requirement do not correlate with a higher proce increase. Below 1000 sq.ft. the garage are has a correlation of around 0.64 with the price of the house.

## 
##  Spearman's rank correlation rho
## 
## data:  a$GarageArea and a$SalePrice
## S = 177960000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.6461048

Pool Size and Quality

A pool area of 550sq. ft. seems to be the most preferred size for the pool. Having an excellent pool would result in a much higher price that a pool of the same size but maintained worse. Increasing pool size does not have the outcome of increasing the price of the house.

Price Per Sq.Ft of Living Area and Building Type

In order to gain deeper understanding of the prices of the house, the metric Sale Price per sq.ft. of living area is used. Northridge Height has the highest priced homes. The most popular kind of homes are 1 family homes. This further explains the apparent limit of 4 cars per household, even in the most expensive houses.

Conclusion

In this analysis a few variables are explored from the massive data set provided by Kaggle. I define a few variables to be as ‘benchmark’ variables wherein attainement of the same would result in much higher prices, but further increase from a specified level does not reciprocate an increase in the sale price. Pool size and condition of the house are two such variables. Garage size is shown to have an effect on the house price till the size reaches 1000 sq.ft. following which the curve smoothens out to show no changes in the price. The location of the house also impacts its price with certain localities having a much higher ranges of the house sold.Surprisingly, the house is much likely to be sold in May or June than in December or January. As a seller, having a Warranty Deed would make it easiest to sell a house.

In a regression model, factors such as the existence of a pool, moderate terrain and acceptable conditions along with availibility of public utilities would have the ‘benchmark’ effect on the price. The location of the house would also play a role in predicting the price of the house. Till a certain degree the size of a garage would warrant an increase in the price of the house but after that its effect diminishes.

Simple Regression

From our data we create a simple regressional model

## 
## Call:
## lm(formula = prices$SalePrice ~ prices$GarageArea + prices$LotArea + 
##     prices$PoolArea, data = prices)
## 
## Coefficients:
##       (Intercept)  prices$GarageArea     prices$LotArea  
##         63613.242            220.357              1.221  
##   prices$PoolArea  
##            87.561

Y=63613.242+220.357 x GarageArea + 1.221 x LotArea + 87.561 x PoolArea

References

Zietz, J. “Determinants of House Prices: A Quantile Regression Approach.” The Journal of Real Estate Finance and Economics. Accessed August 12, 2017.

Olick, Diana. “The best month ??? and day ??? to list your home.” CNBC. March 23, 2016. Accessed August 12, 2017. https://www.cnbc.com/2016/03/22/the-best-month-and-dayto-list-your-home.html.

“Economics of Housing Externalities.” Accessed August 12, 2017. https://www.bing.com/cr?IG=375595E8C9414FBF8D0C01EC16678E49&CID=1DEF5E8B39F8688F388E545638FE691D&rd=1&h=b-WNg3yVZW6CmbeqQxsLXDecilGp8zj33jN2LwhiwxI&v=1&r=https%3a%2f%2fwww.princeton.edu%2f%7eerossi%2fEHE.pdf&p=DevEx,5393.1.