Do some cleaning

# Cleaning the file - Zillow_DF
zillow_df <- res_all %>%
  mutate(pricelisted = as.numeric(str_replace_all(pricelisted,"[^0-9]*",""))) %>%
  mutate(squarefoot = as.numeric(str_replace(squarefoot,",","")))
## Warning: NAs introduced by coercion
# create new dataset without missing data
zillow_df <- na.omit(zillow_df)

Visualisation

Now that data frame is ready we can see what it looks like. After cleaning the data, we have 308 observations. Below shows the how my dataframe looks like for the last few observations.

# 1) Number of Observations :
nrow(zillow_df)
## [1] 308
zillow_df %>% select(1:5) %>% tail()
## # A tibble: 6 x 5
##   address                              pricelisted bedrooms bathrooms squarefoot
##   <chr>                                      <dbl>    <dbl>     <dbl>      <dbl>
## 1 435 Brookwood Dr, Athens, GA 30605        240229        4         3       2039
## 2 125 Putters Dr, Athens, GA 30607          262504        3         3       2569
## 3 170 Cook Cir, Athens, GA 30601            122791        3         1        925
## 4 455 Sandstone Dr, Athens, GA 30605        269410        4         3       2464
## 5 102 Kara Dr, Athens, GA 30606             233593        4         3       2548
## 6 107 Whitehead Ter E, Athens, GA 306~      162982        3         2       1358
attach(zillow_df)  ## For easy access

Visual: Detecting Outliers in Price Of Housing

  • Below boxplot shows how the house prices are distributed and detects the outliers in the distribution of the prices.
  • The boxplot shows where the median price lies and how there are observations beyond the third quartile of the distribution.
  • In actual research such outliers should be analysed and corrected for the outliers.
## Outlier in Price

boxplot(pricelisted, main="Boxplot of House Prices", 
        ylab = "House Price Listed ($)", yaxt = "n", 
        col = "gold", medcol = "red", boxlty = 0, axes=FALSE,
        whisklty = 1,  staplelwd = 4, outpch = 8, outcex = 1)
axis(2, at = seq(0, max(pricelisted), 70000), las = 2, cex.axis=0.5)

Visual: Plot between Price and Square Foot for Correlational Exposition.

  • Below plot explores the relationship between price and square foot of the house.
## Relationship between Price and Squarefoot. 
plot(pricelisted ~ squarefoot, 
     pch=15,  col = rgb((1:7)/7, 0, 0)[as.factor(bedrooms)],
     yaxt = "n", xaxt = "n", axes = FALSE, main="", 
     ylab="House Prices Listed ($)",
     xlab = "Square Foot Of Houses")
axis(2, at = seq(0, max(pricelisted), 125000), las = 2, cex.axis=0.5)
axis(1,at = seq(300, max(squarefoot), 500), las=2, cex.axis = 0.6 )
legend("bottomright", legend = paste("Bedrooms", 1:7), 
            col = rgb((1:7)/7, 0, 0), pch = 15)
title("Here is a simple plot of Square footage vs. price \ngrouped by number of bedrooms", cex.main = 1,   font.main= 2, col.main=rgb(1, 0, 0) ) 

  • We see a positive relation between price of housings and square foot of the house. So as the size of the house increases price increases, too.

Regression analysis: OLS

olsmodel <- lm(pricelisted ~ bedrooms + bathrooms + squarefoot)
summary(olsmodel)
## 
## Call:
## lm(formula = pricelisted ~ bedrooms + bathrooms + squarefoot)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -478936  -86809  -26342   46713  833992 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38741.6    31083.0   1.246 0.213580    
## bedrooms    -41459.3    12676.9  -3.270 0.001197 ** 
## bathrooms    45238.8    11982.8   3.775 0.000192 ***
## squarefoot     147.4       12.6  11.699  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 156700 on 304 degrees of freedom
## Multiple R-squared:  0.6486, Adjusted R-squared:  0.6451 
## F-statistic:   187 on 3 and 304 DF,  p-value: < 2.2e-16
olsmodel$coefficients
## (Intercept)    bedrooms   bathrooms  squarefoot 
##  38741.6281 -41459.2630  45238.7803    147.4233

On average, what is the value of having an additional bedroom in a home?

In the above model results, we see that having an additional bedroom in a home decreases the price of a house by -4.145926310^{4} dollars, ceteris paribus. It is statistically significant at 1% level of significance.

Comment : Since bedrooms is a categorical variable treating it as continuous is not capturing the non-linearity in the bedroom capacity and price of houses.

On average, what is the value of having an additional bathroom in a home?

In the above model results, we see that having an additional bathroom in a home increases the price of a house by 4.52387810^{4} dollars, ceteris paribus. It is statistically significant at 1% level of significance.

Comment : Since bathroom is a categorical variable treating it as continuous is not capturing the non-linearity in the bathroom capacity and price of houses.

  • To capture further non-linearitry it should be interesting to interact the bedroom and bathroom variable and examine its effect on prices as both factors are simultaneously examined to make a decision.

Squarefoot and Price

The correlational relation observed in the above plot is found to be evident in the OLS results above. In the above model results, we see that increasing the squarefoot by 1 unit increases the price of a house by 147.4233309 , ceteris paribus. It is statistically significant at 1% level of significance.

OLS Analysis using Log-Transformation

In this section we take the log of prices and log of squarefoot to control for extreme distribution of the variables. We find the same relation as above in the un-transformed variable. Here the coefficient against “log(squarefoot)” is the price elasticity of increasing squarefoot by 1 unit. All variables are statistically significant at 1% level of significance.

## In Log -form to capture coefficient as elasticity 
olslogmodel <- lm(log(pricelisted) ~ bedrooms + bathrooms + log(squarefoot))
summary(olslogmodel)
## 
## Call:
## lm(formula = log(pricelisted) ~ bedrooms + bathrooms + log(squarefoot))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5364 -0.2361 -0.0574  0.2107  1.2888 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.23237    0.52620  11.844  < 2e-16 ***
## bedrooms        -0.10228    0.03426  -2.985  0.00307 ** 
## bathrooms        0.14672    0.02921   5.022 8.72e-07 ***
## log(squarefoot)  0.82703    0.08238  10.039  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3999 on 304 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6242 
## F-statistic:   171 on 3 and 304 DF,  p-value: < 2.2e-16
olslogmodel$coefficients
##     (Intercept)        bedrooms       bathrooms log(squarefoot) 
##       6.2323660      -0.1022793       0.1467194       0.8270350

Goodness of Fit :

Find the r squared using the summary(olsmodel) command. How good is the fit?

  • We see the R-squared is 0.6486034 for linear model and 0.6278817 for log-transformed model. It tells us that the about 64.8603398 % of the variation in prices is explained by the independent variables.

What elements should be considered in this model to increase the fit?

To improve the fit, a couple of things could be done:

  • Distance To the University of Georgia campus. This can be done by extracting the latitudes and longitudes of each home and calculate the distance to UGA all form Zillow html source code. The hypothesis is that houses closer to UGA would be valued higher.

  • How long the house has been listed? Another feature could be to gather the information on how long a house has been on the market on Zillow. The hypothesis is that longer the listing listed on Zillow the lower the price as that could indicate some trouble selling the house.

  • Longitudinal Analysis: The model can be better improved by constructing a panel data with time varying causal variable like pollution particulate count in each county/zipcode/district/census tract, etc. Panel data will help to control for unobserved heterogeneity. the pollution data can be extracted from EPA’s AQS (Air Quality System) database

  • Fancier: Image Analysis: extract from the images posted on Zillow whether the house shows a garden, porch, trees, flowers, etc.

Bonus Task #1: General Web Scraping Code.

BONUS Task #2: Extra Hedonic Pricing Regression

## This extracts the numeric terms from days on Zillow comments.
zillow_df$daysonZillow <- as.numeric(gsub("([0-9]+).*$", "\\1", zillow_df$zillowcomments))
## Warning: NAs introduced by coercion
##
zillow_df.new <- na.omit(zillow_df)
## We have now 222 houses that have information on how many days on Zillow

attach(zillow_df.new)
## The following objects are masked from zillow_df:
## 
##     address, bathrooms, bedrooms, pricelisted, squarefoot, typeofsale,
##     zillow_link, zillowcomments
##Running OLS With this new variable and the new Zillow dataset.

olsmodel.new <- lm(pricelisted ~ bedrooms + bathrooms + squarefoot+ daysonZillow)
summary(olsmodel.new)
## 
## Call:
## lm(formula = pricelisted ~ bedrooms + bathrooms + squarefoot + 
##     daysonZillow)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -392605  -88805  -37775   34035  817543 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18170.87   37621.34   0.483  0.62958    
## bedrooms     -25534.82   15849.12  -1.611  0.10859    
## bathrooms     49015.90   15625.87   3.137  0.00194 ** 
## squarefoot      119.83      16.94   7.072 2.03e-11 ***
## daysonZillow    237.53     102.04   2.328  0.02083 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 161400 on 219 degrees of freedom
## Multiple R-squared:  0.5605, Adjusted R-squared:  0.5525 
## F-statistic: 69.83 on 4 and 219 DF,  p-value: < 2.2e-16
olsmodel.new$coefficients
##  (Intercept)     bedrooms    bathrooms   squarefoot daysonZillow 
##   18170.8734  -25534.8181   49015.9009     119.8271     237.5253

Here I am sing the number of days house was listed on Zillow. We have 224 Houses that have information on “number of days on Zillow”.

COMMENTS:

References