Run a housing search from Zillow

Zillow is a web-based, leading real estate information service in the United States. I collect data from Zillow to analyze how to determine house prices using hedonic pricing model.

I choose 1+ bedrooms, any bathrooms, and only single-family houses in Duluth, GA. This is the url of my search below:

Zillow link

Conduct some data cleaning

zillow_df <- zillow_df %>%
  mutate(bedrooms = as.integer(str_trim(str_extract(details, "[\\d ]*(?=bd)")))) %>%
  mutate(bathrooms = as.integer(str_trim(str_extract(details, "[\\d ]*(?=ba)")))) %>%
  mutate(sqft = str_trim(str_extract(details, "[\\d ,]*(?=sqft)"))) %>%
  mutate(sqft = as.numeric(str_replace(sqft,",",""))) %>%
  mutate(price = as.numeric(str_replace_all(price,"[^0-9]*",""))) 

zillow_df$ type <- type

head(zillow_df)
##   bedrooms bathrooms sqft  price             type
## 1        3         3 1875 234900   House for sale
## 2        3         4 2792 369000 New construction
## 3        4         3 2394 300000   House for sale
## 4        4         4 3561 447500   House for sale
## 5        4         3 2088 325000   House for sale
## 6        4         3 2397 317000   House for sale
#Create the new dataset w/o missing values
zillow_df <- na.omit(zillow_df)

nrow(zillow_df) # My final dataset has 228 observations. 
## [1] 229

Visual analysis

Below boxplot detects the outliers in the distribution of house prices. There are outliers above the third quartiles of the distribution. I do not remove those outliers for this analysis.

boxplot(zillow_df$price,main="Boxplot of House Prices", 
        ylab = "House Price Listed ($)", yaxt = "n", 
        col = "yellow", medcol = "red", boxlty = 0, axes=FALSE,
        whisklty = 1,  staplelwd = 4, outpch = 8, outcex = 1)
axis(2, at = seq(0, max(zillow_df$price), 50000), las = 2, cex.axis=0.5)

This Scatter plot shows the Positive relationship between price and square footage. We also find that there are some outliers from the scatterplot.

g1<-ggplot(zillow_df, aes(y=price, x=sqft, color=as.factor(bedrooms))) +
  geom_point() +
  labs(y="Price", x="Square Footage",
  title="Scatter plot between price and square footage")

g1 + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Boxplot between price and type of transasction shows that the majority of types is House for sale

g2<-ggplot(zillow_df, aes(type, price))+
  geom_boxplot(outlier.color="red") +
  labs(x="Type", y="Price")

g2 + scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

I create a dummy variable for foreclosure to examine the negative effect of foreclosure on house prices.

zillow_df$foreclosure <- as.character(zillow_df$type)
zillow_df$foreclosure[zillow_df$foreclosure == "House for sale"] <- 0
zillow_df$foreclosure[zillow_df$foreclosure == "For sale by owner"] <- 0
zillow_df$foreclosure[zillow_df$foreclosure == "New construction"] <- 0
zillow_df$foreclosure[zillow_df$foreclosure == "Pre-foreclosure / Auction"] <- 1
zillow_df$foreclosure <- as.numeric(zillow_df$foreclosure)

summary(zillow_df)
##     bedrooms       bathrooms           sqft           price        
##  Min.   :2.000   Min.   : 1.000   Min.   :  980   Min.   : 109275  
##  1st Qu.:3.000   1st Qu.: 3.000   1st Qu.: 2120   1st Qu.: 281844  
##  Median :4.000   Median : 4.000   Median : 2970   Median : 396990  
##  Mean   :4.288   Mean   : 4.306   Mean   : 3850   Mean   : 596679  
##  3rd Qu.:5.000   3rd Qu.: 5.000   3rd Qu.: 4712   3rd Qu.: 619000  
##  Max.   :7.000   Max.   :13.000   Max.   :13756   Max.   :4975000  
##                                                                    
##      type            foreclosure    
##  Length:229         Min.   :0.0000  
##  Class :character   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000  
##                     Mean   :0.1974  
##                     3rd Qu.:0.0000  
##                     Max.   :1.0000  
##                     NA's   :1

Regression Analysis

Hedonic pricing model: linear-linear model

model <- lm(price ~ sqft+bedrooms+bathrooms+foreclosure, zillow_df)
summary(model)
## 
## Call:
## lm(formula = price ~ sqft + bedrooms + bathrooms + foreclosure, 
##     data = zillow_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -724743 -111759   -3405   84558 2219245 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  108996.65   92394.23   1.180 0.239380    
## sqft            198.82      18.15  10.956  < 2e-16 ***
## bedrooms    -147931.37   28007.65  -5.282 3.03e-07 ***
## bathrooms     84429.30   24228.76   3.485 0.000593 ***
## foreclosure  -36327.65   49358.60  -0.736 0.462508    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 283400 on 223 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7939, Adjusted R-squared:  0.7902 
## F-statistic: 214.8 on 4 and 223 DF,  p-value: < 2.2e-16
model$coefficients
##  (Intercept)         sqft     bedrooms    bathrooms  foreclosure 
##  108996.6488     198.8229 -147931.3682   84429.3005  -36327.6525

R-squared is 0.794 (79.4%). It means that the 79.4% of the variation in house prices is explained by the independent variables in the model.

“sqft”, “bedrooms”, and “bathrooms” are statistically significant at 1% level. However, “bedrooms” has a negative sign which I did not expect to have because I control for square footage. More rooms dividing up the same sapce is not necessarily more valuable here. “foreclosure” has a negative sign which I expected to have, but it is not statistically significant.

On average, an additional bedroom in a house decreases the price of house by $148,733.

On average, an additional bathroom in a house increases the price of house by $84,472.

On average, one unit increase in sqft increases the price of house by $199.

Hedonic pricing model - log-linear model Hedonic pricing model often uses log-linear model because of convienent interpretation.

logmodel <- lm(log(price)~ sqft+bedrooms+bathrooms+foreclosure, zillow_df)
summary(logmodel)
## 
## Call:
## lm(formula = log(price) ~ sqft + bedrooms + bathrooms + foreclosure, 
##     data = zillow_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6174 -0.1088 -0.0107  0.1129  0.8478 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.9491804  0.0656519 182.008  < 2e-16 ***
## sqft         0.0001612  0.0000129  12.504  < 2e-16 ***
## bedrooms     0.0128394  0.0199012   0.645    0.519    
## bathrooms    0.0995622  0.0172161   5.783 2.46e-08 ***
## foreclosure -0.1676608  0.0350724  -4.780 3.18e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2014 on 223 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9108, Adjusted R-squared:  0.9092 
## F-statistic: 569.2 on 4 and 223 DF,  p-value: < 2.2e-16
logmodel$coefficients
##   (Intercept)          sqft      bedrooms     bathrooms   foreclosure 
## 11.9491804021  0.0001612462  0.0128394036  0.0995621896 -0.1676608233

R-squared for log-linear model is 91.1% which is much better than linear model. It is because “price” variable becomes normal with transformation.