Task: Analyse the Athens Real Estate Market

Do some cleaning

zillow_df<-zillow  %>%
  mutate(bedrooms = as.integer(str_trim(str_extract(details, "[\\d ]*(?=bds)")))) %>%
  mutate(bathrooms = as.integer(str_trim(str_extract(details, "[\\d ]*(?=ba)")))) %>%
  mutate(sqft = str_trim(str_extract(details, "[\\d ,]*(?=sqft)"))) %>%
  mutate(sqft = as.numeric(str_replace(sqft,",",""))) %>%
  mutate(price = as.numeric(str_replace_all(p,"[^0-9]*",""))) 
zillow_df$p<-as.numeric(gsub('[$,]', '', zillow_df$p))

Visualization

head(zillow_df)
##        p                               details bedrooms bathrooms sqft  price
## 1 252500 3 bds3 ba1,700 sqft- New construction        3         3 1700 252500
## 2 250000   2 bds1 ba1,104 sqft- House for sale        2         1 1104 250000
## 3 279900    3 bds2 ba-- sqft- New construction        3         2   NA 279900
## 4 269900 3 bds2 ba1,675 sqft- New construction        3         2 1675 269900
## 5 795900      5 bds6 ba-- sqft- House for sale        5         6   NA 795900
## 6 449000    4 bds4 ba-- sqft- New construction        4         4   NA 449000
F1 <- ggplot(zillow_df, aes(x=sqft, y=price, size=sqft, color=as.factor(bedrooms))) + 
  geom_point()+
  theme_classic()
F1
## Warning: Removed 18 rows containing missing values (geom_point).

Regression analysis

reg1 <- lm(p ~ bedrooms + bathrooms + sqft, zillow_df)
summary(reg1)
## 
## Call:
## lm(formula = p ~ bedrooms + bathrooms + sqft, data = zillow_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -149983  -30102    -232   21766  157196 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  47965.83   56447.25   0.850   0.4029    
## bedrooms    -50910.13   23405.71  -2.175   0.0386 *  
## bathrooms    -3023.46   20639.06  -0.146   0.8846    
## sqft           215.63      14.58  14.788 1.82e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65760 on 27 degrees of freedom
##   (18 observations deleted due to missingness)
## Multiple R-squared:  0.9319, Adjusted R-squared:  0.9243 
## F-statistic: 123.1 on 3 and 27 DF,  p-value: 7.316e-16
reg1$coefficients
## (Intercept)    bedrooms   bathrooms        sqft 
##  47965.8339 -50910.1268  -3023.4602    215.6277

Answer: The regression result is as above. The bedrooms and sqft are both significant. The coefficient of bedrooms is -51270.81 which is significant at \(\alpha=0.05\) level. It suggests that with fixed value of sqft and bathrooms, the house pricing will decrease $51270.81 by adding more bedrooms. So, the size of the bedroom is quite important.The average value of having an additional bedroom in a home is -$51270.81 with other variables constant. The coefficient of sqft is 215.84 which is significant at \(\alpha=0.001\) level.It suggests that with fixed value of bedrooms and bathrooms, the house pricing will increase $215.84 by adding additional sqft of this house. The average value of sqft is $215.84. The average value of additional bathroom is -$3075.16, although it is not statistically significant.

Goodness of fit

Answer: The goodness of fit is very well in this regression.\(R^2\) is 0.9317. I think we can also include the dummy variables of the house types including condo, house, apartment and etc. to increase the fit of this regression model. Considering the hedonic model, we can also include external factors into our housing pricing model. These external factors can be air pollution index nearby, education level nearby and crime rate nearby. The crime rate report can be obtained by https://spotcrime.com/. Air pollution data map can be obetained by https://aqicn.org/map/usa/. By introducing more external and environmental variables, we will be able to predict house pricing by their location information with more convince.