Scraping RentFaster

Overview

Have you ever asked the question: “Why does my rent cost so much?” I thought I would use this project to look into how different features of a rental property (size, location, style) affect its monthly rental price. Using the website RentFaster (a popular Canadian rental property website), this project examines those factors, and attempts to create a model to predict rental prices, given its advertisement.

Where are all the Listings?

First off, we should get an idea of where these rental listings are geographically located in Calgary.

This plot shows where listings are concentrated geographically in Calgary, and how they are separated by their Quadrants.

Effect of Property Type

The most significant factor affecting rental price is the type of property (think apartment vs house). This plot shows the relative effects of varying property types. Shared residences come at a significant discount, as could be expected. It is curious to me that houses and condos have similar premiums, and that lofts have such a high premium relative to all other property types!

Effects of ‘Add-ons’

When a landlord says to you that he will include internet in the cost of rent, or that he will let you have dogs in the residence, what sort of premium can you expect to pay?

We can see that if internet is included in the rent, you can expect to pay a massive premium! This is, at least in Canada, seriously overpaying for internet.

How does this Model Perform?

One question to ask is: how effective is this model? How is its predictive performance, when given new data? We can both quantify and visualize this model’s performance, and compare it to other models.

Here, the x-axis shows the actual price, and the y-axis displays the predicted price. The RMSE of both the training and test sets are similar, which suggests that we’re not overfitting to the data at all. The Average Error for the test set is also quite low, indicating that we’re not consistently over or under-predicting price. From the visual, we do see something that these metrics don’t capture: that high-priced rentals are consistently under-predicted by our model. This suggests that a different, more flexible model may be able to better capture the true relationship.

Methodology

How did we get these results?

First, we need to load the appropriate packages.

library(jsonlite)
library(dplyr)
library(ggmap)
library(car)
library(Hmisc)
library(httr)
library(XML)
library(colorspace)

Web Scraping

Now, we have to scrape the RentFaster website.
We do this using the jsonlite package.
The following code creates a vector of URLs that contain all the listings priced from 500 to 2500. Then, the data from those listings are stored in a data frame.

#########################
#  Scraping Rentfaster  #
#########################

## Scrape using jsonlite package
## Split into $25 intervals, since the json file only returns the first 500 rows of data
## Then join these dataframes together
prices <- seq(500, 2500, 25)
urls <- paste0('https://www.rentfaster.ca/api/map.json?e=undefined&beds=&baths=&type=&price_range_adv%5Bfrom%5D=', 
               prices, 
               '&price_range_adv%5Bto%5D=', 
               prices + 24, 
               '&area=51.18725531061486%2C-113.72722341074217%2C50.90397819717543%2C-114.3877763892578')

rfData <- fromJSON(txt = urls[1])
rfData <- rfData$listings


for (i in 2:length(urls)){
  newdf <- fromJSON(txt = urls[i])
  rfData <- rbind(rfData, newdf$listings)
}

Next, we scrape a Wikipedia HTML table to acquire the quadrant of each community, and add it to the listings dataframe.

######################
# Scraping Wikipedia #
######################

## Scrape using XML and httr packages ##
## Keep only the first two rows (community and quadrant)

url <- "https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Calgary"
getObject <- GET(url) 
commsParsed <- htmlParse(getObject)
pageTables <- readHTMLTable(commsParsed, stringsAsFactors = FALSE, header = T)
communityTbl <- pageTables[[1]]
communityTbl <- communityTbl[-nrow(communityTbl),]

## Add Quadrant to Rentfaster data by comparing community with Wikipedia data

quadrant <- c()
for (i in 1:nrow(rfData)){
  ifelse(sum(rfData$community[i] == communityTbl[,1]) == 0,
         quadrant[i] <- NA,
         quadrant[i] <- communityTbl[,2][rfData$community[i] == communityTbl[,1]])
}
rfData$quadrant <- quadrant

Data Cleaning

The next step is to correct the variables’ classes (i.e. number of bedrooms variable should be numeric, lists of utilities included are converted to individual dummy variables)

###################
## Data Cleaning ##
###################

## Select useful variables ##

rfDataCleaned <- rfData %>%
  dplyr::select(title, rented, availability, a, unit, city, community, quadrant, intro, latitude, longitude, type, preferred_contact, sq_feet, price, bedrooms, den, baths, cats, dogs, utilities_included)

## Correct the class of variables ##

#str(rfDataCleaned)

rfDataCleaned$type <- unlist(rfDataCleaned$type)
rfDataCleaned$a <- as.Date(rfDataCleaned$a)
rfDataCleaned$rented <- as.factor(rfDataCleaned$rented)
rfDataCleaned$availability <- as.factor(rfDataCleaned$availability)
rfDataCleaned$unit <- as.numeric(rfDataCleaned$unit)
rfDataCleaned$city <- as.factor(rfDataCleaned$city)
rfDataCleaned$community <- as.factor(rfDataCleaned$community)
rfDataCleaned$quadrant <- as.factor(rfDataCleaned$quadrant)
rfDataCleaned$latitude <- as.numeric(rfDataCleaned$latitude)
rfDataCleaned$longitude <- as.numeric(rfDataCleaned$longitude)
rfDataCleaned$type <- as.factor(rfDataCleaned$type)
rfDataCleaned$sq_feet <- as.numeric(rfDataCleaned$sq_feet)
rfDataCleaned$price <- as.numeric(rfDataCleaned$price)
rfDataCleaned$bedrooms <- as.numeric(rfDataCleaned$bedrooms)
rfDataCleaned$den <- as.factor(rfDataCleaned$den)
rfDataCleaned$baths <- as.numeric(rfDataCleaned$baths)
rfDataCleaned$cats <- ifelse(rfDataCleaned$cats == "1",
                       "Yes",
                       "No")
rfDataCleaned$dogs <- ifelse(rfDataCleaned$dogs == "1",
                       "Yes",
                       "No")
heatIncl <- c()
for (i in 1:nrow(rfDataCleaned)){
  if ("Heat" %in% rfDataCleaned$utilities_included[[i]]){
    heatIncl[i] <- TRUE}
  else {
      heatIncl[i] <- FALSE}
}
rfDataCleaned$heatIncl <- heatIncl

cableIncl <- c()
for (i in 1:nrow(rfDataCleaned)){
  if ("Cable" %in% rfDataCleaned$utilities_included[[i]]){
    cableIncl[i] <- TRUE}
  else {
    cableIncl[i] <- FALSE}
}
rfDataCleaned$cableIncl <- cableIncl

elecIncl <- c()
for (i in 1:nrow(rfDataCleaned)){
  if ("Electricity" %in% rfDataCleaned$utilities_included[[i]]){
    elecIncl[i] <- TRUE}
  else {
    elecIncl[i] <- FALSE}
}
rfDataCleaned$elecIncl <- elecIncl

waterIncl <- c()
for (i in 1:nrow(rfDataCleaned)){
  if ("Water" %in% rfDataCleaned$utilities_included[[i]]){
    waterIncl[i] <- TRUE}
  else {
    waterIncl[i] <- FALSE}
}
rfDataCleaned$waterIncl <- waterIncl

intIncl <- c()
for (i in 1:nrow(rfDataCleaned)){
  if ("Internet" %in% rfDataCleaned$utilities_included[[i]]){
    intIncl[i] <- TRUE}
  else {
    intIncl[i] <- FALSE}
}
rfDataCleaned$intIncl <- intIncl

Variable Selection

Variable selection is vital, so that a model is working only with predictive variables.
The following code measures the collinearity of the variables (measured using their Variance Inflation Factor), and then uses stepwise regression to select the most significant combination of variables.

########################
## Variable Selection ##
########################

## Use VIF to eliminate collinear variables ##
## Create a loop that evaluates the VIF of the model's variables
## If the largest GVIF^(1/(2*Df)) of a variable is above 2, eliminate it and run the model again

modelData <- rfDataCleaned %>%
  dplyr::select(community, latitude, longitude, type, sq_feet, bedrooms, den, baths, cats, dogs, heatIncl, cableIncl, elecIncl, waterIncl, intIncl, price)
modelData[modelData$community == "Bearspaw",] <- NA
while (TRUE) {
  modl <- lm(price ~ ., data = modelData)
  dfVif <- modl %>%
    vif() %>%
    as.data.frame()
  ind <- which.max(dfVif$`GVIF^(1/(2*Df))`)
  if (dfVif$`GVIF^(1/(2*Df))`[ind] > 2){
    modelData <- modelData[,-ind]
  } else {break}
}

## use step() with 'both' to select most statistically significant variables ##


modelDataNAexcl <- na.exclude(modelData)
fullModel <- lm(price ~ ., data = modelDataNAexcl)
nullModel <- lm(price ~ 1, data = modelDataNAexcl)
stepForward1 <- step(nullModel, 
                    scope = list(lower = nullModel, upper = fullModel),
                    direction = "both")

However, we do have an issue. The problem is that there are 226 variables! Luckily, since scraped Quadrant, we can use it in place of Community. Quadrant has 8 levels; community has about 200.

modelData <- rfDataCleaned %>%
  dplyr::select(quadrant, latitude, longitude, type, sq_feet, bedrooms, den, baths, cats, dogs, heatIncl, cableIncl, elecIncl, waterIncl, intIncl, price)

## VIF ##

while (TRUE) {
  modl <- lm(price ~ ., data = modelData)
  dfVif <- modl %>%
    vif() %>%
    as.data.frame()
  ind <- which.max(dfVif$`GVIF^(1/(2*Df))`)
  if (dfVif$`GVIF^(1/(2*Df))`[ind] > 2){
    modelData <- modelData[,-ind]
  } else {break}
}

## step() ##

modelDataNAexcl <- na.exclude(modelData)
train <- sample(nrow(modelDataNAexcl), 0.7 * nrow(modelDataNAexcl))
trainData <- modelDataNAexcl[train,]
testData <- modelDataNAexcl[-train,]
fullModel <- lm(price ~ ., data = trainData)
nullModel <- lm(price ~ 1, data = trainData)
stepForward2 <- step(nullModel, 
                    scope = list(lower = nullModel, upper = fullModel),
                    direction = "both")

The summary of the final Regression Model

summary(stepForward2)
## 
## Call:
## lm(formula = price ~ type + baths + intIncl + quadrant + sq_feet + 
##     dogs + elecIncl + bedrooms + den + latitude + cats + waterIncl + 
##     longitude, data = trainData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1034.97  -176.02   -32.72   146.65   942.04 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.524e+04  1.793e+04   0.850 0.395355    
## typeBasement   -2.805e+02  2.666e+01 -10.522  < 2e-16 ***
## typeCondo       1.430e+02  1.701e+01   8.408  < 2e-16 ***
## typeDuplex     -6.438e+01  3.624e+01  -1.777 0.075799 .  
## typeHouse       1.591e+02  3.245e+01   4.904 1.02e-06 ***
## typeLoft        6.224e+02  1.278e+02   4.869 1.22e-06 ***
## typeMain Floor -4.554e+01  2.934e+01  -1.552 0.120796    
## typeShared     -9.940e+02  3.456e+01 -28.762  < 2e-16 ***
## typeTownhouse   9.576e+01  2.602e+01   3.681 0.000239 ***
## baths           2.035e+02  1.331e+01  15.293  < 2e-16 ***
## intInclTRUE     1.956e+02  1.993e+01   9.814  < 2e-16 ***
## quadrantNE/NW   2.758e+02  4.970e+01   5.549 3.29e-08 ***
## quadrantNE/SE  -1.797e+02  1.291e+02  -1.392 0.164146    
## quadrantNW      1.676e+02  3.198e+01   5.241 1.78e-07 ***
## quadrantNW/NE   2.752e+01  4.062e+01   0.677 0.498189    
## quadrantSE      1.210e+02  3.323e+01   3.642 0.000278 ***
## quadrantSW      2.389e+02  3.414e+01   6.997 3.67e-12 ***
## quadrantSW/SE   3.435e+02  3.267e+01  10.513  < 2e-16 ***
## sq_feet         1.703e-01  2.041e-02   8.344  < 2e-16 ***
## dogsYes         1.015e+02  1.921e+01   5.280 1.45e-07 ***
## elecInclTRUE    7.432e+01  1.815e+01   4.094 4.43e-05 ***
## bedrooms        6.014e+01  1.053e+01   5.714 1.29e-08 ***
## denYes          6.307e+01  1.954e+01   3.228 0.001271 ** 
## latitude        4.784e+02  1.626e+02   2.941 0.003309 ** 
## catsYes        -4.308e+01  1.923e+01  -2.241 0.025177 *  
## waterInclTRUE   4.622e+01  2.005e+01   2.305 0.021284 *  
## longitude       3.428e+02  1.567e+02   2.188 0.028810 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 251.6 on 1821 degrees of freedom
## Multiple R-squared:  0.6756, Adjusted R-squared:  0.6709 
## F-statistic: 145.8 on 26 and 1821 DF,  p-value: < 2.2e-16
coef(stepForward2)
##    (Intercept)   typeBasement      typeCondo     typeDuplex      typeHouse 
##  15240.2710246   -280.5394776    142.9769266    -64.3759785    159.1227624 
##       typeLoft typeMain Floor     typeShared  typeTownhouse          baths 
##    622.3552804    -45.5431110   -994.0087776     95.7586471    203.4957578 
##    intInclTRUE  quadrantNE/NW  quadrantNE/SE     quadrantNW  quadrantNW/NE 
##    195.6019459    275.7964445   -179.7202299    167.6335989     27.5159791 
##     quadrantSE     quadrantSW  quadrantSW/SE        sq_feet        dogsYes 
##    121.0265037    238.8501886    343.4567733      0.1703257    101.4511217 
##   elecInclTRUE       bedrooms         denYes       latitude        catsYes 
##     74.3179672     60.1387854     63.0742541    478.4048009    -43.0846302 
##  waterInclTRUE      longitude 
##     46.2171551    342.7776985

R Visualizations

What factors affects the cost of rent? Let’s take a look.

######################
## R Visualizations ##
######################

## Creating a data frame to plot the coefficient effects ##

plotdata <- data.frame(coef(stepForward2))
plotdata$names <- rownames(plotdata)
rownames(plotdata) <- NULL
plot1data <- plotdata[plotdata$names %in% c("intInclTRUE", "cableInclTRUE", "dogsYes", "catsYes", "denYes", "elecInclTRUE", "waterInclTRUE", "heatInclTRUE"),]
plot1data <- plot1data %>%
  arrange(desc(coef.stepForward2.))

## Plot the price adjustment based on different factors ##

ggplot(data = plot1data, aes(x = reorder(names, -coef.stepForward2.), y = coef.stepForward2.)) +
  geom_col(fill = rainbow_hcl(nrow(plot1data))) +
  xlab("") +
  ylab("Price Adjustment ($)") +
  ggtitle("Price Adjustment for Factors") + 
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90)) + 
  geom_text(
    aes(label = ifelse(plot1data$coef.stepForward2. >= 0, 
                       "",
                       "-") %>%
          paste0("$", abs(round(plot1data$coef.stepForward2., 0))),
        y = ifelse(plot1data$coef.stepForward2. >= 0,
                   plot1data$coef.stepForward2. + 5,
                   plot1data$coef.stepForward2. - 10)),
    position = position_dodge(0.9),
    vjust = 0)

## Plot the price adjustment based on Rental Type ##

plot2data <- plotdata[grepl("type", plotdata[,2]),]
plot2data[,2] <- gsub("type","",plot2data[,2])

ggplot(data = plot2data, aes(x = reorder(names, coef.stepForward2.), y = coef.stepForward2.)) +
  geom_col(fill = rainbow_hcl(nrow(plot2data))) +
  xlab("") +
  ylab("Price Adjustment ($)") +
  ggtitle("Price Adjustment for Rental Type") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 90)) +
  geom_text(
    aes(label = ifelse(plot2data$coef.stepForward2. >= 0, 
                       "",
                       "-") %>%
          paste0("$", abs(round(plot2data$coef.stepForward2., 0))),
        y = ifelse(plot2data$coef.stepForward2. >= 0,
                   plot2data$coef.stepForward2. + 15,
                   plot2data$coef.stepForward2. - 50)),
    position = position_dodge(0.9),
    vjust = 0)

Evaluating Predictive Performance of the Linear Model

One question to ask is: how effective is this model? How is its predictive performance, when given new data? We can both quantify and visualize this model’s performance, and compare it to other models.

Here, the x-axis shows the actual price, and the y-axis displays the predicted price. The RMSE of both the training and test sets are similar, which suggests that we’re not overfitting to the data at all. The Average Error for the test set is also quite low, indicating that we’re not consistently over or under-predicting price. From the visual, we do see something that these metrics don’t capture: that high-priced rentals are consistently under-predicted by our model. This suggests that a different, more flexible model may be able to better capture the true relationship.

predDFtrain <- data.frame(preds = predict(stepForward2, trainData), 
                          price = trainData$price, 
                          error = predict(stepForward2, trainData) - trainData$price) %>%
  arrange(price)
predDFtrain <- data.frame(ind = c(1:nrow(predDFtrain)), predDFtrain)

avgErrorTrain <- mean(predDFtrain$error)
avgErrorTrainText <- paste0("Avg Error: $",round(avgErrorTrain, 2))
rmseTrain <- sqrt(mean(predDFtrain$error^2))
rmseTrainText <- paste0("RMSE: $",round(rmseTrain, 2))

ggplot(predDFtrain, aes(x = price, y = preds)) + 
  geom_point(aes(alpha = I(0.2))) +
  geom_abline(slope = 1, intercept = 0, col = "red", size = 1) +
  xlim(c(0, 3000)) +
  ylim(c(0, 3000)) + 
  ggtitle("Training Data: Actual vs Predicted Price") + 
  xlab("Actual Price ($)") +
  ylab("Predicted Price ($)") +
  theme(plot.title = element_text(hjust = 0.5)) + 
  annotate("text", x = 2500, y = 800, label = rmseTrainText) + 
  annotate("text", x = 2500, y = 600, label = avgErrorTrainText)

## Plotting the Model Error on Test Data ##

preds <- predict(stepForward2, newdata = testData)
predDF <- data.frame(ind = c(1:length(preds)), price = testData$price, preds, error = preds - testData$price)

avgErrorTest <- mean(predDF$error)
avgErrorTestText <- paste0("Avg Error: $",round(avgErrorTest, 2))
rmseTest <- sqrt(mean(predDF$error^2))
rmseTestText <- paste0("RMSE: $",round(rmseTest, 2))


ggplot(predDF, aes(x = price, y = preds)) + 
  geom_point(aes(alpha = I(0.3))) +
  geom_abline(slope = 1, intercept = 0, col = "red", size = 1) +
  xlim(c(0, 3000)) +
  ylim(c(0, 3000)) + 
  ggtitle("Test Data: Actual vs Predicted Price") + 
  xlab("Actual Price ($)") +
  ylab("Predicted Price ($)") +
  theme(plot.title = element_text(hjust = 0.5))  + 
  annotate("text", x = 2500, y = 800, label = rmseTestText) + 
  annotate("text", x = 2500, y = 600, label = avgErrorTestText)

Plotting the Quadrant Plot

This plot shows where listings are concentrated geographically in Calgary, and how the Quadrants we scraped from Wikipedia separate them out.

qmplot(longitude, latitude, data = modelDataNAexcl, size = I(1), darken = 0.6, color = quadrant, alpha = I(1), maptype = 'toner-lite', geom = 'point', 
       main = "RentFaster Listings by Quadrant") + 
  theme(plot.title = element_text(hjust = 0.5))

Conclusion

And that’s it! In the future, I plan to implement a more advanced machine learning technique called Random Forests to improve the predictive performance of the model. This method is not as explainable as linear regression (which is why I chose regression for this project), but it will likely be better able to fit the data here. Thanks for reading!