Have you ever asked the question: “Why does my rent cost so much?” I thought I would use this project to look into how different features of a rental property (size, location, style) affect its monthly rental price. Using the website RentFaster (a popular Canadian rental property website), this project examines those factors, and attempts to create a model to predict rental prices, given its advertisement.
First off, we should get an idea of where these rental listings are geographically located in Calgary.
This plot shows where listings are concentrated geographically in Calgary, and how they are separated by their Quadrants.
The most significant factor affecting rental price is the type of property (think apartment vs house). This plot shows the relative effects of varying property types. Shared residences come at a significant discount, as could be expected. It is curious to me that houses and condos have similar premiums, and that lofts have such a high premium relative to all other property types!
When a landlord says to you that he will include internet in the cost of rent, or that he will let you have dogs in the residence, what sort of premium can you expect to pay?
We can see that if internet is included in the rent, you can expect to pay a massive premium! This is, at least in Canada, seriously overpaying for internet.
One question to ask is: how effective is this model? How is its predictive performance, when given new data? We can both quantify and visualize this model’s performance, and compare it to other models.
Here, the x-axis shows the actual price, and the y-axis displays the predicted price. The RMSE of both the training and test sets are similar, which suggests that we’re not overfitting to the data at all. The Average Error for the test set is also quite low, indicating that we’re not consistently over or under-predicting price. From the visual, we do see something that these metrics don’t capture: that high-priced rentals are consistently under-predicted by our model. This suggests that a different, more flexible model may be able to better capture the true relationship.
How did we get these results?
First, we need to load the appropriate packages.
library(jsonlite)
library(dplyr)
library(ggmap)
library(car)
library(Hmisc)
library(httr)
library(XML)
library(colorspace)
Now, we have to scrape the RentFaster website.
We do this using the jsonlite package.
The following code creates a vector of URLs that contain all the listings priced from 500 to 2500. Then, the data from those listings are stored in a data frame.
#########################
# Scraping Rentfaster #
#########################
## Scrape using jsonlite package
## Split into $25 intervals, since the json file only returns the first 500 rows of data
## Then join these dataframes together
prices <- seq(500, 2500, 25)
urls <- paste0('https://www.rentfaster.ca/api/map.json?e=undefined&beds=&baths=&type=&price_range_adv%5Bfrom%5D=',
prices,
'&price_range_adv%5Bto%5D=',
prices + 24,
'&area=51.18725531061486%2C-113.72722341074217%2C50.90397819717543%2C-114.3877763892578')
rfData <- fromJSON(txt = urls[1])
rfData <- rfData$listings
for (i in 2:length(urls)){
newdf <- fromJSON(txt = urls[i])
rfData <- rbind(rfData, newdf$listings)
}
Next, we scrape a Wikipedia HTML table to acquire the quadrant of each community, and add it to the listings dataframe.
######################
# Scraping Wikipedia #
######################
## Scrape using XML and httr packages ##
## Keep only the first two rows (community and quadrant)
url <- "https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Calgary"
getObject <- GET(url)
commsParsed <- htmlParse(getObject)
pageTables <- readHTMLTable(commsParsed, stringsAsFactors = FALSE, header = T)
communityTbl <- pageTables[[1]]
communityTbl <- communityTbl[-nrow(communityTbl),]
## Add Quadrant to Rentfaster data by comparing community with Wikipedia data
quadrant <- c()
for (i in 1:nrow(rfData)){
ifelse(sum(rfData$community[i] == communityTbl[,1]) == 0,
quadrant[i] <- NA,
quadrant[i] <- communityTbl[,2][rfData$community[i] == communityTbl[,1]])
}
rfData$quadrant <- quadrant
The next step is to correct the variables’ classes (i.e. number of bedrooms variable should be numeric, lists of utilities included are converted to individual dummy variables)
###################
## Data Cleaning ##
###################
## Select useful variables ##
rfDataCleaned <- rfData %>%
dplyr::select(title, rented, availability, a, unit, city, community, quadrant, intro, latitude, longitude, type, preferred_contact, sq_feet, price, bedrooms, den, baths, cats, dogs, utilities_included)
## Correct the class of variables ##
#str(rfDataCleaned)
rfDataCleaned$type <- unlist(rfDataCleaned$type)
rfDataCleaned$a <- as.Date(rfDataCleaned$a)
rfDataCleaned$rented <- as.factor(rfDataCleaned$rented)
rfDataCleaned$availability <- as.factor(rfDataCleaned$availability)
rfDataCleaned$unit <- as.numeric(rfDataCleaned$unit)
rfDataCleaned$city <- as.factor(rfDataCleaned$city)
rfDataCleaned$community <- as.factor(rfDataCleaned$community)
rfDataCleaned$quadrant <- as.factor(rfDataCleaned$quadrant)
rfDataCleaned$latitude <- as.numeric(rfDataCleaned$latitude)
rfDataCleaned$longitude <- as.numeric(rfDataCleaned$longitude)
rfDataCleaned$type <- as.factor(rfDataCleaned$type)
rfDataCleaned$sq_feet <- as.numeric(rfDataCleaned$sq_feet)
rfDataCleaned$price <- as.numeric(rfDataCleaned$price)
rfDataCleaned$bedrooms <- as.numeric(rfDataCleaned$bedrooms)
rfDataCleaned$den <- as.factor(rfDataCleaned$den)
rfDataCleaned$baths <- as.numeric(rfDataCleaned$baths)
rfDataCleaned$cats <- ifelse(rfDataCleaned$cats == "1",
"Yes",
"No")
rfDataCleaned$dogs <- ifelse(rfDataCleaned$dogs == "1",
"Yes",
"No")
heatIncl <- c()
for (i in 1:nrow(rfDataCleaned)){
if ("Heat" %in% rfDataCleaned$utilities_included[[i]]){
heatIncl[i] <- TRUE}
else {
heatIncl[i] <- FALSE}
}
rfDataCleaned$heatIncl <- heatIncl
cableIncl <- c()
for (i in 1:nrow(rfDataCleaned)){
if ("Cable" %in% rfDataCleaned$utilities_included[[i]]){
cableIncl[i] <- TRUE}
else {
cableIncl[i] <- FALSE}
}
rfDataCleaned$cableIncl <- cableIncl
elecIncl <- c()
for (i in 1:nrow(rfDataCleaned)){
if ("Electricity" %in% rfDataCleaned$utilities_included[[i]]){
elecIncl[i] <- TRUE}
else {
elecIncl[i] <- FALSE}
}
rfDataCleaned$elecIncl <- elecIncl
waterIncl <- c()
for (i in 1:nrow(rfDataCleaned)){
if ("Water" %in% rfDataCleaned$utilities_included[[i]]){
waterIncl[i] <- TRUE}
else {
waterIncl[i] <- FALSE}
}
rfDataCleaned$waterIncl <- waterIncl
intIncl <- c()
for (i in 1:nrow(rfDataCleaned)){
if ("Internet" %in% rfDataCleaned$utilities_included[[i]]){
intIncl[i] <- TRUE}
else {
intIncl[i] <- FALSE}
}
rfDataCleaned$intIncl <- intIncl
Variable selection is vital, so that a model is working only with predictive variables.
The following code measures the collinearity of the variables (measured using their Variance Inflation Factor), and then uses stepwise regression to select the most significant combination of variables.
########################
## Variable Selection ##
########################
## Use VIF to eliminate collinear variables ##
## Create a loop that evaluates the VIF of the model's variables
## If the largest GVIF^(1/(2*Df)) of a variable is above 2, eliminate it and run the model again
modelData <- rfDataCleaned %>%
dplyr::select(community, latitude, longitude, type, sq_feet, bedrooms, den, baths, cats, dogs, heatIncl, cableIncl, elecIncl, waterIncl, intIncl, price)
modelData[modelData$community == "Bearspaw",] <- NA
while (TRUE) {
modl <- lm(price ~ ., data = modelData)
dfVif <- modl %>%
vif() %>%
as.data.frame()
ind <- which.max(dfVif$`GVIF^(1/(2*Df))`)
if (dfVif$`GVIF^(1/(2*Df))`[ind] > 2){
modelData <- modelData[,-ind]
} else {break}
}
## use step() with 'both' to select most statistically significant variables ##
modelDataNAexcl <- na.exclude(modelData)
fullModel <- lm(price ~ ., data = modelDataNAexcl)
nullModel <- lm(price ~ 1, data = modelDataNAexcl)
stepForward1 <- step(nullModel,
scope = list(lower = nullModel, upper = fullModel),
direction = "both")
However, we do have an issue. The problem is that there are 226 variables! Luckily, since scraped Quadrant, we can use it in place of Community. Quadrant has 8 levels; community has about 200.
modelData <- rfDataCleaned %>%
dplyr::select(quadrant, latitude, longitude, type, sq_feet, bedrooms, den, baths, cats, dogs, heatIncl, cableIncl, elecIncl, waterIncl, intIncl, price)
## VIF ##
while (TRUE) {
modl <- lm(price ~ ., data = modelData)
dfVif <- modl %>%
vif() %>%
as.data.frame()
ind <- which.max(dfVif$`GVIF^(1/(2*Df))`)
if (dfVif$`GVIF^(1/(2*Df))`[ind] > 2){
modelData <- modelData[,-ind]
} else {break}
}
## step() ##
modelDataNAexcl <- na.exclude(modelData)
train <- sample(nrow(modelDataNAexcl), 0.7 * nrow(modelDataNAexcl))
trainData <- modelDataNAexcl[train,]
testData <- modelDataNAexcl[-train,]
fullModel <- lm(price ~ ., data = trainData)
nullModel <- lm(price ~ 1, data = trainData)
stepForward2 <- step(nullModel,
scope = list(lower = nullModel, upper = fullModel),
direction = "both")
The summary of the final Regression Model
summary(stepForward2)
##
## Call:
## lm(formula = price ~ type + baths + intIncl + quadrant + sq_feet +
## dogs + elecIncl + bedrooms + den + latitude + cats + waterIncl +
## longitude, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1034.97 -176.02 -32.72 146.65 942.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.524e+04 1.793e+04 0.850 0.395355
## typeBasement -2.805e+02 2.666e+01 -10.522 < 2e-16 ***
## typeCondo 1.430e+02 1.701e+01 8.408 < 2e-16 ***
## typeDuplex -6.438e+01 3.624e+01 -1.777 0.075799 .
## typeHouse 1.591e+02 3.245e+01 4.904 1.02e-06 ***
## typeLoft 6.224e+02 1.278e+02 4.869 1.22e-06 ***
## typeMain Floor -4.554e+01 2.934e+01 -1.552 0.120796
## typeShared -9.940e+02 3.456e+01 -28.762 < 2e-16 ***
## typeTownhouse 9.576e+01 2.602e+01 3.681 0.000239 ***
## baths 2.035e+02 1.331e+01 15.293 < 2e-16 ***
## intInclTRUE 1.956e+02 1.993e+01 9.814 < 2e-16 ***
## quadrantNE/NW 2.758e+02 4.970e+01 5.549 3.29e-08 ***
## quadrantNE/SE -1.797e+02 1.291e+02 -1.392 0.164146
## quadrantNW 1.676e+02 3.198e+01 5.241 1.78e-07 ***
## quadrantNW/NE 2.752e+01 4.062e+01 0.677 0.498189
## quadrantSE 1.210e+02 3.323e+01 3.642 0.000278 ***
## quadrantSW 2.389e+02 3.414e+01 6.997 3.67e-12 ***
## quadrantSW/SE 3.435e+02 3.267e+01 10.513 < 2e-16 ***
## sq_feet 1.703e-01 2.041e-02 8.344 < 2e-16 ***
## dogsYes 1.015e+02 1.921e+01 5.280 1.45e-07 ***
## elecInclTRUE 7.432e+01 1.815e+01 4.094 4.43e-05 ***
## bedrooms 6.014e+01 1.053e+01 5.714 1.29e-08 ***
## denYes 6.307e+01 1.954e+01 3.228 0.001271 **
## latitude 4.784e+02 1.626e+02 2.941 0.003309 **
## catsYes -4.308e+01 1.923e+01 -2.241 0.025177 *
## waterInclTRUE 4.622e+01 2.005e+01 2.305 0.021284 *
## longitude 3.428e+02 1.567e+02 2.188 0.028810 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 251.6 on 1821 degrees of freedom
## Multiple R-squared: 0.6756, Adjusted R-squared: 0.6709
## F-statistic: 145.8 on 26 and 1821 DF, p-value: < 2.2e-16
coef(stepForward2)
## (Intercept) typeBasement typeCondo typeDuplex typeHouse
## 15240.2710246 -280.5394776 142.9769266 -64.3759785 159.1227624
## typeLoft typeMain Floor typeShared typeTownhouse baths
## 622.3552804 -45.5431110 -994.0087776 95.7586471 203.4957578
## intInclTRUE quadrantNE/NW quadrantNE/SE quadrantNW quadrantNW/NE
## 195.6019459 275.7964445 -179.7202299 167.6335989 27.5159791
## quadrantSE quadrantSW quadrantSW/SE sq_feet dogsYes
## 121.0265037 238.8501886 343.4567733 0.1703257 101.4511217
## elecInclTRUE bedrooms denYes latitude catsYes
## 74.3179672 60.1387854 63.0742541 478.4048009 -43.0846302
## waterInclTRUE longitude
## 46.2171551 342.7776985
What factors affects the cost of rent? Let’s take a look.
######################
## R Visualizations ##
######################
## Creating a data frame to plot the coefficient effects ##
plotdata <- data.frame(coef(stepForward2))
plotdata$names <- rownames(plotdata)
rownames(plotdata) <- NULL
plot1data <- plotdata[plotdata$names %in% c("intInclTRUE", "cableInclTRUE", "dogsYes", "catsYes", "denYes", "elecInclTRUE", "waterInclTRUE", "heatInclTRUE"),]
plot1data <- plot1data %>%
arrange(desc(coef.stepForward2.))
## Plot the price adjustment based on different factors ##
ggplot(data = plot1data, aes(x = reorder(names, -coef.stepForward2.), y = coef.stepForward2.)) +
geom_col(fill = rainbow_hcl(nrow(plot1data))) +
xlab("") +
ylab("Price Adjustment ($)") +
ggtitle("Price Adjustment for Factors") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90)) +
geom_text(
aes(label = ifelse(plot1data$coef.stepForward2. >= 0,
"",
"-") %>%
paste0("$", abs(round(plot1data$coef.stepForward2., 0))),
y = ifelse(plot1data$coef.stepForward2. >= 0,
plot1data$coef.stepForward2. + 5,
plot1data$coef.stepForward2. - 10)),
position = position_dodge(0.9),
vjust = 0)
## Plot the price adjustment based on Rental Type ##
plot2data <- plotdata[grepl("type", plotdata[,2]),]
plot2data[,2] <- gsub("type","",plot2data[,2])
ggplot(data = plot2data, aes(x = reorder(names, coef.stepForward2.), y = coef.stepForward2.)) +
geom_col(fill = rainbow_hcl(nrow(plot2data))) +
xlab("") +
ylab("Price Adjustment ($)") +
ggtitle("Price Adjustment for Rental Type") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90)) +
geom_text(
aes(label = ifelse(plot2data$coef.stepForward2. >= 0,
"",
"-") %>%
paste0("$", abs(round(plot2data$coef.stepForward2., 0))),
y = ifelse(plot2data$coef.stepForward2. >= 0,
plot2data$coef.stepForward2. + 15,
plot2data$coef.stepForward2. - 50)),
position = position_dodge(0.9),
vjust = 0)
One question to ask is: how effective is this model? How is its predictive performance, when given new data? We can both quantify and visualize this model’s performance, and compare it to other models.
Here, the x-axis shows the actual price, and the y-axis displays the predicted price. The RMSE of both the training and test sets are similar, which suggests that we’re not overfitting to the data at all. The Average Error for the test set is also quite low, indicating that we’re not consistently over or under-predicting price. From the visual, we do see something that these metrics don’t capture: that high-priced rentals are consistently under-predicted by our model. This suggests that a different, more flexible model may be able to better capture the true relationship.
predDFtrain <- data.frame(preds = predict(stepForward2, trainData),
price = trainData$price,
error = predict(stepForward2, trainData) - trainData$price) %>%
arrange(price)
predDFtrain <- data.frame(ind = c(1:nrow(predDFtrain)), predDFtrain)
avgErrorTrain <- mean(predDFtrain$error)
avgErrorTrainText <- paste0("Avg Error: $",round(avgErrorTrain, 2))
rmseTrain <- sqrt(mean(predDFtrain$error^2))
rmseTrainText <- paste0("RMSE: $",round(rmseTrain, 2))
ggplot(predDFtrain, aes(x = price, y = preds)) +
geom_point(aes(alpha = I(0.2))) +
geom_abline(slope = 1, intercept = 0, col = "red", size = 1) +
xlim(c(0, 3000)) +
ylim(c(0, 3000)) +
ggtitle("Training Data: Actual vs Predicted Price") +
xlab("Actual Price ($)") +
ylab("Predicted Price ($)") +
theme(plot.title = element_text(hjust = 0.5)) +
annotate("text", x = 2500, y = 800, label = rmseTrainText) +
annotate("text", x = 2500, y = 600, label = avgErrorTrainText)
## Plotting the Model Error on Test Data ##
preds <- predict(stepForward2, newdata = testData)
predDF <- data.frame(ind = c(1:length(preds)), price = testData$price, preds, error = preds - testData$price)
avgErrorTest <- mean(predDF$error)
avgErrorTestText <- paste0("Avg Error: $",round(avgErrorTest, 2))
rmseTest <- sqrt(mean(predDF$error^2))
rmseTestText <- paste0("RMSE: $",round(rmseTest, 2))
ggplot(predDF, aes(x = price, y = preds)) +
geom_point(aes(alpha = I(0.3))) +
geom_abline(slope = 1, intercept = 0, col = "red", size = 1) +
xlim(c(0, 3000)) +
ylim(c(0, 3000)) +
ggtitle("Test Data: Actual vs Predicted Price") +
xlab("Actual Price ($)") +
ylab("Predicted Price ($)") +
theme(plot.title = element_text(hjust = 0.5)) +
annotate("text", x = 2500, y = 800, label = rmseTestText) +
annotate("text", x = 2500, y = 600, label = avgErrorTestText)
This plot shows where listings are concentrated geographically in Calgary, and how the Quadrants we scraped from Wikipedia separate them out.
qmplot(longitude, latitude, data = modelDataNAexcl, size = I(1), darken = 0.6, color = quadrant, alpha = I(1), maptype = 'toner-lite', geom = 'point',
main = "RentFaster Listings by Quadrant") +
theme(plot.title = element_text(hjust = 0.5))
And that’s it! In the future, I plan to implement a more advanced machine learning technique called Random Forests to improve the predictive performance of the model. This method is not as explainable as linear regression (which is why I chose regression for this project), but it will likely be better able to fit the data here. Thanks for reading!