I will be constructing a model that predicts the price of an Airbnb depending on several variables that will be explained further in the following section. Since predicting price gives a continuous outcome I will be fitting and using my data on models best suited for regression. I will start by explaining a little bit about what Airbnb is and what they do and then go on to explaining why this model may be useful. Once we have a good idea about the kind of data we may be using I will load in my data set and begin data cleaning/splitting. Next, I will move on to exploratory data analysis to get a better understanding of the specific data we will be working with. Following this I will begin my model construction process and lastly, I will test my final model on testing data which will not be used throughout the model construction process.
Airbnb is a company that is operated completely online to allow travelers to book accommodations anywhere in the world. Anyone can make an account online in order to either post a listing or a “stay” that people can rent or to sign up and book different accommodations, once you request to stay somewhere the host must approve your stay. You can think of airbnb as a kind of platform or marketplace where people go to post an offer for or rent an accommodation.
Airbnb is mainly focused on home-stays, so many people rent their home or part of their home to travelers that need a place to stay. You simply go online type the dates of your vacation/stay, the location, and the amount of people that will be with you similar to booking a hotel. Then many different listings will pop up with different prices depending on the type of accommodation, location and other similar factors. You can also book through your phone as they have a mobile app.
This model could help hosts that want to post a new accommodation decide how much they want to list their place for. Given a certain amount of information about the price this model will tell them about how much other similar places are renting for on average.
library(yardstick)
library(ISLR)
library(ISLR2)
library(rpart.plot)
library(vip)
library(janitor)
library(randomForest)
library(xgboost)
library(ranger)
library(discrim)
library(ggplot2)
library(tidyverse)
library(tidymodels)
library(corrr)
library(corrplot)
library(ggthemes)
library(readr)
library(corrr)
library(klaR)
library(MASS)
library(dplyr)
library(glmnet)
tidymodels_prefer()
set.seed(777)
I intend to use a data set from insideairbnb.com which sources the data from publicly available information on the official Airbnb site. The data-set includes information about the hosts of the living space being offered, the living space/listing itself, and reviews. The data set I will use will be strictly from the Los Angeles area. There are 33,330 observations and 74 predictors however I edited the .csv and took out some predictors as many were not necessary such as some urls to listings and description of the host/listings. The dataset has both numeric and character/string data variables as well categorical and continuous variables. Most of the data is continuous and I will be able to change the data I need into continuous variables in the Data Cleaning section.
Note: I have a code book with descriptions of all variables used in my github
I converted the dates to a standard format and converted bathrooms_text to numeric which just changed for example 1 bath to 1 (dropping bath for all columns)
airbnb <- read_csv("airbnb_listings_final.csv",
col_types = cols(host_since = col_date(format = "%m/%d/%Y"),
first_review = col_date(format = "%m/%d/%Y"),
last_review = col_date(format = "%m/%d/%Y"),
bathrooms_text = col_number()))
The data set I am using already has standardized names for all variables so there is no need to use janitor to clean them up but there were still some modifications to be made. After removing many variables and loading in the data set I realized there was still a few things I should take out since they are not very relevant to predicting price, after this I have a total of 31 predictors
airbnb <- airbnb %>%
select(-amenities, -host_name, -neighbourhood_cleansed)
Convert Dates to Numeric this is so I can see if date a host joined and the dates of their reviews affects how they price their listing. I can I am able to now use these in the EDA and model building below
airbnb$first_review <- as.numeric(airbnb$first_review)
airbnb$last_review <- as.numeric(airbnb$last_review)
airbnb$host_since <- as.numeric(airbnb$host_since)
Converting True/False Values to 0 and 1 (0 for false and 1 for true). Again this allows me to use these predictors in my continuous/linear regression analysis and models below
airbnb$host_is_superhost <- as.numeric(airbnb$host_is_superhost)
airbnb$host_has_profile_pic <- as.numeric(airbnb$host_has_profile_pic)
airbnb$host_identity_verified <- as.numeric(airbnb$host_identity_verified)
airbnb$instant_bookable <- as.numeric(airbnb$instant_bookable)
I will change property type to 1 or 2 depending on if it is an entire home(1) or just a room being(2) rented so I can see how price affects each type of accommodation
airbnb$property_type [airbnb$property_type == 'Entire home/apt'] <- 1
airbnb$property_type [airbnb$property_type == 'Private room'] <- 2
airbnb$property_type <- as.integer(airbnb$property_type)
Lastly I will convert host_response_time to 1 2 3 or 4 depending on how fast they respond 1 being the fastest and 4 being the highest( from ‘within an hour’, ‘within a few hours’. ‘within an day’, and ‘a few days or more’ respectively) I am leaving all of these variables in to see if the hosts engagement with the app effects the price they set in any way.
airbnb$host_response_time [airbnb$host_response_time == 'within an hour'] <- 1
airbnb$host_response_time [airbnb$host_response_time == 'within a few hours'] <- 2
airbnb$host_response_time [airbnb$host_response_time == 'within an day'] <- 3
airbnb$host_response_time [airbnb$host_response_time == 'a few days or more'] <- 4
airbnb$host_response_time <- as.integer(airbnb$host_response_time)
note: the point of converting all to numeric types was so I can include them in my EDA otherwise I would convert to dummy variables in my recipe’s
I now want to check to see that all my data looks good and ready to go
head(airbnb)
## # A tibble: 6 x 31
## id host_since host_response_time host_response_rate host_acceptance_rate
## <dbl> <dbl> <int> <dbl> <dbl>
## 1 109 14057 NA NA NA
## 2 2708 14138 2 100 100
## 3 2732 14139 1 100 37
## 4 2864 14147 NA NA NA
## 5 3021 14154 NA NA NA
## 6 5728 14308 2 100 82
## # ... with 26 more variables: host_is_superhost <dbl>,
## # host_total_listings_count <dbl>, host_has_profile_pic <dbl>,
## # host_identity_verified <dbl>, property_type <int>, accommodates <dbl>,
## # bathrooms_text <dbl>, bedrooms <dbl>, beds <dbl>, price <dbl>,
## # minimum_nights <dbl>, maximum_nights <dbl>, availability_90 <dbl>,
## # number_of_reviews <dbl>, first_review <dbl>, last_review <dbl>,
## # review_scores_rating <dbl>, review_scores_cleanliness <dbl>, ...
Note: I will be dealing with missing values my imputing in my recipe’s later
I will use the 80-20 split for my training and testing data respectively. I stratified my data on the number of guests that can be accommodated in each listing as I feel that will be the biggest factor in predicting price and I can split each equally
set.seed(777)
airbnb_split <- initial_split(airbnb, prop = 0.8, strata = accommodates)
airbnb_train <- training(airbnb_split)
airbnb_test <- testing(airbnb_split)
The training data set has 26662 observation and the testing set has 6667 observations for a total of 33329 observations
I will be using the training set for this section where each observation is a different listing.
I am going to start off by looking at boxplots of a few different variables that I am most worried about affecting the data set
airbnb_train %>%
ggplot(aes(x = accommodates)) +
geom_boxplot()
airbnb_train %>%
ggplot(aes(x = price, y = property_type)) +
geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
## Warning: Removed 689 rows containing non-finite values (stat_boxplot).
airbnb_train %>%
ggplot(aes(x = bedrooms)) +
geom_boxplot()
## Warning: Removed 3033 rows containing non-finite values (stat_boxplot).
From Looking at these two boxplots it seems as though we have some heavy outliers for price and a few for the amount of people being accommodated and sure enough once I went back to look at the data and organized it by price and accommodates I saw that there are a few listings that are up to 25000$!
The vast majority of listings however, are under 500$ and so I will be filtering out these observations from my training and testing sets now in order to continue my EDA and get a true sense of what my data looks like. Filtering out price coincidentally filters out the outliers for other predictors as well such as accommodates, and the number of bedrooms.
airbnb_train <- airbnb_train %>%
filter(price < 500)
airbnb_test <- airbnb_test %>%
filter(price < 500)
airbnb_train <- airbnb_train %>%
filter(accommodates < 11)
airbnb_test <- airbnb_test %>%
filter(accommodates < 11)
dim(airbnb_train)
## [1] 24270 31
dim(airbnb_test)
## [1] 6059 31
airbnb_train now has 24270 observations loss of about 2200
observations and the testing set has 6059 observations a loss of about
550 observations so they both still represent about 80% and 20% of our
remaining data set of 30329 total observations
airbnb_train %>%
ggplot(aes(x= price)) +
geom_boxplot()
This looks much better than before and we can also take a look at a histogram of both price and accommodates
airbnb_train %>%
ggplot(aes(reorder(accommodates,price),price)) +
geom_boxplot() +
coord_flip() +
labs(
title = "Price by Number of People Accommodated",
x = "Max Number of People That Can be Accomodated"
)
airbnb_train %>%
ggplot(aes(reorder(property_type,price),price)) +
geom_boxplot() +
coord_flip() +
labs(
title = "Price by Property Type",
x = "Property Type 1 = entire unit/house, 2 = room only"
)
Here we see a strong correlation between price and number of people that can be accommodated. We can see that even though there are airbnbs where the number of people allowed to stay is low the price can go very high if you would like to rent a more luxurious accommodation. Airbnb has many different types of accommodations even though it splits accommodation type into either an entire house/apt or just a bedroom the accommodations can be anything from a regular suburban house Inglewood or glass dome in the middle of the Mojave desert!
We must also take note that there are a significant amount of missing data for property type. Which makes me think that maybe I should not use it as clearly a whole accommodation should have a higher price than just renting a room.
Before moving forward I also would like to say that the price is based on a per night basis. The data was collected straight from the airbnb website and each observation was collected on the same day. At the end of each stay the guest rates the accommodation based on several things such as overall experience, cleanliness and value all of which have been included in the data set as observations. Below is a photo of the airbnb app so you can get a sense of how it works.
airbnb_train %>%
ggplot(aes(x = price)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
airbnb_train %>%
ggplot(aes(x = accommodates)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that the median price lies at around 100 and median number of people that a listing can accommodate is about two and that we have a leftward skew for both variables
Now lets take a look at a correlation matrix between a few of the predictors
airbnb_train %>%
select(host_since,host_response_time,host_is_superhost,accommodates,
price, number_of_reviews, review_scores_rating, property_type) %>%
cor(use = "complete.obs") %>%
corrplot(type = "lower", diag = FALSE)
Here we see again the relationship between price and accommodates variables, interestingly there is a negative correlation between number of reviews and the length of time since a host has joined which makes sense as those who have joined recently will not have as many people that have stayed in their accommodation. A super host is someone who consistently is supposed to go above and beyond to meet guests needs and be very engaged with them which makes sense why we would see a correlation between super host and review rating.
Now for my model building I will be trying out the following models on my data:
I will organize my code into three categories Building The Model, running/testing the models, and model analysis
From the EDA I saw that their was a significant amount of missing values so I decided to look back through observations more carefully and saw there was a lot of missing data for all the review scores observations and have decided to filter out all observations with NA as I should still have more than enough to construct a model.
final_train <- airbnb_train %>% drop_na()
final_test <- airbnb_test %>% drop_na()
dim(final_train)
## [1] 9354 31
dim(final_test)
## [1] 2397 31
I assigned this data to a new set, This left us with 9354 obs. for the training set down from 24270 obs. leaving us with the an 80 20 split still and 2397 observations in the testing set down from 6059 obs.
I can now start the model building process but I double checked the price count just to double check the distribution stayed the same
airbnb_train %>%
ggplot(aes(x = price)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I will first setup linear regression model starting with a recipe:
airbnb_recipe <- recipe(price ~ . , data = final_train) %>% # predict price with all predictors
step_normalize(all_predictors()) # this centers and scales all variables
I do not believe there was a strong enough correlation between any of the predictors that warranted creating an interaction term and I converted all T/F and categorical predictors into numerical categories in data cleaning so there is no need for any dummy coding
I now create a linear regression object and workflow :
lm_model <- linear_reg() %>%
set_engine("lm")
lm_wf <- workflow() %>%
add_recipe(airbnb_recipe) %>%
add_model(lm_model)
I now fit my model:
airbnb_fit <- fit(lm_wf, final_train)
I will now set up a random forest model and workflow:
set.seed(777)
rf_mod <- rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
rf_wf <- workflow() %>%
add_recipe(airbnb_recipe) %>%
add_model(rf_mod)
Mtry is the number of predictors that will be randomly chosen at each split of the tree models thus it cannot be lower than 1 since then no predictors would be chosen or greater than 31 since we have 31 predictors. If mtry = 31 that would represent a bagging model.
Trees represents the number of trees that will be used in each model
min_n represents the minimum number of of data points needed at each node/tree end that is needed in order to split it further into another tree
I will now setup a grid for each hyperparameter we are tuning and fold the data for cross validation:
param_grid <- grid_regular(mtry(range = c(6,28)),
trees(range = c(2, 5)),
min_n(), levels = c(10,10,10))
# Fold
airbnb_fold <- vfold_cv(final_train, v = 5, repeats = 5, strata = price)
Now lets fit the tuned model:
set.seed(777)
tune_res_rf <- tune_grid(
rf_wf,
resamples = airbnb_fold,
grid = param_grid,
)
Note: I ran and saved this model outside of the rmd in the save models r file and will just load the model in during analysis since this took about 15-20min to run
First we Set up the Boosted Tree Model and workflow:
boost_spec <- boost_tree(trees = tune(), tree_depth = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
boost_wf <- workflow() %>%
add_recipe(airbnb_recipe) %>%
add_model(boost_spec)
Tuning trees lets us try models different number trees and tuning tree depth is a parameter that controls the number of splits of each tree
Next we set up the grid for trees and tree_depth:
boost_grid <- grid_regular(trees(range = c(50, 350)),
tree_depth(), levels = c(4,8))
Now Fit the tuned model:
set.seed(777)
tune_res_boost <- tune_grid(
boost_wf,
resamples = airbnb_fold,
grid = boost_grid
)
Note: I ran and saved this model outside of the rmd in the save models r file and will just load the model in during analysis since this took about 15min to run
I will first load in the models from my .rda file:
load(file = "tunedmodels.rda")
I will predict price using my data and see how it fairs against the actual training data
multi_metric <- metric_set(rmse, rsq, mae)
airbnb_predict <- predict(airbnb_fit, final_train) %>%
bind_cols(final_train %>% select(price))
multi_metric(airbnb_predict, truth = price, estimate = .pred)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 70.4
## 2 rsq standard 0.530
## 3 mae standard 51.2
Looking at the rmse and R^2 values we see that the model is not great at predicting the price. With an R^2 value just above .5 this means that the model only explains about 50% of the variance in the response variable around the mean. So, about half of the values are not predicted well. The rmse of 70 is not good either since most listing are around the 200$ price point.
First Lets look at some performance metrics:
collect_metrics(tune_res_rf, metric = ) %>%
arrange(mean)
## # A tibble: 800 x 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 2 2 rsq standard 0.456 5 0.00800 Preprocessor1_Model~
## 2 8 2 2 rsq standard 0.464 5 0.0102 Preprocessor1_Model~
## 3 28 2 2 rsq standard 0.466 5 0.00866 Preprocessor1_Model~
## 4 10 2 2 rsq standard 0.467 5 0.00428 Preprocessor1_Model~
## 5 25 2 2 rsq standard 0.469 5 0.00791 Preprocessor1_Model~
## 6 15 2 2 rsq standard 0.473 5 0.00760 Preprocessor1_Model~
## 7 13 2 2 rsq standard 0.474 5 0.00655 Preprocessor1_Model~
## 8 20 2 2 rsq standard 0.475 5 0.00559 Preprocessor1_Model~
## 9 8 2 6 rsq standard 0.475 5 0.00490 Preprocessor1_Model~
## 10 10 2 6 rsq standard 0.477 5 0.00893 Preprocessor1_Model~
## # ... with 790 more rows
We can select the best from these now:
show_best(tune_res_rf, metric = "rsq")
## # A tibble: 5 x 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 18 5 40 rsq standard 0.597 5 0.00425 Preprocessor1_Model3~
## 2 18 5 35 rsq standard 0.593 5 0.00478 Preprocessor1_Model3~
## 3 23 5 35 rsq standard 0.591 5 0.00664 Preprocessor1_Model3~
## 4 18 5 31 rsq standard 0.591 5 0.00130 Preprocessor1_Model3~
## 5 23 5 27 rsq standard 0.591 5 0.00108 Preprocessor1_Model2~
show_best(tune_res_rf, metric = "rmse")
## # A tibble: 5 x 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 18 5 40 rmse standard 65.2 5 0.343 Preprocessor1_Model3~
## 2 18 5 35 rmse standard 65.5 5 0.586 Preprocessor1_Model3~
## 3 8 5 27 rmse standard 65.7 5 0.266 Preprocessor1_Model2~
## 4 23 5 35 rmse standard 65.7 5 0.672 Preprocessor1_Model3~
## 5 18 5 31 rmse standard 65.7 5 0.261 Preprocessor1_Model3~
The best rmse value is 65.24286 and the best rsq value is 0.5969
We can also look at a graph of the performance:
autoplot(tune_res_rf)
From this graph it is clear that the more trees were used the better the rsq and rmse values got which is expected. However, the rsq value is still not great meaning we should use more trees in our model. The model also performed best using about 15 predictors
First Lets look at some performance metrics
collect_metrics(tune_res_boost) %>%
arrange(mean)
## # A tibble: 64 x 8
## trees tree_depth .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 50 1 rsq standard 0.558 5 0.00560 Preprocessor1_Model01
## 2 150 1 rsq standard 0.582 5 0.00526 Preprocessor1_Model02
## 3 250 1 rsq standard 0.591 5 0.00519 Preprocessor1_Model03
## 4 50 15 rsq standard 0.593 5 0.00553 Preprocessor1_Model29
## 5 150 15 rsq standard 0.593 5 0.00553 Preprocessor1_Model30
## 6 250 15 rsq standard 0.593 5 0.00553 Preprocessor1_Model31
## 7 350 15 rsq standard 0.593 5 0.00553 Preprocessor1_Model32
## 8 350 1 rsq standard 0.596 5 0.00507 Preprocessor1_Model04
## 9 50 13 rsq standard 0.598 5 0.00477 Preprocessor1_Model25
## 10 150 13 rsq standard 0.598 5 0.00474 Preprocessor1_Model26
## # ... with 54 more rows
Lets select the best models from these:
show_best(tune_res_boost, metric = "rsq")
## # A tibble: 5 x 8
## trees tree_depth .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 250 3 rsq standard 0.656 5 0.00488 Preprocessor1_Model07
## 2 350 3 rsq standard 0.655 5 0.00397 Preprocessor1_Model08
## 3 50 5 rsq standard 0.654 5 0.00331 Preprocessor1_Model09
## 4 150 3 rsq standard 0.653 5 0.00520 Preprocessor1_Model06
## 5 150 5 rsq standard 0.650 5 0.00266 Preprocessor1_Model10
show_best(tune_res_boost, metric = "rmse")
## # A tibble: 5 x 8
## trees tree_depth .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 250 3 rmse standard 60.2 5 0.586 Preprocessor1_Model07
## 2 350 3 rmse standard 60.4 5 0.516 Preprocessor1_Model08
## 3 50 5 rmse standard 60.4 5 0.424 Preprocessor1_Model09
## 4 150 3 rmse standard 60.5 5 0.600 Preprocessor1_Model06
## 5 150 5 rmse standard 60.8 5 0.356 Preprocessor1_Model10
Looking at this the best rsq value for the boosted tree model was 0.6561402 and the best rmse value was 60.23864 this is very close to the random forest model but a bit better so I would choose this model as my final model.
Lets look at a plot fo the tuned parameters:
autoplot(tune_res_boost)
Here we can see that the best tree depth value would be 4 which gives us a spike in the rsq and decline in the rmse. The number of trees here did not affect the data as much as I thought it would they are all similar however 50 trees performs the best.
I will now construct the final model using the best model from the boosted tree tuned models:
best_boosted <- select_best(tune_res_boost, metric = 'rsq')
final_wf <- finalize_workflow(boost_wf, best_boosted)
final_fit <- fit(final_wf, final_train)
We can now see how well our final model did vs the testing set:
airbnb_metric <- metric_set(rsq)
model_test_predictions <- predict(final_fit, new_data = final_test) %>%
bind_cols(final_test %>% select(price))
model_test_predictions %>%
airbnb_metric(truth = price, estimate = .pred)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.648
augment(final_fit, new_data = final_test) %>%
rmse(truth = price, estimate = .pred)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 58.2
We get an rsq of 0.64 which is similar to the one we got using the training set of 0.6561402 and an rmse of 58 which is also close but a little better to the previews rmse of 60.23864
Overall, our models were not the best at predicting price however the boosted tree model was the best and when transferred over to the testing data it performed relatively the same as it did with the training set. More trees usually means over-fitting so it makes sense that the amount of trees the model performed best at was 50 trees rather than anything higher as I had tuned up to 350 trees for this model. This model may have performed a bit better than the others as predicting price with so many predictors would most likely be too complicated for a regular lm model.
If I were to go back I would try tuning other various amounts of trees for the random forest model or try a different model altogether like maybe ridge regression. Overall this was a great way to test out different models on real world data and navigate all the errors and problems that come with working with real life data. While our models were not the best at predicting price for Airbnb’s I feel as though I learned a lot about the data and how I could improve upon this project in the future.