We will begin by loading the following libraries: tidyverse, readr, data.table, icesTAF, caret, lubridate, glmnet, scales, stringr, dplyr, ggmap, ggcorrplot, treemapify, rpart, nnet, formatR, rmarkdown, and knitr with the “pacman” package. (If a package below is missing, p_load will automatically download it from CRAN).
if (!require(pacman)) install.packages("pacman", repos = "http://cran.us.r-project.org")
library(pacman)
pacman::p_load(tidyverse, readr, data.table, icesTAF, caret, lubridate, ggthemes,
ggplot2, glmnet, scales, stringr, dplyr, ggmap, ggcorrplot, treemapify, rpart,
nnet, formatR, rmarkdown, knitr)
# Download the Dataset:
if (!dir.exists("SAirbnb")) mkdir("SAirbnb")
if (!file.exists("./SAirbnb/seattle_01.csv")) download.file("https://raw.githubusercontent.com/christianmckinnon/Seattle-Airbnb-Listings/master/seattle_01.csv",
"./SAirbnb/seattle_01.csv")
# Read the Data:
suppressWarnings(airbnb <- read_csv("./SAirbnb/seattle_01.csv"))
# Set the number of significant digits to 4
options(digits = 4)
Check the dimensions of the dataset:
dim(airbnb)
## [1] 7576 18
There are 7576 observations and 18 features
str(airbnb)
## tibble [7,576 x 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ X1 : num [1:7576] 0 1 2 3 4 5 6 7 8 9 ...
## $ room_id : num [1:7576] 2318 3335 4291 5682 6606 ...
## $ host_id : num [1:7576] 2536 4193 35749 8993 14942 ...
## $ room_type : chr [1:7576] "Entire home/apt" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ address : chr [1:7576] "Seattle, WA, United States" "Seattle, WA, United States" "Seattle, WA, United States" "Seattle, WA, United States" ...
## $ reviews : num [1:7576] 21 1 63 462 134 130 401 35 36 76 ...
## $ overall_satisfaction: num [1:7576] 5 NA 4.5 5 4.5 4.5 5 5 5 4.5 ...
## $ accommodates : num [1:7576] 8 4 2 2 2 2 2 4 3 4 ...
## $ bedrooms : num [1:7576] 4 2 1 0 1 1 1 2 2 1 ...
## $ bathrooms : num [1:7576] 2.5 1 1 1 1 3 1 1 1 1 ...
## $ price : num [1:7576] 250 100 82 49 90 65 78 165 95 115 ...
## $ last_modified : POSIXct[1:7576], format: "2018-12-20 03:46:14" "2018-12-20 04:08:45" ...
## $ latitude : num [1:7576] 47.6 47.5 47.7 47.5 47.7 ...
## $ longitude : num [1:7576] -122 -122 -122 -122 -122 ...
## $ location : chr [1:7576] "0101000020E6100000D449B6BA9C925EC0416326512FCE4740" "0101000020E61000006FBBD05CA7915EC04DF564FED1C34740" "0101000020E6100000BDAB1E300F945EC0FB93F8DC09D84740" "0101000020E6100000FCC7427408975EC009E1D1C611C34740" ...
## $ name : chr [1:7576] "Casa Madrona - Urban Oasis, 1 block from the Park!" "Sweet Seattle Urban Homestead 2 Bdr" "Sunrise in Seattle Master Suite" "Cozy Studio, min. to downtown -WiFi" ...
## $ currency : chr [1:7576] "USD" "USD" "USD" "USD" ...
## $ rate_type : chr [1:7576] "nightly" "nightly" "nightly" "nightly" ...
## - attr(*, "spec")=
## .. cols(
## .. X1 = col_double(),
## .. room_id = col_double(),
## .. host_id = col_double(),
## .. room_type = col_character(),
## .. address = col_character(),
## .. reviews = col_double(),
## .. overall_satisfaction = col_double(),
## .. accommodates = col_double(),
## .. bedrooms = col_double(),
## .. bathrooms = col_double(),
## .. price = col_double(),
## .. last_modified = col_datetime(format = ""),
## .. latitude = col_double(),
## .. longitude = col_double(),
## .. location = col_character(),
## .. name = col_character(),
## .. currency = col_character(),
## .. rate_type = col_character()
## .. )
We note that there 18 features of numeric, character, and POSIXct classes. We also notice that there is already one visible NA value in the “overall_satisfaction” feature.
Coerce airbnb tibble into a data frame for later partitioning purposes:
airbnb <- as.data.frame(airbnb)
class(airbnb)
## [1] "data.frame"
sum(is.na(airbnb))
## [1] 1475
There are 1475 NAs in the dataset.
colSums(is.na(airbnb))
## X1 room_id host_id
## 0 0 0
## room_type address reviews
## 0 0 0
## overall_satisfaction accommodates bedrooms
## 1473 0 0
## bathrooms price last_modified
## 2 0 0
## latitude longitude location
## 0 0 0
## name currency rate_type
## 0 0 0
We find that “overall_satisfaction” has 1473 NAs while bathrooms has 2.
The visualization confirms there are only two features with missing values and that the all other observations are present.
After confirming that there are only 2 features with NAs, (overall_satisfaction & bathrooms) we proceed to clean the dataset. As overall_satisfaction has 1473 NAs, assigning a value of 0 will significantly skew the ratings negatively. Therefore we will fill the values with the mean to provide for more accurate predictive pricing in our models.
Convert NAs to the mean value.
airbnb$overall_satisfaction[is.na(airbnb$overall_satisfaction)] <- mean(airbnb$overall_satisfaction,
na.rm = TRUE)
The mean is roughly 4.84.
mean(airbnb$overall_satisfaction)
## [1] 4.841
Confirm the absence of NAs in this feature:
head(airbnb$overall_satisfaction)
## [1] 5.000 4.841 4.500 5.000 4.500 4.500
For the bathrooms feature, there are only 2 NAs and so we set them to zero.
airbnb <- airbnb %>% replace_na(list(bathrooms = 0))
Now confirm the absence of any NAs in the dataset:
sum(is.na(airbnb))
## [1] 0
names(airbnb)
## [1] "X1" "room_id" "host_id"
## [4] "room_type" "address" "reviews"
## [7] "overall_satisfaction" "accommodates" "bedrooms"
## [10] "bathrooms" "price" "last_modified"
## [13] "latitude" "longitude" "location"
## [16] "name" "currency" "rate_type"
There are 18 features and those less related to price prediction will be dropped to refine our EDA and Modeling Focus:
head(airbnb$X1)
## [1] 0 1 2 3 4 5
“X1” is simply a numerical list for the dataset.
head(airbnb$room_id)
## [1] 2318 3335 4291 5682 6606 9419
head(airbnb$host_id)
## [1] 2536 4193 35749 8993 14942 30559
“room_id” & “host_id” are simply numbers arbitrarily assigned to identify rooms and hosts.
Check for unique values of the feature “address”
airbnb %>% select(address) %>% distinct()
Note that there are 27 values with different formats and 12 repeated instances of “Seattle.” Any neighborhood of Seattle, the Chinese language version of “Seattle” and listings with only the State of Washington, will all be converted to Seattle. “WA” and “United States” will also be removed as they are redundant.
Replace the Chinese version of “Seattle” separately using regex:
Reassign the column to the feature “address”:
airbnb$address <- gsub("Seattle, United States", "Seattle", gsub("Seattle United States",
"Seattle", address_clean2))
Now confirm there are only 14 different cities and sort by the greatest numbers of listings:
###: Let’s explore a data visualization to confirm this:
Note: As the remaining locations are all cities, the feature “address” will later be renamed “city.”
It is clear the vast majority of listings are in Seattle (6791).
The “last_modified” feature refers to the date a listing was updated and nearly all values occur on 2018-12-20, telling us very little about the data, therefore this feature will be removed.
head(airbnb$last_modified)
## [1] "2018-12-20 03:46:14 UTC" "2018-12-20 04:08:45 UTC"
## [3] "2018-12-20 03:04:19 UTC" "2018-12-20 04:11:25 UTC"
## [5] "2018-12-20 03:12:38 UTC" "2018-12-20 04:08:20 UTC"
The “location” feature will be removed in favor of using “latitude” & “longitude.”
head(airbnb$location)
## [1] "0101000020E6100000D449B6BA9C925EC0416326512FCE4740"
## [2] "0101000020E61000006FBBD05CA7915EC04DF564FED1C34740"
## [3] "0101000020E6100000BDAB1E300F945EC0FB93F8DC09D84740"
## [4] "0101000020E6100000FCC7427408975EC009E1D1C611C34740"
## [5] "0101000020E6100000D47D00529B955EC07782FDD7B9D34740"
## [6] "0101000020E6100000145D177E70945EC0522B4CDF6BC64740"
The “name” feature will be removed as it is a categorical description of each listing.
head(airbnb$name)
## [1] "Casa Madrona - Urban Oasis, 1 block from the Park!"
## [2] "Sweet Seattle Urban Homestead 2 Bdr"
## [3] "Sunrise in Seattle Master Suite"
## [4] "Cozy Studio, min. to downtown -WiFi"
## [5] "Fab, private seattle urban cottage!"
## [6] "Glorious sun room w/ memory foambed"
The “currency” feature will be dropped as all rates are in US Dollars.
airbnb %>% select(currency) %>% distinct()
Check for distinct values of “rate_type”:
airbnb %>% select(rate_type) %>% distinct()
After confirming only one unique value, “nightly,” we determine this feature can be removed.
Remove the above mentioned features and rename the columns:
Reorder the columns:
Confirm the features have been tidied and reordered with only 10 features:
names(airbnb)
## [1] "price" "city" "rating" "reviews_sum" "room_type"
## [6] "bedrooms" "bathrooms" "accommodates" "latitude" "longitude"
Check the first few values of the cleaned dataset:
head(airbnb)
Now we will begin analyzing the features of our dataset to inform our price prediction modeling approach.
Remove non-numeric features:
airbnb_num <- airbnb %>% select(-c(city, room_type))
Create the correlation matrix:
airbnb_cor <- cor(airbnb_num)
Plot the Correlogram:
We discover the price is moderately correlated with the number of people a listing can accommodate as well as the number of bedrooms and bathrooms. Surprisingly, there is little correlation between the location (latitude & longitude) and price. This relationship will be explored with a scatterplot.
Next we explore a plot of the price distribution below $300 to determine the most common price ranges.
The plot reveals a significant portion of the prices are below $100/night.
Let’s use the ggmap package to load a map of Seattle and visualize which areas are more expensive than others.
Create the map using stamenmap:
Note that the quantiles and price range will influence the pricing scale color:
summary(airbnb$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15 65 88 113 125 5900
According to the above summary, we know that Q3 of the IQR was $125, so let’s determine what percentage of prices are less than or equal to $300.
quantile(airbnb$price)
## 0% 25% 50% 75% 100%
## 15 65 88 125 5900
sum(airbnb$price <= 300)/length(airbnb$price)
## [1] 0.9652
~ 96.5% of listings are <= $300, therefore we filter our price to remove outliers.
Filter the dataframe airbnb_map:
airbnb_map <- airbnb %>% filter(price <= 300)
It does not appear that one area is significantly more expensive than another and that most prices are less than $1000, though outliers may be skewing our data. As such, let’s attempt to refine our heatmap by filtering the data.
The visualization suggests the vast majority of listings in Seattle are between USD 50 - 150 / night and are relatively concentrated in Seattle proper (as opposed to Newcastle, Mercer Island, or Bellevue).
Confirm the percentage of listings between USD 50 - 150 per night:
sum(airbnb$price >= 50 & airbnb$price <= 150)/length(airbnb$price)
## [1] 0.6998
Nearly 70% of listings range from USD 50 - 150/night.
Arrange the cities by listing_sum in a dataframe for the Treemap. Plot the Treemap to visualize the distribution of listings by city:
The Treemap confirms the overwhelming number of listings in Seattle compared to the other cities.
Plot the Visualization:
According to this visualization, it would seem location (latitude & longitude) is highly correlated with price. Let’s explore why this was not the case in our correlogram. Let’s replot the visualization with the sum of listings included.
Plot the visualization with Mean Price & Percentage of Total Listings:
We learn that over 89% of listings are in Seattle with a mean price of $112/night. The top 7 listings with the highest average prices barely equal 1% (~0.8%) of the total listings though their average price is $226.10/night. The lower percentage of total listings in these locations at the higher price range, with almost 90% of listings in Seattle likely explains the lower correlation between price and location in this dataset.
The “rating” feature is based on a numerical rating from 0 to 5 with a mean of 4.841. As this is a relatively high mean rating, we will explore its relationship mean price by city to determine its price prediction potential.
Create the dataframe rating_comp to compare mean price and rating. Set the parameters for the dual-axis plot. Plot the Barplot (Rating) with Overlapping Line (Price in RED):
Check the range of mean ratings by city:
range(rating_comp$mean_rating)
## [1] 4.796 5.000
We notice that due to the tight range of mean ratings by city, there is a very low correlation between price and rating. For example, a high price does not guarantee a higher rating as the most expensive city Bothell does not have the highest rating. Furthermore, cities with lower average prices tend to have mean ratings higher than Bothell.
Now let’s compare the total number of reviews and different price ranges:
The visualization shows us that there are generally fewer reviews for listings with higher prices (above ~$500). We can infer that this is because there are fewer stays at these higher-priced listings.
There are 3 different room types: Entire home / apartment, Private Room, & Shared Room. Let’s explore the relationship between room type and average price:
Bathrooms range in quantity from 0 to 8 in increments of 0.5 per listing. Let’s explore a visualization that delineates the mean price per listing based on the number of bathrooms available:
We notice that generally, the higher the number of bathrooms per listing, the higher the price. Interestingly, listings with 0 bathrooms actually had a higher average price than listings with 0.5 - 1.5, and 8 bathrooms! It is clear that a listings with 8 bathrooms having a lower mean price than those with one bathroom is suspect and most certainly an outlier.
Now let’s visualize a barplot with the total number of listings per bathroom count:
It is evident that the vast majority of listings (~74%) have only 1 bathroom.
The bedrooms feature details the number of bedrooms available per listing and has been shown in our correlogram to be positively correlated with the price per listing.
Let’s explore a visualization that details the mean price per listing versus the number of beds available:
The visualization confirms the moderately positive correlation between price and number of bedrooms. This feature is very likely to contribute to the accuracy of our price prediction models.
Now let’s explore listing quantity by number of bedrooms:
1 bedroom listings lead the distribution with 4,282 listings (56.5%). As a result of the above findings, there is a high probability of a guest renting a 1 bedroom, 1 bath unit.
Generally, we might predict that the mean price of a listing per night would increase if it could provide space for more guests.
Let’s explore this idea with a visualization:
The visualization confirms there is a general increase in average price as the number of guests able to be accommodated increases.
Finally, let’s check the number of people most units can accommodate:
The majority of listings can accommodate 4 or fewer guests with 2 guests being the most popular.
The correlogram starts us off by highlighting important relationships between price, bedrooms, bathrooms, and accommodates which are later confirmed by visualizations and quantitative analysis. Additionally, we learn location (using latitude and longitude), rating, and reviews sum all have certain potential to increase the accuracy of our predictive models if included in our formula. “Room_type”, an important feature, could be vectorized into a quantitative scale, though for the purposes of the following regression models, will be revisited in our conclusion.
Partition Airbnb Combined & Test Sets for the Final Model:
airbnb_combined will be used to fit the Final Model as it is equal to the sum of the training and validation sets and equal to 90% of the total data. airbnb_test, comprised of the remaining 10% of the data, will be used as the test set for the final model.
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
test_index <- createDataPartition(y = airbnb$price, times = 1, p = 0.1, list = F)
airbnb_combined <- airbnb[-test_index, ]
airbnb_test <- airbnb[test_index, ]
# Remove test_index:
rm(test_index)
Split Training and Validation Sets from airbnb_combined to train our models: airbnb_train will constitute 80% and validation the remaining 20% of airbnb_combined.
set.seed(123, sample.kind = "Rounding")
test_index <- createDataPartition(y = airbnb_combined$price, times = 1, p = 0.2,
list = F)
airbnb_train <- airbnb_combined[-test_index, ]
validation <- airbnb_combined[test_index, ]
# Remove test_index once again:
rm(test_index)
Results will be based on calculating the Root Mean Square Error on the test set. The RMSE weighs large errors more heavily and tends to be the preferred measurement of error in regression models when large errors are undesirable.
Mathematically, it is defined as:
\[\sqrt{\frac{1}{N}\sum_{u,i}(\hat{y}_{u,i}-y_{u,i})^2}\]
RMSE <- function(true_ratings, predicted_ratings) {
sqrt(mean((true_ratings - predicted_ratings)^2))
}
This model will use the median price of the training set to calculate a “baseline” RMSE as a benchmark. The formula can be defined as follows with \(y_{u,i}\) = predicted rating, M = the median of the observed data, \(\epsilon_{u,i}\) = independent errors centered at 0.
\[y_{u,i} = M + \epsilon_{u,i}\]
airbnb_train_median <- median(airbnb_train$price)
# Table the Results
MM_RMSE <- RMSE(validation$price, airbnb_train_median)
results_table <- tibble(Model_Type = "Baseline Median", RMSE = MM_RMSE) %>% mutate(RMSE = sprintf("%0.2f",
RMSE))
knitr::kable(results_table)
| Model_Type | RMSE |
|---|---|
| Baseline Median | 99.05 |
The Baseline Model achieves an RMSE of 99.05.
The formula has been determined by the above EDA as well as experimentation on the training models. Additionally, the formula will reduce lines of code.
airbnb_form <- price ~ rating + reviews_sum + bedrooms + bathrooms + accommodates +
latitude + longitude
With linear regression, we attempt to predict a (dependent) \(y\) variable, in this case “price”, with the(independent) input \(x\) variables: rating + reviews_sum + bedrooms + bathrooms + accommodates + latitude + longitude in order to build a statistically significant model with a p-value less than .05.
lm_airbnb <- lm(airbnb_form, data = airbnb_train)
# Create the prediction
lm_preds <- predict(lm_airbnb, validation)
# Table the Results
LM_RMSE <- RMSE(validation$price, lm_preds)
results_table <- tibble(Model_Type = c("Baseline Median", "Linear"), RMSE = c(MM_RMSE,
LM_RMSE)) %>% mutate(RMSE = sprintf("%0.2f", RMSE))
knitr::kable(results_table)
| Model_Type | RMSE |
|---|---|
| Baseline Median | 99.05 |
| Linear | 73.80 |
The Linear Model vastly improves upon the Baseline Model with an RMSE of 73.8.
This model introduces regularization on a generalized linear model using penalized maximum likelihood and tuning optimal alpha and lambda parameters unlike Lasso and Ridge Regression Models. It combines advantages of both models by penalizing by the L1 & L2-norms. Mathematically, the estimates can be defined as follows:
\[\hat{\beta} \equiv \underset{\beta}{\operatorname{argmin}} (\| y-X \beta \|^2 + \lambda_2 \|\beta\|^2 + \lambda_1 \|\beta\|_1)\]
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_enr <- train(airbnb_form, data = airbnb_train, method = "glmnet", preProcess = c("center",
"scale"), tuneLength = 10, trace = F)
# Confirm the optimal alpha and lambda parameters
train_enr$bestTune
# alpha = 0.1, and lambda = 21.24
# Create the Prediction:
elastic_preds <- predict(train_enr, validation)
# Table the Results
ENR_RMSE <- RMSE(validation$price, elastic_preds)
results_table <- tibble(Model_Type = c("Baseline Median", "Linear", "Elastic Net Regression"),
RMSE = c(MM_RMSE, LM_RMSE, ENR_RMSE)) %>% mutate(RMSE = sprintf("%0.2f", RMSE))
knitr::kable(results_table)
| Model_Type | RMSE |
|---|---|
| Baseline Median | 99.05 |
| Linear | 73.80 |
| Elastic Net Regression | 74.06 |
The Elastic Net with Regression underperforms the Linear Model with an RMSE of 74.06.
Regression trees or regression decision trees partition data recursively into smaller, more simplistic partitions where specific models can be applied using conditional statements on the predictors. It is generally not as robust as models with bootstrap aggregation or Random Forest Models.
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_rpart <- train(airbnb_form, method = "rpart", data = airbnb_train, tuneGrid = data.frame(cp = seq(0,
0.05, len = 25)), preProcess = c("center", "scale"))
# Check the bestTune to find the final complexity parameter used for the model:
train_rpart$bestTune
# The final cp is 0.00625.
# Create the Prediction:
rt_preds <- predict(train_rpart, validation)
# Table the Results:
RT_RMSE <- RMSE(validation$price, rt_preds)
results_table <- tibble(Model_Type = c("Baseline Median", "Linear", "Elastic Net Regression",
"Regression Tree"), RMSE = c(MM_RMSE, LM_RMSE, ENR_RMSE, RT_RMSE)) %>% mutate(RMSE = sprintf("%0.2f",
RMSE))
knitr::kable(results_table)
| Model_Type | RMSE |
|---|---|
| Baseline Median | 99.05 |
| Linear | 73.80 |
| Elastic Net Regression | 74.06 |
| Regression Tree | 76.03 |
The Regression Tree model underperforms the both the Linear Model and the Elastic Net Regression Model with an RMSE of 76.03.
Random Forests are similar to the Regression Tree model, though they average several random subsets of decision trees in order improve predictive performance.
# Set the tuneGrid parameters:
rf_tune <- expand.grid(.mtry = c(1:3))
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_rf <- train(airbnb_form, data = airbnb_train, method = "rf", ntree = 150, tuneGrid = rf_tune,
nSamp = 1000, preProcess = c("center", "scale"))
# Check the bestTune:
train_rf$bestTune
# The bestTune is a mtry of 1
# Create the Prediction:
rf_preds <- predict(train_rf, validation)
# Table the Results
RF_RMSE <- RMSE(validation$price, rf_preds)
results_table <- tibble(Model_Type = c("Baseline Median", "Linear", "Elastic Net Regression",
"Regression Tree", "Random Forest"), RMSE = c(MM_RMSE, LM_RMSE, ENR_RMSE, RT_RMSE,
RF_RMSE)) %>% mutate(RMSE = sprintf("%0.2f", RMSE))
knitr::kable(results_table)
| Model_Type | RMSE |
|---|---|
| Baseline Median | 99.05 |
| Linear | 73.80 |
| Elastic Net Regression | 74.06 |
| Regression Tree | 76.03 |
| Random Forest | 66.57 |
The Random Forest significantly improves upon the all previous Models with an RMSE of 66.57.
Bagging, similar to Random Forest Models, uses aggregation to create more robust predictions. It is an ensemble method that improves stability over classic regression tree models while attempting to minimize overfitting.
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_bag <- train(airbnb_form, data = airbnb_train, method = "treebag", importance = T,
tuneLength = 10, preProcess = c("center", "scale"))
# Create the Prediction
bag_preds <- predict(train_bag, validation)
# Table the Results
BAG_RMSE <- RMSE(validation$price, bag_preds)
results_table <- tibble(Model_Type = c("Baseline Median", "Linear", "Elastic Net Regression",
"Regression Tree", "Random Forest", "BAG"), RMSE = c(MM_RMSE, LM_RMSE, ENR_RMSE,
RT_RMSE, RF_RMSE, BAG_RMSE)) %>% mutate(RMSE = sprintf("%0.2f", RMSE))
knitr::kable(results_table)
| Model_Type | RMSE |
|---|---|
| Baseline Median | 99.05 |
| Linear | 73.80 |
| Elastic Net Regression | 74.06 |
| Regression Tree | 76.03 |
| Random Forest | 66.57 |
| BAG | 68.98 |
The BAG Model underperforms the Random Forest Model with an RMSE of 68.98.
k Nearest Neighbors uses the parameter \(k\) to tune a weighted average of “neighbors” that is calculated by an algorithm that measures Euclidean or Mahalanobis distance. It is often used in pattern recognition and a popular modeling technique for both classification and regression prediction.
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_knn <- train(airbnb_form, method = "knn", data = airbnb_train, tuneLength = 5,
preProcess = c("center", "scale"))
# Find best value for k:
train_knn$bestTune
# k = 13 is the final value used for the model.
# Create the Prediction
knn_preds <- predict(train_knn, validation)
# Table the Results
kNN_RMSE <- RMSE(validation$price, knn_preds)
results_table <- tibble(Model_Type = c("Baseline Median", "Linear", "Elastic Net Regression",
"Regression Tree", "Random Forest", "BAG", "kNN"), RMSE = c(MM_RMSE, LM_RMSE,
ENR_RMSE, RT_RMSE, RF_RMSE, BAG_RMSE, kNN_RMSE)) %>% mutate(RMSE = sprintf("%0.2f",
RMSE))
knitr::kable(results_table)
| Model_Type | RMSE |
|---|---|
| Baseline Median | 99.05 |
| Linear | 73.80 |
| Elastic Net Regression | 74.06 |
| Regression Tree | 76.03 |
| Random Forest | 66.57 |
| BAG | 68.98 |
| kNN | 69.94 |
The kNN Model underperforms both the RF and BAG CV Models with an RMSE of 69.94.
ANNs or artificial neural networks are algorithms modeled on the human brain, that generally utilize one to three “nodes” to cluster raw inputs and predict real-world data. It is known as a “universal approximator” due to its ability to accurately determine correlations and event probabilities. Mathematically, ANNs assign weights to inputs that reduce noise and amplify predictive power, which are then passed through Net input and activation functions before being released as the output.
# Create the tuneGrid parameters:
NN_grid <- expand.grid(size = c(1, 5, 20), decay = c(0, 0.01, 0.1))
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_NN <- train(airbnb_form, data = airbnb_train, method = "nnet", linout = T,
trace = F, tuneGrid = NN_grid, preProc = c("center", "scale"))
# Check bestTune:
train_NN$bestTune
# The optimal size is 5 and decay = 0.01
# Create the Prediction:
NN_preds <- predict(train_NN, validation)
# Table the Results:
NN_RMSE <- RMSE(validation$price, NN_preds)
results_table <- tibble(Model_Type = c("Baseline Median", "Linear", "Elastic Net Regression",
"Regression Tree", "Random Forest", "BAG", "kNN", "Neural Net"), RMSE = c(MM_RMSE,
LM_RMSE, ENR_RMSE, RT_RMSE, RF_RMSE, BAG_RMSE, kNN_RMSE, NN_RMSE)) %>% mutate(RMSE = sprintf("%0.2f",
RMSE))
knitr::kable(results_table)
| Model_Type | RMSE |
|---|---|
| Baseline Median | 99.05 |
| Linear | 73.80 |
| Elastic Net Regression | 74.06 |
| Regression Tree | 76.03 |
| Random Forest | 66.57 |
| BAG | 68.98 |
| kNN | 69.94 |
| Neural Net | 71.41 |
The Neural Net Model underperforms all models except for the Baseline Linear & Regression Tree Models with an RMSE of 71.41.
# Set the mtry to 1 as determined from the previous tuning on airbnb_train:
tune_grid_rf <- expand.grid(mtry = 1)
# Set the seed for reproducibility:
set.seed(123, sample.kind = "Rounding")
train_rf_final <- train(airbnb_form, data = airbnb_combined, method = "rf", ntree = 150,
tuneGrid = tune_grid_rf, nSamp = 1000, preProcess = c("center", "scale"))
# Create the Prediction:
rf_preds_final <- predict(train_rf_final, airbnb_test)
# Table the Results
RFF_RMSE <- RMSE(airbnb_test$price, rf_preds_final)
results_table <- tibble(Model_Type = c("Baseline Median", "Linear", "Elastic Net Regression",
"Regression Tree", "Random Forest", "BAG", "kNN", "Neural Net", "Random Forest Final (Test Set)"),
RMSE = c(MM_RMSE, LM_RMSE, ENR_RMSE, RT_RMSE, RF_RMSE, BAG_RMSE, kNN_RMSE, NN_RMSE,
RFF_RMSE)) %>% mutate(RMSE = sprintf("%0.2f", RMSE))
| Model_Type | RMSE |
|---|---|
| Baseline Median | 99.05 |
| Linear | 73.80 |
| Elastic Net Regression | 74.06 |
| Regression Tree | 76.03 |
| Random Forest | 66.57 |
| BAG | 68.98 |
| kNN | 69.94 |
| Neural Net | 71.41 |
| Random Forest Final (Test Set) | 71.43 |
Bradley Boehmke, “Regression Trees”
https://uc-r.github.io/regression_trees
Elastic net regularization, https://en.wikipedia.org/wiki/Elastic_net_regularization
Irizarry, Rafael A., “Introduction to Data Science: Data Analysis and Prediction Algorithms in R” https://rafalab.github.io/dsbook/
Pathmind, “A Beginner’s Guide to Neural Networks and Deep Learning” https://pathmind.com/wiki/neural-network