Airbnb is one of the world’s most innovative online marketplaces for offering and arranging lodging, primarily homestays, or tourism experiences. Founded in 2008 by Brian Chesky, Joe Gebbia, and Nathan Blecharczyk, Airbnb has facilitated millions of hospitality entrepreneurs monetize their spaces and their passions while keeping the financial benefits of tourism in their communities. Airbnb market is well blooming in London which had more than 80,000 listings as of June 2020.
The objective of this project is to understand the factors that influence the prices listed on Airbnb in London. It derives from curiosity about specific factors that cause the rental prices for leasing short-term lodging seems easily changed over time.
Initially, we begin by loading the packages that will be required throughout the course of the analysis.
library(tidyverse)
# library(tidyr)
# library(dplyr)
# library(ggplot2)
library(lubridate)
library(caret)
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(glue)
library(scales)
library(plotly)
library(leaflet)
library(rgdal)The descriptions of the packages are in the table below.
| Packages | Description |
|---|---|
| tidyverse | Collection of R packages (tidyr, dplyr, ggplot2) designed for data science that works harmoniously with other packages |
| tidyr | Changing the layout of the data sets, to convert data into a tidy format |
| dplyr | For data manipulation |
| ggplot2 | Customizable graphical representation |
| lubridate | Dates formatting |
| caret | For data Pre-Processing and Feature Selection |
| data.table | For data manipulation that can be reducing programming and compute time tremendously |
| DT | An R interface to the DataTables library |
| kableExtra | Styling an Interactive Data Tables within Markdown |
| knitr | A general-purpose tool for dynamic report generation |
Now, let’s load the dataset into the R-Environment.
This project uses the data from Get the Data tab in insideairbnb website. The insideairbnb is a database website built by Murray Cox and designed by John Morris that collected all their dataset from the Airbnb website.
This project uses the dataset that contains Airbnb listings in London, United Kingdom. The time series dataset records the information of more than 80,000 listings as of 11 June 2020 (downloaded in August 2020).
The dataset is in .csv format, so we will useread.csv method to read the dataset.
After importing the dataset, Let’s take a peek of our dataset!
The dataset has 83,711 rows and 106 columns.
## Rows: 83,711
## Columns: 106
## $ id <dbl> 11551, 13913, 15400, 1...
## $ listing_url <chr> "https://www.airbnb.co...
## $ scrape_id <dbl> 2.020061e+13, 2.020061...
## $ last_scraped <date> 2020-06-16, 2020-06-1...
## $ name <chr> "Arty and Bright Londo...
## $ summary <chr> "Unlike most rental ap...
## $ space <chr> "Amenities Bedding: 1 ...
## $ description <chr> "Unlike most rental ap...
## $ experiences_offered <chr> "family", "business", ...
## $ neighborhood_overview <chr> "Not even 10 minutes b...
## $ notes <chr> "No Smoking (very stri...
## $ transit <chr> "Tons of buses (24hrs)...
## $ access <chr> "Guest will have acces...
## $ interaction <chr> "No interaction with g...
## $ house_rules <chr> "No Smoking (very stri...
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA...
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA...
## $ picture_url <chr> "https://a0.muscache.c...
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA...
## $ host_id <dbl> 43039, 54730, 60302, 6...
## $ host_url <chr> "https://www.airbnb.co...
## $ host_name <chr> "Adriano", "Alina", "P...
## $ host_since <date> 2009-10-03, 2009-11-1...
## $ host_location <chr> "London, England, Unit...
## $ host_about <chr> "Hello, I'm a friendly...
## $ host_response_time <chr> "within an hour", "N/A...
## $ host_response_rate <chr> "100%", "N/A", "100%",...
## $ host_acceptance_rate <chr> "100%", "89%", "83%", ...
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, F...
## $ host_thumbnail_url <chr> "https://a0.muscache.c...
## $ host_picture_url <chr> "https://a0.muscache.c...
## $ host_neighbourhood <chr> "Brixton", "LB of Isli...
## $ host_listings_count <dbl> 3, 4, 1, 18, 3, 3, 2, ...
## $ host_total_listings_count <dbl> 3, 4, 1, 18, 3, 3, 2, ...
## $ host_verifications <chr> "['email', 'phone', 'r...
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE...
## $ host_identity_verified <lgl> TRUE, TRUE, TRUE, TRUE...
## $ street <chr> "London, United Kingdo...
## $ neighbourhood <chr> "Brixton", "LB of Isli...
## $ neighbourhood_cleansed <chr> "Lambeth", "Islington"...
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA...
## $ city <chr> "London", "Islington",...
## $ state <chr> NA, "Greater London", ...
## $ zipcode <chr> "SW9 8DG", "N4 3", "SW...
## $ market <chr> "London", "London", "L...
## $ smart_location <chr> "London, United Kingdo...
## $ country_code <chr> "GB", "GB", "GB", "GB"...
## $ country <chr> "United Kingdom", "Uni...
## $ latitude <dbl> 51.46225, 51.56802, 51...
## $ longitude <dbl> -0.11732, -0.11121, -0...
## $ is_location_exact <lgl> TRUE, TRUE, TRUE, FALS...
## $ property_type <chr> "Apartment", "Apartmen...
## $ room_type <chr> "Entire home/apt", "Pr...
## $ accommodates <dbl> 4, 2, 2, 6, 2, 2, 2, 5...
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 2.0, 1....
## $ bedrooms <dbl> 1, 1, 1, 3, 1, 1, 1, 3...
## $ beds <dbl> 3, 0, 1, 3, 1, 0, 1, 3...
## $ bed_type <chr> "Real Bed", "Real Bed"...
## $ amenities <chr> "{TV,\"Cable TV\",Inte...
## $ square_feet <dbl> NA, 538, NA, NA, NA, N...
## $ price <chr> "$88.00", "$65.00", "$...
## $ weekly_price <chr> "$645.00", "$333.00", ...
## $ monthly_price <chr> "$2,350.00", "$1,176.0...
## $ security_deposit <chr> "$400.00", "$100.00", ...
## $ cleaning_fee <chr> "$35.00", "$15.00", "$...
## $ guests_included <dbl> 2, 1, 2, 3, 1, 1, 1, 1...
## $ extra_people <chr> "$20.00", "$15.00", "$...
## $ minimum_nights <dbl> 2, 1, 10, 3, 3, 10, 1,...
## $ maximum_nights <dbl> 180, 29, 50, 365, 21, ...
## $ minimum_minimum_nights <dbl> 2, 1, 10, 3, 2, 10, 1,...
## $ maximum_minimum_nights <dbl> 2, 1, 10, 4, 3, 10, 1,...
## $ minimum_maximum_nights <dbl> 1125, 29, 50, 60, 21, ...
## $ maximum_maximum_nights <dbl> 1125, 29, 50, 365, 21,...
## $ minimum_nights_avg_ntm <dbl> 2.0, 1.0, 10.0, 3.2, 3...
## $ maximum_nights_avg_ntm <dbl> 1125.0, 29.0, 50.0, 29...
## $ calendar_updated <chr> "3 months ago", "14 mo...
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE...
## $ availability_30 <dbl> 12, 12, 0, 9, 11, 13, ...
## $ availability_60 <dbl> 42, 42, 0, 39, 41, 43,...
## $ availability_90 <dbl> 72, 72, 13, 69, 71, 73...
## $ availability_365 <dbl> 347, 347, 288, 326, 34...
## $ calendar_last_scraped <date> 2020-06-16, 2020-06-1...
## $ number_of_reviews <dbl> 192, 21, 89, 42, 0, 12...
## $ number_of_reviews_ltm <dbl> 9, 5, 4, 2, 0, 8, 1, 1...
## $ first_review <date> 2010-03-21, 2010-08-1...
## $ last_review <date> 2020-03-26, 2020-02-2...
## $ review_scores_rating <dbl> 91, 97, 96, 94, NA, 96...
## $ review_scores_accuracy <dbl> 9, 10, 10, 10, NA, 10,...
## $ review_scores_cleanliness <dbl> 9, 10, 10, 9, NA, 10, ...
## $ review_scores_checkin <dbl> 10, 10, 10, 9, NA, 10,...
## $ review_scores_communication <dbl> 10, 10, 10, 9, NA, 10,...
## $ review_scores_location <dbl> 9, 9, 10, 10, NA, 9, 1...
## $ review_scores_value <dbl> 9, 9, 9, 9, NA, 10, 8,...
## $ requires_license <lgl> FALSE, FALSE, FALSE, F...
## $ license <lgl> NA, NA, NA, NA, NA, NA...
## $ jurisdiction_names <lgl> NA, NA, NA, NA, NA, NA...
## $ instant_bookable <lgl> TRUE, FALSE, TRUE, TRU...
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, F...
## $ cancellation_policy <chr> "strict_14_with_grace_...
## $ require_guest_profile_picture <lgl> FALSE, FALSE, TRUE, FA...
## $ require_guest_phone_verification <lgl> TRUE, FALSE, TRUE, FAL...
## $ calculated_host_listings_count <dbl> 2, 3, 1, 15, 2, 3, 2, ...
## $ calculated_host_listings_count_entire_homes <dbl> 2, 1, 1, 15, 0, 0, 1, ...
## $ calculated_host_listings_count_private_rooms <dbl> 0, 2, 0, 0, 2, 3, 1, 0...
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0...
## $ reviews_per_month <dbl> 1.54, 0.18, 0.70, 0.37...
The dataset we just imported is a raw dataset.
For the Explanatory Data purpose, we need to remove or modify data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.
Let’s see how many missing values in each column.
# Counting missing values in each column
numMissVal <-sapply(raw.data, function(x) sum(length(which(is.na(x)))))
# Result table
kable(as.data.frame(numMissVal)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width = "100%", height = "250px")| numMissVal | |
|---|---|
| id | 0 |
| listing_url | 0 |
| scrape_id | 0 |
| last_scraped | 0 |
| name | 25 |
| summary | 4864 |
| space | 25948 |
| description | 3014 |
| experiences_offered | 0 |
| neighborhood_overview | 30119 |
| notes | 52472 |
| transit | 30845 |
| access | 40186 |
| interaction | 35013 |
| house_rules | 35960 |
| thumbnail_url | 83711 |
| medium_url | 83711 |
| picture_url | 0 |
| xl_picture_url | 83711 |
| host_id | 0 |
| host_url | 0 |
| host_name | 11 |
| host_since | 11 |
| host_location | 208 |
| host_about | 36128 |
| host_response_time | 11 |
| host_response_rate | 11 |
| host_acceptance_rate | 11 |
| host_is_superhost | 11 |
| host_thumbnail_url | 11 |
| host_picture_url | 11 |
| host_neighbourhood | 19231 |
| host_listings_count | 11 |
| host_total_listings_count | 11 |
| host_verifications | 0 |
| host_has_profile_pic | 11 |
| host_identity_verified | 11 |
| street | 0 |
| neighbourhood | 0 |
| neighbourhood_cleansed | 0 |
| neighbourhood_group_cleansed | 83711 |
| city | 453 |
| state | 14854 |
| zipcode | 1899 |
| market | 726 |
| smart_location | 0 |
| country_code | 0 |
| country | 0 |
| latitude | 0 |
| longitude | 0 |
| is_location_exact | 0 |
| property_type | 0 |
| room_type | 0 |
| accommodates | 0 |
| bathrooms | 133 |
| bedrooms | 164 |
| beds | 984 |
| bed_type | 3 |
| amenities | 0 |
| square_feet | 83333 |
| price | 0 |
| weekly_price | 77281 |
| monthly_price | 79131 |
| security_deposit | 29532 |
| cleaning_fee | 22104 |
| guests_included | 0 |
| extra_people | 0 |
| minimum_nights | 0 |
| maximum_nights | 0 |
| minimum_minimum_nights | 0 |
| maximum_minimum_nights | 0 |
| minimum_maximum_nights | 0 |
| maximum_maximum_nights | 0 |
| minimum_nights_avg_ntm | 0 |
| maximum_nights_avg_ntm | 0 |
| calendar_updated | 0 |
| has_availability | 0 |
| availability_30 | 0 |
| availability_60 | 0 |
| availability_90 | 0 |
| availability_365 | 0 |
| calendar_last_scraped | 0 |
| number_of_reviews | 0 |
| number_of_reviews_ltm | 0 |
| first_review | 21568 |
| last_review | 21568 |
| review_scores_rating | 23226 |
| review_scores_accuracy | 23286 |
| review_scores_cleanliness | 23276 |
| review_scores_checkin | 23335 |
| review_scores_communication | 23284 |
| review_scores_location | 23333 |
| review_scores_value | 23334 |
| requires_license | 0 |
| license | 83711 |
| jurisdiction_names | 83711 |
| instant_bookable | 0 |
| is_business_travel_ready | 0 |
| cancellation_policy | 0 |
| require_guest_profile_picture | 0 |
| require_guest_phone_verification | 0 |
| calculated_host_listings_count | 0 |
| calculated_host_listings_count_entire_homes | 0 |
| calculated_host_listings_count_private_rooms | 0 |
| calculated_host_listings_count_shared_rooms | 0 |
| reviews_per_month | 21568 |
Observation findings :
There are still dollar ($) tags in the price, weekly_price, monthly_price, security_deposit, extra_people, and cleaning_fee. These problems will affect numeric manipulation in our analysis, so we need to remove the dollar ($) tags.
# The updated columns stored in `clean.data` data frame
clean.data <- copy(raw.data)
# remove the "$" tag
clean.data$price <- as.numeric(gsub('[$,]','', raw.data$price))
clean.data$weekly_price <- as.numeric(gsub('[$,]','', raw.data$weekly_price))
clean.data$monthly_price <- as.numeric(gsub('[$,]','', raw.data$monthly_price))
clean.data$security_deposit <- as.numeric(gsub('[$,]','', raw.data$security_deposit))
clean.data$extra_people <- as.numeric(gsub('[$,]','', raw.data$extra_people))
clean.data$cleaning_fee <- as.numeric(gsub('[$,]','', raw.data$cleaning_fee))
# result table
head(clean.data[c("price","weekly_price","monthly_price","security_deposit","cleaning_fee")])Some columns add little or no value to our analysis, so we need to remove these columns.
Variance measures how far a set of data is spreading out. A zero variance indicates that all of the data values are identical (no variation). It also means no deviation from the data. Thus, we cannot do our analysis with zero variance columns. These columns are removed using nearzeroVar method from Caret package.
Columns Removed:
# Calculating the variance of each columns
zvCols <- nearZeroVar(clean.data, saveMetrics = TRUE)
# Extract column names that has a zero variance
ZVnames=rownames(subset(zvCols, nzv== TRUE))
# Remove columns that has a zero variance
clean.data <- clean.data[ , !(names(clean.data) %in% ZVnames)]
# result
ZVnames## [1] "scrape_id"
## [2] "experiences_offered"
## [3] "thumbnail_url"
## [4] "medium_url"
## [5] "xl_picture_url"
## [6] "host_has_profile_pic"
## [7] "neighbourhood_group_cleansed"
## [8] "state"
## [9] "market"
## [10] "country_code"
## [11] "country"
## [12] "bed_type"
## [13] "has_availability"
## [14] "requires_license"
## [15] "license"
## [16] "jurisdiction_names"
## [17] "is_business_travel_ready"
## [18] "require_guest_profile_picture"
## [19] "require_guest_phone_verification"
## [20] "calculated_host_listings_count_shared_rooms"
Variance measures how far a set of data is spreading out. Character columns with near 100% variance indicate that all of the data values are different in each row. These columns don’t provide any group-level information to a larger population.
Columns Removed:
# Selecting distinct character columns
dist.df <-
zeroVar.data %>%
select_if(is.character) %>%
summarise_all(list(~n_distinct(.)))
# Distinct cOlumns that has a near 100% variance
dist.df <-
dist.df %>%
gather(key = var_name, value = value, 1:ncol(dist.df))
# Proportions of distinct cOlumns that has a near 100% variance
dist.df$percentUnique <- round(dist.df$value/nrow(zeroVar.data),2)
# Distinct columns that has variance proportions > 20%
unique.vars <- dist.df %>%
filter(percentUnique > 0.2) %>%
pull(var_name)
# Remove columns that has variance proportions > 20%
zeroVar.data <- zeroVar.data[,!(names(zeroVar.data) %in% unique.vars)]
# result
unique.vars## [1] "listing_url" "name" "summary"
## [4] "space" "description" "neighborhood_overview"
## [7] "notes" "transit" "access"
## [10] "interaction" "house_rules" "picture_url"
## [13] "host_url" "host_about" "host_thumbnail_url"
## [16] "host_picture_url" "zipcode" "amenities"
Columns with over than 50% NA values are removed.
Columns Removed:
# Count NA rows in each columns
na_df <- zeroVar.data %>%
summarise_all(list(~sum(is.na(.))))
# Summary of NA rows in each column
na_df <- na_df %>%
gather(key = var_name, value = value, 1:ncol(na_df))
# Proportions of NA rows in each column
na_df$numNa <- round(na_df$value/nrow(zeroVar.data),2)
# Column that has 50% NA rows
na_var <- na_df %>%
filter(numNa > 0.5) %>% pull(var_name)
# Remove columns that has proportions > 50% variance
zeroVar.data <- zeroVar.data[,!(names(zeroVar.data) %in% na_var)]
# result
na_var## [1] "square_feet" "weekly_price" "monthly_price"
Column names contain strings “night”, and “location” are removed, except minimum_nightsand review_scores_location, because those columns are irrelevant information to our analysis.
Columns Removed:
# Make the pattern
pattern <- colnames(
zeroVar.data %>%
select(contains("night"),
contains("location")
)
)
# Excluding "minimun_nights" and "review_scores_location" in the pattern
pattern <- pattern[!(pattern == "minimum_nights" | pattern == "review_scores_location")]
# Remove columns that has similar names with columns in the pattern
zeroVar.data <- zeroVar.data[,!(names(zeroVar.data) %in% pattern)]
# result
pattern## [1] "maximum_nights" "minimum_minimum_nights" "maximum_minimum_nights"
## [4] "minimum_maximum_nights" "maximum_maximum_nights" "minimum_nights_avg_ntm"
## [7] "maximum_nights_avg_ntm" "host_location" "smart_location"
## [10] "is_location_exact"
There aren’t any better pattern left, so some of the columns are removed manually.
Columns Removed:
# List of columns that will be removed manually
dropCol <-
c(
"host_neighbourhood",
"host_listings_count",
"host_verifications",
"host_identity_verified",
"street",
"neighbourhood",
"city",
"property_type",
"calendar_updated",
"calendar_last_scraped",
"number_of_reviews_ltm",
"first_review",
"calculated_host_listings_count",
"calculated_host_listings_count_entire_homes",
"calculated_host_listings_count_private_rooms",
"guests_included",
"extra_people"
)
zeroVar.data <- zeroVar.data[,!(names(zeroVar.data) %in% dropCol)]
# result
dropCol## [1] "host_neighbourhood"
## [2] "host_listings_count"
## [3] "host_verifications"
## [4] "host_identity_verified"
## [5] "street"
## [6] "neighbourhood"
## [7] "city"
## [8] "property_type"
## [9] "calendar_updated"
## [10] "calendar_last_scraped"
## [11] "number_of_reviews_ltm"
## [12] "first_review"
## [13] "calculated_host_listings_count"
## [14] "calculated_host_listings_count_entire_homes"
## [15] "calculated_host_listings_count_private_rooms"
## [16] "guests_included"
## [17] "extra_people"
Some columns need to be changed as factors.
# data format : factor
zeroVar.data$host_name <- as.factor(zeroVar.data$host_name)
zeroVar.data$host_response_time <- as.factor(zeroVar.data$host_response_time)
zeroVar.data$host_response_rate <- as.factor(zeroVar.data$host_response_rate)
zeroVar.data$host_acceptance_rate <- as.factor(zeroVar.data$host_acceptance_rate)
zeroVar.data$host_total_listings_count <- as.factor(zeroVar.data$host_total_listings_count)
zeroVar.data$neighbourhood_cleansed <- as.factor(zeroVar.data$neighbourhood_cleansed)
zeroVar.data$room_type <- as.factor(zeroVar.data$room_type)
zeroVar.data$cancellation_policy <- as.factor(zeroVar.data$cancellation_policy)
zeroVar.data$review_scores_rating <- as.factor(zeroVar.data$review_scores_rating)
zeroVar.data$review_scores_accuracy <- as.factor(zeroVar.data$review_scores_accuracy)
zeroVar.data$review_scores_cleanliness <- as.factor(zeroVar.data$review_scores_cleanliness)
zeroVar.data$review_scores_checkin <- as.factor(zeroVar.data$review_scores_checkin)
zeroVar.data$review_scores_communication <- as.factor(zeroVar.data$review_scores_communication)
zeroVar.data$review_scores_location <- as.factor(zeroVar.data$review_scores_location)
zeroVar.data$review_scores_value <- as.factor(zeroVar.data$review_scores_value)last_scraped, host_since, and last_review changed into ymd format using lubridate library.
After filtering the dataset, some columns still have missing values.
# Counting missing values in each column
clean_NAval <-sapply(zeroVar.data, function(x) sum(length(which(is.na(x)))))
# Result table
kable(as.data.frame(clean_NAval)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
scroll_box(width = "100%", height = "250px")| clean_NAval | |
|---|---|
| id | 0 |
| last_scraped | 0 |
| host_id | 0 |
| host_name | 11 |
| host_since | 11 |
| host_response_time | 11 |
| host_response_rate | 11 |
| host_acceptance_rate | 11 |
| host_is_superhost | 11 |
| host_total_listings_count | 11 |
| neighbourhood_cleansed | 0 |
| latitude | 0 |
| longitude | 0 |
| room_type | 0 |
| accommodates | 0 |
| bathrooms | 133 |
| bedrooms | 164 |
| beds | 984 |
| price | 0 |
| security_deposit | 29532 |
| cleaning_fee | 22104 |
| minimum_nights | 0 |
| availability_30 | 0 |
| availability_60 | 0 |
| availability_90 | 0 |
| availability_365 | 0 |
| number_of_reviews | 0 |
| last_review | 21568 |
| review_scores_rating | 23226 |
| review_scores_accuracy | 23286 |
| review_scores_cleanliness | 23276 |
| review_scores_checkin | 23335 |
| review_scores_communication | 23284 |
| review_scores_location | 23333 |
| review_scores_value | 23334 |
| instant_bookable | 0 |
| cancellation_policy | 0 |
| reviews_per_month | 21568 |
The missing value can severely distort the distribution of the data. However, there isn’t any better way to deal with missing data. Removing columns or rows with missing values can produce a bias in the analysis. Note that imputation does not necessarily give better results.
Alvira Swalin gave a better explanation about how to handle missing data in towardsdatascience.com. The methods to handle missing values are as follow:
Based on the previous flowchart, our dataset has both general problems, continuous and categorical. We use data imputation only on factor and numeric data format.
The numerical columns that have missing values are bathrooms, bedrooms, beds, security_deposit, and cleaning_fee. we can replace the missing values in these numerical columns with the average of the corresponding column.
First, get the averages of each numeric column.
# Find average value
baths_mean <- mean(zeroVar.data$bathrooms, na.rm = TRUE)
bedrms_mean <- mean(zeroVar.data$bedrooms, na.rm = TRUE)
beds_mean <- mean(zeroVar.data$beds, na.rm = TRUE)
seqdep_mean <- mean(zeroVar.data$security_deposit, na.rm = TRUE)
cleanfee_mean <- mean(zeroVar.data$cleaning_fee, na.rm = TRUE)
monthlyreviews_mean <- mean(zeroVar.data$reviews_per_month, na.rm = TRUE)Then replace the missing values with the averages of the corresponding columns.
zeroVar.data$bathrooms[is.na(zeroVar.data$bathrooms)] <- baths_mean
zeroVar.data$bedrooms[is.na(zeroVar.data$bedrooms)] <- bedrms_mean
zeroVar.data$beds[is.na(zeroVar.data$beds)] <- beds_mean
zeroVar.data$security_deposit[is.na(zeroVar.data$security_deposit)] <- seqdep_mean
zeroVar.data$cleaning_fee[is.na(zeroVar.data$cleaning_fee)] <- cleanfee_mean
zeroVar.data$reviews_per_month[is.na(zeroVar.data$reviews_per_month)] <- monthlyreviews_meanThe categorical columns that have missing values are host_name, host_response_time, host_response_rate, host_acceptance_rate, host_total_listings_count, neighbourhood_cleansed, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, and cancellation_policy
We can treat the missing values in these categorical as levels. As example, let’s see the levels attribute of host_response_time. The missing values are treated as one of the levels in the host_response_time.
## [1] "a few days or more" "N/A" "within a day"
## [4] "within a few hours" "within an hour"
Here’s the cleaned data set:
The summary of the variables of dataset are in the table below.
| No. | Variable | Class | Description |
|---|---|---|---|
| 1 | id | numeric | Unique listing ID |
| 2 | last_scraped | date | Latest scrapped data |
| 3 | host_id | numeric | Host ID |
| 4 | host_name | factor | Name of the host |
| 5 | host_since | date | The date of user listed as a host |
| 6 | host_response_time | factor | How long the host response to user’s request |
| 7 | host_response_rate | factor | Percentage of host response time to numbers of request |
| 8 | host_acceptance_rate | factor | Percentage of host accepts the request to numbers of request |
| 9 | host_is_superhost | logical | Whether the host is super host |
| 10 | host_total_listings_count | factor | Amount of listing a host have |
| 11 | neighbourhood_cleansed | factor | Area in London |
| 12 | city | factor | City in London |
| 13 | latitude | numeric | Latitude coordinates |
| 14 | longitude | numeric | Longitude coordinates |
| 15 | room_type | factor | The type of room provided in the listing (Entire home/apt, Private room, or Shared room) |
| 16 | accommodates | numeric | The number of people that the listing can accommodate |
| 17 | bathrooms | numeric | The number of bathrooms in the listing |
| 18 | bedrooms | numeric | The number of bedrooms in the listing |
| 19 | beds | numeric | The number of beds in the listing |
| 20 | price | numeric | The price to stay in the listing for one night. |
| 21 | security_deposit | numeric | The price of security deposit according to 60% of a listing’s nightly rate |
| 22 | cleaning_fee | numeric | Additional charge for cleaning the listing, set by the hosts and paid once |
| 24 | minimum_nights | numeric | Amount of minimum nights |
| 25 | availability_30 | numeric | Number of days when listing is available in a month for booking |
| 26 | availability_60 | numeric | Number of days when listing is available in two months for booking |
| 27 | availability_90 | numeric | Number of days when listing is available in three months for booking |
| 28 | availability_365 | numeric | Number of days when listing is available in a year for booking |
| 29 | number_of_reviews | numeric | Amount of reviews the listing get |
| 30 | last_review | date | Latest review |
| 31 | review_scores_rating | factor | Rating for host for the overall experience |
| 32 | review_scores_accuracy | factor | Rating for host for the overall accuracy |
| 33 | review_scores_cleanliness | factor | Rating for host for the overall cleanliness |
| 34 | review_scores_checkin | factor | Rating for host for the overall check-in |
| 35 | review_scores_communication | factor | Rating for host for the overall host communication with a user |
| 36 | review_scores_location | factor | Rating for host for the overall listing location |
| 37 | review_scores_value | factor | Rating for host for the overall listing vallue |
| 38 | instant_bookable | logical | Whether the listing is instant bookable |
| 39 | cancellation_policy | factor | The type of cancelation policy |
| 40 | reviews_per_month | numeric | Amount of reviews per month the listing get |
# Count the numbers of listings of each room type
property_df <- zeroVar.data %>%
group_by(neighbourhood_cleansed, room_type) %>%
summarize(Freq = n()) %>% # number of observations within a group
ungroup()
# Count the total listing of each neighborhood
total_property <- zeroVar.data %>%
filter(room_type %in% c("Private room", "Entire home/apt", "Shared room", "Hotel room")) %>%
group_by(neighbourhood_cleansed) %>%
summarize(sum = n())
# Merge the data and select the top 5 cities
merge_prop <- merge (property_df, total_property, by= "neighbourhood_cleansed") %>%
arrange(-sum, -Freq) %>%
head(20)
# Sort the neighborhood from the most listing
merge_prop$size_f <- factor(merge_prop$neighbourhood_cleansed, levels=c('Westminster','Tower Hamlets','Hackney','Camden','Kensington and Chelsea'))
# Assign each neighborhood and room type with specific color
color_table <- tibble(
neighbourhood_cleansed = c('Westminster','Tower Hamlets','Hackney','Camden','Kensington and Chelsea'),
color = c("salmon", "goldenrod1", "springgreen3", "lightslateblue", "violetred")
)
ct_room <- tibble(
room_type = c('Entire home/apt','Private room','Hotel room','Shared room'),
color = c("yellow", "blue4", "lightblue", "darkgreen")
)
# a factor variable in the same order as the color table
merge_prop$neighbourhood_cleansed <- factor(merge_prop$neighbourhood_cleansed, levels = color_table$neighbourhood_cleansed)
merge_prop$room_type <- factor(merge_prop$room_type, levels = ct_room$room_type)The top 5 Cities with the most listing go to:
# Plot the data by the neighborhood
merge_prop %>%
ggplot() +
geom_bar(aes(reorder(as.factor(neighbourhood_cleansed), Freq), Freq, fill=neighbourhood_cleansed), stat = 'identity') +
geom_text(aes(neighbourhood_cleansed, sum, label = sum), hjust = 2.0, color = "white") +
labs(x="Room Type", y="Count") +
scale_fill_manual(values = color_table$color) +
theme(legend.position = 'none') +
ggtitle("The Most Listing in London by Neighborhood") +
xlab("Neighbourhood") +
ylab("Number of Listings") +
theme(legend.position = 'none',
plot.title = element_text(color = 'black', size = 14, face = 'bold', hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text()) +
coord_flip()
Now, let’s break down each neighborhood to their room type!
# Plot the data by the room type
merge_prop %>%
ggplot() +
geom_bar(aes(reorder(as.factor(room_type), Freq), Freq, fill=room_type), stat = 'identity') +
scale_fill_manual(values = ct_room$color) +
theme(legend.position = "none")+
labs(title = 'Number of Room Type by Neighbourhood',
x='Room Type',
y='Number of Listings') +
facet_wrap(~size_f, ncol = 2) +
theme(legend.position = 'none',
plot.title = element_text(color = 'black', size = 14, face = 'bold', hjust = 0.5)) +
coord_flip()
Observation findings :
Let’s find out about the correlation between mean price and number of listings!
# Count the mean price for every neighborhood
mean_price_neighbor <- zeroVar.data %>%
group_by(neighbourhood_cleansed) %>%
summarize(mean_neighbor = mean(price, na.rm = TRUE)) %>% # the number of observations within a group
ungroup()
# Count the total listing of each neighborhood
listing_sum <- zeroVar.data %>%
group_by(neighbourhood_cleansed) %>%
summarize(sum = n())
# Merge the data and select the top 5 cities with the most listing
merge_mean <- merge (mean_price_neighbor, listing_sum, by= "neighbourhood_cleansed") %>%
arrange(-sum, -mean_neighbor) %>%
head(5)
# Assign each neighborhood and room type with specific color
color_table <- tibble(
neighbourhood_cleansed = c('Westminster','Tower Hamlets','Hackney','Camden','Kensington and Chelsea'),
color = c("salmon", "goldenrod1", "springgreen3", "lightslateblue", "violetred")
)
ct_room <- tibble(
room_type = c('Entire home/apt','Private room','Hotel room','Shared room'),
color = c("yellow", "blue4", "lightblue", "darkgreen")
)
# a factor variable in the same order as the color table
merge_mean$neighbourhood_cleansed <- factor(merge_mean$neighbourhood_cleansed, levels = color_table$neighbourhood_cleansed)City with the highest mean price goes to:
# Plot the data
merge_mean %>%
ggplot(aes(x = reorder(neighbourhood_cleansed, mean_neighbor), y = mean_neighbor, fill = neighbourhood_cleansed)) +
geom_col(stat ="identity") +
scale_fill_manual(values = color_table$color) +
coord_flip() +
theme_gray() +
labs(x = "Neighbourhood Group", y = "Price") +
geom_text(aes(label = round(mean_neighbor,digit = 2)), hjust = 2.0, color = "white", size = 3.5) +
ggtitle("Mean Price comparison for each Neighbourhood", subtitle = "Price vs Neighbourhood") +
xlab("Neighbourhood") +
ylab("Mean Price ($)") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank())Well, what about the interaction between the mean price of those cities and their room type ?
# Count mean price for each room type in every neighborhood
property_mean <- zeroVar.data %>%
group_by(neighbourhood_cleansed, room_type) %>%
summarize(mean_room = mean(price, na.rm = TRUE)) %>% # number of observations within a group
ungroup()
# Count listings of each room type in every neighborhood
property_sum <- zeroVar.data %>%
filter(room_type %in% c("Private room", "Entire home/apt", "Shared room", "Hotel room")) %>%
group_by(neighbourhood_cleansed) %>%
summarize(sum = n())
# merge data
merge_mean_neighbor <- merge (property_mean, property_sum, by= "neighbourhood_cleansed") %>%
arrange(-sum, -mean_room) %>%
head(20)
# Sort the neighborhood from the most listing
merge_mean_neighbor$sorts <- factor(merge_mean_neighbor$room_type, levels=c('Hotel room','Entire home/apt','Shared room','Private room'))
# a factor variable in the same order as the color table
merge_mean_neighbor$neighbourhood_cleansed <- factor(merge_mean_neighbor$neighbourhood_cleansed,
levels = color_table$neighbourhood_cleansed)
merge_mean_neighbor$room_type <- factor(merge_mean_neighbor$room_type,
levels = ct_room$room_type)
# Plot the data by the room type
merge_mean_neighbor %>%
ggplot()+
geom_col(aes(x = reorder(neighbourhood_cleansed, mean_room),
y = mean_room,
fill = neighbourhood_cleansed),
position = "dodge") +
scale_fill_manual(values = color_table$color) +
theme(legend.position = 'none') +
ggtitle("Average Price by Room Type") +
xlab("Neighbourhood") +
ylab("Mean Price") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank()) +
facet_wrap(~ sorts, ncol = 2) +
coord_flip()
Observation findings :
Rumour has it that the host often increase the listing price based on their will. Do customers’ reviews have an impact on increasing the listings price? The vertical lines represent the mean price for each neighbourhood.
# Find the mean price for each neighbour
df1 <- zeroVar.data %>%
filter(neighbourhood_cleansed == "Westminster" | neighbourhood_cleansed == "Hackney") %>%
group_by(neighbourhood_cleansed) %>%
summarise(meanprice2 = mean(price))
# Plot the interaction between Price and Number or Reviews
zeroVar.data %>%
filter(neighbourhood_cleansed == "Westminster" | neighbourhood_cleansed == "Hackney") %>%
ggplot(aes(price, number_of_reviews)) +
geom_jitter(aes(col = number_of_reviews)) +
scale_x_log10() +
facet_wrap(~neighbourhood_cleansed) +
labs(title = "Number of Reviews Vs Price", x="Price ($)", y= "Number of Reviews",
color='Degree of Reviews') +
theme(plot.title = element_text(hjust = 0.5)) +
geom_vline(data=filter(df1, neighbourhood_cleansed == "Westminster"),
aes(xintercept = meanprice2), colour="salmon") +
geom_vline(data=filter(df1, neighbourhood_cleansed == "Hackney"),
aes(xintercept = meanprice2), colour="springgreen3")
Observation findings :
Airbnb is an emerging unicorn company since 2010. Numerous people have been listed as a host for so many years. Let’s take a look at the growth of hosts in London!
case4 <- zeroVar.data %>%
mutate(host_year = year(host_since)) %>%
group_by(host_year,) %>%
summarise(num_listing_year = n(),
avg_price_list = round(sum(price)/num_listing_year,2)) %>%
ungroup() %>%
mutate(text = glue("Year: {host_year}
Number of Listings: {number(num_listing_year, big.mark = ',', accuracy = 1)}
Avg. Price: ${avg_price_list}"
)) %>%
arrange(host_year) %>%
ggplot(aes(host_year, num_listing_year)) +
geom_point(aes(text = text), color = "maroon", size = 2) +
geom_line(aes(text = text), group = 1) +
scale_x_continuous(breaks = seq(2008, 2020, 2)) +
theme_bw() +
labs(title = "Number of Listings per Year",
x="Year ", y= "Count")
ggplotly(case4, tooltip = "text")%>%
config(displaylogo = FALSE)
Observation findings :
When the guests feel satisfied, they will give a great review on the listing page. One of the reasons the guests can feel pleased is by replying their inquiries.
zeroVar.data %>%
filter(neighbourhood_cleansed %in% c('Westminster','Tower Hamlets','Hackney')) %>%
filter(host_response_time %in% c('within an hour','within a few hours','within a day','a few days or more')) %>%
mutate(year_revs = year(last_review)) %>%
group_by(neighbourhood_cleansed, host_response_time, year_revs) %>%
summarise(sums_rev = sum(number_of_reviews)) %>%
ungroup() %>%
na.omit() %>%
filter(year_revs >= 2017) %>%
ggplot(aes(x = year_revs, y = sums_rev)) +
geom_col(aes(fill = year_revs), position = "dodge") +
scale_fill_viridis_b() +
theme(legend.position = 'none') +
ggtitle("Number of Reviews vs Response Time ") +
xlab("Year") +
ylab("Mean Number of Reviews") +
theme(legend.position = "none",
plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
axis.title.y = element_text(),
axis.title.x = element_text(),
axis.ticks = element_blank()) +
facet_grid(neighbourhood_cleansed ~ host_response_time, scales = "free")
Observation findings :
Count only those listings that available at least every weekend throughout the year. The first chart shows the distribution of Listings’ availability days during a year.
case6 <- zeroVar.data %>%
filter(!is.na(availability_365)) %>%
group_by(availability_365) %>%
mutate(count_avb = n(),
mean_avb = round(sum(price)/count_avb,2)) %>%
arrange(count_avb) %>%
filter(availability_365 >= 50) %>%
mutate(text = glue("{availability_365} Days per year
Number of Listings: {number(count_avb, big.mark = ',', accuracy = 1)}
Avg. Price: ${number(mean_avb, big.mark = ',', accuracy = 1)}"
)) %>%
ggplot(aes(availability_365, count_avb), text = text) +
geom_point(aes(text = text), alpha = 0.5, color = "violetred") +
geom_line() +
labs(title = " Availability During a Year", x =" Availability Days", y= "Count")+
scale_x_continuous(breaks = seq(50, 365, 45)) +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(case6, tooltip = "text")%>%
config(displaylogo = FALSE)
Observation findings :
The second chart shows us the relationship between Listings’ availability days and its price. The graph shows the maximum price for its corresponding days.
case6b <- zeroVar.data %>%
filter(!is.na(availability_365)) %>%
group_by(availability_365) %>%
mutate(count_avb = n(),
max_avb = max(price),
mean_avb = round(sum(price)/count_avb,2),
min_avb = min(price)) %>%
mutate(text =glue("{availability_365} Days per year
There are {count_avb} Listings
Max. Price: ${number(max_avb, big.mark = ',', accuracy = 1)}
Avg. Price: ${mean_avb}
Min. Price: ${min_avb}"
)) %>%
ggplot(aes(availability_365, price)) +
geom_point(data = . %>% group_by(availability_365) %>% filter(price == max(price)),
aes(text = text),
alpha = 0.2, color = "forestgreen") +
geom_segment(aes(x=availability_365, xend=availability_365, y=0, yend=price)) +
labs(title = " Availability Vs Price", x ="Availability Days", y= "Price")+
scale_x_continuous(breaks = seq(0, 365, 45)) +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(case6b, tooltip = "text")%>%
config(displaylogo = FALSE)
Observation findings :
zeroVar.data %>%
leaflet() %>%
addProviderTiles(providers$Esri) %>%
# setView(-0.103894, 51.503971, zoom = 13) %>%
addMarkers(lng = ~longitude,
lat = ~latitude,
clusterOptions = markerClusterOptions(),
popup = paste0("Host Name: ", zeroVar.data$host_name,
"<br>Room Type: ", zeroVar.data$room_type,
"<br>Price: $", zeroVar.data$price,
"<br>Rating: ", zeroVar.data$review_scores_rating))