Synopsis

Airbnb is one of the world’s most innovative online marketplaces for offering and arranging lodging, primarily homestays, or tourism experiences. Founded in 2008 by Brian Chesky, Joe Gebbia, and Nathan Blecharczyk, Airbnb has facilitated millions of hospitality entrepreneurs monetize their spaces and their passions while keeping the financial benefits of tourism in their communities. Airbnb market is well blooming in London which had more than 80,000 listings as of June 2020.

The objective of this project is to understand the factors that influence the prices listed on Airbnb in London. It derives from curiosity about specific factors that cause the rental prices for leasing short-term lodging seems easily changed over time.

Packages Required

Initially, we begin by loading the packages that will be required throughout the course of the analysis.

library(tidyverse)
# library(tidyr)
# library(dplyr)
# library(ggplot2)
library(lubridate)
library(caret)
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(glue)
library(scales)
library(plotly)
library(leaflet)
library(rgdal)

The descriptions of the packages are in the table below.

Packages	Description
tidyverse	Collection of R packages (tidyr, dplyr, ggplot2) designed for data science that works harmoniously with other packages
tidyr	Changing the layout of the data sets, to convert data into a tidy format
dplyr	For data manipulation
ggplot2	Customizable graphical representation
lubridate	Dates formatting
caret	For data Pre-Processing and Feature Selection
data.table	For data manipulation that can be reducing programming and compute time tremendously
DT	An R interface to the DataTables library
kableExtra	Styling an Interactive Data Tables within Markdown
knitr	A general-purpose tool for dynamic report generation

Data Preparation

Now, let’s load the dataset into the R-Environment.

Importing Data

Data Source

This project uses the data from Get the Data tab in insideairbnb website. The insideairbnb is a database website built by Murray Cox and designed by John Morris that collected all their dataset from the Airbnb website.

This project uses the dataset that contains Airbnb listings in London, United Kingdom. The time series dataset records the information of more than 80,000 listings as of 11 June 2020 (downloaded in August 2020).

Read the Data

The dataset is in .csv format, so we will useread.csv method to read the dataset.

raw.data <- read_csv("det_listings.csv")

Glimpse of the Data

After importing the dataset, Let’s take a peek of our dataset!

The dataset has 83,711 rows and 106 columns.

# DATA INPUT: RAW DATASET
glimpse(raw.data)

## Rows: 83,711
## Columns: 106
## $ id                                           <dbl> 11551, 13913, 15400, 1...
## $ listing_url                                  <chr> "https://www.airbnb.co...
## $ scrape_id                                    <dbl> 2.020061e+13, 2.020061...
## $ last_scraped                                 <date> 2020-06-16, 2020-06-1...
## $ name                                         <chr> "Arty and Bright Londo...
## $ summary                                      <chr> "Unlike most rental ap...
## $ space                                        <chr> "Amenities Bedding: 1 ...
## $ description                                  <chr> "Unlike most rental ap...
## $ experiences_offered                          <chr> "family", "business", ...
## $ neighborhood_overview                        <chr> "Not even 10 minutes b...
## $ notes                                        <chr> "No Smoking (very stri...
## $ transit                                      <chr> "Tons of buses (24hrs)...
## $ access                                       <chr> "Guest will have acces...
## $ interaction                                  <chr> "No interaction with g...
## $ house_rules                                  <chr> "No Smoking (very stri...
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA...
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA...
## $ picture_url                                  <chr> "https://a0.muscache.c...
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA...
## $ host_id                                      <dbl> 43039, 54730, 60302, 6...
## $ host_url                                     <chr> "https://www.airbnb.co...
## $ host_name                                    <chr> "Adriano", "Alina", "P...
## $ host_since                                   <date> 2009-10-03, 2009-11-1...
## $ host_location                                <chr> "London, England, Unit...
## $ host_about                                   <chr> "Hello, I'm a friendly...
## $ host_response_time                           <chr> "within an hour", "N/A...
## $ host_response_rate                           <chr> "100%", "N/A", "100%",...
## $ host_acceptance_rate                         <chr> "100%", "89%", "83%", ...
## $ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, F...
## $ host_thumbnail_url                           <chr> "https://a0.muscache.c...
## $ host_picture_url                             <chr> "https://a0.muscache.c...
## $ host_neighbourhood                           <chr> "Brixton", "LB of Isli...
## $ host_listings_count                          <dbl> 3, 4, 1, 18, 3, 3, 2, ...
## $ host_total_listings_count                    <dbl> 3, 4, 1, 18, 3, 3, 2, ...
## $ host_verifications                           <chr> "['email', 'phone', 'r...
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE...
## $ host_identity_verified                       <lgl> TRUE, TRUE, TRUE, TRUE...
## $ street                                       <chr> "London, United Kingdo...
## $ neighbourhood                                <chr> "Brixton", "LB of Isli...
## $ neighbourhood_cleansed                       <chr> "Lambeth", "Islington"...
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA...
## $ city                                         <chr> "London", "Islington",...
## $ state                                        <chr> NA, "Greater London", ...
## $ zipcode                                      <chr> "SW9 8DG", "N4 3", "SW...
## $ market                                       <chr> "London", "London", "L...
## $ smart_location                               <chr> "London, United Kingdo...
## $ country_code                                 <chr> "GB", "GB", "GB", "GB"...
## $ country                                      <chr> "United Kingdom", "Uni...
## $ latitude                                     <dbl> 51.46225, 51.56802, 51...
## $ longitude                                    <dbl> -0.11732, -0.11121, -0...
## $ is_location_exact                            <lgl> TRUE, TRUE, TRUE, FALS...
## $ property_type                                <chr> "Apartment", "Apartmen...
## $ room_type                                    <chr> "Entire home/apt", "Pr...
## $ accommodates                                 <dbl> 4, 2, 2, 6, 2, 2, 2, 5...
## $ bathrooms                                    <dbl> 1.0, 1.0, 1.0, 2.0, 1....
## $ bedrooms                                     <dbl> 1, 1, 1, 3, 1, 1, 1, 3...
## $ beds                                         <dbl> 3, 0, 1, 3, 1, 0, 1, 3...
## $ bed_type                                     <chr> "Real Bed", "Real Bed"...
## $ amenities                                    <chr> "{TV,\"Cable TV\",Inte...
## $ square_feet                                  <dbl> NA, 538, NA, NA, NA, N...
## $ price                                        <chr> "$88.00", "$65.00", "$...
## $ weekly_price                                 <chr> "$645.00", "$333.00", ...
## $ monthly_price                                <chr> "$2,350.00", "$1,176.0...
## $ security_deposit                             <chr> "$400.00", "$100.00", ...
## $ cleaning_fee                                 <chr> "$35.00", "$15.00", "$...
## $ guests_included                              <dbl> 2, 1, 2, 3, 1, 1, 1, 1...
## $ extra_people                                 <chr> "$20.00", "$15.00", "$...
## $ minimum_nights                               <dbl> 2, 1, 10, 3, 3, 10, 1,...
## $ maximum_nights                               <dbl> 180, 29, 50, 365, 21, ...
## $ minimum_minimum_nights                       <dbl> 2, 1, 10, 3, 2, 10, 1,...
## $ maximum_minimum_nights                       <dbl> 2, 1, 10, 4, 3, 10, 1,...
## $ minimum_maximum_nights                       <dbl> 1125, 29, 50, 60, 21, ...
## $ maximum_maximum_nights                       <dbl> 1125, 29, 50, 365, 21,...
## $ minimum_nights_avg_ntm                       <dbl> 2.0, 1.0, 10.0, 3.2, 3...
## $ maximum_nights_avg_ntm                       <dbl> 1125.0, 29.0, 50.0, 29...
## $ calendar_updated                             <chr> "3 months ago", "14 mo...
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE...
## $ availability_30                              <dbl> 12, 12, 0, 9, 11, 13, ...
## $ availability_60                              <dbl> 42, 42, 0, 39, 41, 43,...
## $ availability_90                              <dbl> 72, 72, 13, 69, 71, 73...
## $ availability_365                             <dbl> 347, 347, 288, 326, 34...
## $ calendar_last_scraped                        <date> 2020-06-16, 2020-06-1...
## $ number_of_reviews                            <dbl> 192, 21, 89, 42, 0, 12...
## $ number_of_reviews_ltm                        <dbl> 9, 5, 4, 2, 0, 8, 1, 1...
## $ first_review                                 <date> 2010-03-21, 2010-08-1...
## $ last_review                                  <date> 2020-03-26, 2020-02-2...
## $ review_scores_rating                         <dbl> 91, 97, 96, 94, NA, 96...
## $ review_scores_accuracy                       <dbl> 9, 10, 10, 10, NA, 10,...
## $ review_scores_cleanliness                    <dbl> 9, 10, 10, 9, NA, 10, ...
## $ review_scores_checkin                        <dbl> 10, 10, 10, 9, NA, 10,...
## $ review_scores_communication                  <dbl> 10, 10, 10, 9, NA, 10,...
## $ review_scores_location                       <dbl> 9, 9, 10, 10, NA, 9, 1...
## $ review_scores_value                          <dbl> 9, 9, 9, 9, NA, 10, 8,...
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, F...
## $ license                                      <lgl> NA, NA, NA, NA, NA, NA...
## $ jurisdiction_names                           <lgl> NA, NA, NA, NA, NA, NA...
## $ instant_bookable                             <lgl> TRUE, FALSE, TRUE, TRU...
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, F...
## $ cancellation_policy                          <chr> "strict_14_with_grace_...
## $ require_guest_profile_picture                <lgl> FALSE, FALSE, TRUE, FA...
## $ require_guest_phone_verification             <lgl> TRUE, FALSE, TRUE, FAL...
## $ calculated_host_listings_count               <dbl> 2, 3, 1, 15, 2, 3, 2, ...
## $ calculated_host_listings_count_entire_homes  <dbl> 2, 1, 1, 15, 0, 0, 1, ...
## $ calculated_host_listings_count_private_rooms <dbl> 0, 2, 0, 0, 2, 3, 1, 0...
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0...
## $ reviews_per_month                            <dbl> 1.54, 0.18, 0.70, 0.37...

Data Cleaning

The dataset we just imported is a raw dataset.

For the Explanatory Data purpose, we need to remove or modify data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

Quality Check

Missing Values

Let’s see how many missing values in each column.

# Counting missing values in each column
numMissVal <-sapply(raw.data, function(x) sum(length(which(is.na(x)))))

# Result table
kable(as.data.frame(numMissVal)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width = "100%", height = "250px")

	numMissVal
id	0
listing_url	0
scrape_id	0
last_scraped	0
name	25
summary	4864
space	25948
description	3014
experiences_offered	0
neighborhood_overview	30119
notes	52472
transit	30845
access	40186
interaction	35013
house_rules	35960
thumbnail_url	83711
medium_url	83711
picture_url	0
xl_picture_url	83711
host_id	0
host_url	0
host_name	11
host_since	11
host_location	208
host_about	36128
host_response_time	11
host_response_rate	11
host_acceptance_rate	11
host_is_superhost	11
host_thumbnail_url	11
host_picture_url	11
host_neighbourhood	19231
host_listings_count	11
host_total_listings_count	11
host_verifications	0
host_has_profile_pic	11
host_identity_verified	11
street	0
neighbourhood	0
neighbourhood_cleansed	0
neighbourhood_group_cleansed	83711
city	453
state	14854
zipcode	1899
market	726
smart_location	0
country_code	0
country	0
latitude	0
longitude	0
is_location_exact	0
property_type	0
room_type	0
accommodates	0
bathrooms	133
bedrooms	164
beds	984
bed_type	3
amenities	0
square_feet	83333
price	0
weekly_price	77281
monthly_price	79131
security_deposit	29532
cleaning_fee	22104
guests_included	0
extra_people	0
minimum_nights	0
maximum_nights	0
minimum_minimum_nights	0
maximum_minimum_nights	0
minimum_maximum_nights	0
maximum_maximum_nights	0
minimum_nights_avg_ntm	0
maximum_nights_avg_ntm	0
calendar_updated	0
has_availability	0
availability_30	0
availability_60	0
availability_90	0
availability_365	0
calendar_last_scraped	0
number_of_reviews	0
number_of_reviews_ltm	0
first_review	21568
last_review	21568
review_scores_rating	23226
review_scores_accuracy	23286
review_scores_cleanliness	23276
review_scores_checkin	23335
review_scores_communication	23284
review_scores_location	23333
review_scores_value	23334
requires_license	0
license	83711
jurisdiction_names	83711
instant_bookable	0
is_business_travel_ready	0
cancellation_policy	0
require_guest_profile_picture	0
require_guest_phone_verification	0
calculated_host_listings_count	0
calculated_host_listings_count_entire_homes	0
calculated_host_listings_count_private_rooms	0
calculated_host_listings_count_shared_rooms	0
reviews_per_month	21568

Observation findings :

The majority of URL, host, reviews, and legal-related information are blank.
There aren’t many listings that rent by weekly and monthly.

Dollar tag

There are still dollar ($) tags in the price, weekly_price, monthly_price, security_deposit, extra_people, and cleaning_fee. These problems will affect numeric manipulation in our analysis, so we need to remove the dollar ($) tags.

# The updated columns stored in `clean.data` data frame
clean.data <- copy(raw.data)

# remove the "$" tag
clean.data$price <- as.numeric(gsub('[$,]','', raw.data$price))
clean.data$weekly_price <- as.numeric(gsub('[$,]','', raw.data$weekly_price))
clean.data$monthly_price <- as.numeric(gsub('[$,]','', raw.data$monthly_price))
clean.data$security_deposit <- as.numeric(gsub('[$,]','', raw.data$security_deposit))
clean.data$extra_people <- as.numeric(gsub('[$,]','', raw.data$extra_people))
clean.data$cleaning_fee <- as.numeric(gsub('[$,]','', raw.data$cleaning_fee))

# result table
head(clean.data[c("price","weekly_price","monthly_price","security_deposit","cleaning_fee")])

Duplicated Rows

There aren’t any duplicated rows found.

# Check duplicated rows
clean.data[which(duplicated(clean.data) ==T), ]

Filtering

Some columns add little or no value to our analysis, so we need to remove these columns.

Zero Variance Columns

Variance measures how far a set of data is spreading out. A zero variance indicates that all of the data values are identical (no variation). It also means no deviation from the data. Thus, we cannot do our analysis with zero variance columns. These columns are removed using nearzeroVar method from Caret package.

Columns Removed:

# Calculating the variance of each columns
zvCols <- nearZeroVar(clean.data, saveMetrics = TRUE)

# Extract column names that has a zero variance
ZVnames=rownames(subset(zvCols, nzv== TRUE))
# Remove columns that has a zero variance
clean.data <- clean.data[ , !(names(clean.data) %in% ZVnames)]

# result
ZVnames

##  [1] "scrape_id"                                  
##  [2] "experiences_offered"                        
##  [3] "thumbnail_url"                              
##  [4] "medium_url"                                 
##  [5] "xl_picture_url"                             
##  [6] "host_has_profile_pic"                       
##  [7] "neighbourhood_group_cleansed"               
##  [8] "state"                                      
##  [9] "market"                                     
## [10] "country_code"                               
## [11] "country"                                    
## [12] "bed_type"                                   
## [13] "has_availability"                           
## [14] "requires_license"                           
## [15] "license"                                    
## [16] "jurisdiction_names"                         
## [17] "is_business_travel_ready"                   
## [18] "require_guest_profile_picture"              
## [19] "require_guest_phone_verification"           
## [20] "calculated_host_listings_count_shared_rooms"

Based on Unique Values - Character Columns

Variance measures how far a set of data is spreading out. Character columns with near 100% variance indicate that all of the data values are different in each row. These columns don’t provide any group-level information to a larger population.

Columns Removed:

# Selecting distinct character columns
dist.df <-
  zeroVar.data %>% 
  select_if(is.character) %>% 
  summarise_all(list(~n_distinct(.)))

# Distinct cOlumns that has a near 100% variance
dist.df <-
  dist.df %>% 
  gather(key = var_name, value = value, 1:ncol(dist.df))

# Proportions of distinct cOlumns that has a near 100% variance
dist.df$percentUnique <- round(dist.df$value/nrow(zeroVar.data),2)

# Distinct columns that has variance proportions > 20%
unique.vars <- dist.df %>% 
  filter(percentUnique > 0.2) %>% 
  pull(var_name)

# Remove columns that has variance proportions > 20% 
zeroVar.data <- zeroVar.data[,!(names(zeroVar.data) %in% unique.vars)]

# result
unique.vars

##  [1] "listing_url"           "name"                  "summary"              
##  [4] "space"                 "description"           "neighborhood_overview"
##  [7] "notes"                 "transit"               "access"               
## [10] "interaction"           "house_rules"           "picture_url"          
## [13] "host_url"              "host_about"            "host_thumbnail_url"   
## [16] "host_picture_url"      "zipcode"               "amenities"

Based on NA Values

Columns with over than 50% NA values are removed.

Columns Removed:

# Count NA rows in each columns
na_df <- zeroVar.data %>% 
  summarise_all(list(~sum(is.na(.))))

# Summary of NA rows in each column
na_df <- na_df %>% 
  gather(key = var_name, value = value, 1:ncol(na_df))

# Proportions of NA rows in each column
na_df$numNa <- round(na_df$value/nrow(zeroVar.data),2)

# Column that has 50% NA rows
na_var <- na_df %>% 
  filter(numNa > 0.5) %>% pull(var_name)

# Remove columns that has proportions > 50% variance
zeroVar.data <- zeroVar.data[,!(names(zeroVar.data) %in% na_var)]

# result
na_var

## [1] "square_feet"   "weekly_price"  "monthly_price"

Pattern Matching

Column names contain strings “night”, and “location” are removed, except minimum_nightsand review_scores_location, because those columns are irrelevant information to our analysis.

Columns Removed:

# Make the pattern
pattern <- colnames(
  zeroVar.data %>%
  select(contains("night"),
         contains("location")
         )
)
  


# Excluding "minimun_nights" and "review_scores_location" in the pattern
pattern <- pattern[!(pattern == "minimum_nights" | pattern == "review_scores_location")]

# Remove columns that has similar names with columns in the pattern
zeroVar.data <- zeroVar.data[,!(names(zeroVar.data) %in% pattern)]

# result
pattern

##  [1] "maximum_nights"         "minimum_minimum_nights" "maximum_minimum_nights"
##  [4] "minimum_maximum_nights" "maximum_maximum_nights" "minimum_nights_avg_ntm"
##  [7] "maximum_nights_avg_ntm" "host_location"          "smart_location"        
## [10] "is_location_exact"

Manual Removal

There aren’t any better pattern left, so some of the columns are removed manually.

Columns Removed:

# List of columns that will be removed manually
dropCol <-
  c(
  "host_neighbourhood",
  "host_listings_count",
  "host_verifications",
  "host_identity_verified",
  "street",
  "neighbourhood",
  "city",
  "property_type",
  "calendar_updated",
  "calendar_last_scraped",
  "number_of_reviews_ltm",
  "first_review",
  "calculated_host_listings_count",
  "calculated_host_listings_count_entire_homes",
  "calculated_host_listings_count_private_rooms",
  "guests_included",
  "extra_people"
  )

zeroVar.data <- zeroVar.data[,!(names(zeroVar.data) %in% dropCol)]


# result
dropCol

##  [1] "host_neighbourhood"                          
##  [2] "host_listings_count"                         
##  [3] "host_verifications"                          
##  [4] "host_identity_verified"                      
##  [5] "street"                                      
##  [6] "neighbourhood"                               
##  [7] "city"                                        
##  [8] "property_type"                               
##  [9] "calendar_updated"                            
## [10] "calendar_last_scraped"                       
## [11] "number_of_reviews_ltm"                       
## [12] "first_review"                                
## [13] "calculated_host_listings_count"              
## [14] "calculated_host_listings_count_entire_homes" 
## [15] "calculated_host_listings_count_private_rooms"
## [16] "guests_included"                             
## [17] "extra_people"

Data Formatting

Some columns need to be changed as factors.

# data format : factor
zeroVar.data$host_name <- as.factor(zeroVar.data$host_name)
zeroVar.data$host_response_time <-  as.factor(zeroVar.data$host_response_time)
zeroVar.data$host_response_rate <- as.factor(zeroVar.data$host_response_rate)
zeroVar.data$host_acceptance_rate <- as.factor(zeroVar.data$host_acceptance_rate)
zeroVar.data$host_total_listings_count <- as.factor(zeroVar.data$host_total_listings_count)

zeroVar.data$neighbourhood_cleansed <- as.factor(zeroVar.data$neighbourhood_cleansed)
zeroVar.data$room_type <- as.factor(zeroVar.data$room_type)
zeroVar.data$cancellation_policy <- as.factor(zeroVar.data$cancellation_policy)


zeroVar.data$review_scores_rating <-  as.factor(zeroVar.data$review_scores_rating)
zeroVar.data$review_scores_accuracy <-  as.factor(zeroVar.data$review_scores_accuracy)
zeroVar.data$review_scores_cleanliness <-  as.factor(zeroVar.data$review_scores_cleanliness)
zeroVar.data$review_scores_checkin <-  as.factor(zeroVar.data$review_scores_checkin)
zeroVar.data$review_scores_communication <-  as.factor(zeroVar.data$review_scores_communication)
zeroVar.data$review_scores_location <-  as.factor(zeroVar.data$review_scores_location)
zeroVar.data$review_scores_value <-  as.factor(zeroVar.data$review_scores_value)

last_scraped, host_since, and last_review changed into ymd format using lubridate library.

zeroVar.data$last_scraped <- ymd(zeroVar.data$last_scraped)
zeroVar.data$host_since <- ymd(zeroVar.data$host_since)
zeroVar.data$last_review <- ymd(zeroVar.data$last_review)

NA Values Treatment

Problem Statement

After filtering the dataset, some columns still have missing values.

# Counting missing values in each column
clean_NAval <-sapply(zeroVar.data, function(x) sum(length(which(is.na(x)))))

# Result table
kable(as.data.frame(clean_NAval)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width = "100%", height = "250px")

	clean_NAval
id	0
last_scraped	0
host_id	0
host_name	11
host_since	11
host_response_time	11
host_response_rate	11
host_acceptance_rate	11
host_is_superhost	11
host_total_listings_count	11
neighbourhood_cleansed	0
latitude	0
longitude	0
room_type	0
accommodates	0
bathrooms	133
bedrooms	164
beds	984
price	0
security_deposit	29532
cleaning_fee	22104
minimum_nights	0
availability_30	0
availability_60	0
availability_90	0
availability_365	0
number_of_reviews	0
last_review	21568
review_scores_rating	23226
review_scores_accuracy	23286
review_scores_cleanliness	23276
review_scores_checkin	23335
review_scores_communication	23284
review_scores_location	23333
review_scores_value	23334
instant_bookable	0
cancellation_policy	0
reviews_per_month	21568

The missing value can severely distort the distribution of the data. However, there isn’t any better way to deal with missing data. Removing columns or rows with missing values can produce a bias in the analysis. Note that imputation does not necessarily give better results.

Alvira Swalin gave a better explanation about how to handle missing data in towardsdatascience.com. The methods to handle missing values are as follow:

Problem Solvings

Based on the previous flowchart, our dataset has both general problems, continuous and categorical. We use data imputation only on factor and numeric data format.

Continuous

The numerical columns that have missing values are bathrooms, bedrooms, beds, security_deposit, and cleaning_fee. we can replace the missing values in these numerical columns with the average of the corresponding column.

First, get the averages of each numeric column.

# Find average value 
baths_mean <- mean(zeroVar.data$bathrooms,  na.rm = TRUE)
bedrms_mean <- mean(zeroVar.data$bedrooms,  na.rm = TRUE) 
beds_mean <- mean(zeroVar.data$beds,  na.rm = TRUE) 
seqdep_mean <- mean(zeroVar.data$security_deposit,  na.rm = TRUE) 
cleanfee_mean <- mean(zeroVar.data$cleaning_fee,  na.rm = TRUE)
monthlyreviews_mean <- mean(zeroVar.data$reviews_per_month, na.rm = TRUE)

Then replace the missing values with the averages of the corresponding columns.

zeroVar.data$bathrooms[is.na(zeroVar.data$bathrooms)] <- baths_mean
zeroVar.data$bedrooms[is.na(zeroVar.data$bedrooms)] <- bedrms_mean
zeroVar.data$beds[is.na(zeroVar.data$beds)] <- beds_mean
zeroVar.data$security_deposit[is.na(zeroVar.data$security_deposit)] <- seqdep_mean
zeroVar.data$cleaning_fee[is.na(zeroVar.data$cleaning_fee)] <- cleanfee_mean
zeroVar.data$reviews_per_month[is.na(zeroVar.data$reviews_per_month)] <- monthlyreviews_mean

Categorical

The categorical columns that have missing values are host_name, host_response_time, host_response_rate, host_acceptance_rate, host_total_listings_count, neighbourhood_cleansed, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, and cancellation_policy

We can treat the missing values in these categorical as levels. As example, let’s see the levels attribute of host_response_time. The missing values are treated as one of the levels in the host_response_time.

levels(zeroVar.data$host_response_time)

## [1] "a few days or more" "N/A"                "within a day"      
## [4] "within a few hours" "within an hour"

Cleaned Dataset

Here’s the cleaned data set:

datatable(head(zeroVar.data, 50), 
          style = 'bootstrap', 
          class = 'table-bordered table-condensed')

Summary of Variables

The summary of the variables of dataset are in the table below.

No.	Variable	Class	Description
1	id	numeric	Unique listing ID
2	last_scraped	date	Latest scrapped data
3	host_id	numeric	Host ID
4	host_name	factor	Name of the host
5	host_since	date	The date of user listed as a host
6	host_response_time	factor	How long the host response to user’s request
7	host_response_rate	factor	Percentage of host response time to numbers of request
8	host_acceptance_rate	factor	Percentage of host accepts the request to numbers of request
9	host_is_superhost	logical	Whether the host is super host
10	host_total_listings_count	factor	Amount of listing a host have
11	neighbourhood_cleansed	factor	Area in London
12	city	factor	City in London
13	latitude	numeric	Latitude coordinates
14	longitude	numeric	Longitude coordinates
15	room_type	factor	The type of room provided in the listing (Entire home/apt, Private room, or Shared room)
16	accommodates	numeric	The number of people that the listing can accommodate
17	bathrooms	numeric	The number of bathrooms in the listing
18	bedrooms	numeric	The number of bedrooms in the listing
19	beds	numeric	The number of beds in the listing
20	price	numeric	The price to stay in the listing for one night.
21	security_deposit	numeric	The price of security deposit according to 60% of a listing’s nightly rate
22	cleaning_fee	numeric	Additional charge for cleaning the listing, set by the hosts and paid once
24	minimum_nights	numeric	Amount of minimum nights
25	availability_30	numeric	Number of days when listing is available in a month for booking
26	availability_60	numeric	Number of days when listing is available in two months for booking
27	availability_90	numeric	Number of days when listing is available in three months for booking
28	availability_365	numeric	Number of days when listing is available in a year for booking
29	number_of_reviews	numeric	Amount of reviews the listing get
30	last_review	date	Latest review
31	review_scores_rating	factor	Rating for host for the overall experience
32	review_scores_accuracy	factor	Rating for host for the overall accuracy
33	review_scores_cleanliness	factor	Rating for host for the overall cleanliness
34	review_scores_checkin	factor	Rating for host for the overall check-in
35	review_scores_communication	factor	Rating for host for the overall host communication with a user
36	review_scores_location	factor	Rating for host for the overall listing location
37	review_scores_value	factor	Rating for host for the overall listing vallue
38	instant_bookable	logical	Whether the listing is instant bookable
39	cancellation_policy	factor	The type of cancelation policy
40	reviews_per_month	numeric	Amount of reviews per month the listing get

Exploratory Data Analysis & Data Visualization

Case 1

The Top 5 Cities with The Most Listing in London

# Count the numbers of listings of each room type
property_df <-  zeroVar.data %>% 
  group_by(neighbourhood_cleansed, room_type) %>% 
  summarize(Freq = n()) %>% #  number of observations within a group
  ungroup()

# Count the total listing of each neighborhood
total_property <-  zeroVar.data %>% 
  filter(room_type %in% c("Private room", "Entire home/apt", "Shared room", "Hotel room")) %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarize(sum = n())
  
# Merge the data and select the top 5 cities
merge_prop <- merge (property_df, total_property, by= "neighbourhood_cleansed") %>% 
  arrange(-sum, -Freq) %>% 
  head(20)


# Sort the neighborhood from the most listing
merge_prop$size_f <- factor(merge_prop$neighbourhood_cleansed, levels=c('Westminster','Tower Hamlets','Hackney','Camden','Kensington and Chelsea'))


# Assign each neighborhood and room type with specific color
color_table <- tibble(
  neighbourhood_cleansed = c('Westminster','Tower Hamlets','Hackney','Camden','Kensington and Chelsea'),
  color = c("salmon", "goldenrod1", "springgreen3", "lightslateblue", "violetred")
  )

ct_room <- tibble(
  room_type = c('Entire home/apt','Private room','Hotel room','Shared room'),
  color = c("yellow", "blue4", "lightblue", "darkgreen")
  )


# a factor variable in the same order as the color table
merge_prop$neighbourhood_cleansed <- factor(merge_prop$neighbourhood_cleansed, levels = color_table$neighbourhood_cleansed)
merge_prop$room_type <- factor(merge_prop$room_type, levels = ct_room$room_type)

The top 5 Cities with the most listing go to:

# Plot the data by the neighborhood

merge_prop %>% 
  ggplot() +
  geom_bar(aes(reorder(as.factor(neighbourhood_cleansed), Freq), Freq, fill=neighbourhood_cleansed), stat = 'identity') +
  geom_text(aes(neighbourhood_cleansed, sum, label = sum), hjust = 2.0,  color = "white") +
  labs(x="Room Type", y="Count") +
  scale_fill_manual(values = color_table$color) +
  theme(legend.position = 'none') +
  ggtitle("The Most Listing in London by Neighborhood") + 
  xlab("Neighbourhood") + 
  ylab("Number of Listings") +
  theme(legend.position = 'none',
        plot.title = element_text(color = 'black', size = 14, face = 'bold', hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text()) +
  coord_flip()

Now, let’s break down each neighborhood to their room type!

# Plot the data by the room type
merge_prop %>% 
  ggplot() +
  geom_bar(aes(reorder(as.factor(room_type), Freq), Freq, fill=room_type), stat = 'identity') +
  scale_fill_manual(values = ct_room$color) +
  theme(legend.position = "none")+
  labs(title = 'Number of Room Type by Neighbourhood',
       x='Room Type', 
       y='Number of Listings') +
  facet_wrap(~size_f, ncol = 2) +
  theme(legend.position = 'none',
        plot.title = element_text(color = 'black', size = 14, face = 'bold', hjust = 0.5)) +
  coord_flip()

Observation findings :

The room type of entire home/apartment and private room contribute the most on Airbnb listings in London.
The room type of hotel room and shared room contribute the least on Airbnb listings in London.

Case 2

Mean Price Comparison of Top 5 Cities with The Most Listing in London

Let’s find out about the correlation between mean price and number of listings!

# Count the mean price for every neighborhood
mean_price_neighbor <-  zeroVar.data %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarize(mean_neighbor = mean(price, na.rm = TRUE)) %>% #  the number of observations within a group
  ungroup()

# Count the total listing of each neighborhood
listing_sum <-  zeroVar.data %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarize(sum = n())
  
# Merge the data and select the top 5 cities with the most listing
merge_mean <- merge (mean_price_neighbor, listing_sum, by= "neighbourhood_cleansed") %>% 
  arrange(-sum, -mean_neighbor) %>% 
  head(5)

# Assign each neighborhood and room type with specific color
color_table <- tibble(
  neighbourhood_cleansed = c('Westminster','Tower Hamlets','Hackney','Camden','Kensington and Chelsea'),
  color = c("salmon", "goldenrod1", "springgreen3", "lightslateblue", "violetred")
  )

ct_room <- tibble(
  room_type = c('Entire home/apt','Private room','Hotel room','Shared room'),
  color = c("yellow", "blue4", "lightblue", "darkgreen")
  )


# a factor variable in the same order as the color table
merge_mean$neighbourhood_cleansed <- factor(merge_mean$neighbourhood_cleansed, levels = color_table$neighbourhood_cleansed)

City with the highest mean price goes to:

# Plot the data
merge_mean %>% 
  ggplot(aes(x = reorder(neighbourhood_cleansed, mean_neighbor), y = mean_neighbor, fill = neighbourhood_cleansed)) +
  geom_col(stat ="identity") +
  scale_fill_manual(values = color_table$color) +
  coord_flip() +
  theme_gray() +
  labs(x = "Neighbourhood Group", y = "Price") +
  geom_text(aes(label = round(mean_neighbor,digit = 2)), hjust = 2.0, color = "white", size = 3.5) +
  ggtitle("Mean Price comparison for each Neighbourhood", subtitle = "Price vs Neighbourhood") + 
  xlab("Neighbourhood") + 
  ylab("Mean Price ($)") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Well, what about the interaction between the mean price of those cities and their room type ?

# Count mean price for each room type in every neighborhood
property_mean <-  zeroVar.data %>% 
  group_by(neighbourhood_cleansed, room_type) %>% 
  summarize(mean_room = mean(price, na.rm = TRUE)) %>% #  number of observations within a group
  ungroup()


# Count listings of each room type in every neighborhood
property_sum <-  zeroVar.data %>% 
  filter(room_type %in% c("Private room", "Entire home/apt", "Shared room", "Hotel room")) %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarize(sum = n())  


# merge data
merge_mean_neighbor <- merge (property_mean, property_sum, by= "neighbourhood_cleansed") %>% 
  arrange(-sum, -mean_room) %>% 
  head(20)


# Sort the neighborhood from the most listing
merge_mean_neighbor$sorts <- factor(merge_mean_neighbor$room_type, levels=c('Hotel room','Entire home/apt','Shared room','Private room'))


# a factor variable in the same order as the color table
merge_mean_neighbor$neighbourhood_cleansed <- factor(merge_mean_neighbor$neighbourhood_cleansed, 
                                                     levels = color_table$neighbourhood_cleansed)

merge_mean_neighbor$room_type <- factor(merge_mean_neighbor$room_type, 
                                        levels = ct_room$room_type)


# Plot the data by the room type
merge_mean_neighbor %>% 
  ggplot()+
  geom_col(aes(x = reorder(neighbourhood_cleansed, mean_room), 
               y = mean_room,
               fill = neighbourhood_cleansed), 
           position = "dodge") +
  scale_fill_manual(values = color_table$color) +
  theme(legend.position = 'none') +
  ggtitle("Average Price by Room Type") +
  xlab("Neighbourhood") + 
  ylab("Mean Price") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank()) +
  facet_wrap(~ sorts, ncol = 2) +
  coord_flip()

Observation findings :

Westminster has the highest mean price because the area has the highest hotel room mean price.
Mostly the hotel room price gives a meaningful impact on the calculation of the average price in each neighbourhood.
The private room price gives the least impact on the calculation of the average price in each neighbourhood.

Case 3

The Interaction between Price and Review at Westminster and Hackney

Rumour has it that the host often increase the listing price based on their will. Do customers’ reviews have an impact on increasing the listings price? The vertical lines represent the mean price for each neighbourhood.

# Find the mean price for each neighbour
df1 <- zeroVar.data %>% 
  filter(neighbourhood_cleansed == "Westminster" | neighbourhood_cleansed == "Hackney") %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarise(meanprice2 = mean(price))


# Plot the interaction between Price and Number or Reviews
zeroVar.data %>% 
  filter(neighbourhood_cleansed == "Westminster" | neighbourhood_cleansed == "Hackney") %>% 
  ggplot(aes(price, number_of_reviews)) +
  geom_jitter(aes(col = number_of_reviews)) +
  scale_x_log10() +
  facet_wrap(~neighbourhood_cleansed) +
  labs(title = "Number of Reviews Vs Price", x="Price ($)", y= "Number of Reviews",
       color='Degree of Reviews') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_vline(data=filter(df1, neighbourhood_cleansed == "Westminster"), 
             aes(xintercept = meanprice2), colour="salmon") + 
  geom_vline(data=filter(df1, neighbourhood_cleansed == "Hackney"), 
             aes(xintercept = meanprice2), colour="springgreen3")

Observation findings :

Both neighbourhoods show that the higher number of reviews, the listing price is close to or below the neighbourhood mean price.
Both neighbourhoods also show that the higher the listing price, the less it has the number of reviews.
Both graphs show that customers give more reviews on a listing that has a price around or below the mean price in each neighbourhood.

Case 4

Number of Listings over Years at Top 5 Cities

Airbnb is an emerging unicorn company since 2010. Numerous people have been listed as a host for so many years. Let’s take a look at the growth of hosts in London!

case4 <- zeroVar.data %>% 
  mutate(host_year = year(host_since)) %>% 
  group_by(host_year,) %>%
  summarise(num_listing_year = n(),
         avg_price_list = round(sum(price)/num_listing_year,2)) %>% 
  ungroup() %>% 
  mutate(text = glue("Year: {host_year}
                      Number of Listings: {number(num_listing_year, big.mark = ',', accuracy = 1)}
                      Avg. Price: ${avg_price_list}"
                      )) %>% 
  arrange(host_year) %>% 
  ggplot(aes(host_year, num_listing_year)) +
  geom_point(aes(text = text), color = "maroon", size = 2) +
  geom_line(aes(text = text), group = 1) +
  scale_x_continuous(breaks = seq(2008, 2020, 2)) +
  theme_bw() +
  labs(title = "Number of Listings per Year", 
       x="Year ", y= "Count")

ggplotly(case4, tooltip = "text")%>%
  config(displaylogo = FALSE)

Observation findings :

There’s a steep growth of hosts from 2010 to 2015.
On 2015, London has reached its highest level of the growth of listing hosts.
Unfortunately, on 2020 Number of Hosts in London bottomed out due to Covid-19.

Case 5

Reviews on Response Time of The Three Cities with The Most Listings

When the guests feel satisfied, they will give a great review on the listing page. One of the reasons the guests can feel pleased is by replying their inquiries.

zeroVar.data %>% 
  filter(neighbourhood_cleansed %in% c('Westminster','Tower Hamlets','Hackney')) %>%
  filter(host_response_time %in% c('within an hour','within a few hours','within a day','a few days or more')) %>% 
  mutate(year_revs = year(last_review)) %>% 
  group_by(neighbourhood_cleansed, host_response_time, year_revs) %>%
  summarise(sums_rev = sum(number_of_reviews)) %>% 
  ungroup() %>% 
  na.omit() %>% 
  filter(year_revs >= 2017) %>%
  ggplot(aes(x = year_revs, y = sums_rev)) +
  geom_col(aes(fill = year_revs), position = "dodge")  +
  scale_fill_viridis_b() +
  theme(legend.position = 'none') +
  ggtitle("Number of Reviews vs Response Time ") +
  xlab("Year") + 
  ylab("Mean Number of Reviews") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank()) +
  facet_grid(neighbourhood_cleansed ~ host_response_time, scales = "free")

Observation findings :

As the years passed, the more prompt a host responses to the guests’ inquiries, the more reviews the host will get.
On 2020, the guests have given many reviews on Listings in London than the other years.

Case 6

Price vs Availability

Count only those listings that available at least every weekend throughout the year. The first chart shows the distribution of Listings’ availability days during a year.

case6 <- zeroVar.data %>% 
  filter(!is.na(availability_365)) %>% 
  group_by(availability_365) %>% 
  mutate(count_avb = n(),
         mean_avb = round(sum(price)/count_avb,2)) %>% 
  arrange(count_avb) %>% 
  filter(availability_365 >= 50) %>% 
  mutate(text = glue("{availability_365} Days per year
                      Number of Listings: {number(count_avb, big.mark = ',', accuracy = 1)}
                      Avg. Price: ${number(mean_avb, big.mark = ',', accuracy = 1)}"
                      )) %>% 
  ggplot(aes(availability_365, count_avb), text = text) +
  geom_point(aes(text = text), alpha = 0.5, color = "violetred") + 
  geom_line() +
  labs(title = " Availability During a Year", x =" Availability Days", y= "Count")+
  scale_x_continuous(breaks = seq(50, 365, 45)) +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(case6, tooltip = "text")%>%
  config(displaylogo = FALSE)

Observation findings :

The listings with 347 availability days have the highest number of listings.
The Listing hosts in London prefer to put their listing up to the market with 72, 163, and 247 availability days.

The second chart shows us the relationship between Listings’ availability days and its price. The graph shows the maximum price for its corresponding days.

case6b <- zeroVar.data %>% 
  filter(!is.na(availability_365)) %>% 
  group_by(availability_365) %>% 
  mutate(count_avb = n(),
         max_avb = max(price),
         mean_avb = round(sum(price)/count_avb,2),
         min_avb = min(price)) %>% 
  mutate(text =glue("{availability_365} Days per year
                     There are {count_avb} Listings
                     
                     Max. Price: ${number(max_avb, big.mark = ',', accuracy = 1)}
                     Avg. Price: ${mean_avb}
                     Min. Price: ${min_avb}"
                      )) %>% 
  ggplot(aes(availability_365, price)) +
  geom_point(data = . %>% group_by(availability_365) %>% filter(price == max(price)), 
             aes(text = text),
             alpha = 0.2, color = "forestgreen") +
  geom_segment(aes(x=availability_365, xend=availability_365, y=0, yend=price)) +
  labs(title = " Availability Vs Price", x ="Availability Days", y= "Price")+
  scale_x_continuous(breaks = seq(0, 365, 45)) +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(case6b, tooltip = "text")%>%
  config(displaylogo = FALSE)

Observation findings :

Although there’s a significant gap between the maximum price and the average price, the average price still fluctuates among the availability days.
The listing with 344, 346, and 347 availability days have the highest listing price.

Case 7

Listings Distribution in London

zeroVar.data %>% 
  leaflet() %>% 
  addProviderTiles(providers$Esri) %>% 
  # setView(-0.103894, 51.503971, zoom = 13) %>%
  addMarkers(lng = ~longitude, 
             lat = ~latitude,
             clusterOptions = markerClusterOptions(),
             popup = paste0("Host Name: ", zeroVar.data$host_name, 
                            "<br>Room Type: ", zeroVar.data$room_type,
                            "<br>Price: $", zeroVar.data$price,
                            "<br>Rating: ", zeroVar.data$review_scores_rating))

London Airbnb

By Sabrina

2020-09-15

Synopsis

Packages Required

Data Preparation

Importing Data

Data Source

Read the Data

Glimpse of the Data

Data Cleaning

Quality Check

Missing Values

Dollar tag

Duplicated Rows

Filtering

Zero Variance Columns

Based on Unique Values - Character Columns

Based on NA Values

Pattern Matching

Manual Removal

Data Formatting

NA Values Treatment

Problem Statement

Problem Solvings

Continuous

Categorical

Cleaned Dataset

Summary of Variables

Exploratory Data Analysis & Data Visualization

Case 1

The Top 5 Cities with The Most Listing in London

Case 2

Mean Price Comparison of Top 5 Cities with The Most Listing in London

Case 3

The Interaction between Price and Review at Westminster and Hackney

Case 4

Number of Listings over Years at Top 5 Cities

Case 5

Reviews on Response Time of The Three Cities with The Most Listings

Case 6

Price vs Availability

Case 7

Listings Distribution in London