Synopsis

Airbnb is one of the world’s most innovative online marketplaces for offering and arranging lodging, primarily homestays, or tourism experiences. Founded in 2008 by Brian Chesky, Joe Gebbia, and Nathan Blecharczyk, Airbnb has facilitated millions of hospitality entrepreneurs monetize their spaces and their passions while keeping the financial benefits of tourism in their communities. Airbnb market is well blooming in London which had more than 80,000 listings as of June 2020.

The objective of this project is to understand the factors that influence the prices listed on Airbnb in London. It derives from curiosity about specific factors that cause the rental prices for leasing short-term lodging seems easily changed over time.

Packages Required

Initially, we begin by loading the packages that will be required throughout the course of the analysis.

library(tidyverse)
# library(tidyr)
# library(dplyr)
# library(ggplot2)
library(lubridate)
library(caret)
library(data.table)
library(DT)
library(kableExtra)
library(knitr)
library(glue)
library(scales)
library(plotly)
library(leaflet)
library(rgdal)

The descriptions of the packages are in the table below.

Packages Description
tidyverse Collection of R packages (tidyr, dplyr, ggplot2) designed for data science that works harmoniously with other packages
tidyr Changing the layout of the data sets, to convert data into a tidy format
dplyr For data manipulation
ggplot2 Customizable graphical representation
lubridate Dates formatting
caret For data Pre-Processing and Feature Selection
data.table For data manipulation that can be reducing programming and compute time tremendously
DT An R interface to the DataTables library
kableExtra Styling an Interactive Data Tables within Markdown
knitr A general-purpose tool for dynamic report generation

Data Preparation

Now, let’s load the dataset into the R-Environment.

Importing Data

Data Source

This project uses the data from Get the Data tab in insideairbnb website. The insideairbnb is a database website built by Murray Cox and designed by John Morris that collected all their dataset from the Airbnb website.

This project uses the dataset that contains Airbnb listings in London, United Kingdom. The time series dataset records the information of more than 80,000 listings as of 11 June 2020 (downloaded in August 2020).

Read the Data

The dataset is in .csv format, so we will useread.csv method to read the dataset.

raw.data <- read_csv("det_listings.csv")

Glimpse of the Data

After importing the dataset, Let’s take a peek of our dataset!

The dataset has 83,711 rows and 106 columns.

# DATA INPUT: RAW DATASET
glimpse(raw.data)
## Rows: 83,711
## Columns: 106
## $ id                                           <dbl> 11551, 13913, 15400, 1...
## $ listing_url                                  <chr> "https://www.airbnb.co...
## $ scrape_id                                    <dbl> 2.020061e+13, 2.020061...
## $ last_scraped                                 <date> 2020-06-16, 2020-06-1...
## $ name                                         <chr> "Arty and Bright Londo...
## $ summary                                      <chr> "Unlike most rental ap...
## $ space                                        <chr> "Amenities Bedding: 1 ...
## $ description                                  <chr> "Unlike most rental ap...
## $ experiences_offered                          <chr> "family", "business", ...
## $ neighborhood_overview                        <chr> "Not even 10 minutes b...
## $ notes                                        <chr> "No Smoking (very stri...
## $ transit                                      <chr> "Tons of buses (24hrs)...
## $ access                                       <chr> "Guest will have acces...
## $ interaction                                  <chr> "No interaction with g...
## $ house_rules                                  <chr> "No Smoking (very stri...
## $ thumbnail_url                                <lgl> NA, NA, NA, NA, NA, NA...
## $ medium_url                                   <lgl> NA, NA, NA, NA, NA, NA...
## $ picture_url                                  <chr> "https://a0.muscache.c...
## $ xl_picture_url                               <lgl> NA, NA, NA, NA, NA, NA...
## $ host_id                                      <dbl> 43039, 54730, 60302, 6...
## $ host_url                                     <chr> "https://www.airbnb.co...
## $ host_name                                    <chr> "Adriano", "Alina", "P...
## $ host_since                                   <date> 2009-10-03, 2009-11-1...
## $ host_location                                <chr> "London, England, Unit...
## $ host_about                                   <chr> "Hello, I'm a friendly...
## $ host_response_time                           <chr> "within an hour", "N/A...
## $ host_response_rate                           <chr> "100%", "N/A", "100%",...
## $ host_acceptance_rate                         <chr> "100%", "89%", "83%", ...
## $ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, F...
## $ host_thumbnail_url                           <chr> "https://a0.muscache.c...
## $ host_picture_url                             <chr> "https://a0.muscache.c...
## $ host_neighbourhood                           <chr> "Brixton", "LB of Isli...
## $ host_listings_count                          <dbl> 3, 4, 1, 18, 3, 3, 2, ...
## $ host_total_listings_count                    <dbl> 3, 4, 1, 18, 3, 3, 2, ...
## $ host_verifications                           <chr> "['email', 'phone', 'r...
## $ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE...
## $ host_identity_verified                       <lgl> TRUE, TRUE, TRUE, TRUE...
## $ street                                       <chr> "London, United Kingdo...
## $ neighbourhood                                <chr> "Brixton", "LB of Isli...
## $ neighbourhood_cleansed                       <chr> "Lambeth", "Islington"...
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA...
## $ city                                         <chr> "London", "Islington",...
## $ state                                        <chr> NA, "Greater London", ...
## $ zipcode                                      <chr> "SW9 8DG", "N4 3", "SW...
## $ market                                       <chr> "London", "London", "L...
## $ smart_location                               <chr> "London, United Kingdo...
## $ country_code                                 <chr> "GB", "GB", "GB", "GB"...
## $ country                                      <chr> "United Kingdom", "Uni...
## $ latitude                                     <dbl> 51.46225, 51.56802, 51...
## $ longitude                                    <dbl> -0.11732, -0.11121, -0...
## $ is_location_exact                            <lgl> TRUE, TRUE, TRUE, FALS...
## $ property_type                                <chr> "Apartment", "Apartmen...
## $ room_type                                    <chr> "Entire home/apt", "Pr...
## $ accommodates                                 <dbl> 4, 2, 2, 6, 2, 2, 2, 5...
## $ bathrooms                                    <dbl> 1.0, 1.0, 1.0, 2.0, 1....
## $ bedrooms                                     <dbl> 1, 1, 1, 3, 1, 1, 1, 3...
## $ beds                                         <dbl> 3, 0, 1, 3, 1, 0, 1, 3...
## $ bed_type                                     <chr> "Real Bed", "Real Bed"...
## $ amenities                                    <chr> "{TV,\"Cable TV\",Inte...
## $ square_feet                                  <dbl> NA, 538, NA, NA, NA, N...
## $ price                                        <chr> "$88.00", "$65.00", "$...
## $ weekly_price                                 <chr> "$645.00", "$333.00", ...
## $ monthly_price                                <chr> "$2,350.00", "$1,176.0...
## $ security_deposit                             <chr> "$400.00", "$100.00", ...
## $ cleaning_fee                                 <chr> "$35.00", "$15.00", "$...
## $ guests_included                              <dbl> 2, 1, 2, 3, 1, 1, 1, 1...
## $ extra_people                                 <chr> "$20.00", "$15.00", "$...
## $ minimum_nights                               <dbl> 2, 1, 10, 3, 3, 10, 1,...
## $ maximum_nights                               <dbl> 180, 29, 50, 365, 21, ...
## $ minimum_minimum_nights                       <dbl> 2, 1, 10, 3, 2, 10, 1,...
## $ maximum_minimum_nights                       <dbl> 2, 1, 10, 4, 3, 10, 1,...
## $ minimum_maximum_nights                       <dbl> 1125, 29, 50, 60, 21, ...
## $ maximum_maximum_nights                       <dbl> 1125, 29, 50, 365, 21,...
## $ minimum_nights_avg_ntm                       <dbl> 2.0, 1.0, 10.0, 3.2, 3...
## $ maximum_nights_avg_ntm                       <dbl> 1125.0, 29.0, 50.0, 29...
## $ calendar_updated                             <chr> "3 months ago", "14 mo...
## $ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE...
## $ availability_30                              <dbl> 12, 12, 0, 9, 11, 13, ...
## $ availability_60                              <dbl> 42, 42, 0, 39, 41, 43,...
## $ availability_90                              <dbl> 72, 72, 13, 69, 71, 73...
## $ availability_365                             <dbl> 347, 347, 288, 326, 34...
## $ calendar_last_scraped                        <date> 2020-06-16, 2020-06-1...
## $ number_of_reviews                            <dbl> 192, 21, 89, 42, 0, 12...
## $ number_of_reviews_ltm                        <dbl> 9, 5, 4, 2, 0, 8, 1, 1...
## $ first_review                                 <date> 2010-03-21, 2010-08-1...
## $ last_review                                  <date> 2020-03-26, 2020-02-2...
## $ review_scores_rating                         <dbl> 91, 97, 96, 94, NA, 96...
## $ review_scores_accuracy                       <dbl> 9, 10, 10, 10, NA, 10,...
## $ review_scores_cleanliness                    <dbl> 9, 10, 10, 9, NA, 10, ...
## $ review_scores_checkin                        <dbl> 10, 10, 10, 9, NA, 10,...
## $ review_scores_communication                  <dbl> 10, 10, 10, 9, NA, 10,...
## $ review_scores_location                       <dbl> 9, 9, 10, 10, NA, 9, 1...
## $ review_scores_value                          <dbl> 9, 9, 9, 9, NA, 10, 8,...
## $ requires_license                             <lgl> FALSE, FALSE, FALSE, F...
## $ license                                      <lgl> NA, NA, NA, NA, NA, NA...
## $ jurisdiction_names                           <lgl> NA, NA, NA, NA, NA, NA...
## $ instant_bookable                             <lgl> TRUE, FALSE, TRUE, TRU...
## $ is_business_travel_ready                     <lgl> FALSE, FALSE, FALSE, F...
## $ cancellation_policy                          <chr> "strict_14_with_grace_...
## $ require_guest_profile_picture                <lgl> FALSE, FALSE, TRUE, FA...
## $ require_guest_phone_verification             <lgl> TRUE, FALSE, TRUE, FAL...
## $ calculated_host_listings_count               <dbl> 2, 3, 1, 15, 2, 3, 2, ...
## $ calculated_host_listings_count_entire_homes  <dbl> 2, 1, 1, 15, 0, 0, 1, ...
## $ calculated_host_listings_count_private_rooms <dbl> 0, 2, 0, 0, 2, 3, 1, 0...
## $ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0...
## $ reviews_per_month                            <dbl> 1.54, 0.18, 0.70, 0.37...

Data Cleaning

The dataset we just imported is a raw dataset.

For the Explanatory Data purpose, we need to remove or modify data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

Quality Check

Missing Values

Let’s see how many missing values in each column.

# Counting missing values in each column
numMissVal <-sapply(raw.data, function(x) sum(length(which(is.na(x)))))

# Result table
kable(as.data.frame(numMissVal)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width = "100%", height = "250px")
numMissVal
id 0
listing_url 0
scrape_id 0
last_scraped 0
name 25
summary 4864
space 25948
description 3014
experiences_offered 0
neighborhood_overview 30119
notes 52472
transit 30845
access 40186
interaction 35013
house_rules 35960
thumbnail_url 83711
medium_url 83711
picture_url 0
xl_picture_url 83711
host_id 0
host_url 0
host_name 11
host_since 11
host_location 208
host_about 36128
host_response_time 11
host_response_rate 11
host_acceptance_rate 11
host_is_superhost 11
host_thumbnail_url 11
host_picture_url 11
host_neighbourhood 19231
host_listings_count 11
host_total_listings_count 11
host_verifications 0
host_has_profile_pic 11
host_identity_verified 11
street 0
neighbourhood 0
neighbourhood_cleansed 0
neighbourhood_group_cleansed 83711
city 453
state 14854
zipcode 1899
market 726
smart_location 0
country_code 0
country 0
latitude 0
longitude 0
is_location_exact 0
property_type 0
room_type 0
accommodates 0
bathrooms 133
bedrooms 164
beds 984
bed_type 3
amenities 0
square_feet 83333
price 0
weekly_price 77281
monthly_price 79131
security_deposit 29532
cleaning_fee 22104
guests_included 0
extra_people 0
minimum_nights 0
maximum_nights 0
minimum_minimum_nights 0
maximum_minimum_nights 0
minimum_maximum_nights 0
maximum_maximum_nights 0
minimum_nights_avg_ntm 0
maximum_nights_avg_ntm 0
calendar_updated 0
has_availability 0
availability_30 0
availability_60 0
availability_90 0
availability_365 0
calendar_last_scraped 0
number_of_reviews 0
number_of_reviews_ltm 0
first_review 21568
last_review 21568
review_scores_rating 23226
review_scores_accuracy 23286
review_scores_cleanliness 23276
review_scores_checkin 23335
review_scores_communication 23284
review_scores_location 23333
review_scores_value 23334
requires_license 0
license 83711
jurisdiction_names 83711
instant_bookable 0
is_business_travel_ready 0
cancellation_policy 0
require_guest_profile_picture 0
require_guest_phone_verification 0
calculated_host_listings_count 0
calculated_host_listings_count_entire_homes 0
calculated_host_listings_count_private_rooms 0
calculated_host_listings_count_shared_rooms 0
reviews_per_month 21568


Observation findings :

  1. The majority of URL, host, reviews, and legal-related information are blank.
  2. There aren’t many listings that rent by weekly and monthly.

Dollar tag

There are still dollar ($) tags in the price, weekly_price, monthly_price, security_deposit, extra_people, and cleaning_fee. These problems will affect numeric manipulation in our analysis, so we need to remove the dollar ($) tags.

# The updated columns stored in `clean.data` data frame
clean.data <- copy(raw.data)

# remove the "$" tag
clean.data$price <- as.numeric(gsub('[$,]','', raw.data$price))
clean.data$weekly_price <- as.numeric(gsub('[$,]','', raw.data$weekly_price))
clean.data$monthly_price <- as.numeric(gsub('[$,]','', raw.data$monthly_price))
clean.data$security_deposit <- as.numeric(gsub('[$,]','', raw.data$security_deposit))
clean.data$extra_people <- as.numeric(gsub('[$,]','', raw.data$extra_people))
clean.data$cleaning_fee <- as.numeric(gsub('[$,]','', raw.data$cleaning_fee))

# result table
head(clean.data[c("price","weekly_price","monthly_price","security_deposit","cleaning_fee")])

Duplicated Rows

There aren’t any duplicated rows found.

# Check duplicated rows
clean.data[which(duplicated(clean.data) ==T), ] 

Filtering

Some columns add little or no value to our analysis, so we need to remove these columns.

Zero Variance Columns

Variance measures how far a set of data is spreading out. A zero variance indicates that all of the data values are identical (no variation). It also means no deviation from the data. Thus, we cannot do our analysis with zero variance columns. These columns are removed using nearzeroVar method from Caret package.

Columns Removed:

# Calculating the variance of each columns
zvCols <- nearZeroVar(clean.data, saveMetrics = TRUE)

# Extract column names that has a zero variance
ZVnames=rownames(subset(zvCols, nzv== TRUE))
# Remove columns that has a zero variance
clean.data <- clean.data[ , !(names(clean.data) %in% ZVnames)]

# result
ZVnames
##  [1] "scrape_id"                                  
##  [2] "experiences_offered"                        
##  [3] "thumbnail_url"                              
##  [4] "medium_url"                                 
##  [5] "xl_picture_url"                             
##  [6] "host_has_profile_pic"                       
##  [7] "neighbourhood_group_cleansed"               
##  [8] "state"                                      
##  [9] "market"                                     
## [10] "country_code"                               
## [11] "country"                                    
## [12] "bed_type"                                   
## [13] "has_availability"                           
## [14] "requires_license"                           
## [15] "license"                                    
## [16] "jurisdiction_names"                         
## [17] "is_business_travel_ready"                   
## [18] "require_guest_profile_picture"              
## [19] "require_guest_phone_verification"           
## [20] "calculated_host_listings_count_shared_rooms"

Based on Unique Values - Character Columns

Variance measures how far a set of data is spreading out. Character columns with near 100% variance indicate that all of the data values are different in each row. These columns don’t provide any group-level information to a larger population.

Columns Removed:

# Selecting distinct character columns
dist.df <-
  zeroVar.data %>% 
  select_if(is.character) %>% 
  summarise_all(list(~n_distinct(.)))

# Distinct cOlumns that has a near 100% variance
dist.df <-
  dist.df %>% 
  gather(key = var_name, value = value, 1:ncol(dist.df))

# Proportions of distinct cOlumns that has a near 100% variance
dist.df$percentUnique <- round(dist.df$value/nrow(zeroVar.data),2)

# Distinct columns that has variance proportions > 20%
unique.vars <- dist.df %>% 
  filter(percentUnique > 0.2) %>% 
  pull(var_name)

# Remove columns that has variance proportions > 20% 
zeroVar.data <- zeroVar.data[,!(names(zeroVar.data) %in% unique.vars)]

# result
unique.vars
##  [1] "listing_url"           "name"                  "summary"              
##  [4] "space"                 "description"           "neighborhood_overview"
##  [7] "notes"                 "transit"               "access"               
## [10] "interaction"           "house_rules"           "picture_url"          
## [13] "host_url"              "host_about"            "host_thumbnail_url"   
## [16] "host_picture_url"      "zipcode"               "amenities"

Based on NA Values

Columns with over than 50% NA values are removed.

Columns Removed:

# Count NA rows in each columns
na_df <- zeroVar.data %>% 
  summarise_all(list(~sum(is.na(.))))

# Summary of NA rows in each column
na_df <- na_df %>% 
  gather(key = var_name, value = value, 1:ncol(na_df))

# Proportions of NA rows in each column
na_df$numNa <- round(na_df$value/nrow(zeroVar.data),2)

# Column that has 50% NA rows
na_var <- na_df %>% 
  filter(numNa > 0.5) %>% pull(var_name)

# Remove columns that has proportions > 50% variance
zeroVar.data <- zeroVar.data[,!(names(zeroVar.data) %in% na_var)]

# result
na_var
## [1] "square_feet"   "weekly_price"  "monthly_price"

Pattern Matching

Column names contain strings “night”, and “location” are removed, except minimum_nightsand review_scores_location, because those columns are irrelevant information to our analysis.

Columns Removed:

# Make the pattern
pattern <- colnames(
  zeroVar.data %>%
  select(contains("night"),
         contains("location")
         )
)
  


# Excluding "minimun_nights" and "review_scores_location" in the pattern
pattern <- pattern[!(pattern == "minimum_nights" | pattern == "review_scores_location")]

# Remove columns that has similar names with columns in the pattern
zeroVar.data <- zeroVar.data[,!(names(zeroVar.data) %in% pattern)]

# result
pattern
##  [1] "maximum_nights"         "minimum_minimum_nights" "maximum_minimum_nights"
##  [4] "minimum_maximum_nights" "maximum_maximum_nights" "minimum_nights_avg_ntm"
##  [7] "maximum_nights_avg_ntm" "host_location"          "smart_location"        
## [10] "is_location_exact"

Manual Removal

There aren’t any better pattern left, so some of the columns are removed manually.

Columns Removed:

# List of columns that will be removed manually
dropCol <-
  c(
  "host_neighbourhood",
  "host_listings_count",
  "host_verifications",
  "host_identity_verified",
  "street",
  "neighbourhood",
  "city",
  "property_type",
  "calendar_updated",
  "calendar_last_scraped",
  "number_of_reviews_ltm",
  "first_review",
  "calculated_host_listings_count",
  "calculated_host_listings_count_entire_homes",
  "calculated_host_listings_count_private_rooms",
  "guests_included",
  "extra_people"
  )

zeroVar.data <- zeroVar.data[,!(names(zeroVar.data) %in% dropCol)]


# result
dropCol
##  [1] "host_neighbourhood"                          
##  [2] "host_listings_count"                         
##  [3] "host_verifications"                          
##  [4] "host_identity_verified"                      
##  [5] "street"                                      
##  [6] "neighbourhood"                               
##  [7] "city"                                        
##  [8] "property_type"                               
##  [9] "calendar_updated"                            
## [10] "calendar_last_scraped"                       
## [11] "number_of_reviews_ltm"                       
## [12] "first_review"                                
## [13] "calculated_host_listings_count"              
## [14] "calculated_host_listings_count_entire_homes" 
## [15] "calculated_host_listings_count_private_rooms"
## [16] "guests_included"                             
## [17] "extra_people"

Data Formatting

Some columns need to be changed as factors.

# data format : factor
zeroVar.data$host_name <- as.factor(zeroVar.data$host_name)
zeroVar.data$host_response_time <-  as.factor(zeroVar.data$host_response_time)
zeroVar.data$host_response_rate <- as.factor(zeroVar.data$host_response_rate)
zeroVar.data$host_acceptance_rate <- as.factor(zeroVar.data$host_acceptance_rate)
zeroVar.data$host_total_listings_count <- as.factor(zeroVar.data$host_total_listings_count)

zeroVar.data$neighbourhood_cleansed <- as.factor(zeroVar.data$neighbourhood_cleansed)
zeroVar.data$room_type <- as.factor(zeroVar.data$room_type)
zeroVar.data$cancellation_policy <- as.factor(zeroVar.data$cancellation_policy)


zeroVar.data$review_scores_rating <-  as.factor(zeroVar.data$review_scores_rating)
zeroVar.data$review_scores_accuracy <-  as.factor(zeroVar.data$review_scores_accuracy)
zeroVar.data$review_scores_cleanliness <-  as.factor(zeroVar.data$review_scores_cleanliness)
zeroVar.data$review_scores_checkin <-  as.factor(zeroVar.data$review_scores_checkin)
zeroVar.data$review_scores_communication <-  as.factor(zeroVar.data$review_scores_communication)
zeroVar.data$review_scores_location <-  as.factor(zeroVar.data$review_scores_location)
zeroVar.data$review_scores_value <-  as.factor(zeroVar.data$review_scores_value)

last_scraped, host_since, and last_review changed into ymd format using lubridate library.

zeroVar.data$last_scraped <- ymd(zeroVar.data$last_scraped)
zeroVar.data$host_since <- ymd(zeroVar.data$host_since)
zeroVar.data$last_review <- ymd(zeroVar.data$last_review)

NA Values Treatment

Problem Statement

After filtering the dataset, some columns still have missing values.

# Counting missing values in each column
clean_NAval <-sapply(zeroVar.data, function(x) sum(length(which(is.na(x)))))

# Result table
kable(as.data.frame(clean_NAval)) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% 
  scroll_box(width = "100%", height = "250px")
clean_NAval
id 0
last_scraped 0
host_id 0
host_name 11
host_since 11
host_response_time 11
host_response_rate 11
host_acceptance_rate 11
host_is_superhost 11
host_total_listings_count 11
neighbourhood_cleansed 0
latitude 0
longitude 0
room_type 0
accommodates 0
bathrooms 133
bedrooms 164
beds 984
price 0
security_deposit 29532
cleaning_fee 22104
minimum_nights 0
availability_30 0
availability_60 0
availability_90 0
availability_365 0
number_of_reviews 0
last_review 21568
review_scores_rating 23226
review_scores_accuracy 23286
review_scores_cleanliness 23276
review_scores_checkin 23335
review_scores_communication 23284
review_scores_location 23333
review_scores_value 23334
instant_bookable 0
cancellation_policy 0
reviews_per_month 21568


The missing value can severely distort the distribution of the data. However, there isn’t any better way to deal with missing data. Removing columns or rows with missing values can produce a bias in the analysis. Note that imputation does not necessarily give better results.


Alvira Swalin gave a better explanation about how to handle missing data in towardsdatascience.com. The methods to handle missing values are as follow:

Problem Solvings

Based on the previous flowchart, our dataset has both general problems, continuous and categorical. We use data imputation only on factor and numeric data format.


Continuous

The numerical columns that have missing values are bathrooms, bedrooms, beds, security_deposit, and cleaning_fee. we can replace the missing values in these numerical columns with the average of the corresponding column.

First, get the averages of each numeric column.

# Find average value 
baths_mean <- mean(zeroVar.data$bathrooms,  na.rm = TRUE)
bedrms_mean <- mean(zeroVar.data$bedrooms,  na.rm = TRUE) 
beds_mean <- mean(zeroVar.data$beds,  na.rm = TRUE) 
seqdep_mean <- mean(zeroVar.data$security_deposit,  na.rm = TRUE) 
cleanfee_mean <- mean(zeroVar.data$cleaning_fee,  na.rm = TRUE)
monthlyreviews_mean <- mean(zeroVar.data$reviews_per_month, na.rm = TRUE)

Then replace the missing values with the averages of the corresponding columns.

zeroVar.data$bathrooms[is.na(zeroVar.data$bathrooms)] <- baths_mean
zeroVar.data$bedrooms[is.na(zeroVar.data$bedrooms)] <- bedrms_mean
zeroVar.data$beds[is.na(zeroVar.data$beds)] <- beds_mean
zeroVar.data$security_deposit[is.na(zeroVar.data$security_deposit)] <- seqdep_mean
zeroVar.data$cleaning_fee[is.na(zeroVar.data$cleaning_fee)] <- cleanfee_mean
zeroVar.data$reviews_per_month[is.na(zeroVar.data$reviews_per_month)] <- monthlyreviews_mean

Categorical

The categorical columns that have missing values are host_name, host_response_time, host_response_rate, host_acceptance_rate, host_total_listings_count, neighbourhood_cleansed, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, and cancellation_policy

We can treat the missing values in these categorical as levels. As example, let’s see the levels attribute of host_response_time. The missing values are treated as one of the levels in the host_response_time.

levels(zeroVar.data$host_response_time)
## [1] "a few days or more" "N/A"                "within a day"      
## [4] "within a few hours" "within an hour"

Cleaned Dataset

Here’s the cleaned data set:

datatable(head(zeroVar.data, 50), 
          style = 'bootstrap', 
          class = 'table-bordered table-condensed')

Summary of Variables

The summary of the variables of dataset are in the table below.

No.  Variable Class Description
1 id numeric Unique listing ID
2 last_scraped date Latest scrapped data
3 host_id numeric Host ID
4 host_name factor Name of the host
5 host_since date The date of user listed as a host
6 host_response_time factor How long the host response to user’s request
7 host_response_rate factor Percentage of host response time to numbers of request
8 host_acceptance_rate factor Percentage of host accepts the request to numbers of request
9 host_is_superhost logical Whether the host is super host
10 host_total_listings_count factor Amount of listing a host have
11 neighbourhood_cleansed factor Area in London
12 city factor City in London
13 latitude numeric Latitude coordinates
14 longitude numeric Longitude coordinates
15 room_type factor The type of room provided in the listing (Entire home/apt, Private room, or Shared room)
16 accommodates numeric The number of people that the listing can accommodate
17 bathrooms numeric The number of bathrooms in the listing
18 bedrooms numeric The number of bedrooms in the listing
19 beds numeric The number of beds in the listing
20 price numeric The price to stay in the listing for one night.
21 security_deposit numeric The price of security deposit according to 60% of a listing’s nightly rate
22 cleaning_fee numeric Additional charge for cleaning the listing, set by the hosts and paid once
24 minimum_nights numeric Amount of minimum nights
25 availability_30 numeric Number of days when listing is available in a month for booking
26 availability_60 numeric Number of days when listing is available in two months for booking
27 availability_90 numeric Number of days when listing is available in three months for booking
28 availability_365 numeric Number of days when listing is available in a year for booking
29 number_of_reviews numeric Amount of reviews the listing get
30 last_review date Latest review
31 review_scores_rating factor Rating for host for the overall experience
32 review_scores_accuracy factor Rating for host for the overall accuracy
33 review_scores_cleanliness factor Rating for host for the overall cleanliness
34 review_scores_checkin factor Rating for host for the overall check-in
35 review_scores_communication factor Rating for host for the overall host communication with a user
36 review_scores_location factor Rating for host for the overall listing location
37 review_scores_value factor Rating for host for the overall listing vallue
38 instant_bookable logical Whether the listing is instant bookable
39 cancellation_policy factor The type of cancelation policy
40 reviews_per_month numeric Amount of reviews per month the listing get

Exploratory Data Analysis & Data Visualization

Case 1

The Top 5 Cities with The Most Listing in London

# Count the numbers of listings of each room type
property_df <-  zeroVar.data %>% 
  group_by(neighbourhood_cleansed, room_type) %>% 
  summarize(Freq = n()) %>% #  number of observations within a group
  ungroup()

# Count the total listing of each neighborhood
total_property <-  zeroVar.data %>% 
  filter(room_type %in% c("Private room", "Entire home/apt", "Shared room", "Hotel room")) %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarize(sum = n())
  
# Merge the data and select the top 5 cities
merge_prop <- merge (property_df, total_property, by= "neighbourhood_cleansed") %>% 
  arrange(-sum, -Freq) %>% 
  head(20)


# Sort the neighborhood from the most listing
merge_prop$size_f <- factor(merge_prop$neighbourhood_cleansed, levels=c('Westminster','Tower Hamlets','Hackney','Camden','Kensington and Chelsea'))


# Assign each neighborhood and room type with specific color
color_table <- tibble(
  neighbourhood_cleansed = c('Westminster','Tower Hamlets','Hackney','Camden','Kensington and Chelsea'),
  color = c("salmon", "goldenrod1", "springgreen3", "lightslateblue", "violetred")
  )

ct_room <- tibble(
  room_type = c('Entire home/apt','Private room','Hotel room','Shared room'),
  color = c("yellow", "blue4", "lightblue", "darkgreen")
  )


# a factor variable in the same order as the color table
merge_prop$neighbourhood_cleansed <- factor(merge_prop$neighbourhood_cleansed, levels = color_table$neighbourhood_cleansed)
merge_prop$room_type <- factor(merge_prop$room_type, levels = ct_room$room_type)

The top 5 Cities with the most listing go to:

# Plot the data by the neighborhood

merge_prop %>% 
  ggplot() +
  geom_bar(aes(reorder(as.factor(neighbourhood_cleansed), Freq), Freq, fill=neighbourhood_cleansed), stat = 'identity') +
  geom_text(aes(neighbourhood_cleansed, sum, label = sum), hjust = 2.0,  color = "white") +
  labs(x="Room Type", y="Count") +
  scale_fill_manual(values = color_table$color) +
  theme(legend.position = 'none') +
  ggtitle("The Most Listing in London by Neighborhood") + 
  xlab("Neighbourhood") + 
  ylab("Number of Listings") +
  theme(legend.position = 'none',
        plot.title = element_text(color = 'black', size = 14, face = 'bold', hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text()) +
  coord_flip()


Now, let’s break down each neighborhood to their room type!

# Plot the data by the room type
merge_prop %>% 
  ggplot() +
  geom_bar(aes(reorder(as.factor(room_type), Freq), Freq, fill=room_type), stat = 'identity') +
  scale_fill_manual(values = ct_room$color) +
  theme(legend.position = "none")+
  labs(title = 'Number of Room Type by Neighbourhood',
       x='Room Type', 
       y='Number of Listings') +
  facet_wrap(~size_f, ncol = 2) +
  theme(legend.position = 'none',
        plot.title = element_text(color = 'black', size = 14, face = 'bold', hjust = 0.5)) +
  coord_flip()


Observation findings :

  1. The room type of entire home/apartment and private room contribute the most on Airbnb listings in London.
  2. The room type of hotel room and shared room contribute the least on Airbnb listings in London.

Case 2

Mean Price Comparison of Top 5 Cities with The Most Listing in London


Let’s find out about the correlation between mean price and number of listings!

# Count the mean price for every neighborhood
mean_price_neighbor <-  zeroVar.data %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarize(mean_neighbor = mean(price, na.rm = TRUE)) %>% #  the number of observations within a group
  ungroup()

# Count the total listing of each neighborhood
listing_sum <-  zeroVar.data %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarize(sum = n())
  
# Merge the data and select the top 5 cities with the most listing
merge_mean <- merge (mean_price_neighbor, listing_sum, by= "neighbourhood_cleansed") %>% 
  arrange(-sum, -mean_neighbor) %>% 
  head(5)

# Assign each neighborhood and room type with specific color
color_table <- tibble(
  neighbourhood_cleansed = c('Westminster','Tower Hamlets','Hackney','Camden','Kensington and Chelsea'),
  color = c("salmon", "goldenrod1", "springgreen3", "lightslateblue", "violetred")
  )

ct_room <- tibble(
  room_type = c('Entire home/apt','Private room','Hotel room','Shared room'),
  color = c("yellow", "blue4", "lightblue", "darkgreen")
  )


# a factor variable in the same order as the color table
merge_mean$neighbourhood_cleansed <- factor(merge_mean$neighbourhood_cleansed, levels = color_table$neighbourhood_cleansed)

City with the highest mean price goes to:

# Plot the data
merge_mean %>% 
  ggplot(aes(x = reorder(neighbourhood_cleansed, mean_neighbor), y = mean_neighbor, fill = neighbourhood_cleansed)) +
  geom_col(stat ="identity") +
  scale_fill_manual(values = color_table$color) +
  coord_flip() +
  theme_gray() +
  labs(x = "Neighbourhood Group", y = "Price") +
  geom_text(aes(label = round(mean_neighbor,digit = 2)), hjust = 2.0, color = "white", size = 3.5) +
  ggtitle("Mean Price comparison for each Neighbourhood", subtitle = "Price vs Neighbourhood") + 
  xlab("Neighbourhood") + 
  ylab("Mean Price ($)") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank())

Well, what about the interaction between the mean price of those cities and their room type ?

# Count mean price for each room type in every neighborhood
property_mean <-  zeroVar.data %>% 
  group_by(neighbourhood_cleansed, room_type) %>% 
  summarize(mean_room = mean(price, na.rm = TRUE)) %>% #  number of observations within a group
  ungroup()


# Count listings of each room type in every neighborhood
property_sum <-  zeroVar.data %>% 
  filter(room_type %in% c("Private room", "Entire home/apt", "Shared room", "Hotel room")) %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarize(sum = n())  


# merge data
merge_mean_neighbor <- merge (property_mean, property_sum, by= "neighbourhood_cleansed") %>% 
  arrange(-sum, -mean_room) %>% 
  head(20)


# Sort the neighborhood from the most listing
merge_mean_neighbor$sorts <- factor(merge_mean_neighbor$room_type, levels=c('Hotel room','Entire home/apt','Shared room','Private room'))


# a factor variable in the same order as the color table
merge_mean_neighbor$neighbourhood_cleansed <- factor(merge_mean_neighbor$neighbourhood_cleansed, 
                                                     levels = color_table$neighbourhood_cleansed)

merge_mean_neighbor$room_type <- factor(merge_mean_neighbor$room_type, 
                                        levels = ct_room$room_type)


# Plot the data by the room type
merge_mean_neighbor %>% 
  ggplot()+
  geom_col(aes(x = reorder(neighbourhood_cleansed, mean_room), 
               y = mean_room,
               fill = neighbourhood_cleansed), 
           position = "dodge") +
  scale_fill_manual(values = color_table$color) +
  theme(legend.position = 'none') +
  ggtitle("Average Price by Room Type") +
  xlab("Neighbourhood") + 
  ylab("Mean Price") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank()) +
  facet_wrap(~ sorts, ncol = 2) +
  coord_flip()


Observation findings :

  1. Westminster has the highest mean price because the area has the highest hotel room mean price.
  2. Mostly the hotel room price gives a meaningful impact on the calculation of the average price in each neighbourhood.
  3. The private room price gives the least impact on the calculation of the average price in each neighbourhood.

Case 3

The Interaction between Price and Review at Westminster and Hackney


Rumour has it that the host often increase the listing price based on their will. Do customers’ reviews have an impact on increasing the listings price? The vertical lines represent the mean price for each neighbourhood.

# Find the mean price for each neighbour
df1 <- zeroVar.data %>% 
  filter(neighbourhood_cleansed == "Westminster" | neighbourhood_cleansed == "Hackney") %>% 
  group_by(neighbourhood_cleansed) %>% 
  summarise(meanprice2 = mean(price))


# Plot the interaction between Price and Number or Reviews
zeroVar.data %>% 
  filter(neighbourhood_cleansed == "Westminster" | neighbourhood_cleansed == "Hackney") %>% 
  ggplot(aes(price, number_of_reviews)) +
  geom_jitter(aes(col = number_of_reviews)) +
  scale_x_log10() +
  facet_wrap(~neighbourhood_cleansed) +
  labs(title = "Number of Reviews Vs Price", x="Price ($)", y= "Number of Reviews",
       color='Degree of Reviews') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_vline(data=filter(df1, neighbourhood_cleansed == "Westminster"), 
             aes(xintercept = meanprice2), colour="salmon") + 
  geom_vline(data=filter(df1, neighbourhood_cleansed == "Hackney"), 
             aes(xintercept = meanprice2), colour="springgreen3")


Observation findings :

  1. Both neighbourhoods show that the higher number of reviews, the listing price is close to or below the neighbourhood mean price.
  2. Both neighbourhoods also show that the higher the listing price, the less it has the number of reviews.
  3. Both graphs show that customers give more reviews on a listing that has a price around or below the mean price in each neighbourhood.

Case 4

Number of Listings over Years at Top 5 Cities


Airbnb is an emerging unicorn company since 2010. Numerous people have been listed as a host for so many years. Let’s take a look at the growth of hosts in London!

case4 <- zeroVar.data %>% 
  mutate(host_year = year(host_since)) %>% 
  group_by(host_year,) %>%
  summarise(num_listing_year = n(),
         avg_price_list = round(sum(price)/num_listing_year,2)) %>% 
  ungroup() %>% 
  mutate(text = glue("Year: {host_year}
                      Number of Listings: {number(num_listing_year, big.mark = ',', accuracy = 1)}
                      Avg. Price: ${avg_price_list}"
                      )) %>% 
  arrange(host_year) %>% 
  ggplot(aes(host_year, num_listing_year)) +
  geom_point(aes(text = text), color = "maroon", size = 2) +
  geom_line(aes(text = text), group = 1) +
  scale_x_continuous(breaks = seq(2008, 2020, 2)) +
  theme_bw() +
  labs(title = "Number of Listings per Year", 
       x="Year ", y= "Count")

ggplotly(case4, tooltip = "text")%>%
  config(displaylogo = FALSE)


Observation findings :

  1. There’s a steep growth of hosts from 2010 to 2015.
  2. On 2015, London has reached its highest level of the growth of listing hosts.
  3. Unfortunately, on 2020 Number of Hosts in London bottomed out due to Covid-19.

Case 5

Reviews on Response Time of The Three Cities with The Most Listings

When the guests feel satisfied, they will give a great review on the listing page. One of the reasons the guests can feel pleased is by replying their inquiries.

zeroVar.data %>% 
  filter(neighbourhood_cleansed %in% c('Westminster','Tower Hamlets','Hackney')) %>%
  filter(host_response_time %in% c('within an hour','within a few hours','within a day','a few days or more')) %>% 
  mutate(year_revs = year(last_review)) %>% 
  group_by(neighbourhood_cleansed, host_response_time, year_revs) %>%
  summarise(sums_rev = sum(number_of_reviews)) %>% 
  ungroup() %>% 
  na.omit() %>% 
  filter(year_revs >= 2017) %>%
  ggplot(aes(x = year_revs, y = sums_rev)) +
  geom_col(aes(fill = year_revs), position = "dodge")  +
  scale_fill_viridis_b() +
  theme(legend.position = 'none') +
  ggtitle("Number of Reviews vs Response Time ") +
  xlab("Year") + 
  ylab("Mean Number of Reviews") +
  theme(legend.position = "none",
        plot.title = element_text(color = "black", size = 14, face = "bold", hjust = 0.5),
        plot.subtitle = element_text(color = "darkblue", hjust = 0.5),
        axis.title.y = element_text(),
        axis.title.x = element_text(),
        axis.ticks = element_blank()) +
  facet_grid(neighbourhood_cleansed ~ host_response_time, scales = "free")


Observation findings :

  1. As the years passed, the more prompt a host responses to the guests’ inquiries, the more reviews the host will get.
  2. On 2020, the guests have given many reviews on Listings in London than the other years.

Case 6

Price vs Availability


Count only those listings that available at least every weekend throughout the year. The first chart shows the distribution of Listings’ availability days during a year.

case6 <- zeroVar.data %>% 
  filter(!is.na(availability_365)) %>% 
  group_by(availability_365) %>% 
  mutate(count_avb = n(),
         mean_avb = round(sum(price)/count_avb,2)) %>% 
  arrange(count_avb) %>% 
  filter(availability_365 >= 50) %>% 
  mutate(text = glue("{availability_365} Days per year
                      Number of Listings: {number(count_avb, big.mark = ',', accuracy = 1)}
                      Avg. Price: ${number(mean_avb, big.mark = ',', accuracy = 1)}"
                      )) %>% 
  ggplot(aes(availability_365, count_avb), text = text) +
  geom_point(aes(text = text), alpha = 0.5, color = "violetred") + 
  geom_line() +
  labs(title = " Availability During a Year", x =" Availability Days", y= "Count")+
  scale_x_continuous(breaks = seq(50, 365, 45)) +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(case6, tooltip = "text")%>%
  config(displaylogo = FALSE)


Observation findings :

  1. The listings with 347 availability days have the highest number of listings.
  2. The Listing hosts in London prefer to put their listing up to the market with 72, 163, and 247 availability days.


The second chart shows us the relationship between Listings’ availability days and its price. The graph shows the maximum price for its corresponding days.

case6b <- zeroVar.data %>% 
  filter(!is.na(availability_365)) %>% 
  group_by(availability_365) %>% 
  mutate(count_avb = n(),
         max_avb = max(price),
         mean_avb = round(sum(price)/count_avb,2),
         min_avb = min(price)) %>% 
  mutate(text =glue("{availability_365} Days per year
                     There are {count_avb} Listings
                     
                     Max. Price: ${number(max_avb, big.mark = ',', accuracy = 1)}
                     Avg. Price: ${mean_avb}
                     Min. Price: ${min_avb}"
                      )) %>% 
  ggplot(aes(availability_365, price)) +
  geom_point(data = . %>% group_by(availability_365) %>% filter(price == max(price)), 
             aes(text = text),
             alpha = 0.2, color = "forestgreen") +
  geom_segment(aes(x=availability_365, xend=availability_365, y=0, yend=price)) +
  labs(title = " Availability Vs Price", x ="Availability Days", y= "Price")+
  scale_x_continuous(breaks = seq(0, 365, 45)) +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(case6b, tooltip = "text")%>%
  config(displaylogo = FALSE)


Observation findings :

  1. Although there’s a significant gap between the maximum price and the average price, the average price still fluctuates among the availability days.
  2. The listing with 344, 346, and 347 availability days have the highest listing price.

Case 7

Listings Distribution in London

zeroVar.data %>% 
  leaflet() %>% 
  addProviderTiles(providers$Esri) %>% 
  # setView(-0.103894, 51.503971, zoom = 13) %>%
  addMarkers(lng = ~longitude, 
             lat = ~latitude,
             clusterOptions = markerClusterOptions(),
             popup = paste0("Host Name: ", zeroVar.data$host_name, 
                            "<br>Room Type: ", zeroVar.data$room_type,
                            "<br>Price: $", zeroVar.data$price,
                            "<br>Rating: ", zeroVar.data$review_scores_rating))