For this section, we demonstrate how to separate, clean and upload host and listing data taken from Airbnb for Singapore.
The following libraries are used for data cleaning and database construction
# Data cleaning libraries
library(readr)
library(tidyr)
library(stringr)
library(tibble)
library(dplyr)
# Database libraries
library(DBI)
library(RSQLite)
# ER Diagram libraries
library(dm)
Begin by loading the downloaded data into R using the
read_csv()
command from the readr
library.
data <- read_csv("../data/listings.csv.gz")
We also need to form a database connection object:
con <- dbConnect(RSQLite::SQLite(), "../db/airdb.SQLite")
Before moving on to the data processing stage, we first define a few useful functions.
remove_live_database()
functionWhile we are actively working on our database construction, we may
find it useful to periodically disconnect and rebuild our database from
scratch. Having to close R
, delete the
airdb.SQLite
file and reopen our session repeatedly can be
tedious. To work around this, we use the following
remove_live_database()
function to disconnect and delete
any current working database while R
is still open:
remove_live_database <- function(con){
if(file.exists("../db/airdb.SQLite")){
if (exists("con")){
dbDisconnect(con)
}
file.remove("../db/airdb.SQLite")
}
}
insert_to_sql()
functionIn order to facilitate the insertion of data into our
RSQLite
database, we use the following
insert_to_sql()
function. This function performs a number
of operations prior to insertion:
data
argument.NA
values are in the correct form.
For all non NA
values, replace the double quotes with
single quotes and bookend all strings with double quotes.NA
values to NULL
and trim all additional whitespace.\\
’s. Finally, use the paste0()
function
to convert our string into a query prior to insertion in our database’s
corresponding relevant table.insert_to_sql <- function(con, table, data){
# (1)
column_name <- paste(names(data), collapse = ", ")
# (2)
data_new <- data %>%
mutate_if(is.character, function(x) ifelse(is.na(x), NA, x %>%
str_replace_all('"', "'") %>% # Replace " with '
paste0('"', . , '"') # Add " before and after string
)
)
value_data <- apply(data_new, MARGIN = 1,
function(x) x %>%
paste(collapse = ",") %>% # Join all column into single string
str_replace_all("\\bNA\\b", "NULL") %>% # Create NULL from NA
str_trim() # remove unnecessary whitespace
)
# (4)
query_value <- paste(value_data) %>%
paste0("(", ., ")") %>% # Add bracket before and after string
paste(collapse = ", ") %>% # Join all values into single string
str_remove_all("\\\\") %>% # Remove \\ from string
paste0("INSERT INTO ", table, "(", column_name, ") VALUES ", .)
# (5)
dbSendQuery(con, query_value)
}
The data used in for this project is provided by Airbnb and can be
found here. To
get started, first download the listings.csv.gz
from the
provided link. In order to help streamline the process of constructing
our database, it is beneficial to split the .csv
file into
a listing
table and a host_info
table. The E/R
(Entity/Relationship) diagram in the next section shows how these two
tables are related and the attributes they each hold.
We can examine our initial data using the glimpse()
command from the tibble
library.
data %>% glimpse()
Rows: 4,099
Columns: 74
$ id <dbl> 5.955499e+17, 5.064600e+0…
$ listing_url <chr> "https://www.airbnb.com/r…
$ scrape_id <dbl> 2.022062e+13, 2.022062e+1…
$ last_scraped <date> 2022-06-22, 2022-06-22, …
$ name <chr> "HCK[R&F] 2BR 2-4pax City…
$ description <chr> "The location is quite co…
$ neighborhood_overview <chr> "Located at Bespoke @ JB …
$ picture_url <chr> "https://a0.muscache.com/…
$ host_id <dbl> 95993346, 227796, 367042,…
$ host_url <chr> "https://www.airbnb.com/u…
$ host_name <chr> "Kylie", "Sujatha", "Beli…
$ host_since <date> 2016-09-20, 2010-09-08, …
$ host_location <chr> "Johor, Malaysia", "Singa…
$ host_about <chr> NA, "I am a working profe…
$ host_response_time <chr> "within an hour", "a few …
$ host_response_rate <chr> "100%", "0%", "100%", "10…
$ host_acceptance_rate <chr> "100%", "N/A", "100%", "1…
$ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALS…
$ host_thumbnail_url <chr> "https://a0.muscache.com/…
$ host_picture_url <chr> "https://a0.muscache.com/…
$ host_neighbourhood <chr> NA, "Bukit Timah", "Tampi…
$ host_listings_count <dbl> 1, 1, 6, 6, 6, 30, 30, 6,…
$ host_total_listings_count <dbl> 1, 1, 6, 6, 6, 30, 30, 6,…
$ host_verifications <chr> "['email', 'phone']", "['…
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ host_identity_verified <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ neighbourhood <chr> "Johor Bahru, Johor, Mala…
$ neighbourhood_cleansed <chr> "Woodlands", "Bukit Timah…
$ neighbourhood_group_cleansed <chr> "North Region", "Central …
$ latitude <dbl> 1.45838, 1.33432, 1.34537…
$ longitude <dbl> 103.7703, 103.7852, 103.9…
$ property_type <chr> "Entire rental unit", "Pr…
$ room_type <chr> "Entire home/apt", "Priva…
$ accommodates <dbl> 4, 2, 6, 2, 3, 1, 1, 6, 2…
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N…
$ bathrooms_text <chr> "1 bath", "1 bath", "1 pr…
$ bedrooms <dbl> 2, 1, 2, 1, 1, 1, 1, 3, 1…
$ beds <dbl> 2, 1, 3, 1, 2, 1, 1, 5, 1…
$ amenities <chr> "[\"Free washer \\u2013 I…
$ price <chr> "$58.00", "$80.00", "$143…
$ minimum_nights <dbl> 1, 92, 92, 92, 92, 14, 14…
$ maximum_nights <dbl> 1125, 730, 1125, 1125, 11…
$ minimum_minimum_nights <dbl> 1, 92, 92, 92, 92, 14, 14…
$ maximum_minimum_nights <dbl> 1, 92, 92, 92, 92, 14, 14…
$ minimum_maximum_nights <dbl> 1125, 730, 1125, 1125, 11…
$ maximum_maximum_nights <dbl> 1125, 730, 1125, 1125, 11…
$ minimum_nights_avg_ntm <dbl> 1, 92, 92, 92, 92, 14, 14…
$ maximum_nights_avg_ntm <dbl> 1125, 730, 1125, 1125, 11…
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N…
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ availability_30 <dbl> 0, 30, 30, 30, 30, 5, 13,…
$ availability_60 <dbl> 0, 60, 60, 60, 60, 13, 43…
$ availability_90 <dbl> 17, 90, 90, 90, 90, 43, 7…
$ availability_365 <dbl> 18, 365, 365, 365, 365, 3…
$ calendar_last_scraped <date> 2022-06-22, 2022-06-22, …
$ number_of_reviews <dbl> 8, 18, 20, 24, 47, 21, 13…
$ number_of_reviews_ltm <dbl> 8, 0, 0, 0, 0, 1, 0, 0, 0…
$ number_of_reviews_l30d <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ first_review <date> 2022-04-14, 2014-04-18, …
$ last_review <date> 2022-05-01, 2014-12-26, …
$ review_scores_rating <dbl> 5.00, 4.56, 4.44, 4.16, 4…
$ review_scores_accuracy <dbl> 4.88, 4.72, 4.37, 4.22, 4…
$ review_scores_cleanliness <dbl> 4.88, 4.78, 4.00, 4.09, 4…
$ review_scores_checkin <dbl> 5.00, 4.78, 4.63, 4.43, 4…
$ review_scores_communication <dbl> 5.00, 4.94, 4.78, 4.43, 4…
$ review_scores_location <dbl> 4.88, 4.72, 4.26, 4.17, 4…
$ review_scores_value <dbl> 4.88, 4.50, 4.32, 4.04, 4…
$ license <chr> NA, NA, NA, NA, NA, "S039…
$ instant_bookable <lgl> TRUE, FALSE, FALSE, TRUE,…
$ calculated_host_listings_count <dbl> 1, 1, 6, 6, 6, 50, 50, 6,…
$ calculated_host_listings_count_entire_homes <dbl> 1, 0, 0, 0, 0, 2, 2, 0, 1…
$ calculated_host_listings_count_private_rooms <dbl> 0, 1, 6, 6, 6, 48, 48, 6,…
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reviews_per_month <dbl> 3.43, 0.18, 0.16, 0.18, 0…
From here, we will work on separating and cleaning the two respective tables prior to insertion in our database.
Before we separate the data into two separate tables, we perform the following preprocessing steps on all entries:
date
to the type
character
to help process NA
values.NA
representations
(blanks, None, N/A, NA) to NA
.data
.data <- data %>%
# (1) Convert dates to characters for NA values
mutate(last_scraped = as.character(last_scraped),
host_since = as.character(host_since),
calendar_last_scraped = as.character(calendar_last_scraped),
first_review = as.character(first_review),
last_review = as.character(last_review),
) %>%
# (2) Homogenize NA values
#*# Taken from: https://rpubs.com/Argaadya/create_table_sql
mutate_all(function(x) ifelse(x == "" | x == "None" | x == "N/A", NA, x)) %>% #*#
# mutate_all(function(x) ifelse(is.na(x), "NULL", x)) %>%
# (3) Convert character strings back to date type
mutate(last_scraped = as.Date(last_scraped),
host_since = as.Date(host_since),
calendar_last_scraped = as.Date(calendar_last_scraped),
first_review = as.Date(first_review),
last_review = as.Date(last_review))
We are now ready to move on to constructing our
host_info
table.
Since a host can have many listings, it’s beneficial to split our
initial data into two tables, one containing the info related to
hosts
and the other containing the info related to
listings
, before inserting into our database. The first
table that we create, is the host_info
table with the same
attributes as shown in the E/R Diagram above.
We are now ready to extract and clean the data from our initial
data
table in order to construct our host_data
table. This is done and the following four steps:
data
table. Note that we use the :
syntax to grab many columns
at once. This syntax is inclusive.distinct()
function
from the dplyr
library.host_since
column back to the type
character
. This is required since RSQLite
does
not support data of the type date
. We will convert this
column back to the correct type when performing queries later.str_remove_all()
function from the
stringr
library to convert the
host_verifications
sublists into simple strings. # (1) Extract host data
host_data <- data %>%
select(host_id:host_identity_verified,
calculated_host_listings_count:calculated_host_listings_count_shared_rooms)
# (2) Remove duplicate values
host_data <- host_data %>% distinct()
# (3) Convert dates
# Note that this will need to converted back to type = date for analysis
host_data <- host_data %>% mutate(host_since = as.character(host_since))
# (4) Clean host verification column
host_data <-
host_data %>%
mutate(host_verifications = str_remove_all(host_verifications, "[\\'\\[\\]]"))
We can now view our clean data:
rmarkdown::paged_table(host_data)
Now that our host_data
table is clean. We can create the
equivalent table as a query, initially as a string listing the table’s
columns, before creating the empty table in our database using our
con
object:
# Create host info query
query <- "CREATE TABLE host_info(
host_id INT,
host_url VARCHAR(50),
host_name VARCHAR(100),
host_since VARCHAR(50),
host_location VARCHAR(500),
host_about VARCHAR(10000),
host_response_time VARCHAR(50),
host_response_rate VARCHAR(50),
host_acceptance_rate VARCHAR(50),
host_is_superhost BOOLEAN,
host_thumbnail_url VARCHAR(500),
host_picture_url VARCHAR(500),
host_neighbourhood VARCHAR(50),
host_listings_count INT,
host_total_listings_count INT,
host_verifications VARCHAR(500),
host_has_profile_pic BOOLEAN,
host_identity_verified BOOLEAN,
calculated_host_listings_count INT,
calculated_host_listings_count_entire_homes INT,
calculated_host_listings_count_private_rooms INT,
calculated_host_listings_count_shared_rooms INT,
PRIMARY KEY(host_id)
)"
Now create the empty table in our database.
dbRemoveTable(con, "host_info")
dbSendQuery(con, query)
<SQLiteResult>
SQL CREATE TABLE host_info(
host_id INT,
host_url VARCHAR(50),
host_name VARCHAR(100),
host_since VARCHAR(50),
host_location VARCHAR(500),
host_about VARCHAR(10000),
host_response_time VARCHAR(50),
host_response_rate VARCHAR(50),
host_acceptance_rate VARCHAR(50),
host_is_superhost BOOLEAN,
host_thumbnail_url VARCHAR(500),
host_picture_url VARCHAR(500),
host_neighbourhood VARCHAR(50),
host_listings_count INT,
host_total_listings_count INT,
host_verifications VARCHAR(500),
host_has_profile_pic BOOLEAN,
host_identity_verified BOOLEAN,
calculated_host_listings_count INT,
calculated_host_listings_count_entire_homes INT,
calculated_host_listings_count_private_rooms INT,
calculated_host_listings_count_shared_rooms INT,
PRIMARY KEY(host_id)
)
ROWS Fetched: 0 [complete]
Changed: 0
Next, we check the schema of our database so far to ensure that it was loaded correctly.
res <- dbSendQuery(con, "PRAGMA table_info([host_info]);")
fetch(res)
cid name type notnull
1 0 host_id INT 0
2 1 host_url VARCHAR(50) 0
3 2 host_name VARCHAR(100) 0
4 3 host_since VARCHAR(50) 0
5 4 host_location VARCHAR(500) 0
6 5 host_about VARCHAR(10000) 0
7 6 host_response_time VARCHAR(50) 0
8 7 host_response_rate VARCHAR(50) 0
9 8 host_acceptance_rate VARCHAR(50) 0
10 9 host_is_superhost BOOLEAN 0
11 10 host_thumbnail_url VARCHAR(500) 0
12 11 host_picture_url VARCHAR(500) 0
13 12 host_neighbourhood VARCHAR(50) 0
14 13 host_listings_count INT 0
15 14 host_total_listings_count INT 0
16 15 host_verifications VARCHAR(500) 0
17 16 host_has_profile_pic BOOLEAN 0
18 17 host_identity_verified BOOLEAN 0
19 18 calculated_host_listings_count INT 0
20 19 calculated_host_listings_count_entire_homes INT 0
21 20 calculated_host_listings_count_private_rooms INT 0
22 21 calculated_host_listings_count_shared_rooms INT 0
dflt_value pk
1 NA 1
2 NA 0
3 NA 0
4 NA 0
5 NA 0
6 NA 0
7 NA 0
8 NA 0
9 NA 0
10 NA 0
11 NA 0
12 NA 0
13 NA 0
14 NA 0
15 NA 0
16 NA 0
17 NA 0
18 NA 0
19 NA 0
20 NA 0
21 NA 0
22 NA 0
dbClearResult(res)
Finally, insert our host_data
table into the equivalent
table in our RSQLite
database using the
insert_to_sql()
function as defined above.
insert_to_sql(con, "host_info", host_data)
We can verify the contents of our newly created and populated
RSQLite
database table host_info
by viewing
the first 10 rows of each column as follows:
res <- dbSendQuery(con, "SELECT * FROM host_info LIMIT 10")
out_db <- fetch(res)
dbClearResult(res)
rmarkdown::paged_table(out_db)
We are now ready to move on to constructing our listing
table. This table contains information related to the types of listings
available and their relevant attributes. The process of constructing the
listing
table is very similar to the process of
constructing the host_info
table.
In order to prepare the relevant columns related to listing information we perform the following:
host_data
.price
column.host_verifications
column to the amenities
column.date
to the type
character
as is required for insertion into our
database.# (1) Remove host_data columns
listing_data <- data %>%
select( - names(host_data)[-1])
# (2) Remove extraneous columns
listing_data <- listing_data %>%
select(-c(license, calendar_updated, bathrooms, scrape_id))
# (3) Remove dollar signs from price column
listing_data <- listing_data %>%
mutate(price = str_remove_all(price, "[$,]") %>%
as.numeric()
)
# (4) Transform amenities and host verification column
listing_data <- listing_data %>%
mutate(amenities = str_remove_all(amenities, "[\"\\'\\[\\]]"))
# (5) Convert dates to character
listing_data <-
listing_data %>%
mutate(last_scraped = as.character(last_scraped),
calendar_last_scraped = as.character(calendar_last_scraped),
first_review = as.character(first_review),
last_review = as.character(last_review))
Just as before, create our listing
database table
initially as a string to be inserted as a query. Except here, listing’s
id column has very long length which SQLite cannot handle, thus changing
it to string first.
#Converts listing id to string
#listing id was too long to store in SQLite as a INT hence change to string to store as VARCHAR
listing_data$id <- as.character(listing_data$id)
# Create listing table query
query_2 <- "CREATE TABLE listing (
id VARCHAR(2000),
listing_url VARCHAR(100),
last_scraped VARCHAR(50),
name VARCHAR(500),
description VARCHAR(2000),
neighborhood_overview VARCHAR(2000),
neighbourhood VARCHAR(100),
neighbourhood_cleansed VARCHAR(100),
neighbourhood_group_cleansed VARCHAR(100),
latitude DECIMAL(25,18),
longitude DECIMAL(25, 18),
property_type VARCHAR(100),
room_type VARCHAR(100),
picture_url VARCHAR(500),
host_id INT,
accommodates INT,
bathrooms_text VARCHAR(100),
bedrooms INT,
beds INT,
amenities VARCHAR(2000),
price DECIMAL(15, 5),
minimum_nights INT,
maximum_nights INT,
minimum_minimum_nights INT,
maximum_minimum_nights INT,
minimum_maximum_nights INT,
maximum_maximum_nights INT,
minimum_nights_avg_ntm DECIMAL(16, 5),
maximum_nights_avg_ntm DECIMAL(16, 5),
has_availability BOOLEAN,
availability_30 INT,
availability_60 INT,
availability_90 INT,
availability_365 INT,
calendar_last_scraped VARCHAR(50),
number_of_reviews INT,
number_of_reviews_ltm INT,
number_of_reviews_l30d INT,
first_review VARCHAR(50),
last_review VARCHAR(50),
review_scores_rating DECIMAL(10, 5),
review_scores_accuracy DECIMAL(10, 5),
review_scores_cleanliness DECIMAL(10, 5),
review_scores_checkin DECIMAL(10, 5),
review_scores_communication DECIMAL(10, 5),
review_scores_location DECIMAL(10, 5),
review_scores_value DECIMAL(10, 5),
instant_bookable BOOLEAN,
reviews_per_month DECIMAL(10, 5),
PRIMARY KEY(id),
FOREIGN KEY(host_id) REFERENCES host_info(host_id)
)"
Insert the listing
table into our database.
dbRemoveTable(con, "listing")
dbSendQuery(con, query_2)
<SQLiteResult>
SQL CREATE TABLE listing (
id VARCHAR(2000),
listing_url VARCHAR(100),
last_scraped VARCHAR(50),
name VARCHAR(500),
description VARCHAR(2000),
neighborhood_overview VARCHAR(2000),
neighbourhood VARCHAR(100),
neighbourhood_cleansed VARCHAR(100),
neighbourhood_group_cleansed VARCHAR(100),
latitude DECIMAL(25,18),
longitude DECIMAL(25, 18),
property_type VARCHAR(100),
room_type VARCHAR(100),
picture_url VARCHAR(500),
host_id INT,
accommodates INT,
bathrooms_text VARCHAR(100),
bedrooms INT,
beds INT,
amenities VARCHAR(2000),
price DECIMAL(15, 5),
minimum_nights INT,
maximum_nights INT,
minimum_minimum_nights INT,
maximum_minimum_nights INT,
minimum_maximum_nights INT,
maximum_maximum_nights INT,
minimum_nights_avg_ntm DECIMAL(16, 5),
maximum_nights_avg_ntm DECIMAL(16, 5),
has_availability BOOLEAN,
availability_30 INT,
availability_60 INT,
availability_90 INT,
availability_365 INT,
calendar_last_scraped VARCHAR(50),
number_of_reviews INT,
number_of_reviews_ltm INT,
number_of_reviews_l30d INT,
first_review VARCHAR(50),
last_review VARCHAR(50),
review_scores_rating DECIMAL(10, 5),
review_scores_accuracy DECIMAL(10, 5),
review_scores_cleanliness DECIMAL(10, 5),
review_scores_checkin DECIMAL(10, 5),
review_scores_communication DECIMAL(10, 5),
review_scores_location DECIMAL(10, 5),
review_scores_value DECIMAL(10, 5),
instant_bookable BOOLEAN,
reviews_per_month DECIMAL(10, 5),
PRIMARY KEY(id),
FOREIGN KEY(host_id) REFERENCES host_info(host_id)
)
ROWS Fetched: 0 [complete]
Changed: 0
Insert the listing
data into our database using the same
insert_to_sql()
function as before.
insert_to_sql(con, "listing", listing_data)
Confirm that the data was indeed inserted correctly.
res <- dbSendQuery(con, "SELECT * FROM listing LIMIT 10")
out_db <- fetch(res)
dbClearResult(res)
rmarkdown::paged_table(out_db)
Finally, before exiting our program, disconnect from the database
dbDisconnect(con)
We have provided an example of how to pull real-world data from an
actual source, then separate, clean and load the data into a
RSQLite
database. Real-world data is often times initially
messy, but that does not mean it cannot be wrangled into a form from
which we can gain deeper insights. In our next post, we demonstrate how
to answer some interesting questions given our newly created
database.
In this section, we demonstrate how to answer a number of questions related to room listing type and host info using our newly created RSQLite database.
We answer the following \(11\) questions:
First, we must load the necessary libraries required for our data analysis and connect to our database.
# Libraries
library(DBI)
library(RSQLite)
library(tidyr)
library(tibble)
library(stringr)
library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(leaflet)
library(lubridate)
library(ggplot2)
library(gridExtra)
library(tidyquant)
library(zoo)
Next, we connect to our database that we set up in the previous post
con <- dbConnect(RSQLite::SQLite(), "../db/airdb.SQLite")
build_airbnb_database()
FunctionIf you have not yet set up the require database, you can use the
build_airbnb_database()
function to do so. Note that this
function leverages the remove_old_database()
and
insert_to_sql()
functions defined in the previous post.
build_airbnb_database <- function(con, listing_data, remove_old_database = FALSE){
#################### Remove Existing database
if(remove_old_database == TRUE){
remove_live_database(con)
}
#################### Deal with NA values
listing_data <-
listing_data %>%
# Convert dates to characters for NA values
mutate(last_scraped = as.character(last_scraped),
host_since = as.character(host_since),
calendar_last_scraped = as.character(calendar_last_scraped),
first_review = as.character(first_review),
last_review = as.character(last_review),
) %>%
# Homogenize NA values
#*# Taken from: https://rpubs.com/Argaadya/create_table_sql
mutate_all(function(x) ifelse(x == "" | x == "None" | x == "N/A", NA, x)) %>% #*#
# mutate_all(function(x) ifelse(is.na(x), "NULL", x)) %>%
# Convert character strings back to date type
mutate(last_scraped = as.Date(last_scraped),
host_since = as.Date(host_since),
calendar_last_scraped = as.Date(calendar_last_scraped),
first_review = as.Date(first_review),
last_review = as.Date(last_review))
#################### Extract host data
host_data <- listing_data %>%
select(host_id:host_identity_verified,
calculated_host_listings_count:calculated_host_listings_count_shared_rooms)
#################### Remove duplicate values
host_data <- host_data %>% distinct()
#################### Convert dates
# Note that this will need to converted back to type = date for analysis
host_data <- host_data %>% mutate(host_since = as.character(host_since))
#################### Clean host verification column
host_data <-
host_data %>%
mutate(host_verifications = str_remove_all(host_verifications, "[\\'\\[\\]]"))
#################### Create table for host info
query <- "CREATE TABLE host_info(
host_id INT,
host_url VARCHAR(50),
host_name VARCHAR(100),
host_since VARCHAR(50),
host_location VARCHAR(500),
host_about VARCHAR(10000),
host_response_time VARCHAR(50),
host_response_rate VARCHAR(50),
host_acceptance_rate VARCHAR(50),
host_is_superhost BOOLEAN,
host_thumbnail_url VARCHAR(500),
host_picture_url VARCHAR(500),
host_neighbourhood VARCHAR(50),
host_listings_count INT,
host_total_listings_count INT,
host_verifications VARCHAR(500),
host_has_profile_pic BOOLEAN,
host_identity_verified BOOLEAN,
calculated_host_listings_count INT,
calculated_host_listings_count_entire_homes INT,
calculated_host_listings_count_private_rooms INT,
calculated_host_listings_count_shared_rooms INT,
PRIMARY KEY(host_id)
)"
#################### Load host_info table
dbSendQuery(con, query)
#################### Check schema
res <- dbSendQuery(con, "PRAGMA table_info([host_info]);")
fetch(res)
dbClearResult(res)
#################### Insert data into host_info table
insert_to_sql(con, "host_info", host_data)
####################Listing table Processing####################
# listing_data %>% view()
listing_data %>% glimpse()
#################### Remove host_data columns
listing_data <- listing_data %>%
select( - names(host_data)[-1])
#################### Remove extraneous columns
listing_data <- listing_data %>%
select(-c(license, calendar_updated, bathrooms, scrape_id))
#################### Remove dollar signs from price column
listing_data <- listing_data %>%
mutate(price = str_remove_all(price, "[$,]") %>%
as.numeric()
)
#################### Transform amenities and host verification column
listing_data <- listing_data %>%
mutate(amenities = str_remove_all(amenities, "[\"\\'\\[\\]]"))
listing_data %>% glimpse()
#################### Convert dates to character
listing_data <-
listing_data %>%
mutate(last_scraped = as.character(last_scraped),
calendar_last_scraped = as.character(calendar_last_scraped),
first_review = as.character(first_review),
last_review = as.character(last_review))
#################### Converts listing id to string
# listing id was too long to store in SQLite as a INT hence change to string to store as VARCHAR
listing_data$id <- as.character(listing_data$id)
#################### Create listing table
query_2 <- "CREATE TABLE listing (
id VARCHAR(2000)
listing_url VARCHAR(100),
last_scraped VARCHAR(50),
name VARCHAR(500),
description VARCHAR(2000),
neighborhood_overview VARCHAR(2000),
picture_url VARCHAR(500),
host_id INT,
neighbourhood VARCHAR(100),
neighbourhood_cleansed VARCHAR(100),
neighbourhood_group_cleansed VARCHAR(100),
latitude DECIMAL(25,18),
longitude DECIMAL(25, 18),
property_type VARCHAR(100),
room_type VARCHAR(100),
accommodates INT,
bathrooms_text VARCHAR(100),
bedrooms INT,
beds INT,
amenities VARCHAR(2000),
price DECIMAL(15, 5),
minimum_nights INT,
maximum_nights INT,
minimum_minimum_nights INT,
maximum_minimum_nights INT,
minimum_maximum_nights INT,
maximum_maximum_nights INT,
minimum_nights_avg_ntm DECIMAL(16, 5),
maximum_nights_avg_ntm DECIMAL(16, 5),
has_availability BOOLEAN,
availability_30 INT,
availability_60 INT,
availability_90 INT,
availability_365 INT,
calendar_last_scraped VARCHAR(50),
number_of_reviews INT,
number_of_reviews_ltm INT,
number_of_reviews_l30d INT,
first_review VARCHAR(50),
last_review VARCHAR(50),
review_scores_rating DECIMAL(10, 5),
review_scores_accuracy DECIMAL(10, 5),
review_scores_cleanliness DECIMAL(10, 5),
review_scores_checkin DECIMAL(10, 5),
review_scores_communication DECIMAL(10, 5),
review_scores_location DECIMAL(10, 5),
review_scores_value DECIMAL(10, 5),
instant_bookable BOOLEAN,
reviews_per_month DECIMAL(10, 5),
PRIMARY KEY(id),
FOREIGN KEY(host_id) REFERENCES host_info(host_id)
)"
#################### Insert listing table into database
dbSendQuery(con, query_2)
#################### Insert data into listing table
insert_to_sql(con, "listing", listing_data)
#################### Extract tables from database
}
We can load the data from our database in either of the following ways:
host_info <- tbl(con, "host_info") %>% as.data.frame()
listing <- tbl(con, "listing") %>% as.data.frame()
# load host_info table
res_host_info <- dbSendQuery(con, "select * from host_info")
host_info <- fetch(res_host_info)
dbClearResult(res_host_info)
# load listing table
res_listing <- dbSendQuery(con, "select * from listing")
listing <- fetch(res_listing)
dbClearResult(res_listing)
To find the most common room type available, we start by selecting
room_type
and has_availability
, the latter
being a logical indicator of unit availability. Then we filter the rooms
that have availability, group them by room_type
and rank
them by availability.
q1<- listing%>%
select(room_type,has_availability)%>%
group_by(room_type)%>%
filter(has_availability==1)%>%
summarise(availability=n())%>%
arrange(desc(availability))
#Using kable for table format
most_common_room_type_available <-q1%>%
knitr::kable(align = c("l", "c"),
format.args = list(big.mark = ","),
digits = 2)
most_common_room_type_available
room_type | availability |
---|---|
Entire home/apt | 2,218 |
Private room | 1,495 |
Hotel room | 150 |
Shared room | 117 |
For a clearer visual representation of room type availability, we include the following bar graph of the same data.
q1_plot<-q1%>%
ggplot(aes(x = availability, y = room_type %>% reorder(availability))) +
geom_col(fill = "Aquamarine4") +
scale_x_continuous(labels = scales::number_format(big.mark = ",")) +
labs(
title = "Most Common Room Type Available",
x = "Availability",
y = "Room Type"
) +
theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
q1_plot
From the plots, we can see that Entire home/apt
is the
room type with the most available units overall with 2218 available
units. However, this is affected by the amount of
Entire home/apt
that are listed on Airbnb, which is 2232.
This is similarly observed across all room types, with the availability
being proportional to the number of listings of that room type.
For further analysis, we utilised other information in the data, particularly the information about availability of room types over periods of 30, 60, 90 and 365 days. This allows us to understand the average amount of nights available per room type across the different selection periods.
#Create list columns to retrieve data
availability_periods <- c('availability_30', 'availability_60',
'availability_90', 'availability_365')
#Looping the results per period
for(col in availability_periods){
tables<- paste('q1',col, sep='_')
assign(tables,listing%>%
select(room_type,col)%>%
group_by(room_type)%>%
summarise(mean=mean(.data[[col]])))
}
####################LOOP THE PLOTS
q1_availability_30_plot<-q1_availability_30%>%
ggplot(aes(x = mean, y = room_type )) +
geom_col(fill = "Aquamarine4") +
scale_x_continuous(labels = scales::number_format(big.mark = ",")) +
labs(
title = "Most Common 30 Days",
x = "Availability",
y = "Room Type"
) +
theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
q1_availability_60_plot<-q1_availability_60%>%
ggplot(aes(x = mean, y = room_type )) +
geom_col(fill = "Aquamarine4") +
scale_x_continuous(labels = scales::number_format(big.mark = ",")) +
labs(
title = "Most Common 60 Days",
x = "Availability",
y = "Room Type"
) +
theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
q1_availability_90_plot<-q1_availability_90%>%
ggplot(aes(x = mean, y = room_type )) +
geom_col(fill = "Aquamarine4") +
scale_x_continuous(labels = scales::number_format(big.mark = ",")) +
labs(
title = "Most Common 90 Days",
x = "Availability",
y = "Room Type"
) +
theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
q1_availability_365_plot<-q1_availability_365%>%
ggplot(aes(x = mean, y = room_type )) +
geom_col(fill = "Aquamarine4") +
scale_x_continuous(labels = scales::number_format(big.mark = ",")) +
labs(
title = "Most Common 365 Days",
x = "Availability",
y = "Room Type"
) +
theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
grid.arrange(q1_availability_30_plot,
q1_availability_60_plot,
q1_availability_90_plot,
q1_availability_365_plot, ncol = 2)
From these 4 plots we can see that across all periods
Entire home/apt
is still the room type with highest average
availability, followed by Private room
,
Shared room
and Hotel room
. We can infer that
the demand for Entire home/apt
is not excessive since its
overall availability and average availability across periods are
relatively high. Something interesting to note is that although there
are more hotel rooms available than shared rooms, the average
availability of hotel rooms is lower than that of shared rooms. This may
suggest that in Singapore, the demand for hotel rooms is higher than
shared rooms.
First we start by selecting property_type
and
price
from all available data, then we group by property
type and summarise
by average price. Then, we create two
separate rankings:
top_property_type_average_price
bottom_property_type_average_price
q2<- listing%>%
select(property_type,price)%>%
group_by(property_type)%>%
summarise(price = mean(price))%>%
arrange(desc(price))
top_property_type_average_price <-q2 %>%
arrange(desc(price))%>%
top_n(10)%>%
knitr::kable(align = c("l", "c"),
format.args = list(big.mark = ","),
digits = 2)
top_property_type_average_price
property_type | price |
---|---|
Houseboat | 2,162.00 |
Boat | 1,083.50 |
Entire bungalow | 886.50 |
Entire villa | 858.00 |
Entire home | 519.90 |
Entire chalet | 504.00 |
Private room in chalet | 477.00 |
Farm stay | 417.00 |
Private room in tent | 363.50 |
Room in hotel | 324.51 |
bottom_property_type_average_price <-q2 %>%
arrange(price)%>%
top_n(-10)%>%
knitr::kable(align = c("l", "c"),
format.args = list(big.mark = ","),
digits = 2)
bottom_property_type_average_price
property_type | price |
---|---|
Shared room in guesthouse | 27.00 |
Shared room in home | 37.50 |
Shared room | 43.33 |
Shared room in bed and breakfast | 44.31 |
Private room in casa particular | 47.00 |
Shared room in townhouse | 47.00 |
Private room in earthen home | 50.00 |
Shared room in riad | 50.00 |
Entire vacation home | 56.50 |
Shared room in hostel | 58.75 |
First we plot the top properties based on
average price
.
top_q2<- q2%>%
arrange(desc(price))%>%
top_n(10)%>%
ggplot(aes(x = price, y = property_type %>% reorder(price))) +
geom_col(fill = "Aquamarine4") +
scale_x_continuous(labels = scales::number_format(big.mark = ",")) +
labs(
title = "Top Property Type by Average price",
x = "Average price",
y = "Property Type"
) +
theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
top_q2
Finally, we view the bottom properties based on
average price
.
bottom_q2<- q2%>%
arrange((price))%>%
top_n(-10)%>%
ggplot(aes(x = price, y = property_type %>% reorder(price))) +
geom_col(fill = "Skyblue3") +
scale_x_continuous(labels = scales::number_format(big.mark = ",")) +
labs(
title = "Bottom Property Type by Average price",
x = "Average price",
y = "Property Type"
) +
theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
bottom_q2
From the plots, we can see that across most of the properties, larger
properties tend to have higher average prices. Large properties like
Houseboat
, Boat
, Entire bungalow
,
Entire villa
have high average prices, whereas smaller
properties like shared room in guesthouse
has the lowest
average price.
For this query, we used the same logic as in the previous section but
instead of selecting property_type
and price
we select property_type
and
review_scores_rating
.
q3<- listing%>%
select(property_type,review_scores_rating)%>%
group_by(property_type)%>%
summarise(review_scores_rating = mean(review_scores_rating,na.rm=TRUE))%>%
arrange(desc(review_scores_rating))
top_property_type_review_scores_rating <-q3 %>%
arrange(desc(review_scores_rating))%>%
top_n(10)%>%
knitr::kable(align = c("l", "c"),
format.args = list(big.mark = ","),
digits = 2)
top_property_type_review_scores_rating
property_type | review_scores_rating |
---|---|
Boat | 4.91 |
Entire guest suite | 4.89 |
Entire loft | 4.88 |
Shared room in boutique hotel | 4.88 |
Tiny home | 4.87 |
Private room in villa | 4.84 |
Entire place | 4.83 |
Entire guesthouse | 4.81 |
Shared room in condo | 4.75 |
Private room in loft | 4.72 |
bottom_property_type_review_scores_rating <-q3 %>%
arrange(review_scores_rating)%>%
top_n(-10)%>%
knitr::kable(align = c("l", "c"),
format.args = list(big.mark = ","),
digits = 2)
bottom_property_type_review_scores_rating
property_type | review_scores_rating |
---|---|
Tent | 1.00 |
Private room in guesthouse | 3.04 |
Private room | 3.62 |
Room in aparthotel | 3.73 |
Entire home | 3.83 |
Private room in tent | 4.00 |
Shared room in hostel | 4.06 |
Shared room in rental unit | 4.06 |
Private room in hostel | 4.21 |
Room in hostel | 4.21 |
Next, we plot the top 10 property types based on
review score
.
top_q3<- q3%>%
arrange(desc(review_scores_rating))%>%
top_n(10)%>%
ggplot(aes(x = review_scores_rating, y = property_type %>% reorder(review_scores_rating))) +
geom_col(fill = "Aquamarine4") +
scale_x_continuous(labels = scales::number_format(big.mark = ",")) +
labs(
title = "Top Property Type by Review Score Rating",
x = "Review Score Rating",
y = "Property Type"
) +
theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
top_q3
Finally, we plot the of the bottom 10 property types based on
review score
.
bottom_q3<- q3%>%
arrange((review_scores_rating))%>%
top_n(-10)%>%
ggplot(aes(x = review_scores_rating, y = property_type %>% reorder(review_scores_rating))) +
geom_col(fill = "Skyblue3") +
scale_x_continuous(labels = scales::number_format(big.mark = ",")) +
labs(
title = "Bottom Property Type by Review Score Rating",
x = "Review Score Rating",
y = "Property Type"
) +
theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
bottom_q3
From the plots and the tables, we can see that Boat
has
the highest average review score rating of 4.91, with other property
types such as Entire guest suite
,
Entire loft
, Shared room in boutique hotel
and
Tiny home
having high scores. The top performing property
types in terms of review score ratings are more premium and fancy, which
is understandable since they result in more memorable experiences. As
for the bottom property types, Tent
has the lowest rating
of 1,
Tent
has the lowest average review score rating of 1,
which presents itself as an outlier as the next lowest average review
score rating was 3.04 for Private room in guesthouse
. Let’s
take a look at the Tent
listings to understand the reason
behind the low rating. We shall also take a look at the count of all
listings by proporty type for a better understanding of the review
rating scores.
tent_listing <- listing %>% filter(property_type == "Tent") %>% select(property_type, review_scores_rating)
property_count <- listing%>% group_by(property_type) %>% summarise(count = n()) %>% select(property_type, count) %>% arrange(count)
tent_listing
property_type review_scores_rating
1 Tent 1
property_count
# A tibble: 56 × 2
property_type count
<chr> <int>
1 Campsite 1
2 Entire chalet 1
3 Entire home/apt 1
4 Farm stay 1
5 Private room in casa particular 1
6 Private room in earthen home 1
7 Private room in guest suite 1
8 Private room in vacation home 1
9 Shared room in riad 1
10 Shared room in townhouse 1
# … with 46 more rows
From tent_listing
, we can see that there was only one
entry for Tent
property type, thus the review score of 1
was only from 1 entry. From property_count
we can also see
that many listings have very few entries, hence the average ratings may
be skewed and not as representative.
By looking at the amenities information in the listing data we
realize that all amenities are listed in one column. In order to
quantify and group the amenities, we first get the max amount of
amenities a property can have using the str_count()
function of the stringr
library.
ncols_q4 <- max(stringr::str_count(listing$amenities, ",")) + 1
Thereafter, we create multiple columns according to the max amount of
amenities a property can have and populate the columns with all the
amenities a property can have. We do this by using separate
function from the tidyr
library.
#Create list of column names
colmn_q4 <- paste("col", 1:ncols_q4)
#Create columns filled with the amenities all properties have
q4 <- listing%>%
select(amenities)%>%
tidyr::separate(
col = amenities,
sep= ",",
into=colmn_q4,
remove = FALSE)
By using pivot_longer
, we prepare our data into data
base format which allows for the occurrence of amenity types to be
counted.
q4 <- pivot_longer(data=q4,
cols = 'col 1':paste('col',length(q4)-1), #pivot to total number of columns
names_to = "col_number",
values_to = "separated_amenities")
Next, we tabulate all occurrences of the available amenity types.
q4 <- q4 %>%
select(separated_amenities) %>%
group_by(separated_amenities) %>%
summarise(amenities_count=n()) %>%
na.omit() %>%
arrange(desc(amenities_count)) %>%
top_n (10)
q4_plot <- q4 %>%
ggplot(aes(x = amenities_count, y = separated_amenities %>% reorder(amenities_count))) +
geom_col(fill = "Aquamarine4") +
scale_x_continuous(labels = scales::number_format(big.mark = ",")) +
labs(
title = "Most Common Amenities",
x = "Count",
y = "Amenities"
) +
theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
most_common_amenities <- q4%>%
knitr::kable(align = c("l", "c"),
format.args = list(big.mark = ","),
digits = 2)
q4_plot
most_common_amenities
separated_amenities | amenities_count |
---|---|
Long term stays allowed | 3,940 |
Air conditioning | 3,894 |
Essentials | 3,464 |
Kitchen | 3,325 |
Hangers | 3,242 |
Washer | 3,223 |
Iron | 3,069 |
Hair dryer | 2,896 |
Hot water | 2,745 |
Wifi | 2,733 |
We notice that while an accommodation may only have one score in terms of price, it is reviewed across several different dimensions:
We start by selecting the price
and review related
columns from the listing
table, before dropping any rows
that are missing review scores. This leaves us with roughly 10,000
observations remaining, more than enough to theoretically examine
correlation.
q5 <-
listing %>% select(price, review_scores_accuracy,
review_scores_cleanliness, review_scores_checkin,
review_scores_communication, review_scores_location,
review_scores_rating, review_scores_value) %>%
drop_na()
We defined the following function c_plot()
to handle the
repetitive plotting of price versus our eight different review
dimensions.
# Function defining correlation plot
c_plot <- function(df, y_val, y_name, plot_title = y_name, clr = "dodgerblue4"){
c_plot <- df %>%
ggplot(aes(x = price,
y = y_val)) +
geom_jitter(color = clr, alpha = 0.5) +
scale_x_log10(label = scales::number_format(big.mark = ",")) +
labs(x = "Price",
y = y_name,
title = plot_title) + theme_tq()
return(c_plot)
}
We then construct our correlation graphs using the above defined
c_plot()
function. We also use the
grid.arrange()
function from the gridExtra
library to help align our multiple plots for parallel examination.
# Colours for correlation plot
c <- c("Aquamarine4", "Sienna3")
# Build correlation plots
q5_1 <- c_plot(q5, q5$review_scores_rating, "Rating", plot_title = "Rating vs Price")
q5_2 <- c_plot(q5, q5$review_scores_accuracy, "Accuracy", clr = c[1])
q5_3 <- c_plot(q5, q5$review_scores_cleanliness, "Cleanliness", clr = c[1])
q5_4 <- c_plot(q5, q5$review_scores_checkin, "Check-in", clr = c[1])
q5_5 <- c_plot(q5, q5$review_scores_communication, "Communication", clr = c[2])
q5_6 <- c_plot(q5, q5$review_scores_location, "Location", clr = c[2])
q5_7 <- c_plot(q5, q5$review_scores_value, "Value", clr = c[2])
# Output correlation plots
q5_1
grid.arrange(q5_2, q5_3, q5_4, ncol = 3)
grid.arrange(q5_5, q5_6, q5_7, ncol = 3)
Our plots appear noisy, but the trend seems to be that properties
receiving lower ratings along each of the dimensions tend to be those
with lower rental prices. Among the few properties falling into the
higher rental price range (i.e. > 1,000), almost all of the ratings
lie above 4 out of the possible 5 points. However, there are still
plenty of properties at the lower price range that receive excellent
review scores. Unfortunately, due to the asymmetric, non-normal
distributed nature of the data, we are unable to apply the
cor.test()
function to determine if the correlation between
price and review rating is statistically significant.
To study the geographical distribution of available accommodations to
rent, we use the leaflet
library to create an interactive
map
q6 <- listing %>%
left_join(host_info, by = "host_id") %>%
select(host_id, host_name, listing_url, latitude, longitude, price,
review_scores_rating, number_of_reviews, neighbourhood_cleansed) %>%
replace_na(list(name = "No Name", host_name = "No Host Name"))
popup <- paste0("<b>", q6$name, "</b><br>",
"Listing ID: ", q6$id, "<br>",
"Host Name: ", q6$host_name, "<br>",
"Price: ", q6$price, "<br>",
"Review Scores Rating: ", ifelse(is.na(q6$review_scores_rating),
"No Review Yet", q6$review_scores_rating) , "<br>",
"Number of Reviews: ", q6$number_of_reviews, "<br>",
"<a href=", q6$listing_url, "> Click for more info</a>"
)
leaflet(data = q6) %>%
addTiles() %>%
addMarkers(lng = ~longitude,
lat = ~latitude,
popup = popup,
clusterOptions = markerClusterOptions())
We begin by joining together the two tables on the column
host_id
. We then select the necessary columns and create a
new column called total_earnings
which consists of the
formula:
We then remove the columns containing NA
values and
perform a count
after grouping by the attributes
host_id
and host_name
. At the same time, we
calculate the average price and then finally, select the columns we want
and arrange in descending order by the total_earnings
.
q7 <- listing %>%
left_join(host_info, by = "host_id") %>%
select(host_id, host_name, price,
review_scores_rating, minimum_nights, number_of_reviews) %>%
mutate(total_earnings = price * review_scores_rating * minimum_nights) %>%
drop_na() %>%
group_by(host_id, host_name) %>%
mutate(number_of_listing = n(),
average_price = mean(price)) %>%
ungroup() %>%
select(host_id, host_name, total_earnings, number_of_listing, average_price) %>%
arrange(desc(total_earnings))
We create two plots instead of just one to examine the top posts by revenue. The first plot examines the top 10 hosts by the number of listings they have. The second plot, depicts the top 10 hosts by their total earnings.
We plot the results using a similar process for both plots with the
main difference being that the y-axis for the
top_host_by_listing
plot is ordered by the
number_of_listing
column, While the
top_host_by_earning
is ordered by the
total_earnings
column
top_host_by_listing <-
q7 %>%
arrange(desc(number_of_listing)) %>%
select(host_name, number_of_listing) %>%
distinct() %>%
head(10) %>%
ggplot(aes(x = number_of_listing, y = host_name %>% reorder(number_of_listing))) +
geom_col(fill = "Skyblue3") +
labs(
title = "Top 10 Host by # of Listings",
x = "Number of Listing",
y = "Host Name"
) +
theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
top_host_by_earning <-
q7 %>%
select(host_name, total_earnings) %>%
arrange(desc(total_earnings)) %>%
filter(total_earnings != 16242500) %>%
head(10) %>%
ggplot(aes(x = total_earnings, y = host_name %>% reorder(total_earnings))) +
geom_col(fill = "Aquamarine4") +
scale_x_continuous(labels = scales::number_format(big.mark = ",")) +
labs(
title = "Top 10 Host by Total Earning",
x = "Total Earning (in SGD)",
y = "Host Name"
) +
theme_tq() +
theme(axis.text.x = element_text(angle = 45, face = "bold",
vjust = 0.85, hjust = 0.89),
axis.text.y = element_text(face = "bold"))
We once again use the grid.arrange()
function from the
GridExtra
library to view the two plots side-by-side to aid
in direct comparison.
grid.arrange(top_host_by_listing, top_host_by_earning, ncol = 2)
From the two graphs, we noticed that there are no common host that appears in both Top 10 lists.
Certain hosts receive the designation of superhost which can be achieved by meeting the following criteria:
In order to determine the difference in review score between
superhosts and regular hosts, we use the mutate()
function
to create a new logical
column
host_is_superhost
.
q8 <- listing %>%
left_join(host_info, by = "host_id") %>%
select(host_id, host_name, review_scores_rating, host_is_superhost) %>%
drop_na() %>%
mutate(host_is_superhost = as.logical(host_is_superhost)) %>%
select(review_scores_rating, host_is_superhost)
We then create two separate boxplots after isolating only the
observations that match the respective TRUE/FALSE
condition
for the host_is_superhost
column.
q8_1 <-
q8[q8$host_is_superhost == FALSE, ] %>%
ggplot(aes(y = review_scores_rating, group = host_is_superhost)) +
geom_boxplot(fill = "Skyblue3") +
labs(
title = "Host Ratings",
subtitle = "Ratings Distribution",
x = "Host",
y = "Rating"
) + theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
q8_2 <-
q8[q8$host_is_superhost == TRUE, ] %>%
ggplot(aes(y = review_scores_rating, group = host_is_superhost)) +
geom_boxplot(fill = "Aquamarine3") +
labs(
title = "Superhost Ratings",
subtitle = "Ratings Distribution",
x = "Superhost",
y = "Rating"
) + theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
Finally, we once again use the grid.arrange()
function
from the GridExtra
library to view the two plots
side-by-side to aid in direct comparison.
grid.arrange(q8_1, q8_2, ncol = 2)
There does appear to be a difference in response rate, especially in
terms of variance, between superhosts and regular hosts based off visual
inspection with the average superhost rating also ranking slightly
higher. The IQR for Host ratings are much larger compared to Superhost
Ratings. Unfortunately, due to the non-normality of the data, we are
unable to rely on a t.test()
to verify statistically if our
visual assumptions are correct.
We repeat the same process as the previous query to determine the
difference in response rate
q9 <- listing %>%
left_join(host_info, by = "host_id") %>%
select(host_id, host_name, host_response_rate, host_acceptance_rate, host_is_superhost) %>%
drop_na() %>%
mutate(host_is_superhost = as.logical(host_is_superhost),
# Transform acceptance rate and response rate
host_response_rate = host_response_rate %>%
str_remove("[%]") %>%
as.numeric(),
host_acceptance_rate = host_acceptance_rate %>%
str_remove("[%]") %>%
as.numeric()
)
q9_1 <-
q9[q9$host_is_superhost == FALSE, ] %>%
ggplot(aes(y = host_response_rate, group = host_is_superhost)) +
geom_boxplot(fill = "Skyblue3") +
labs(
title = "Host Response Rate",
subtitle = "Ratings Distribution",
x = "Host",
y = "Rating"
) + theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
q9_2 <-
q9[q9$host_is_superhost == TRUE, ] %>%
ggplot(aes(y = host_response_rate, group = host_is_superhost)) +
geom_boxplot(fill = "Aquamarine3") +
labs(
title = "Superhost Response Rate",
subtitle = "Ratings Distribution",
x = "Superhost",
y = "Rating"
) + theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
grid.arrange(q9_1, q9_2, ncol = 2)
We see here that regular hosts have a noticeably lower first quartile
for response rate
than the superhosts. Although, the set of
superhosts is not without its outliers raising the question if some of
these hosts may soon lose their superhost status due to their lacklustre
response rate.
Here we will answer this question applying same logic as we did on question #4 for most common amenities.
#Max amount of verified informations per host
ncols_q10 <- max(stringr::str_count(na.omit(host_info$host_verifications), ",")) + 1
#Create list of column names
colmn_q10 <- paste("col", 1:ncols_q10)
#Create columns filled with all the verified informations
q10<- host_info%>%
select(host_verifications)%>%
tidyr::separate(
col = host_verifications,
sep= ",",
into=colmn_q10,
remove = FALSE)
#Pivot longer to prepare data into Data Base format
q10<- pivot_longer(data=q10,
cols = 'col 1':paste('col',length(q10)-1), #pivot to total number of columns
names_to = "col_number",
values_to = "separated_host_verifications")
#There are some rows in q10 in separated host verifications that have blank strings, hence changing them to NA so that we can remove them.
q10$separated_host_verifications[q10$separated_host_verifications ==""] <- NA
#Pipe to answer the question
#Trimmed separated_host_verifications to consolidate 'phone' method
q10<- q10%>%
mutate(separated_host_verifications = str_trim(separated_host_verifications))%>%
select(separated_host_verifications)%>%
group_by(separated_host_verifications)%>%
summarise(host_verifications_count=n())%>%
na.omit()%>%
arrange(desc(host_verifications_count))%>%
top_n (10)
#plotting the answer for visual reference
q10_plot<-q10%>%
ggplot(aes(x = host_verifications_count,
y = separated_host_verifications %>%
reorder(host_verifications_count))) +
geom_col(fill = "Aquamarine4") +
scale_x_continuous(labels = scales::number_format(big.mark = ",")) +
labs(
title = "Most Common Verified Information",
x = "Count",
y = "Verified Information"
) +
theme_tq() +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"))
q10_plot
q10 <-most_common_host_verifications<- q10%>%
knitr::kable(align = c("l", "c"),
format.args = list(big.mark = ","),
digits = 2)
most_common_host_verifications
separated_host_verifications | host_verifications_count |
---|---|
phone | 1,167 |
1,002 | |
work_email | 130 |
We begin by extracting and isolating the necessary data using the following steps:
host_since_date
to the type date
using the as.Date()
function.host_since
column into three separate
columns for Year
, month
and day
respectively.day
column.Year
and Month
and use the
count()
function to tabulate the results.NA
valuesq11 <-
host_info %>% # 1
left_join(listing, by = "host_id") %>%
select(host_id, host_since) %>% # 2
mutate(host_since_date = as.Date(host_since)) %>% # 3
separate("host_since", c("Year", "Month", "Day"), sep = "-") %>% # 4
select(-Day) %>% # 5
group_by(Year, Month) %>% # 6
count(Year, Month) %>%
ungroup() %>%
mutate(year_month = paste0(Year, "-", Month, "-", "01"), # 7
year_month_2 = paste0(Year, "-", Month),
joined = n) %>%
select(year_month, year_month_2, joined) %>%
mutate(year_month = as.Date(year_month)) %>%
drop_na() # 8
Next, we use the ggplot
library’s
geom_line()
function to plot the data as a time series.
q11 %>%
ggplot(aes(x = year_month, y = joined)) +
geom_line(size = 1.2, colour = "Aquamarine4") +
scale_x_date(breaks = waiver(), date_breaks = "6 months") + theme_tq() +
theme(axis.text.x = element_text(angle = 45, face = "bold", vjust = 0.65),
axis.text.y = element_text(face = "bold")) +
labs(
title = "Number of hosts joined",
subtitle = "Shows the frequency rate at which new posts sign up for airbnb",
caption = "",
x = "Joined",
y = "Year/Month")
Since we are dealing with data over a number of years, it is helpful to also compile a list of the top 10 most active months in terms of new hosts
q11 %>%
select(-year_month) %>%
mutate(year_month = as.yearmon(year_month_2)) %>%
select(-year_month_2) %>%
select(year_month, joined) %>%
arrange(desc(joined)) %>%
head(10) %>% knitr::kable(align = c("c", "c"))
year_month | joined |
---|---|
Jul 2017 | 237 |
Apr 2016 | 224 |
Oct 2017 | 220 |
Oct 2015 | 129 |
May 2012 | 108 |
Sep 2013 | 103 |
Jan 2019 | 103 |
Oct 2016 | 101 |
Aug 2015 | 95 |
Dec 2014 | 87 |
We see that the second half of the year (Jul-Dec), makeup most of the busiest months in terms of new hosts joining the service. Jul 2017, Apr 2016 and Oct 2017 make up the top 3 busiest months.
Finally, before exiting our program, disconnect from the database
dbDisconnect(con)
We have demonstrated how to pull in real-world data, divide and clean
the data into usable tables, insert the data into a database and then
use that database to answer interesting questions that may help provide
useful and actionable insights on which to base future decisions.
R
and its many libraries, specifically the
tidyverse
, provide a powerful framework with which to
answer many interesting questions, often in only a few lines of code. We
encourage you to come up with your own questions and see if you can
answer them using the provided data.