1 Database

For this section, we demonstrate how to separate, clean and upload host and listing data taken from Airbnb for Singapore.

1.1 Libraries

The following libraries are used for data cleaning and database construction

# Data cleaning libraries
library(readr)
library(tidyr)
library(stringr)
library(tibble)
library(dplyr) 

# Database libraries
library(DBI)
library(RSQLite)

# ER Diagram libraries
library(dm)

1.2 Data/Database Connection

Begin by loading the downloaded data into R using the read_csv() command from the readr library.

data <- read_csv("../data/listings.csv.gz")

We also need to form a database connection object:

con <- dbConnect(RSQLite::SQLite(), "../db/airdb.SQLite")

1.2.1 Useful Functions

Before moving on to the data processing stage, we first define a few useful functions.

1.2.1.1 The remove_live_database() function

While we are actively working on our database construction, we may find it useful to periodically disconnect and rebuild our database from scratch. Having to close R, delete the airdb.SQLite file and reopen our session repeatedly can be tedious. To work around this, we use the following remove_live_database() function to disconnect and delete any current working database while R is still open:

remove_live_database <- function(con){
    if(file.exists("../db/airdb.SQLite")){
        if (exists("con")){
            dbDisconnect(con)   
        }
        file.remove("../db/airdb.SQLite")
    }
}

1.2.1.2 The insert_to_sql() function

In order to facilitate the insertion of data into our RSQLite database, we use the following insert_to_sql() function. This function performs a number of operations prior to insertion:

  1. Create a vector of the column names in the provided the data argument.
  2. Makes sure that all NA values are in the correct form. For all non NA values, replace the double quotes with single quotes and bookend all strings with double quotes.
  3. Join all column data into a single string, transform NA values to NULL and trim all additional whitespace.
  4. Prepend and append each string with a left and right parentheses respectively before joining all values into a single string and removing all \\’s. Finally, use the paste0() function to convert our string into a query prior to insertion in our database’s corresponding relevant table.
  5. Insert the now prepared data into the database.
insert_to_sql <- function(con, table, data){
    
    
    # (1)
    column_name <- paste(names(data), collapse = ", ")
    
    # (2)
    data_new <- data %>% 
        mutate_if(is.character, function(x) ifelse(is.na(x), NA,  x %>% 
                                str_replace_all('"', "'") %>% # Replace " with '
                                paste0('"', . , '"') # Add " before and after string
        )
        )
    
    value_data <- apply(data_new, MARGIN = 1,
                        function(x) x %>%
                            paste(collapse = ",") %>% # Join all column into single string
                            str_replace_all("\\bNA\\b", "NULL")  %>% # Create NULL from NA
                            str_trim() # remove unnecessary whitespace
    )
    # (4)
    query_value <- paste(value_data) %>% 
        paste0("(", ., ")") %>% # Add bracket before and after string
        paste(collapse = ", ") %>% # Join all values into single string
        str_remove_all("\\\\") %>% # Remove \\ from string
        paste0("INSERT INTO ", table, "(", column_name, ") VALUES ", .)
    
    # (5)
    dbSendQuery(con, query_value)
}

1.3 Data

The data used in for this project is provided by Airbnb and can be found here. To get started, first download the listings.csv.gz from the provided link. In order to help streamline the process of constructing our database, it is beneficial to split the .csv file into a listing table and a host_info table. The E/R (Entity/Relationship) diagram in the next section shows how these two tables are related and the attributes they each hold.

We can examine our initial data using the glimpse() command from the tibble library.

data %>% glimpse()
Rows: 4,099
Columns: 74
$ id                                           <dbl> 5.955499e+17, 5.064600e+0…
$ listing_url                                  <chr> "https://www.airbnb.com/r…
$ scrape_id                                    <dbl> 2.022062e+13, 2.022062e+1…
$ last_scraped                                 <date> 2022-06-22, 2022-06-22, …
$ name                                         <chr> "HCK[R&F] 2BR 2-4pax City…
$ description                                  <chr> "The location is quite co…
$ neighborhood_overview                        <chr> "Located at Bespoke @ JB …
$ picture_url                                  <chr> "https://a0.muscache.com/…
$ host_id                                      <dbl> 95993346, 227796, 367042,…
$ host_url                                     <chr> "https://www.airbnb.com/u…
$ host_name                                    <chr> "Kylie", "Sujatha", "Beli…
$ host_since                                   <date> 2016-09-20, 2010-09-08, …
$ host_location                                <chr> "Johor, Malaysia", "Singa…
$ host_about                                   <chr> NA, "I am a working profe…
$ host_response_time                           <chr> "within an hour", "a few …
$ host_response_rate                           <chr> "100%", "0%", "100%", "10…
$ host_acceptance_rate                         <chr> "100%", "N/A", "100%", "1…
$ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, FALS…
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/…
$ host_picture_url                             <chr> "https://a0.muscache.com/…
$ host_neighbourhood                           <chr> NA, "Bukit Timah", "Tampi…
$ host_listings_count                          <dbl> 1, 1, 6, 6, 6, 30, 30, 6,…
$ host_total_listings_count                    <dbl> 1, 1, 6, 6, 6, 30, 30, 6,…
$ host_verifications                           <chr> "['email', 'phone']", "['…
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ host_identity_verified                       <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ neighbourhood                                <chr> "Johor Bahru, Johor, Mala…
$ neighbourhood_cleansed                       <chr> "Woodlands", "Bukit Timah…
$ neighbourhood_group_cleansed                 <chr> "North Region", "Central …
$ latitude                                     <dbl> 1.45838, 1.33432, 1.34537…
$ longitude                                    <dbl> 103.7703, 103.7852, 103.9…
$ property_type                                <chr> "Entire rental unit", "Pr…
$ room_type                                    <chr> "Entire home/apt", "Priva…
$ accommodates                                 <dbl> 4, 2, 6, 2, 3, 1, 1, 6, 2…
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N…
$ bathrooms_text                               <chr> "1 bath", "1 bath", "1 pr…
$ bedrooms                                     <dbl> 2, 1, 2, 1, 1, 1, 1, 3, 1…
$ beds                                         <dbl> 2, 1, 3, 1, 2, 1, 1, 5, 1…
$ amenities                                    <chr> "[\"Free washer \\u2013 I…
$ price                                        <chr> "$58.00", "$80.00", "$143…
$ minimum_nights                               <dbl> 1, 92, 92, 92, 92, 14, 14…
$ maximum_nights                               <dbl> 1125, 730, 1125, 1125, 11…
$ minimum_minimum_nights                       <dbl> 1, 92, 92, 92, 92, 14, 14…
$ maximum_minimum_nights                       <dbl> 1, 92, 92, 92, 92, 14, 14…
$ minimum_maximum_nights                       <dbl> 1125, 730, 1125, 1125, 11…
$ maximum_maximum_nights                       <dbl> 1125, 730, 1125, 1125, 11…
$ minimum_nights_avg_ntm                       <dbl> 1, 92, 92, 92, 92, 14, 14…
$ maximum_nights_avg_ntm                       <dbl> 1125, 730, 1125, 1125, 11…
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N…
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ availability_30                              <dbl> 0, 30, 30, 30, 30, 5, 13,…
$ availability_60                              <dbl> 0, 60, 60, 60, 60, 13, 43…
$ availability_90                              <dbl> 17, 90, 90, 90, 90, 43, 7…
$ availability_365                             <dbl> 18, 365, 365, 365, 365, 3…
$ calendar_last_scraped                        <date> 2022-06-22, 2022-06-22, …
$ number_of_reviews                            <dbl> 8, 18, 20, 24, 47, 21, 13…
$ number_of_reviews_ltm                        <dbl> 8, 0, 0, 0, 0, 1, 0, 0, 0…
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ first_review                                 <date> 2022-04-14, 2014-04-18, …
$ last_review                                  <date> 2022-05-01, 2014-12-26, …
$ review_scores_rating                         <dbl> 5.00, 4.56, 4.44, 4.16, 4…
$ review_scores_accuracy                       <dbl> 4.88, 4.72, 4.37, 4.22, 4…
$ review_scores_cleanliness                    <dbl> 4.88, 4.78, 4.00, 4.09, 4…
$ review_scores_checkin                        <dbl> 5.00, 4.78, 4.63, 4.43, 4…
$ review_scores_communication                  <dbl> 5.00, 4.94, 4.78, 4.43, 4…
$ review_scores_location                       <dbl> 4.88, 4.72, 4.26, 4.17, 4…
$ review_scores_value                          <dbl> 4.88, 4.50, 4.32, 4.04, 4…
$ license                                      <chr> NA, NA, NA, NA, NA, "S039…
$ instant_bookable                             <lgl> TRUE, FALSE, FALSE, TRUE,…
$ calculated_host_listings_count               <dbl> 1, 1, 6, 6, 6, 50, 50, 6,…
$ calculated_host_listings_count_entire_homes  <dbl> 1, 0, 0, 0, 0, 2, 2, 0, 1…
$ calculated_host_listings_count_private_rooms <dbl> 0, 1, 6, 6, 6, 48, 48, 6,…
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reviews_per_month                            <dbl> 3.43, 0.18, 0.16, 0.18, 0…

1.3.1 E/R Diagram

From here, we will work on separating and cleaning the two respective tables prior to insertion in our database.

1.3.2 Initial Preprocessing

Before we separate the data into two separate tables, we perform the following preprocessing steps on all entries:

  1. Convert all columns of the type date to the type character to help process NA values.
  2. Homogenize all of the different NA representations (blanks, None, N/A, NA) to NA.
  3. Convert all columns that contain dates back to the type data.
data <- data %>% 
        # (1) Convert dates to characters for NA values
        mutate(last_scraped = as.character(last_scraped),
               host_since = as.character(host_since),
               calendar_last_scraped = as.character(calendar_last_scraped),
               first_review = as.character(first_review),
               last_review = as.character(last_review),
               ) %>% 
        
        # (2) Homogenize NA values
        #*# Taken from: https://rpubs.com/Argaadya/create_table_sql
        mutate_all(function(x) ifelse(x == "" | x == "None" | x == "N/A", NA, x)) %>%  #*#
        # mutate_all(function(x) ifelse(is.na(x), "NULL", x)) %>% 
        
        # (3) Convert character strings back to date type
        mutate(last_scraped = as.Date(last_scraped),
               host_since = as.Date(host_since),
               calendar_last_scraped = as.Date(calendar_last_scraped),
               first_review = as.Date(first_review),
               last_review = as.Date(last_review))

We are now ready to move on to constructing our host_info table.

1.3.3 Host Table

Since a host can have many listings, it’s beneficial to split our initial data into two tables, one containing the info related to hosts and the other containing the info related to listings, before inserting into our database. The first table that we create, is the host_info table with the same attributes as shown in the E/R Diagram above.

1.3.3.1 Data Cleaning

We are now ready to extract and clean the data from our initial data table in order to construct our host_data table. This is done and the following four steps:

  1. Extract the relevant columns from our initial data table. Note that we use the : syntax to grab many columns at once. This syntax is inclusive.
  2. Remove duplicate rows using the distinct() function from the dplyr library.
  3. Convert the host_since column back to the type character. This is required since RSQLite does not support data of the type date. We will convert this column back to the correct type when performing queries later.
  4. Can we use the str_remove_all() function from the stringr library to convert the host_verifications sublists into simple strings.
e.g. “[‘email’, ‘phone’]” \(\rightarrow\) “email, phone”.
    # (1) Extract host data 
    host_data <- data %>% 
        select(host_id:host_identity_verified, 
               calculated_host_listings_count:calculated_host_listings_count_shared_rooms)
    
    
    # (2) Remove duplicate values
    host_data <- host_data %>% distinct()
    
    
    # (3) Convert dates
    # Note that this will need to converted back to type = date for analysis
    host_data  <- host_data %>% mutate(host_since = as.character(host_since)) 
    
    
    # (4) Clean host verification column
    host_data <- 
        host_data %>% 
        mutate(host_verifications = str_remove_all(host_verifications, "[\\'\\[\\]]"))

We can now view our clean data:

rmarkdown::paged_table(host_data)

1.3.3.2 Host_info Table Creation

Now that our host_data table is clean. We can create the equivalent table as a query, initially as a string listing the table’s columns, before creating the empty table in our database using our con object:

# Create host info query
    query <- "CREATE TABLE host_info(
        host_id INT, 
        host_url VARCHAR(50), 
        host_name VARCHAR(100), 
        host_since VARCHAR(50),
        host_location VARCHAR(500), 
        host_about VARCHAR(10000),
        host_response_time VARCHAR(50),
        host_response_rate VARCHAR(50),
        host_acceptance_rate VARCHAR(50),
        host_is_superhost BOOLEAN,
        host_thumbnail_url VARCHAR(500),
        host_picture_url VARCHAR(500),
        host_neighbourhood VARCHAR(50),
        host_listings_count INT,
        host_total_listings_count INT,
        host_verifications VARCHAR(500),
        host_has_profile_pic BOOLEAN,
        host_identity_verified BOOLEAN,
        calculated_host_listings_count INT, 
        calculated_host_listings_count_entire_homes INT,
        calculated_host_listings_count_private_rooms INT,
        calculated_host_listings_count_shared_rooms INT,
        PRIMARY KEY(host_id)
        )"

1.3.3.3 Data Insertion

Now create the empty table in our database.

dbRemoveTable(con, "host_info")
dbSendQuery(con, query)
<SQLiteResult>
  SQL  CREATE TABLE host_info(
        host_id INT, 
        host_url VARCHAR(50), 
        host_name VARCHAR(100), 
        host_since VARCHAR(50),
        host_location VARCHAR(500), 
        host_about VARCHAR(10000),
        host_response_time VARCHAR(50),
        host_response_rate VARCHAR(50),
        host_acceptance_rate VARCHAR(50),
        host_is_superhost BOOLEAN,
        host_thumbnail_url VARCHAR(500),
        host_picture_url VARCHAR(500),
        host_neighbourhood VARCHAR(50),
        host_listings_count INT,
        host_total_listings_count INT,
        host_verifications VARCHAR(500),
        host_has_profile_pic BOOLEAN,
        host_identity_verified BOOLEAN,
        calculated_host_listings_count INT, 
        calculated_host_listings_count_entire_homes INT,
        calculated_host_listings_count_private_rooms INT,
        calculated_host_listings_count_shared_rooms INT,
        PRIMARY KEY(host_id)
        )
  ROWS Fetched: 0 [complete]
       Changed: 0

1.3.3.4 Schema Verification

Next, we check the schema of our database so far to ensure that it was loaded correctly.

res <- dbSendQuery(con, "PRAGMA table_info([host_info]);")
fetch(res)
   cid                                         name           type notnull
1    0                                      host_id            INT       0
2    1                                     host_url    VARCHAR(50)       0
3    2                                    host_name   VARCHAR(100)       0
4    3                                   host_since    VARCHAR(50)       0
5    4                                host_location   VARCHAR(500)       0
6    5                                   host_about VARCHAR(10000)       0
7    6                           host_response_time    VARCHAR(50)       0
8    7                           host_response_rate    VARCHAR(50)       0
9    8                         host_acceptance_rate    VARCHAR(50)       0
10   9                            host_is_superhost        BOOLEAN       0
11  10                           host_thumbnail_url   VARCHAR(500)       0
12  11                             host_picture_url   VARCHAR(500)       0
13  12                           host_neighbourhood    VARCHAR(50)       0
14  13                          host_listings_count            INT       0
15  14                    host_total_listings_count            INT       0
16  15                           host_verifications   VARCHAR(500)       0
17  16                         host_has_profile_pic        BOOLEAN       0
18  17                       host_identity_verified        BOOLEAN       0
19  18               calculated_host_listings_count            INT       0
20  19  calculated_host_listings_count_entire_homes            INT       0
21  20 calculated_host_listings_count_private_rooms            INT       0
22  21  calculated_host_listings_count_shared_rooms            INT       0
   dflt_value pk
1          NA  1
2          NA  0
3          NA  0
4          NA  0
5          NA  0
6          NA  0
7          NA  0
8          NA  0
9          NA  0
10         NA  0
11         NA  0
12         NA  0
13         NA  0
14         NA  0
15         NA  0
16         NA  0
17         NA  0
18         NA  0
19         NA  0
20         NA  0
21         NA  0
22         NA  0
dbClearResult(res)

1.3.3.5 Host Data Insertion

Finally, insert our host_data table into the equivalent table in our RSQLite database using the insert_to_sql() function as defined above.

insert_to_sql(con, "host_info", host_data)

1.3.3.6 Database Verification

We can verify the contents of our newly created and populated RSQLite database table host_info by viewing the first 10 rows of each column as follows:

res <- dbSendQuery(con, "SELECT * FROM host_info LIMIT 10") 

out_db <- fetch(res) 
dbClearResult(res)

rmarkdown::paged_table(out_db)

1.3.4 Listing Table

We are now ready to move on to constructing our listing table. This table contains information related to the types of listings available and their relevant attributes. The process of constructing the listing table is very similar to the process of constructing the host_info table.

1.3.4.1 Data Cleaning

In order to prepare the relevant columns related to listing information we perform the following:

  1. Remove all columns related to host_data.
  2. Remove additional unnecessary columns.
  3. Remove the dollar signs in the price column.
  4. Perform the same transformation as we did above on the host_verifications column to the amenities column.
  5. Convert all columns of the type date to the type character as is required for insertion into our database.
# (1) Remove host_data columns
listing_data <- data %>% 
    select( - names(host_data)[-1])


# (2) Remove extraneous columns 
listing_data <- listing_data %>% 
    select(-c(license, calendar_updated, bathrooms, scrape_id))


# (3) Remove dollar signs from price column
listing_data <- listing_data %>% 
    mutate(price = str_remove_all(price, "[$,]") %>% 
               as.numeric()
    )

# (4) Transform amenities and host verification column
listing_data <- listing_data %>% 
    mutate(amenities = str_remove_all(amenities, "[\"\\'\\[\\]]"))


# (5)  Convert dates to character
listing_data <- 
    listing_data %>% 
    mutate(last_scraped          = as.character(last_scraped), 
           calendar_last_scraped = as.character(calendar_last_scraped),
           first_review          = as.character(first_review),
           last_review           = as.character(last_review))

1.3.4.2 Listing Table Creation

Just as before, create our listing database table initially as a string to be inserted as a query. Except here, listing’s id column has very long length which SQLite cannot handle, thus changing it to string first.

#Converts listing id to string
#listing id was too long to store in SQLite as a INT hence change to string to store as VARCHAR
listing_data$id <- as.character(listing_data$id)

# Create listing table query
    query_2 <- "CREATE TABLE listing (
        id VARCHAR(2000),
        listing_url VARCHAR(100),
        last_scraped VARCHAR(50),
        name VARCHAR(500),
        description VARCHAR(2000),
        neighborhood_overview VARCHAR(2000),
        neighbourhood VARCHAR(100),
        neighbourhood_cleansed VARCHAR(100),
        neighbourhood_group_cleansed VARCHAR(100),
        latitude DECIMAL(25,18),
        longitude DECIMAL(25, 18),
        property_type VARCHAR(100),
        room_type VARCHAR(100),
        picture_url VARCHAR(500),
        host_id INT,
        accommodates INT,
        bathrooms_text VARCHAR(100),
        bedrooms INT,
        beds INT,
        amenities VARCHAR(2000),
        price DECIMAL(15, 5),
        minimum_nights INT,
        maximum_nights INT,
        minimum_minimum_nights INT,
        maximum_minimum_nights INT,
        minimum_maximum_nights INT,
        maximum_maximum_nights INT,
        minimum_nights_avg_ntm DECIMAL(16, 5),
        maximum_nights_avg_ntm DECIMAL(16, 5),
        has_availability BOOLEAN,
        availability_30 INT,
        availability_60 INT,
        availability_90 INT,
        availability_365 INT,
        calendar_last_scraped VARCHAR(50),
        number_of_reviews INT,
        number_of_reviews_ltm INT,
        number_of_reviews_l30d INT,
        first_review VARCHAR(50),
        last_review VARCHAR(50),
        review_scores_rating DECIMAL(10, 5),
        review_scores_accuracy DECIMAL(10, 5),
        review_scores_cleanliness DECIMAL(10, 5),
        review_scores_checkin DECIMAL(10, 5),
        review_scores_communication DECIMAL(10, 5),
        review_scores_location DECIMAL(10, 5),
        review_scores_value DECIMAL(10, 5),
        instant_bookable BOOLEAN,
        reviews_per_month DECIMAL(10, 5),
        PRIMARY KEY(id),
        FOREIGN KEY(host_id) REFERENCES host_info(host_id)
    )"

1.3.4.3 Table Insertion

Insert the listing table into our database.

dbRemoveTable(con, "listing")
dbSendQuery(con, query_2)
<SQLiteResult>
  SQL  CREATE TABLE listing (
        id VARCHAR(2000),
        listing_url VARCHAR(100),
        last_scraped VARCHAR(50),
        name VARCHAR(500),
        description VARCHAR(2000),
        neighborhood_overview VARCHAR(2000),
        neighbourhood VARCHAR(100),
        neighbourhood_cleansed VARCHAR(100),
        neighbourhood_group_cleansed VARCHAR(100),
        latitude DECIMAL(25,18),
        longitude DECIMAL(25, 18),
        property_type VARCHAR(100),
        room_type VARCHAR(100),
        picture_url VARCHAR(500),
        host_id INT,
        accommodates INT,
        bathrooms_text VARCHAR(100),
        bedrooms INT,
        beds INT,
        amenities VARCHAR(2000),
        price DECIMAL(15, 5),
        minimum_nights INT,
        maximum_nights INT,
        minimum_minimum_nights INT,
        maximum_minimum_nights INT,
        minimum_maximum_nights INT,
        maximum_maximum_nights INT,
        minimum_nights_avg_ntm DECIMAL(16, 5),
        maximum_nights_avg_ntm DECIMAL(16, 5),
        has_availability BOOLEAN,
        availability_30 INT,
        availability_60 INT,
        availability_90 INT,
        availability_365 INT,
        calendar_last_scraped VARCHAR(50),
        number_of_reviews INT,
        number_of_reviews_ltm INT,
        number_of_reviews_l30d INT,
        first_review VARCHAR(50),
        last_review VARCHAR(50),
        review_scores_rating DECIMAL(10, 5),
        review_scores_accuracy DECIMAL(10, 5),
        review_scores_cleanliness DECIMAL(10, 5),
        review_scores_checkin DECIMAL(10, 5),
        review_scores_communication DECIMAL(10, 5),
        review_scores_location DECIMAL(10, 5),
        review_scores_value DECIMAL(10, 5),
        instant_bookable BOOLEAN,
        reviews_per_month DECIMAL(10, 5),
        PRIMARY KEY(id),
        FOREIGN KEY(host_id) REFERENCES host_info(host_id)
    )
  ROWS Fetched: 0 [complete]
       Changed: 0

1.3.4.4 Data Insertion

Insert the listing data into our database using the same insert_to_sql() function as before.

insert_to_sql(con, "listing", listing_data)

1.3.4.5 Database Verification

Confirm that the data was indeed inserted correctly.

res <- dbSendQuery(con, "SELECT * FROM listing LIMIT 10")

out_db <- fetch(res)
dbClearResult(res)

rmarkdown::paged_table(out_db)

1.3.4.6 Database Disconnect

Finally, before exiting our program, disconnect from the database

dbDisconnect(con)

1.4 Conclusion

We have provided an example of how to pull real-world data from an actual source, then separate, clean and load the data into a RSQLite database. Real-world data is often times initially messy, but that does not mean it cannot be wrangled into a form from which we can gain deeper insights. In our next post, we demonstrate how to answer some interesting questions given our newly created database.

2 Analysis

In this section, we demonstrate how to answer a number of questions related to room listing type and host info using our newly created RSQLite database.

2.1 Questions

We answer the following \(11\) questions:

2.1.1 Room Listing

  1. What is the most common room type available?
  2. What is the top and bottom 10 property types based on average price?
  3. What is the top and bottom 10 property types based on review score?
  4. What is the most common amenities provided?
  5. Is there any correlation between room price and the review score?
  6. Room listing geographical distribution

2.1.2 Host

  1. Who are the top 10 host based on revenue?
  2. Is there any difference in review score between superhost and normal host?
  3. Is there any difference in response rate between superhost and normal host?
  4. What is the most commonly verified host information?
  5. How has the number of hosts joining airbnb increased/decreased overtime?

2.2 Libraries, Database Connection and Functions

2.2.1 Libraries

First, we must load the necessary libraries required for our data analysis and connect to our database.

# Libraries
library(DBI) 
library(RSQLite)
library(tidyr)
library(tibble)
library(stringr)
library(dbplyr)
library(dplyr, warn.conflicts = FALSE)
library(leaflet)
library(lubridate)
library(ggplot2)
library(gridExtra)
library(tidyquant)
library(zoo)

2.2.2 Database Connection

Next, we connect to our database that we set up in the previous post

con <- dbConnect(RSQLite::SQLite(), "../db/airdb.SQLite")

2.2.3 The build_airbnb_database() Function

If you have not yet set up the require database, you can use the build_airbnb_database() function to do so. Note that this function leverages the remove_old_database() and insert_to_sql() functions defined in the previous post.

build_airbnb_database <- function(con, listing_data, remove_old_database = FALSE){
    
    #################### Remove Existing database
    if(remove_old_database == TRUE){
        remove_live_database(con)
    }

    #################### Deal with NA values 
    listing_data <- 
        listing_data %>% 
        
        # Convert dates to characters for NA values
        mutate(last_scraped = as.character(last_scraped),
               host_since = as.character(host_since),
               calendar_last_scraped = as.character(calendar_last_scraped),
               first_review = as.character(first_review),
               last_review = as.character(last_review),
               ) %>% 
        
        # Homogenize NA values
        #*# Taken from: https://rpubs.com/Argaadya/create_table_sql
        mutate_all(function(x) ifelse(x == "" | x == "None" | x == "N/A", NA, x)) %>%  #*#
        # mutate_all(function(x) ifelse(is.na(x), "NULL", x)) %>% 
        
        # Convert character strings back to date type
        mutate(last_scraped = as.Date(last_scraped),
               host_since = as.Date(host_since),
               calendar_last_scraped = as.Date(calendar_last_scraped),
               first_review = as.Date(first_review),
               last_review = as.Date(last_review)) 
    
    
    #################### Extract host data
    host_data <- listing_data %>% 
        select(host_id:host_identity_verified, 
               calculated_host_listings_count:calculated_host_listings_count_shared_rooms)
    
    
    #################### Remove duplicate values
    host_data <- host_data %>% distinct()
    
    
    #################### Convert dates
    # Note that this will need to converted back to type = date for analysis
    host_data  <- host_data %>% mutate(host_since = as.character(host_since)) 
    
    
    #################### Clean host verification column
    host_data <- 
        host_data %>% 
        mutate(host_verifications = str_remove_all(host_verifications, "[\\'\\[\\]]"))
    
    
    #################### Create table for host info
    query <- "CREATE TABLE host_info(
        host_id INT, 
        host_url VARCHAR(50), 
        host_name VARCHAR(100), 
        host_since VARCHAR(50),
        host_location VARCHAR(500), 
        host_about VARCHAR(10000),
        host_response_time VARCHAR(50),
        host_response_rate VARCHAR(50),
        host_acceptance_rate VARCHAR(50),
        host_is_superhost BOOLEAN,
        host_thumbnail_url VARCHAR(500),
        host_picture_url VARCHAR(500),
        host_neighbourhood VARCHAR(50),
        host_listings_count INT,
        host_total_listings_count INT,
        host_verifications VARCHAR(500),
        host_has_profile_pic BOOLEAN,
        host_identity_verified BOOLEAN,
        calculated_host_listings_count INT, 
        calculated_host_listings_count_entire_homes INT,
        calculated_host_listings_count_private_rooms INT,
        calculated_host_listings_count_shared_rooms INT,
        PRIMARY KEY(host_id)
        )"
    
    
    #################### Load host_info table
    dbSendQuery(con, query)
    
    
    #################### Check schema
    res <- dbSendQuery(con, "PRAGMA table_info([host_info]);")
    fetch(res)  
    dbClearResult(res)
    
    
    #################### Insert data into host_info table
    insert_to_sql(con, "host_info", host_data)
    
    
    ####################Listing table Processing####################
    
    # listing_data %>% view() 
    listing_data %>% glimpse()
    
    #################### Remove host_data columns
    listing_data <- listing_data %>% 
        select( - names(host_data)[-1])
    
    
    #################### Remove extraneous columns 
    listing_data <- listing_data %>% 
        select(-c(license, calendar_updated, bathrooms, scrape_id))
    
    
    #################### Remove dollar signs from price column
    listing_data <- listing_data %>% 
        mutate(price = str_remove_all(price, "[$,]") %>% 
                   as.numeric()
        )
    
    
    #################### Transform amenities and host verification column
    listing_data <- listing_data %>% 
        mutate(amenities = str_remove_all(amenities, "[\"\\'\\[\\]]"))
    
    
    listing_data %>% glimpse()
    
    #################### Convert dates to character
    listing_data <- 
        listing_data %>% 
        mutate(last_scraped          = as.character(last_scraped), 
               calendar_last_scraped = as.character(calendar_last_scraped),
               first_review          = as.character(first_review),
               last_review           = as.character(last_review))
    
    #################### Converts listing id to string
    # listing id was too long to store in SQLite as a INT hence change to string to store as VARCHAR
    listing_data$id <- as.character(listing_data$id)
    
    #################### Create listing table
    query_2 <- "CREATE TABLE listing (
        id VARCHAR(2000)
        listing_url VARCHAR(100),
        last_scraped VARCHAR(50),
        name VARCHAR(500),
        description VARCHAR(2000),
        neighborhood_overview VARCHAR(2000),
        picture_url VARCHAR(500),
        host_id INT,
        neighbourhood VARCHAR(100),
        neighbourhood_cleansed VARCHAR(100),
        neighbourhood_group_cleansed VARCHAR(100),
        latitude DECIMAL(25,18),
        longitude DECIMAL(25, 18),
        property_type VARCHAR(100),
        room_type VARCHAR(100),
        accommodates INT,
        bathrooms_text VARCHAR(100),
        bedrooms INT,
        beds INT,
        amenities VARCHAR(2000),
        price DECIMAL(15, 5),
        minimum_nights INT,
        maximum_nights INT,
        minimum_minimum_nights INT,
        maximum_minimum_nights INT,
        minimum_maximum_nights INT,
        maximum_maximum_nights INT,
        minimum_nights_avg_ntm DECIMAL(16, 5),
        maximum_nights_avg_ntm DECIMAL(16, 5),
        has_availability BOOLEAN,
        availability_30 INT,
        availability_60 INT,
        availability_90 INT,
        availability_365 INT,
        calendar_last_scraped VARCHAR(50),
        number_of_reviews INT,
        number_of_reviews_ltm INT,
        number_of_reviews_l30d INT,
        first_review VARCHAR(50),
        last_review VARCHAR(50),
        review_scores_rating DECIMAL(10, 5),
        review_scores_accuracy DECIMAL(10, 5),
        review_scores_cleanliness DECIMAL(10, 5),
        review_scores_checkin DECIMAL(10, 5),
        review_scores_communication DECIMAL(10, 5),
        review_scores_location DECIMAL(10, 5),
        review_scores_value DECIMAL(10, 5),
        instant_bookable BOOLEAN,
        reviews_per_month DECIMAL(10, 5),
        PRIMARY KEY(id),
        FOREIGN KEY(host_id) REFERENCES host_info(host_id)
    )"
    
    
    #################### Insert listing table into database
    dbSendQuery(con, query_2)
    
    
    #################### Insert data into listing table
    insert_to_sql(con, "listing", listing_data)
    
    
    #################### Extract tables from database

}

2.2.4 Data Loading

We can load the data from our database in either of the following ways:

  1. As data frames
host_info <- tbl(con, "host_info") %>% as.data.frame()
listing <- tbl(con, "listing") %>% as.data.frame()
  1. By querying the database
# load host_info table 
res_host_info <- dbSendQuery(con, "select * from host_info")  
host_info <- fetch(res_host_info) 
dbClearResult(res_host_info)

# load listing table
res_listing <- dbSendQuery(con, "select * from listing")
listing <- fetch(res_listing)
dbClearResult(res_listing)

2.3 Room Listing Queries

2.3.1 Most Common Room Type Available

What is the most common room type available?

To find the most common room type available, we start by selecting room_type and has_availability, the latter being a logical indicator of unit availability. Then we filter the rooms that have availability, group them by room_type and rank them by availability.


q1<- listing%>%
    select(room_type,has_availability)%>%
    group_by(room_type)%>%
    filter(has_availability==1)%>%
    summarise(availability=n())%>%
    arrange(desc(availability))

#Using kable for table format
most_common_room_type_available <-q1%>% 
    knitr::kable(align = c("l", "c"),
                 format.args = list(big.mark = ","),
                 digits = 2)

most_common_room_type_available
room_type availability
Entire home/apt 2,218
Private room 1,495
Hotel room 150
Shared room 117

For a clearer visual representation of room type availability, we include the following bar graph of the same data.

q1_plot<-q1%>%
    ggplot(aes(x = availability, y = room_type %>% reorder(availability))) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common Room Type Available",
        x        = "Availability",
        y        = "Room Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

q1_plot

From the plots, we can see that Entire home/apt is the room type with the most available units overall with 2218 available units. However, this is affected by the amount of Entire home/apt that are listed on Airbnb, which is 2232. This is similarly observed across all room types, with the availability being proportional to the number of listings of that room type.

For further analysis, we utilised other information in the data, particularly the information about availability of room types over periods of 30, 60, 90 and 365 days. This allows us to understand the average amount of nights available per room type across the different selection periods.

#Create list columns to retrieve data

availability_periods <- c('availability_30', 'availability_60', 
                          'availability_90', 'availability_365')

#Looping the results per period
for(col in availability_periods){
    tables<- paste('q1',col, sep='_')
    assign(tables,listing%>%
               select(room_type,col)%>%
               group_by(room_type)%>%
               summarise(mean=mean(.data[[col]])))
}


####################LOOP THE PLOTS


q1_availability_30_plot<-q1_availability_30%>%
    ggplot(aes(x = mean, y = room_type )) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common 30 Days",
        x        = "Availability",
        y        = "Room Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))


q1_availability_60_plot<-q1_availability_60%>%
    ggplot(aes(x = mean, y = room_type )) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common 60 Days",
        x        = "Availability",
        y        = "Room Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))


q1_availability_90_plot<-q1_availability_90%>%
    ggplot(aes(x = mean, y = room_type )) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common 90 Days",
        x        = "Availability",
        y        = "Room Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))



q1_availability_365_plot<-q1_availability_365%>%
    ggplot(aes(x = mean, y = room_type )) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common 365 Days",
        x        = "Availability",
        y        = "Room Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))


 
grid.arrange(q1_availability_30_plot, 
             q1_availability_60_plot, 
             q1_availability_90_plot,
             q1_availability_365_plot, ncol = 2)

From these 4 plots we can see that across all periods Entire home/apt is still the room type with highest average availability, followed by Private room, Shared room and Hotel room. We can infer that the demand for Entire home/apt is not excessive since its overall availability and average availability across periods are relatively high. Something interesting to note is that although there are more hotel rooms available than shared rooms, the average availability of hotel rooms is lower than that of shared rooms. This may suggest that in Singapore, the demand for hotel rooms is higher than shared rooms.

2.3.2 Top/Bottom Property Types Average Price

Which are top 10 most expensive and top 10 cheapest property type?

First we start by selecting property_type and price from all available data, then we group by property type and summarise by average price. Then, we create two separate rankings:

  1. top_property_type_average_price
  2. bottom_property_type_average_price

q2<- listing%>%
    select(property_type,price)%>%
    group_by(property_type)%>%
    summarise(price = mean(price))%>%
    arrange(desc(price))


top_property_type_average_price <-q2 %>%
    arrange(desc(price))%>%
    top_n(10)%>% 
    knitr::kable(align = c("l", "c"),
    format.args = list(big.mark = ","),
    digits = 2)

top_property_type_average_price
property_type price
Houseboat 2,162.00
Boat 1,083.50
Entire bungalow 886.50
Entire villa 858.00
Entire home 519.90
Entire chalet 504.00
Private room in chalet 477.00
Farm stay 417.00
Private room in tent 363.50
Room in hotel 324.51
        

bottom_property_type_average_price <-q2 %>%
    arrange(price)%>%
    top_n(-10)%>% 
    knitr::kable(align = c("l", "c"),
                 format.args = list(big.mark = ","),
                 digits = 2)

bottom_property_type_average_price
property_type price
Shared room in guesthouse 27.00
Shared room in home 37.50
Shared room 43.33
Shared room in bed and breakfast 44.31
Private room in casa particular 47.00
Shared room in townhouse 47.00
Private room in earthen home 50.00
Shared room in riad 50.00
Entire vacation home 56.50
Shared room in hostel 58.75

First we plot the top properties based on average price.

top_q2<- q2%>%
    arrange(desc(price))%>%
    top_n(10)%>% 
    ggplot(aes(x = price, y = property_type %>% reorder(price))) + 
    geom_col(fill = "Aquamarine4") + 
   
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Top Property Type by Average price",
        x        = "Average price",
        y        = "Property Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

top_q2

Finally, we view the bottom properties based on average price.

bottom_q2<- q2%>%
    arrange((price))%>%
    top_n(-10)%>% 
    ggplot(aes(x = price, y = property_type %>% reorder(price))) + 
    geom_col(fill = "Skyblue3") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Bottom Property Type by Average price",
        x        = "Average price",
        y        = "Property Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold")) 

bottom_q2

From the plots, we can see that across most of the properties, larger properties tend to have higher average prices. Large properties like Houseboat, Boat, Entire bungalow, Entire villa have high average prices, whereas smaller properties like shared room in guesthouse has the lowest average price.

2.3.3 Top/Bottom Room Types Review Score

Which are the best 10 and worst 10 reviewed room types?

For this query, we used the same logic as in the previous section but instead of selecting property_type and price we select property_type and review_scores_rating.

q3<- listing%>%
    select(property_type,review_scores_rating)%>%
    group_by(property_type)%>%
    summarise(review_scores_rating = mean(review_scores_rating,na.rm=TRUE))%>%
    arrange(desc(review_scores_rating))


top_property_type_review_scores_rating <-q3 %>%
    arrange(desc(review_scores_rating))%>%
    top_n(10)%>% 
    knitr::kable(align = c("l", "c"),
                 format.args = list(big.mark = ","),
                 digits = 2)

top_property_type_review_scores_rating
property_type review_scores_rating
Boat 4.91
Entire guest suite 4.89
Entire loft 4.88
Shared room in boutique hotel 4.88
Tiny home 4.87
Private room in villa 4.84
Entire place 4.83
Entire guesthouse 4.81
Shared room in condo 4.75
Private room in loft 4.72


bottom_property_type_review_scores_rating <-q3 %>%
    arrange(review_scores_rating)%>%
    top_n(-10)%>% 
    knitr::kable(align = c("l", "c"),
                 format.args = list(big.mark = ","),
                 digits = 2)

bottom_property_type_review_scores_rating
property_type review_scores_rating
Tent 1.00
Private room in guesthouse 3.04
Private room 3.62
Room in aparthotel 3.73
Entire home 3.83
Private room in tent 4.00
Shared room in hostel 4.06
Shared room in rental unit 4.06
Private room in hostel 4.21
Room in hostel 4.21

Next, we plot the top 10 property types based on review score.

top_q3<- q3%>%
    arrange(desc(review_scores_rating))%>%
    top_n(10)%>% 
    ggplot(aes(x = review_scores_rating, y = property_type %>% reorder(review_scores_rating))) + 
    geom_col(fill = "Aquamarine4") + 
    
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Top Property Type by Review Score Rating",
        x        = "Review Score Rating",
        y        = "Property Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

top_q3

Finally, we plot the of the bottom 10 property types based on review score.

bottom_q3<- q3%>%
    arrange((review_scores_rating))%>%
    top_n(-10)%>% 
    ggplot(aes(x = review_scores_rating, y = property_type %>% reorder(review_scores_rating))) + 
    geom_col(fill = "Skyblue3") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Bottom Property Type by Review Score Rating",
        x        = "Review Score Rating",
        y        = "Property Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold")) 

bottom_q3

From the plots and the tables, we can see that Boat has the highest average review score rating of 4.91, with other property types such as Entire guest suite , Entire loft, Shared room in boutique hotel and Tiny home having high scores. The top performing property types in terms of review score ratings are more premium and fancy, which is understandable since they result in more memorable experiences. As for the bottom property types, Tent has the lowest rating of 1,

Tent has the lowest average review score rating of 1, which presents itself as an outlier as the next lowest average review score rating was 3.04 for Private room in guesthouse. Let’s take a look at the Tent listings to understand the reason behind the low rating. We shall also take a look at the count of all listings by proporty type for a better understanding of the review rating scores.

tent_listing <- listing %>% filter(property_type == "Tent") %>% select(property_type, review_scores_rating)
property_count <- listing%>% group_by(property_type) %>% summarise(count = n()) %>% select(property_type, count) %>% arrange(count)

tent_listing
  property_type review_scores_rating
1          Tent                    1
property_count
# A tibble: 56 × 2
   property_type                   count
   <chr>                           <int>
 1 Campsite                            1
 2 Entire chalet                       1
 3 Entire home/apt                     1
 4 Farm stay                           1
 5 Private room in casa particular     1
 6 Private room in earthen home        1
 7 Private room in guest suite         1
 8 Private room in vacation home       1
 9 Shared room in riad                 1
10 Shared room in townhouse            1
# … with 46 more rows

From tent_listing, we can see that there was only one entry for Tent property type, thus the review score of 1 was only from 1 entry. From property_count we can also see that many listings have very few entries, hence the average ratings may be skewed and not as representative.

2.3.4 Most Common Amenities

Which are the most common amenities offered on the properties?

By looking at the amenities information in the listing data we realize that all amenities are listed in one column. In order to quantify and group the amenities, we first get the max amount of amenities a property can have using the str_count() function of the stringr library.


ncols_q4 <- max(stringr::str_count(listing$amenities, ",")) + 1

Thereafter, we create multiple columns according to the max amount of amenities a property can have and populate the columns with all the amenities a property can have. We do this by using separate function from the tidyr library.

#Create list of column names
colmn_q4 <- paste("col", 1:ncols_q4)

#Create columns filled with the amenities all properties have
q4 <- listing%>%
    select(amenities)%>%
    tidyr::separate(
    col = amenities, 
    sep= ",", 
    into=colmn_q4,
    remove = FALSE)

By using pivot_longer, we prepare our data into data base format which allows for the occurrence of amenity types to be counted.

q4 <- pivot_longer(data=q4,
                  cols = 'col 1':paste('col',length(q4)-1), #pivot to total number of columns
                  names_to = "col_number",
                  values_to = "separated_amenities")

Next, we tabulate all occurrences of the available amenity types.

    
q4 <- q4 %>%
    select(separated_amenities) %>%
    group_by(separated_amenities) %>%
    summarise(amenities_count=n()) %>%
    na.omit() %>%
    arrange(desc(amenities_count)) %>%
    top_n (10)

 
q4_plot <- q4 %>%
    ggplot(aes(x = amenities_count, y = separated_amenities %>% reorder(amenities_count))) + 
    geom_col(fill = "Aquamarine4") + 
    
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common Amenities",
        x        = "Count",
        y        = "Amenities"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

most_common_amenities <- q4%>%
    knitr::kable(align = c("l", "c"),
                 format.args = list(big.mark = ","),
                 digits = 2)

q4_plot

most_common_amenities
separated_amenities amenities_count
Long term stays allowed 3,940
Air conditioning 3,894
Essentials 3,464
Kitchen 3,325
Hangers 3,242
Washer 3,223
Iron 3,069
Hair dryer 2,896
Hot water 2,745
Wifi 2,733

2.3.5 Room Price Review Score Correlation

Does paying a higher price for ‘nicer’ accommodations correlate to an increased satisfaction level?

We notice that while an accommodation may only have one score in terms of price, it is reviewed across several different dimensions:

  1. Cleanliness
  2. Check-in
  3. Communication
  4. Location
  5. Rating
  6. Accuracy
  7. Value
  8. Overall Rating

We start by selecting the price and review related columns from the listing table, before dropping any rows that are missing review scores. This leaves us with roughly 10,000 observations remaining, more than enough to theoretically examine correlation.

q5 <- 
    listing %>% select(price, review_scores_accuracy, 
                       review_scores_cleanliness, review_scores_checkin, 
                       review_scores_communication, review_scores_location, 
                       review_scores_rating, review_scores_value) %>% 
                drop_na()

We defined the following function c_plot() to handle the repetitive plotting of price versus our eight different review dimensions.

# Function defining correlation plot
c_plot <- function(df, y_val, y_name, plot_title = y_name, clr = "dodgerblue4"){
    c_plot <- df %>% 
        ggplot(aes(x = price,
                   y = y_val)) +
            geom_jitter(color = clr, alpha = 0.5) + 
            scale_x_log10(label = scales::number_format(big.mark = ",")) +
            labs(x = "Price",
                 y = y_name,
                 title = plot_title) + theme_tq()
    return(c_plot)
}

We then construct our correlation graphs using the above defined c_plot() function. We also use the grid.arrange() function from the gridExtra library to help align our multiple plots for parallel examination.

# Colours for correlation plot
c <- c("Aquamarine4", "Sienna3")

# Build correlation plots
q5_1 <- c_plot(q5, q5$review_scores_rating, "Rating", plot_title = "Rating vs Price") 
q5_2 <- c_plot(q5, q5$review_scores_accuracy, "Accuracy", clr = c[1]) 
q5_3 <- c_plot(q5, q5$review_scores_cleanliness, "Cleanliness", clr = c[1]) 
q5_4 <- c_plot(q5, q5$review_scores_checkin, "Check-in", clr = c[1]) 
q5_5 <- c_plot(q5, q5$review_scores_communication, "Communication", clr = c[2]) 
q5_6 <- c_plot(q5, q5$review_scores_location, "Location", clr = c[2]) 
q5_7 <- c_plot(q5, q5$review_scores_value, "Value", clr = c[2]) 

# Output correlation plots
q5_1

grid.arrange(q5_2, q5_3, q5_4, ncol = 3)

grid.arrange(q5_5, q5_6, q5_7, ncol = 3)

Our plots appear noisy, but the trend seems to be that properties receiving lower ratings along each of the dimensions tend to be those with lower rental prices. Among the few properties falling into the higher rental price range (i.e. > 1,000), almost all of the ratings lie above 4 out of the possible 5 points. However, there are still plenty of properties at the lower price range that receive excellent review scores. Unfortunately, due to the asymmetric, non-normal distributed nature of the data, we are unable to apply the cor.test() function to determine if the correlation between price and review rating is statistically significant.

2.3.6 Room Listing Geographical Distribution

How are rental properties distributed by geographical location?

To study the geographical distribution of available accommodations to rent, we use the leaflet library to create an interactive map

q6 <- listing %>%
    left_join(host_info, by = "host_id") %>%
    select(host_id, host_name, listing_url, latitude, longitude, price,
           review_scores_rating, number_of_reviews, neighbourhood_cleansed) %>%
    replace_na(list(name = "No Name", host_name = "No Host Name"))


popup <- paste0("<b>", q6$name, "</b><br>",
                "Listing ID: ", q6$id, "<br>",
                "Host Name: ", q6$host_name, "<br>",
                "Price: ", q6$price, "<br>",
                "Review Scores Rating: ", ifelse(is.na(q6$review_scores_rating), 
                "No Review Yet", q6$review_scores_rating) , "<br>",
                "Number of Reviews: ", q6$number_of_reviews, "<br>",
                "<a href=", q6$listing_url, "> Click for more info</a>"
                )
leaflet(data = q6) %>% 
    addTiles() %>% 
    addMarkers(lng = ~longitude,
               lat = ~latitude, 
               popup = popup, 
               clusterOptions = markerClusterOptions())

2.4 Host Queries

2.4.1 Top hosts

Who are the top 10 hosts based on revenue?

We begin by joining together the two tables on the column host_id. We then select the necessary columns and create a new column called total_earnings which consists of the formula:

total_earnings = price * review_scores_rating * minimum_nights

We then remove the columns containing NA values and perform a count after grouping by the attributes host_id and host_name. At the same time, we calculate the average price and then finally, select the columns we want and arrange in descending order by the total_earnings.

q7 <- listing %>% 
    left_join(host_info, by = "host_id") %>% 
    select(host_id, host_name, price, 
           review_scores_rating, minimum_nights, number_of_reviews) %>%
    mutate(total_earnings = price * review_scores_rating * minimum_nights) %>% 
    drop_na() %>% 
    group_by(host_id, host_name) %>% 
    mutate(number_of_listing = n(),
           average_price = mean(price)) %>% 
    ungroup() %>% 
    select(host_id, host_name, total_earnings, number_of_listing, average_price) %>% 
    arrange(desc(total_earnings))

We create two plots instead of just one to examine the top posts by revenue. The first plot examines the top 10 hosts by the number of listings they have. The second plot, depicts the top 10 hosts by their total earnings.

We plot the results using a similar process for both plots with the main difference being that the y-axis for the top_host_by_listing plot is ordered by the number_of_listing column, While the top_host_by_earning is ordered by the total_earnings column

top_host_by_listing <- 
    q7 %>% 
    arrange(desc(number_of_listing)) %>% 
    select(host_name, number_of_listing) %>% 
    distinct() %>% 
    head(10) %>% 
    ggplot(aes(x = number_of_listing, y = host_name %>% reorder(number_of_listing))) + 
    geom_col(fill = "Skyblue3") +
    labs(
        title    = "Top 10 Host by # of Listings",
        x        = "Number of Listing",
        y        = "Host Name"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

top_host_by_earning <- 
    q7 %>% 
    select(host_name, total_earnings) %>%
    arrange(desc(total_earnings)) %>% 
    filter(total_earnings != 16242500) %>% 
    head(10) %>% 
    ggplot(aes(x = total_earnings, y = host_name %>% reorder(total_earnings))) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Top 10 Host by Total Earning",
        x        = "Total Earning (in SGD)",
        y        = "Host Name"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(angle = 45, face = "bold", 
                                     vjust = 0.85, hjust = 0.89),
          axis.text.y = element_text(face = "bold"))

We once again use the grid.arrange() function from the GridExtra library to view the two plots side-by-side to aid in direct comparison.

grid.arrange(top_host_by_listing, top_host_by_earning, ncol = 2)

From the two graphs, we noticed that there are no common host that appears in both Top 10 lists.

2.4.2 Superhost vs Host Review Score

Is there any difference in review score between superhost and normal host?

Certain hosts receive the designation of superhost which can be achieved by meeting the following criteria:

  1. 12 months experience as a host
  2. Complete a minimum of 100 nights booked
  3. Response rate 90% or higher
  4. The consistent overall rating of 4.8 or higher
  5. A cancellation rate less than 1%
  6. At least 80% of all reviews are 5-star

In order to determine the difference in review score between superhosts and regular hosts, we use the mutate() function to create a new logical column host_is_superhost.

q8 <- listing %>% 
    left_join(host_info, by = "host_id") %>% 
    select(host_id, host_name, review_scores_rating, host_is_superhost) %>% 
    drop_na() %>% 
    mutate(host_is_superhost = as.logical(host_is_superhost)) %>% 
    select(review_scores_rating, host_is_superhost)

We then create two separate boxplots after isolating only the observations that match the respective TRUE/FALSE condition for the host_is_superhost column.

q8_1 <- 
    q8[q8$host_is_superhost == FALSE, ] %>% 
    ggplot(aes(y = review_scores_rating, group = host_is_superhost)) + 
    geom_boxplot(fill = "Skyblue3") + 
    labs(
        title    = "Host Ratings",
        subtitle = "Ratings Distribution",
        x        = "Host",
        y        = "Rating"
    ) + theme_tq()  + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold")) 

q8_2 <- 
    q8[q8$host_is_superhost == TRUE, ] %>% 
    ggplot(aes(y = review_scores_rating, group = host_is_superhost)) + 
    geom_boxplot(fill = "Aquamarine3") + 
    labs(
        title    = "Superhost Ratings",
        subtitle = "Ratings Distribution",
        x        = "Superhost",
        y        = "Rating"
    ) + theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

Finally, we once again use the grid.arrange() function from the GridExtra library to view the two plots side-by-side to aid in direct comparison.

grid.arrange(q8_1, q8_2, ncol = 2)

There does appear to be a difference in response rate, especially in terms of variance, between superhosts and regular hosts based off visual inspection with the average superhost rating also ranking slightly higher. The IQR for Host ratings are much larger compared to Superhost Ratings. Unfortunately, due to the non-normality of the data, we are unable to rely on a t.test() to verify statistically if our visual assumptions are correct.

2.4.3 Superhost vs Host Response Rate

Is there any difference in response rate between superhost and normal host?

We repeat the same process as the previous query to determine the difference in response rate

q9 <- listing %>% 
    left_join(host_info, by = "host_id") %>% 
    select(host_id, host_name, host_response_rate, host_acceptance_rate, host_is_superhost) %>%
    drop_na() %>% 
    mutate(host_is_superhost = as.logical(host_is_superhost), 
           # Transform acceptance rate and response rate
           host_response_rate = host_response_rate %>% 
               str_remove("[%]") %>% 
               as.numeric(),
           host_acceptance_rate = host_acceptance_rate %>% 
               str_remove("[%]") %>% 
               as.numeric()
    )
q9_1 <- 
    q9[q9$host_is_superhost == FALSE, ] %>% 
    ggplot(aes(y = host_response_rate, group = host_is_superhost)) + 
    geom_boxplot(fill = "Skyblue3") + 
    labs(
        title    = "Host Response Rate",
        subtitle = "Ratings Distribution",
        x        = "Host",
        y        = "Rating"
    ) + theme_tq()  + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold")) 

q9_2 <- 
    q9[q9$host_is_superhost == TRUE, ] %>% 
    ggplot(aes(y = host_response_rate, group = host_is_superhost)) + 
    geom_boxplot(fill = "Aquamarine3") + 
    labs(
        title    = "Superhost Response Rate",
        subtitle = "Ratings Distribution",
        x        = "Superhost",
        y        = "Rating"
    ) + theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))
grid.arrange(q9_1, q9_2, ncol = 2)

We see here that regular hosts have a noticeably lower first quartile for response rate than the superhosts. Although, the set of superhosts is not without its outliers raising the question if some of these hosts may soon lose their superhost status due to their lacklustre response rate.

2.4.4 What is the most commonly verified host information?

What is the most commonly verified host information?

Here we will answer this question applying same logic as we did on question #4 for most common amenities.


#Max amount of verified informations per host
ncols_q10 <- max(stringr::str_count(na.omit(host_info$host_verifications), ",")) + 1
#Create list of column names
colmn_q10 <- paste("col", 1:ncols_q10)


#Create columns filled with all the verified informations
q10<- host_info%>%
    select(host_verifications)%>%
    tidyr::separate(
        col = host_verifications, 
        sep= ",", 
        into=colmn_q10,
        remove = FALSE)
#Pivot longer to prepare data into Data Base format
q10<- pivot_longer(data=q10,
                  cols = 'col 1':paste('col',length(q10)-1), #pivot to total number of columns
                  names_to = "col_number",
                  values_to = "separated_host_verifications")

#There are some rows in q10 in separated host verifications that have blank strings, hence changing them to NA so that we can remove them.
q10$separated_host_verifications[q10$separated_host_verifications ==""] <- NA

#Pipe to answer the question
#Trimmed separated_host_verifications to consolidate 'phone' method
q10<- q10%>%
    mutate(separated_host_verifications = str_trim(separated_host_verifications))%>%
    select(separated_host_verifications)%>%
    group_by(separated_host_verifications)%>%
    summarise(host_verifications_count=n())%>%
    na.omit()%>%
    arrange(desc(host_verifications_count))%>%
    top_n (10)

#plotting the answer for visual reference
q10_plot<-q10%>%
    ggplot(aes(x = host_verifications_count, 
               y = separated_host_verifications %>% 
                   reorder(host_verifications_count))) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common Verified Information",
        x        = "Count",
        y        = "Verified Information"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

q10_plot



q10 <-most_common_host_verifications<- q10%>%
    knitr::kable(align = c("l", "c"),
                 format.args = list(big.mark = ","),
                 digits = 2)

most_common_host_verifications
separated_host_verifications host_verifications_count
phone 1,167
email 1,002
work_email 130

2.4.5 Most Active Months for New Host Sign-Up

What months of the year have historically seen the most activity in terms of new hosts signing up for the service?

We begin by extracting and isolating the necessary data using the following steps:

  1. Join together the two tables.
  2. Select the relevant columns.
  3. Convert host_since_date to the type date using the as.Date() function.
  4. Separate the host_since column into three separate columns for Year, month and day respectively.
  5. Discard the unnecessary day column.
  6. Group by Year and Month and use the count() function to tabulate the results.
  7. Construct the necessary columns for the resulting plot and table in the next step
  8. Disregard any rows containing NA values
q11 <- 
    host_info %>%                                                     # 1
    left_join(listing, by = "host_id") %>% 
    select(host_id, host_since) %>%                                   # 2
    mutate(host_since_date = as.Date(host_since)) %>%                 # 3
    separate("host_since", c("Year", "Month", "Day"), sep = "-") %>%  # 4
    select(-Day) %>%                                                  # 5
    group_by(Year, Month) %>%                                         # 6
    count(Year, Month) %>% 
    ungroup() %>% 
    mutate(year_month = paste0(Year, "-", Month, "-", "01"),          # 7
           year_month_2 = paste0(Year, "-", Month), 
           joined = n) %>% 
    select(year_month, year_month_2, joined) %>% 
    mutate(year_month = as.Date(year_month)) %>% 
    drop_na()                                                         # 8

Next, we use the ggplot library’s geom_line() function to plot the data as a time series.

q11 %>%
    ggplot(aes(x = year_month, y = joined)) +
    geom_line(size = 1.2, colour = "Aquamarine4") + 
    scale_x_date(breaks = waiver(), date_breaks = "6 months") + theme_tq() + 
    theme(axis.text.x = element_text(angle = 45, face = "bold", vjust = 0.65),
          axis.text.y = element_text(face = "bold")) + 
    labs(
        title    = "Number of hosts joined",
        subtitle = "Shows the frequency rate at which new posts sign up for airbnb",
        caption  = "",
        x        = "Joined",
        y        = "Year/Month")

Since we are dealing with data over a number of years, it is helpful to also compile a list of the top 10 most active months in terms of new hosts

q11 %>% 
    select(-year_month) %>% 
    mutate(year_month = as.yearmon(year_month_2)) %>% 
    select(-year_month_2) %>% 
    select(year_month, joined) %>% 
    arrange(desc(joined)) %>% 
    head(10) %>% knitr::kable(align = c("c", "c"))
year_month joined
Jul 2017 237
Apr 2016 224
Oct 2017 220
Oct 2015 129
May 2012 108
Sep 2013 103
Jan 2019 103
Oct 2016 101
Aug 2015 95
Dec 2014 87

We see that the second half of the year (Jul-Dec), makeup most of the busiest months in terms of new hosts joining the service. Jul 2017, Apr 2016 and Oct 2017 make up the top 3 busiest months.

2.4.6 Database Disconnect

Finally, before exiting our program, disconnect from the database

dbDisconnect(con)

2.5 Conclusion

We have demonstrated how to pull in real-world data, divide and clean the data into usable tables, insert the data into a database and then use that database to answer interesting questions that may help provide useful and actionable insights on which to base future decisions. R and its many libraries, specifically the tidyverse, provide a powerful framework with which to answer many interesting questions, often in only a few lines of code. We encourage you to come up with your own questions and see if you can answer them using the provided data.