Introduction

In the world with increasing data availability, it is become a norm to store and collect data from a cloud database instead of using a local file when you are part of a start up or large corporation. Several service, such as Google Cloud Platform or Microsoft Azure provide us with a cloud storage service to store a large dataset. Most of the current database system use SQL to store and collect this data. Therefore, understanding how to write a command or a query using SQL is currently one of the top skills required in data-related job, espescially for a data analyst and data scientist. The following graph is the result of 2020 Data Science and Machine Learning Survey.

To learn more about how to run an SQL query, we will use one of the most common database management system: MySQL. On this occasion, I will guide you on how to do the following things with MySQL:

Create a database with MySQL server
Create multiple table
Use R to connect to MySQL server
Insert data into SQL table
Write query to collect and analyze data from SQL database

For a quick introduction about SQL if you are not familiar with SQL, you can visit this website and just read the welcome page or you can try some practice and come back here later.

Set Up MySQL Server

There are a lot of options for you to start creating a database, either using common cloud service such as Google Cloud Platform or Microsoft Azure, or you can also try setting up a local mysql server in your device. However, setting up a database on these can be a quite long process. Since our goal is focus on preparing data and store with SQL, we will use a free hosting website with db4free. You can also use SQLite to store a small database as well.

This website help us set a free and small MySQL server for us to practice. You just need to register with your email and enter the following information:

The name of your MySQL database
Your username to login
Your password to login
Your email address for validation

Library

The following is the required library that we will use throughout this post.

# Data Wrangling
library(tidyverse)

# SQL Driver
library(RMySQL)

Basic SQL

Connect to MySQL Server

After you have set up the mysql server, now you can connect into the server with RMySQL. You need the following information from the server:

host name
port
username
password
database name (dbname)

mydb <- dbConnect(MySQL(),
                  host = "db4free.net", # write "localhost" if you use a localhost
                  port = 3306,
                  user = "your_name",
                  password = "your_pass",
                  dbname = "your_dbname"
)

Create and Check Table

Let’s check the connection and start with the basic. We will create a simple table of customer information with 3 columns:

customer_id: ID of the customer with data type of integer (INT)
name: name of the customer with data type of character (VARCHAR 100)
city: city location of the customer with data type of character (VARCHAR 100)

Trivia

VARCHAR(100) means that we state that this column has data type of character with maximum length of 100 characters.

The following is the general SQL command for creating a table

CREATE TABLE table_name( column_1 type_data1, column_2 type_data2, )

query <- "CREATE TABLE dummy_customer(
customer_id INT,
name VARCHAR(100),
city VARCHAR(100)
)"

query

## [1] "CREATE TABLE dummy_customer(\ncustomer_id INT,\nname VARCHAR(100),\ncity VARCHAR(100)\n)"

You can send the query by using the dbSendQuery().

dbSendQuery(mydb, # your connection
            query # SQL query
            )

## <MySQLResult:0,0,0>

You can check all available table in the database by using the following query. This will return all available table in your database in a dataframe format.

SHOW TABLES

dbSendQuery(mydb, "SHOW TABLES") %>% 
  fetch()

If you want to check the name and the type of each column in your data, you can use the following command.

DESCRIBE table_name

dbSendQuery(mydb, "DESCRIBE dummy_customer") %>% 
  fetch()

If you want to delete the table, you can run the following query.

DROP TABLE table_name

dbSendQuery(mydb, "DROP TABLE dummy_customer")

## <MySQLResult:589867488,0,3>

Data

We will use data of Thailand room listing from Airbnb. The dataset contain information about room or listing that is posted on Airbnb site in Bangkok, Central Thailand, Thailand.

We will use the Listing dataset that contain the detailed information about listing or room posted on airbnb. One host can have multiple listings. For the detailed information regarding each column, you can check the description for each column in this spreadsheet.

The dataset can be further broken down into several parts. For example, in listing data there are several columns that store information about the host instead of the room/listing. In practice, we can reduce the number of required space for the database by separating the data about listing and the data about host, since a single host can have multiple listing.

To illustrate the relation between each data, we will draw an Entity Relationship Diagram (ERD). An ERD is often used to help data engineer to design a database and show the relation between each table. Open image in new tab if you want to zoom in.

On each table, you can have a column that contain a unique ID to identify each row in a table. This column is called as Primary Key (PK). For example, in host_info table, a unique host should only stored once and there is no duplicate in the table. This host is identified by the host_id column. There is also a column that is called as Foreign Key (FK). For example, the listing table has id as the primary key and host_id as the foreign key. This means that we can join the listing table with the host_info table by matching the host_id on the listing table with the host_id on the host_info table.

The relation or the cardinality between table is illustrated by the sign at both end of the arrow. There are different cardinality in ERD, you can look at the detailed explanation here. A host can have zero (no listing) or many listing, so the cardinality at the listing side is illustrated as zero or many. Naturally, a room listing can only be owned by a single host. There is no listing that is owned by different hosts. Therefore, the cardinality at the host_info side is illustrated as one (and only one).

Host

Processing Host Table

Since the only table that has no foreign key is host_info, we will start creating and cleansing a host table from the listing data. Let’s start by importing the listing data.

df_listing <-  read.csv("data/listings.csv") 

cat( paste("Number of rows:", nrow(df_listing), "\n"))

## Number of rows: 19289

cat( paste("Number of columns:", ncol(df_listing)))

## Number of columns: 74

Let’s check the some of the content from the dataset.

head(df_listing)

Some missing values are written as an empty string ("") or as "None", therefore we will transform this value into explicit NA for R.

df_listing <- df_listing %>% 
  mutate_all( function(x) ifelse(x == "" | x == "None" | x == "N/A", NA, x) )

Let’s check the information from the dataset.

glimpse(df_listing)

## Rows: 19,289
## Columns: 74
## $ id                                           <int> 27934, 27942, 27979, 2835…
## $ listing_url                                  <chr> "https://www.airbnb.com/r…
## $ scrape_id                                    <dbl> 2.021032e+13, 2.021032e+1…
## $ last_scraped                                 <chr> "2021-03-22", "2021-03-22…
## $ name                                         <chr> "Nice room with superb ci…
## $ description                                  <chr> "Our cool and comfortable…
## $ neighborhood_overview                        <chr> "It is very center of Ban…
## $ picture_url                                  <chr> "https://a0.muscache.com/…
## $ host_id                                      <int> 120437, 120462, 120541, 1…
## $ host_url                                     <chr> "https://www.airbnb.com/u…
## $ host_name                                    <chr> "Nuttee", "Donald", "Emy"…
## $ host_since                                   <chr> "2010-05-08", "2010-05-08…
## $ host_location                                <chr> "Bangkok", "Thailand", "B…
## $ host_about                                   <chr> "Hi All, I am nuttee patr…
## $ host_response_time                           <chr> "within a few hours", NA,…
## $ host_response_rate                           <chr> "100%", NA, NA, "100%", N…
## $ host_acceptance_rate                         <chr> NA, NA, NA, NA, NA, NA, N…
## $ host_is_superhost                            <chr> "f", "f", "f", "f", "f", …
## $ host_thumbnail_url                           <chr> "https://a0.muscache.com/…
## $ host_picture_url                             <chr> "https://a0.muscache.com/…
## $ host_neighbourhood                           <chr> "Victory Monument", NA, N…
## $ host_listings_count                          <int> 2, 1, 2, 1, 1, 1, 1, 1, 1…
## $ host_total_listings_count                    <int> 2, 1, 2, 1, 1, 1, 1, 1, 1…
## $ host_verifications                           <chr> "['email', 'phone', 'revi…
## $ host_has_profile_pic                         <chr> "t", "t", "t", "t", "t", …
## $ host_identity_verified                       <chr> "t", "f", "f", "t", "f", …
## $ neighbourhood                                <chr> "Samsen Nai, Bangkok, Tha…
## $ neighbourhood_cleansed                       <chr> "Ratchathewi", "Bangkok Y…
## $ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, N…
## $ latitude                                     <dbl> 13.76000, 13.73463, 13.66…
## $ longitude                                    <dbl> 100.5427, 100.4754, 100.6…
## $ property_type                                <chr> "Entire condominium", "Pr…
## $ room_type                                    <chr> "Entire home/apt", "Priva…
## $ accommodates                                 <int> 3, 1, 2, 2, 2, 2, 2, 2, 3…
## $ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N…
## $ bathrooms_text                               <chr> "1.5 baths", "1 bath", "1…
## $ bedrooms                                     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ beds                                         <int> 1, 1, 2, 1, 1, 1, 1, 1, 1…
## $ amenities                                    <chr> "[\"Microwave\", \"Pool\"…
## $ price                                        <chr> "$1,694.00", "$924.00", "…
## $ minimum_nights                               <int> 3, 1, 1, 28, 60, 3, 5, 1,…
## $ maximum_nights                               <int> 90, 730, 730, 365, 730, 7…
## $ minimum_minimum_nights                       <int> 3, 1, 1, 28, 60, 3, 5, 1,…
## $ maximum_minimum_nights                       <int> 3, 1, 1, 28, 60, 3, 5, 1,…
## $ minimum_maximum_nights                       <int> 90, 730, 730, 1125, 730, …
## $ maximum_maximum_nights                       <int> 90, 730, 730, 1125, 730, …
## $ minimum_nights_avg_ntm                       <dbl> 3, 1, 1, 28, 60, 3, 5, 1,…
## $ maximum_nights_avg_ntm                       <dbl> 90, 730, 730, 1125, 730, …
## $ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N…
## $ has_availability                             <chr> "t", "f", "t", "t", "t", …
## $ availability_30                              <int> 27, 0, 1, 29, 30, 0, 30, …
## $ availability_60                              <int> 57, 0, 1, 59, 60, 0, 60, …
## $ availability_90                              <int> 87, 0, 1, 89, 90, 0, 90, …
## $ availability_365                             <int> 362, 0, 1, 364, 365, 0, 3…
## $ calendar_last_scraped                        <chr> "2021-03-22", "2021-03-22…
## $ number_of_reviews                            <int> 65, 0, 0, 52, 0, 0, 1, 0,…
## $ number_of_reviews_ltm                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ number_of_reviews_l30d                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ first_review                                 <chr> "2012-04-07", NA, NA, "20…
## $ last_review                                  <chr> "2020-01-06", NA, NA, "20…
## $ review_scores_rating                         <int> 97, NA, NA, 96, NA, NA, 8…
## $ review_scores_accuracy                       <int> 10, NA, NA, 10, NA, NA, 1…
## $ review_scores_cleanliness                    <int> 10, NA, NA, 10, NA, NA, 6…
## $ review_scores_checkin                        <int> 10, NA, NA, 10, NA, NA, 4…
## $ review_scores_communication                  <int> 10, NA, NA, 10, NA, NA, 8…
## $ review_scores_location                       <int> 9, NA, NA, 9, NA, NA, 8, …
## $ review_scores_value                          <int> 9, NA, NA, 9, NA, NA, 10,…
## $ license                                      <lgl> NA, NA, NA, NA, NA, NA, N…
## $ instant_bookable                             <chr> "f", "f", "f", "t", "f", …
## $ calculated_host_listings_count               <int> 2, 1, 2, 1, 1, 1, 1, 1, 1…
## $ calculated_host_listings_count_entire_homes  <int> 2, 0, 1, 1, 0, 0, 0, 0, 0…
## $ calculated_host_listings_count_private_rooms <int> 0, 1, 1, 0, 1, 1, 1, 1, 1…
## $ calculated_host_listings_count_shared_rooms  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ reviews_per_month                            <dbl> 0.60, NA, NA, 0.40, NA, N…

The listing dataset contain information about each room listing or room posted on the airbnb website by the host.

A single host is identified by the host_id and may have multiple room in the listing. Let’s check this hypothesis.

# Count the number of listing for each host and get the top 10
df_listing %>% 
  count(host_id) %>% 
  arrange(-n) %>% 
  head(10)

As we have seen from the above output, several hosts even have hundreds of listing. For efficient storage in the database, we will separate information about the host, such as host_id, host_url, etc. from the information about the listing.

df_host <- df_listing %>% 
  select(host_id:host_identity_verified, 
         calculated_host_listings_count:calculated_host_listings_count_shared_rooms)

glimpse(df_host)

## Rows: 19,289
## Columns: 22
## $ host_id                                      <int> 120437, 120462, 120541, 1…
## $ host_url                                     <chr> "https://www.airbnb.com/u…
## $ host_name                                    <chr> "Nuttee", "Donald", "Emy"…
## $ host_since                                   <chr> "2010-05-08", "2010-05-08…
## $ host_location                                <chr> "Bangkok", "Thailand", "B…
## $ host_about                                   <chr> "Hi All, I am nuttee patr…
## $ host_response_time                           <chr> "within a few hours", NA,…
## $ host_response_rate                           <chr> "100%", NA, NA, "100%", N…
## $ host_acceptance_rate                         <chr> NA, NA, NA, NA, NA, NA, N…
## $ host_is_superhost                            <chr> "f", "f", "f", "f", "f", …
## $ host_thumbnail_url                           <chr> "https://a0.muscache.com/…
## $ host_picture_url                             <chr> "https://a0.muscache.com/…
## $ host_neighbourhood                           <chr> "Victory Monument", NA, N…
## $ host_listings_count                          <int> 2, 1, 2, 1, 1, 1, 1, 1, 1…
## $ host_total_listings_count                    <int> 2, 1, 2, 1, 1, 1, 1, 1, 1…
## $ host_verifications                           <chr> "['email', 'phone', 'revi…
## $ host_has_profile_pic                         <chr> "t", "t", "t", "t", "t", …
## $ host_identity_verified                       <chr> "t", "f", "f", "t", "f", …
## $ calculated_host_listings_count               <int> 2, 1, 2, 1, 1, 1, 1, 1, 1…
## $ calculated_host_listings_count_entire_homes  <int> 2, 0, 1, 1, 0, 0, 0, 0, 0…
## $ calculated_host_listings_count_private_rooms <int> 0, 1, 1, 0, 1, 1, 1, 1, 1…
## $ calculated_host_listings_count_shared_rooms  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0…

Now we have 22 columns for the host and the rest is columns related to the individual listing.

There should be no duplicate host, so we will remove duplicated host.

df_host <- df_host %>% 
  distinct()

print( paste("Number of rows:", nrow(df_host)) )

## [1] "Number of rows: 8296"

The next we do is preparing the dataset so they will have a proper data type in the database. You may have noticed that some columns should be have a boolean or logical data type, such as the host_is_superhost, host_has_profile_pic, and host_identity_verified. They contain string t if the value is True and f if the value is False. We will transform the data to the proper data type.

df_host <- df_host %>% 
  mutate_at(vars(host_is_superhost, host_has_profile_pic, host_identity_verified),
            function(x) case_when(x == "t" ~ TRUE,
                                  x == "f" ~ FALSE,
                                  T ~ NA)
            )

The next thing that I want to clean up is the host_verification column that is in a list format like following sample.

head(df_host$host_verifications)

## [1] "['email', 'phone', 'reviews', 'jumio', 'offline_government_id', 'selfie', 'government_id', 'identity_manual']"
## [2] "['email', 'phone']"                                                                                           
## [3] "['email', 'phone']"                                                                                           
## [4] "['email', 'phone', 'facebook', 'reviews', 'jumio', 'selfie', 'government_id', 'identity_manual']"             
## [5] "['email', 'phone']"                                                                                           
## [6] "['email', 'phone', 'facebook', 'reviews', 'jumio', 'offline_government_id', 'government_id']"

We will clean them so they contain a simple string text. For example, the first row will be email, phone, facebook, ….

df_host <- df_host %>% 
  
  # remove ', [, and  ]
  mutate(host_verifications = str_remove_all(host_verifications, "[\\'\\[\\]]"))

head(df_host$host_verifications)

## [1] "email, phone, reviews, jumio, offline_government_id, selfie, government_id, identity_manual"
## [2] "email, phone"                                                                               
## [3] "email, phone"                                                                               
## [4] "email, phone, facebook, reviews, jumio, selfie, government_id, identity_manual"             
## [5] "email, phone"                                                                               
## [6] "email, phone, facebook, reviews, jumio, offline_government_id, government_id"

Create Host Table

We will start creating a table for the database. Since there is no numeric or decimal value from the host data, we will assign all numeric value into integer. The next thing we need to do is to design a proper data type for the string text columns. We will check the maximum length of each string column in the host dataset.

df_host %>% 
  select_if(is.character) %>% 
  lapply(FUN = function(x) max(nchar(x, keepNA = F))) %>% 
  as.data.frame() %>% 
  pivot_longer(cols = names(.),
               names_to = "column",
               values_to = "maximum_length")

Now we can create a table named host_info with the following data type:

INT: integer
VARCHAR(n): String or character with maximum length of n
DATE: date (format in YYYY-MM-DD)
BOOLEAN: logical (TRUE or FALSE)

Make sure the number of character (n) in VARCHAR is bigger than the length of your data characters. For example, the maximum length of host_location is 125, so you can create a VARCHAR(200) or VARCHAR(500) just to make sure. Don’t forget to assign host_id as the primary key for the table.

For more detailed information about different data type allowed by MySQL, you can check the following manuals.

query <-  "CREATE TABLE host_info(
host_id INT,
host_url VARCHAR(50),
host_name VARCHAR(100),
host_since DATE,
host_location VARCHAR(500),
host_about VARCHAR(10000),
host_response_time VARCHAR(50),
host_response_rate VARCHAR(50),
host_acceptance_rate VARCHAR(50),
host_is_superhost BOOLEAN,
host_thumbnail_url VARCHAR(500),
host_picture_url VARCHAR(500),
host_neighbourhood VARCHAR(50),
host_listings_count INT,
host_total_listings_count INT,
host_verifications VARCHAR(500),
host_has_profile_pic BOOLEAN,
host_identity_verified BOOLEAN,
calculated_host_listings_count INT,
calculated_host_listings_count_entire_homes INT,
calculated_host_listings_count_private_rooms INT,
calculated_host_listings_count_shared_rooms INT,
PRIMARY KEY(host_id)
)"

query

## [1] "CREATE TABLE host_info(\nhost_id INT,\nhost_url VARCHAR(50),\nhost_name VARCHAR(100),\nhost_since DATE,\nhost_location VARCHAR(500),\nhost_about VARCHAR(10000),\nhost_response_time VARCHAR(50),\nhost_response_rate VARCHAR(50),\nhost_acceptance_rate VARCHAR(50),\nhost_is_superhost BOOLEAN,\nhost_thumbnail_url VARCHAR(500),\nhost_picture_url VARCHAR(500),\nhost_neighbourhood VARCHAR(50),\nhost_listings_count INT,\nhost_total_listings_count INT,\nhost_verifications VARCHAR(500),\nhost_has_profile_pic BOOLEAN,\nhost_identity_verified BOOLEAN,\ncalculated_host_listings_count INT,\ncalculated_host_listings_count_entire_homes INT,\ncalculated_host_listings_count_private_rooms INT,\ncalculated_host_listings_count_shared_rooms INT,\nPRIMARY KEY(host_id)\n)"

After you have created the SQL query, you can execute them.

# Execute query
dbSendQuery(mydb, query)

You can run a query and get a result. For example, you can check data information about the host_info table.

# sent query
res <- dbSendQuery(mydb, "DESCRIBE host_info") 

# get the result
out_db <- fetch(res) 

# free resource
dbClearResult(res)

## [1] TRUE

# print result
out_db

If you want to check the content or the data from host_info table, you can use the following query, although for now the table is still empty because we don’t insert anything yet into the table.

res <- dbSendQuery(mydb, "SELECT * FROM host_info") 

out_db <- fetch(res) 
dbClearResult(res)

out_db

Insert Data to Host Table

Let’s start inserting data into MySQL server. The generic formula to insert data into the table is as follows:

INSERT INTO table_name (column_name) VALUES (value_for_each_column)

For example, if you have a table named customer that has reviewer_id and reviewer_name column, you can insert a row with the following query:

INSERT INTO customer (reviewer_id, reviewer_name) VALUES (123, ‘John Doe’)

So we need to insert the name of the column and the respective value for each column. Let’s start by preparing the column name into a single string.

column_name <-  paste(names(df_host), collapse = ", ")

column_name

## [1] "host_id, host_url, host_name, host_since, host_location, host_about, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, host_thumbnail_url, host_picture_url, host_neighbourhood, host_listings_count, host_total_listings_count, host_verifications, host_has_profile_pic, host_identity_verified, calculated_host_listings_count, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms"

Next, we can prepare the value into a single string for each row. For example, the following is the input for the first row of the host data.

Just like when you read a csv file, an input for a single column is separated by comma (,) value by default. For example, input for the first column host_id is 266763, input for the second column is the string for host_url, etc. However, you may notice that at the middle of the string we have an extra comma from the host_verifications (email, phone, facebook) which should go into a single column. We need to clean the string so they can fit properly to how the SQL will read the data.

To get a clean values, we need to process the string column first with the following rules:

A string should be started and ended with quotation mark ("“), e.g. ”email, phone, facebook“,”Fransesca"
Missing values (NA) should not be quoted and should be transformed to NULL for SQL
Logical value should not be quoted

The following is the proper query for the value in a single row of data from multiple columns.

# Convert value of each row into a single string
value_data <- df_host %>% 
  slice(1) %>% # take the first row as sample
  mutate_if(is.character, function(x) ifelse(is.na(x), NA,  x %>% 
                                               str_replace_all('"', "'") %>% # Replace all "" into '' in a string before joining
                                               paste0('"', . , '"') 
                                              )
            ) %>% 
  unlist() %>% 
  paste(collapse = ", ")

value_data

## [1] "120437, \"https://www.airbnb.com/users/show/120437\", \"Nuttee\", \"2010-05-08\", \"Bangkok\", \"Hi All, I am nuttee patranavik from Bangkok, Thailand.\nalways travel but easy to connect via airbnb..\", \"within a few hours\", \"100%\", NA, FALSE, \"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_small\", \"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_x_medium\", \"Victory Monument\", 2, 2, \"email, phone, reviews, jumio, offline_government_id, selfie, government_id, identity_manual\", TRUE, TRUE, 2, 2, 0, 0"

Now you can see that even if a column has internal comma, such as the host_verifications, they will not be a problem because we give a quotation mark for a string so the SQL will understand that the comma will not be read as a command to fill the next column.

Finally, We need to transform a missing value NA into explicit NULL since SQL only consisder NULL as the legitimate missing value.

# Replace missing value with explicit NULL
value_data <- str_replace_all(value_data, "\\bNA\\b", "NULL") %>% # Replace NA to NULL
  str_replace_all('\\"NULL\\"', "NULL") # Remove "" from NULL

value_data

## [1] "120437, \"https://www.airbnb.com/users/show/120437\", \"Nuttee\", \"2010-05-08\", \"Bangkok\", \"Hi All, I am nuttee patranavik from Bangkok, Thailand.\nalways travel but easy to connect via airbnb..\", \"within a few hours\", \"100%\", NULL, FALSE, \"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_small\", \"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_x_medium\", \"Victory Monument\", 2, 2, \"email, phone, reviews, jumio, offline_government_id, selfie, government_id, identity_manual\", TRUE, TRUE, 2, 2, 0, 0"

The full query for the first row is as follow:

paste0("INSERT INTO host_info (", column_name, ") VALUES (", value_data , ")")

## [1] "INSERT INTO host_info (host_id, host_url, host_name, host_since, host_location, host_about, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, host_thumbnail_url, host_picture_url, host_neighbourhood, host_listings_count, host_total_listings_count, host_verifications, host_has_profile_pic, host_identity_verified, calculated_host_listings_count, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms) VALUES (120437, \"https://www.airbnb.com/users/show/120437\", \"Nuttee\", \"2010-05-08\", \"Bangkok\", \"Hi All, I am nuttee patranavik from Bangkok, Thailand.\nalways travel but easy to connect via airbnb..\", \"within a few hours\", \"100%\", NULL, FALSE, \"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_small\", \"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_x_medium\", \"Victory Monument\", 2, 2, \"email, phone, reviews, jumio, offline_government_id, selfie, government_id, identity_manual\", TRUE, TRUE, 2, 2, 0, 0)"

Let’s prepare the data.

df_host_new <- df_host %>% 
  mutate_if(is.character, function(x) ifelse(is.na(x), NA,  x %>% 
                                               str_replace_all('"', "'") %>% # Replace " with '
                                               paste0('"', . , '"')  # add " before and after string
                                              )
            )

You can insert multiple data at once with the following SQL format:

INSERT INTO customer (reviewer_id, reviewer_name) VALUES (123, ‘John Doe’), VALUES (124, ‘Jean Doe’)

Let’s prepare the value for each observation with the following function.

value_data <- apply(df_host_new, MARGIN = 1, 
      function(x) x %>% 
        paste(collapse = ",") %>% # Join all column into single string
        str_replace_all("\\bNA\\b", "NULL") %>% # Replace NA with NULL
        str_trim() # Remove unnecessary whitespace
      )

# Check first 3 data
value_data %>% 
  head(3)

## [1] "120437,\"https://www.airbnb.com/users/show/120437\",\"Nuttee\",\"2010-05-08\",\"Bangkok\",\"Hi All, I am nuttee patranavik from Bangkok, Thailand.\nalways travel but easy to connect via airbnb..\",\"within a few hours\",\"100%\",NULL,FALSE,\"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_small\",\"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_x_medium\",\"Victory Monument\",   2,   2,\"email, phone, reviews, jumio, offline_government_id, selfie, government_id, identity_manual\",TRUE,TRUE,  2,  2, 0, 0"
## [2] "120462,\"https://www.airbnb.com/users/show/120462\",\"Donald\",\"2010-05-08\",\"Thailand\",NULL,NULL,NULL,NULL,FALSE,\"https://a0.muscache.com/im/users/120462/profile_pic/1273318678/original.jpg?aki_policy=profile_small\",\"https://a0.muscache.com/im/users/120462/profile_pic/1273318678/original.jpg?aki_policy=profile_x_medium\",NULL,   1,   1,\"email, phone\",TRUE,FALSE,  1,  0, 1, 0"                                                                                                                                                                                                                                     
## [3] "120541,\"https://www.airbnb.com/users/show/120541\",\"Emy\",\"2010-05-08\",\"Bangkok, Thailand\",NULL,NULL,NULL,NULL,FALSE,\"https://a0.muscache.com/im/users/120541/profile_pic/1436431885/original.jpg?aki_policy=profile_small\",\"https://a0.muscache.com/im/users/120541/profile_pic/1436431885/original.jpg?aki_policy=profile_x_medium\",NULL,   2,   2,\"email, phone\",TRUE,FALSE,  2,  1, 1, 0"

Finally, you will join all rows into a single, giant string. For example, the following is the query to insert the first 3 data.

paste(value_data[1:3]) %>% 
  paste0("(", ., ")") %>% # Add bracket before and after string
  paste(collapse = ", ") %>% # Join all values into single string
  paste0("INSERT INTO host_info (", column_name, ") VALUES ", .)

## [1] "INSERT INTO host_info (host_id, host_url, host_name, host_since, host_location, host_about, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, host_thumbnail_url, host_picture_url, host_neighbourhood, host_listings_count, host_total_listings_count, host_verifications, host_has_profile_pic, host_identity_verified, calculated_host_listings_count, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms) VALUES (120437,\"https://www.airbnb.com/users/show/120437\",\"Nuttee\",\"2010-05-08\",\"Bangkok\",\"Hi All, I am nuttee patranavik from Bangkok, Thailand.\nalways travel but easy to connect via airbnb..\",\"within a few hours\",\"100%\",NULL,FALSE,\"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_small\",\"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_x_medium\",\"Victory Monument\",   2,   2,\"email, phone, reviews, jumio, offline_government_id, selfie, government_id, identity_manual\",TRUE,TRUE,  2,  2, 0, 0), (120462,\"https://www.airbnb.com/users/show/120462\",\"Donald\",\"2010-05-08\",\"Thailand\",NULL,NULL,NULL,NULL,FALSE,\"https://a0.muscache.com/im/users/120462/profile_pic/1273318678/original.jpg?aki_policy=profile_small\",\"https://a0.muscache.com/im/users/120462/profile_pic/1273318678/original.jpg?aki_policy=profile_x_medium\",NULL,   1,   1,\"email, phone\",TRUE,FALSE,  1,  0, 1, 0), (120541,\"https://www.airbnb.com/users/show/120541\",\"Emy\",\"2010-05-08\",\"Bangkok, Thailand\",NULL,NULL,NULL,NULL,FALSE,\"https://a0.muscache.com/im/users/120541/profile_pic/1436431885/original.jpg?aki_policy=profile_small\",\"https://a0.muscache.com/im/users/120541/profile_pic/1436431885/original.jpg?aki_policy=profile_x_medium\",NULL,   2,   2,\"email, phone\",TRUE,FALSE,  2,  1, 1, 0)"

We will create a function to simplify the process so that we can use the previous steps multiple times in a single line.

insert_to_sql <- function(conn, table, data){
  
  column_name <-  paste(names(data), collapse = ", ")
  
  data_new <- data %>% 
    mutate_if(is.character, function(x) ifelse(is.na(x), NA,  x %>% 
                                                 str_replace_all('"', "'") %>% # Replace " with '
                                                 paste0('"', . , '"') # Add " before and after string
                                               )
              )
  
  value_data <- apply(data_new, MARGIN = 1, 
                      function(x) x %>% 
                        paste(collapse = ",") %>% # Join all column into single string
                        str_replace_all("\\bNA\\b", "NULL")  %>% # Create NULL from NA
                        str_trim() # remove unnecessary whitespace
                      )
  
  query_value <- paste(value_data) %>% 
    paste0("(", ., ")") %>% # Add bracket before and after string
    paste(collapse = ", ") %>% # Join all values into single string
    str_remove_all("\\\\") %>% # Remove \\ from string
    paste0("INSERT INTO ", table, "(", column_name, ") VALUES ", .)
  
  dbSendQuery(conn, query_value)
  
  cat("Data uploaded to MySQL\n")
}

Let’s insert the data into the MySQL server.

# Insert all data to SQL
insert_to_sql(mydb, "host_info", df_host)

Query the Host Table

Now we will try to do some query to the host_info table, starting with connect to the MySQL.

Let’s try to get the first 10 observation from the host_info table. The LIMIT command is the same as head() in python and will give us the first n row.

# sent query
res <- dbSendQuery(mydb, "SELECT * FROM host_info LIMIT 10") 

# get the result
out_db <- fetch(res) 

# free resource
dbClearResult(res)

## [1] TRUE

out_db

You can try to do conditional filtering of the data. For example, let’s try to get the first 5 host who is considered as a superhost. To create a condition, you can use WHERE command.

# sent query
res <- dbSendQuery(mydb, "SELECT * FROM host_info WHERE host_is_superhost = TRUE LIMIT 5") 

# get the result
out_db <- fetch(res) 

# free resource
dbClearResult(res)

## [1] TRUE

out_db

You can get the number of row in the data using COUNT( * )

# sent query
res <- dbSendQuery(mydb, "SELECT COUNT(*) as frequency FROM host_info") 

# get the result
out_db <- fetch(res) 

# free resource
dbClearResult(res)

## [1] TRUE

out_db

Listing

Processing Listing Table

We will do full analysis of the data and understand different kind of SQL command in another post. For now, we will finish inserting all data into the database. Let’s continue processing the remaining column from the listing dataset.

# Remove column related to host except host_id
df_new_listing <- df_listing %>% 
  select(-names(df_host)[-1])

glimpse(df_new_listing)

## Rows: 19,289
## Columns: 53
## $ id                           <int> 27934, 27942, 27979, 28354, 28745, 35656,…
## $ listing_url                  <chr> "https://www.airbnb.com/rooms/27934", "ht…
## $ scrape_id                    <dbl> 2.021032e+13, 2.021032e+13, 2.021032e+13,…
## $ last_scraped                 <chr> "2021-03-22", "2021-03-22", "2021-03-31",…
## $ name                         <chr> "Nice room with superb city view", "Town …
## $ description                  <chr> "Our cool and comfortable one bedroom apa…
## $ neighborhood_overview        <chr> "It is very center of Bangkok and easy ac…
## $ picture_url                  <chr> "https://a0.muscache.com/pictures/566374/…
## $ host_id                      <int> 120437, 120462, 120541, 121848, 123784, 1…
## $ neighbourhood                <chr> "Samsen Nai, Bangkok, Thailand", NA, NA, …
## $ neighbourhood_cleansed       <chr> "Ratchathewi", "Bangkok Yai", "Bang Na", …
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ latitude                     <dbl> 13.76000, 13.73463, 13.66818, 13.81152, 1…
## $ longitude                    <dbl> 100.5427, 100.4754, 100.6167, 100.5662, 1…
## $ property_type                <chr> "Entire condominium", "Private room in ho…
## $ room_type                    <chr> "Entire home/apt", "Private room", "Priva…
## $ accommodates                 <int> 3, 1, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2,…
## $ bathrooms                    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bathrooms_text               <chr> "1.5 baths", "1 bath", "1 bath", "1 bath"…
## $ bedrooms                     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, 1, 1, 1…
## $ beds                         <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1…
## $ amenities                    <chr> "[\"Microwave\", \"Pool\", \"Smoke alarm\…
## $ price                        <chr> "$1,694.00", "$924.00", "$1,188.00", "$1,…
## $ minimum_nights               <int> 3, 1, 1, 28, 60, 3, 5, 1, 30, 1, 180, 3, …
## $ maximum_nights               <int> 90, 730, 730, 365, 730, 730, 366, 730, 73…
## $ minimum_minimum_nights       <int> 3, 1, 1, 28, 60, 3, 5, 1, 30, 1, 180, 3, …
## $ maximum_minimum_nights       <int> 3, 1, 1, 28, 60, 3, 5, 1, 30, 1, 180, 3, …
## $ minimum_maximum_nights       <int> 90, 730, 730, 1125, 730, 730, 366, 730, 7…
## $ maximum_maximum_nights       <int> 90, 730, 730, 1125, 730, 730, 366, 730, 7…
## $ minimum_nights_avg_ntm       <dbl> 3, 1, 1, 28, 60, 3, 5, 1, 30, 1, 180, 3, …
## $ maximum_nights_avg_ntm       <dbl> 90, 730, 730, 1125, 730, 730, 366, 730, 7…
## $ calendar_updated             <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ has_availability             <chr> "t", "f", "t", "t", "t", "f", "t", "t", "…
## $ availability_30              <int> 27, 0, 1, 29, 30, 0, 30, 30, 0, 0, 28, 30…
## $ availability_60              <int> 57, 0, 1, 59, 60, 0, 60, 60, 0, 0, 58, 60…
## $ availability_90              <int> 87, 0, 1, 89, 90, 0, 90, 90, 0, 0, 88, 90…
## $ availability_365             <int> 362, 0, 1, 364, 365, 0, 365, 365, 0, 0, 3…
## $ calendar_last_scraped        <chr> "2021-03-22", "2021-03-22", "2021-03-31",…
## $ number_of_reviews            <int> 65, 0, 0, 52, 0, 0, 1, 0, 0, 48, 0, 1, 0,…
## $ number_of_reviews_ltm        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ number_of_reviews_l30d       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ first_review                 <chr> "2012-04-07", NA, NA, "2010-09-08", NA, N…
## $ last_review                  <chr> "2020-01-06", NA, NA, "2019-12-09", NA, N…
## $ review_scores_rating         <int> 97, NA, NA, 96, NA, NA, 80, NA, NA, 98, N…
## $ review_scores_accuracy       <int> 10, NA, NA, 10, NA, NA, 10, NA, NA, 10, N…
## $ review_scores_cleanliness    <int> 10, NA, NA, 10, NA, NA, 6, NA, NA, 10, NA…
## $ review_scores_checkin        <int> 10, NA, NA, 10, NA, NA, 4, NA, NA, 10, NA…
## $ review_scores_communication  <int> 10, NA, NA, 10, NA, NA, 8, NA, NA, 10, NA…
## $ review_scores_location       <int> 9, NA, NA, 9, NA, NA, 8, NA, NA, 9, NA, 1…
## $ review_scores_value          <int> 9, NA, NA, 9, NA, NA, 10, NA, NA, 10, NA,…
## $ license                      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ instant_bookable             <chr> "f", "f", "f", "t", "f", "f", "f", "f", "…
## $ reviews_per_month            <dbl> 0.60, NA, NA, 0.40, NA, NA, 0.02, NA, NA,…

Some columns, such as the calendar_updated, license, and bathrooms are only consists of missing values. The scrape_id is also has no apparent meaning for now since we already has the last_scraped column to show us the lates date of scraping. These columns are not informative, so we will drop them.

df_new_listing <- df_new_listing %>% 
  select(-c(license, calendar_updated, bathrooms, scrape_id))

Next, we will convert columns that should has a boolean/logical value, including has_availability and instant_bookable.

df_new_listing <- df_new_listing %>% 
  mutate_at(vars(has_availability, instant_bookable),
            function(x) case_when(x == "t" ~ TRUE,
                                  x == "f" ~ FALSE,
                                  T ~ NA)
            )

We will continue by transforming the price column into decimal/numeric values by removing the dollar sign from the string.

df_new_listing <- df_new_listing %>% 
  mutate(price = str_remove_all(price, "[$,]") %>% 
           as.numeric()
         )

head(df_new_listing$price)

## [1] 1694  924 1188 1095  800 4619

Now we will continue transforming the amenities with the same treatment as the host_verification column from the host dataset.

df_new_listing <- df_new_listing %>% 
  mutate(amenities = str_remove_all(amenities, "[\"\\'\\[\\]]"))

head(df_new_listing$amenities)

## [1] "Microwave, Pool, Smoke alarm, Dryer, Long term stays allowed, Hair dryer, Stove, Iron, Free parking on premises, Essentials, Dishes and silverware, Washer, Hot water, Patio or balcony, Hangers, Air conditioning, Luggage dropoff allowed, Kitchen, Gym, Fire extinguisher, Elevator, TV, Dedicated workspace, Host greets you, Shampoo, Wifi, Refrigerator"
## [2] "Cable TV, Kitchen, Air conditioning, Free parking on premises, Breakfast, TV, Long term stays allowed"                                                                                                                                                                                                                                                        
## [3] "Cable TV, TV, Pool, Shampoo, Hot tub, Heating, First aid kit, Gym, Wifi, Long term stays allowed, Smoke alarm, Fire extinguisher, Kitchen, Elevator, Essentials, Washer, Free parking on premises"                                                                                                                                                            
## [4] "Dedicated workspace, Air conditioning, Hair dryer, Microwave, Essentials, Free parking on premises, Cable TV, Carbon monoxide alarm, Dishes and silverware, Long term stays allowed, Iron, TV, Elevator, Wifi, Shampoo, Pool, Kitchen, Refrigerator, Hangers, Smoke alarm, Gym, Washer, First aid kit, Fire extinguisher"                                     
## [5] "Long term stays allowed"                                                                                                                                                                                                                                                                                                                                      
## [6] "Cable TV, Kitchen, Air conditioning, Free parking on premises, TV, Long term stays allowed"

Create Listing Table

Now we will start creating a listing table in the MySQL server. First, let’s check the length of each string column in the data.

df_new_listing %>% 
  select_if(is.character) %>% 
  lapply(FUN = function(x) max(nchar(x, keepNA = F))) %>% 
  as.data.frame() %>% 
  pivot_longer(cols = names(.))

Let’s start writing the query for creating the table. For a numeric values such as price, we will use Decimal data type while the rest is the same as the host table. The primary key for the listing table is the id, which indicate the listing id. The foreign key for the listing table is host_id, which we can use to join the listing table with the host_info table for analysis.

query <-  "CREATE TABLE listing (
id INT,
listing_url VARCHAR(100),
last_scraped DATE,
name VARCHAR(500),
description VARCHAR(2000),
neighborhood_overview VARCHAR(2000),
picture_url VARCHAR(500),
host_id INT,
neighbourhood VARCHAR(100),
neighbourhood_cleansed VARCHAR(100),
neighbourhood_group_cleansed VARCHAR(100),
latitude DECIMAL(25,18),
longitude DECIMAL(25, 18),
property_type VARCHAR(100),
room_type VARCHAR(100),
accommodates INT,
bathrooms_text VARCHAR(100),
bedrooms INT,
beds INT,
amenities VARCHAR(2000),
price DECIMAL(15, 5),
minimum_nights INT,
maximum_nights INT,
minimum_minimum_nights INT,
maximum_minimum_nights INT,
minimum_maximum_nights INT,
maximum_maximum_nights INT,
minimum_nights_avg_ntm DECIMAL(16, 5),
maximum_nights_avg_ntm DECIMAL(16, 5),
has_availability BOOLEAN,
availability_30 INT,
availability_60 INT,
availability_90 INT,
availability_365 INT,
calendar_last_scraped DATE,
number_of_reviews INT,
number_of_reviews_ltm INT,
number_of_reviews_l30d INT,
first_review DATE,
last_review DATE,
review_scores_rating DECIMAL(10, 5),
review_scores_accuracy DECIMAL(10, 5),
review_scores_cleanliness DECIMAL(10, 5),
review_scores_checkin DECIMAL(10, 5),
review_scores_communication DECIMAL(10, 5),
review_scores_location DECIMAL(10, 5),
review_scores_value DECIMAL(10, 5),
instant_bookable BOOLEAN,
reviews_per_month DECIMAL(10, 5),
PRIMARY KEY(id),
FOREIGN KEY(host_id) REFERENCES host_info(host_id)
)"

query

## [1] "CREATE TABLE listing (\nid INT,\nlisting_url VARCHAR(100),\nlast_scraped DATE,\nname VARCHAR(500),\ndescription VARCHAR(2000),\nneighborhood_overview VARCHAR(2000),\npicture_url VARCHAR(500),\nhost_id INT,\nneighbourhood VARCHAR(100),\nneighbourhood_cleansed VARCHAR(100),\nneighbourhood_group_cleansed VARCHAR(100),\nlatitude DECIMAL(25,18),\nlongitude DECIMAL(25, 18),\nproperty_type VARCHAR(100),\nroom_type VARCHAR(100),\naccommodates INT,\nbathrooms_text VARCHAR(100),\nbedrooms INT,\nbeds INT,\namenities VARCHAR(2000),\nprice DECIMAL(15, 5),\nminimum_nights INT,\nmaximum_nights INT,\nminimum_minimum_nights INT,\nmaximum_minimum_nights INT,\nminimum_maximum_nights INT,\nmaximum_maximum_nights INT,\nminimum_nights_avg_ntm DECIMAL(16, 5),\nmaximum_nights_avg_ntm DECIMAL(16, 5),\nhas_availability BOOLEAN,\navailability_30 INT,\navailability_60 INT,\navailability_90 INT,\navailability_365 INT,\ncalendar_last_scraped DATE,\nnumber_of_reviews INT,\nnumber_of_reviews_ltm INT,\nnumber_of_reviews_l30d INT,\nfirst_review DATE,\nlast_review DATE,\nreview_scores_rating DECIMAL(10, 5),\nreview_scores_accuracy DECIMAL(10, 5),\nreview_scores_cleanliness DECIMAL(10, 5),\nreview_scores_checkin DECIMAL(10, 5),\nreview_scores_communication DECIMAL(10, 5),\nreview_scores_location DECIMAL(10, 5),\nreview_scores_value DECIMAL(10, 5),\ninstant_bookable BOOLEAN,\nreviews_per_month DECIMAL(10, 5),\nPRIMARY KEY(id),\nFOREIGN KEY(host_id) REFERENCES host_info(host_id)\n)"

Let’s run the query.

dbSendQuery(mydb, query)

You can check the listing table with the following query.

# sent query
res <- dbSendQuery(mydb, "DESCRIBE listing") 

# get the result
out_db <- fetch(res) 

# free resource
dbClearResult(res)

## [1] TRUE

out_db

Insert Data to Listing Table

Now we will insert each row into the table using the same function that we have defined earlier. Unfortunately, MySQL has maximum allowed packet that goes in a single query. You can check with the following query.

# sent query
res <- dbSendQuery(mydb, "SHOW VARIABLES LIKE 'max_allowed_packet'") 

# get the result
out_db <- fetch(res) 

# free resource
dbClearResult(res)

## [1] TRUE

out_db

Since our data is larger than the allowed packet, we need to split our data into smaller dataset. Here, we will split the data into 10 equal parts.

Query the Listing Table

Let’s do some simple query to check the data that we have inserted.

Let’s check how many rows are there on the listing table.

res <- dbSendQuery(mydb, "SELECT COUNT(*) as frequency FROM listing")

out_db <- fetch(res)

dbClearResult(res)

## [1] TRUE

out_db

I want to see the top 10 property type based on the frequency from the listing. We can use COUNT to get the number of row and use GROUP BY to make sure we count the number of row for each property type. To sort the value, we can use ORDER BY, folllowed by DESC to indicate that we want to sort descending and finally we only take the first 10 row with LIMIT.

res <- dbSendQuery(mydb, 
                   "SELECT COUNT(*) as frequency, property_type FROM listing GROUP BY property_type ORDER BY COUNT(*) DESC LIMIT 10"
                   )

out_db <- fetch(res)

dbClearResult(res)

## [1] TRUE

out_db

How about the top 10 most expensive listing? We can check the sorting the data with the price column.

res <- dbSendQuery(mydb, 
                   "SELECT price, property_type, neighbourhood FROM listing ORDER BY price DESC LIMIT 10"
                   )

out_db <- fetch(res)
dbClearResult(res)

## [1] TRUE

out_db

Conclusion

If you have done with doing some query with your MySQL server, don’t forget to close the connection to the database.

# Closing connection to database
dbDisconnect(mydb)

## [1] TRUE

I hope that the notebook has fulfilled the following goal that we have stated earlier:

Create a database with MySQL server
Create multiple table
Insert data into SQL table
Write query to do simple data collection from MySQL database

For my next post, I will write on how to do some data analysis using SQL Query.

Connect and Insert Data to MySQL Server with R

Arga Adyatama

6/28/2021