In the world with increasing data availability, it is become a norm to store and collect data from a cloud database instead of using a local file when you are part of a start up or large corporation. Several service, such as Google Cloud Platform or Microsoft Azure provide us with a cloud storage service to store a large dataset. Most of the current database system use SQL to store and collect this data. Therefore, understanding how to write a command or a query using SQL is currently one of the top skills required in data-related job, espescially for a data analyst and data scientist. The following graph is the result of 2020 Data Science and Machine Learning Survey.
To learn more about how to run an SQL query, we will use one of the most common database management system: MySQL. On this occasion, I will guide you on how to do the following things with MySQL:
For a quick introduction about SQL if you are not familiar with SQL, you can visit this website and just read the welcome page or you can try some practice and come back here later.
There are a lot of options for you to start creating a database, either using common cloud service such as Google Cloud Platform or Microsoft Azure, or you can also try setting up a local mysql server in your device. However, setting up a database on these can be a quite long process. Since our goal is focus on preparing data and store with SQL, we will use a free hosting website with db4free. You can also use SQLite to store a small database as well.
This website help us set a free and small MySQL server for us to practice. You just need to register with your email and enter the following information:
The following is the required library that we will use throughout this post.
# Data Wrangling
library(tidyverse)
# SQL Driver
library(RMySQL)
After you have set up the mysql server, now you can connect into the server with RMySQL
. You need the following information from the server:
<- dbConnect(MySQL(),
mydb host = "db4free.net", # write "localhost" if you use a localhost
port = 3306,
user = "your_name",
password = "your_pass",
dbname = "your_dbname"
)
Let’s check the connection and start with the basic. We will create a simple table of customer information with 3 columns:
customer_id
: ID of the customer with data type of integer (INT)name
: name of the customer with data type of character (VARCHAR 100)city
: city location of the customer with data type of character (VARCHAR 100)Trivia
VARCHAR(100) means that we state that this column has data type of character with maximum length of 100 characters.
The following is the general SQL command for creating a table
CREATE TABLE table_name( column_1 type_data1, column_2 type_data2, )
<- "CREATE TABLE dummy_customer(
query customer_id INT,
name VARCHAR(100),
city VARCHAR(100)
)"
query
## [1] "CREATE TABLE dummy_customer(\ncustomer_id INT,\nname VARCHAR(100),\ncity VARCHAR(100)\n)"
You can send the query by using the dbSendQuery()
.
dbSendQuery(mydb, # your connection
# SQL query
query )
## <MySQLResult:0,0,0>
You can check all available table in the database by using the following query. This will return all available table in your database in a dataframe format.
SHOW TABLES
dbSendQuery(mydb, "SHOW TABLES") %>%
fetch()
If you want to check the name and the type of each column in your data, you can use the following command.
DESCRIBE table_name
dbSendQuery(mydb, "DESCRIBE dummy_customer") %>%
fetch()
If you want to delete the table, you can run the following query.
DROP TABLE table_name
dbSendQuery(mydb, "DROP TABLE dummy_customer")
## <MySQLResult:589867488,0,3>
We will use data of Thailand room listing from Airbnb. The dataset contain information about room or listing that is posted on Airbnb site in Bangkok, Central Thailand, Thailand.
We will use the Listing
dataset that contain the detailed information about listing or room posted on airbnb. One host can have multiple listings. For the detailed information regarding each column, you can check the description for each column in this spreadsheet.
The dataset can be further broken down into several parts. For example, in listing data there are several columns that store information about the host instead of the room/listing. In practice, we can reduce the number of required space for the database by separating the data about listing and the data about host, since a single host can have multiple listing.
To illustrate the relation between each data, we will draw an Entity Relationship Diagram (ERD). An ERD is often used to help data engineer to design a database and show the relation between each table. Open image in new tab if you want to zoom in.
On each table, you can have a column that contain a unique ID to identify each row in a table. This column is called as Primary Key (PK). For example, in host_info table, a unique host should only stored once and there is no duplicate in the table. This host is identified by the host_id
column. There is also a column that is called as Foreign Key (FK). For example, the listing table has id
as the primary key and host_id
as the foreign key. This means that we can join the listing table with the host_info table by matching the host_id
on the listing table with the host_id
on the host_info table.
The relation or the cardinality between table is illustrated by the sign at both end of the arrow. There are different cardinality in ERD, you can look at the detailed explanation here. A host can have zero (no listing) or many listing, so the cardinality at the listing side is illustrated as zero or many. Naturally, a room listing can only be owned by a single host. There is no listing that is owned by different hosts. Therefore, the cardinality at the host_info side is illustrated as one (and only one).
Since the only table that has no foreign key is host_info
, we will start creating and cleansing a host table from the listing data. Let’s start by importing the listing data.
<- read.csv("data/listings.csv")
df_listing
cat( paste("Number of rows:", nrow(df_listing), "\n"))
## Number of rows: 19289
cat( paste("Number of columns:", ncol(df_listing)))
## Number of columns: 74
Let’s check the some of the content from the dataset.
head(df_listing)
Some missing values are written as an empty string (""
) or as "None"
, therefore we will transform this value into explicit NA for R.
<- df_listing %>%
df_listing mutate_all( function(x) ifelse(x == "" | x == "None" | x == "N/A", NA, x) )
Let’s check the information from the dataset.
glimpse(df_listing)
## Rows: 19,289
## Columns: 74
## $ id <int> 27934, 27942, 27979, 2835…
## $ listing_url <chr> "https://www.airbnb.com/r…
## $ scrape_id <dbl> 2.021032e+13, 2.021032e+1…
## $ last_scraped <chr> "2021-03-22", "2021-03-22…
## $ name <chr> "Nice room with superb ci…
## $ description <chr> "Our cool and comfortable…
## $ neighborhood_overview <chr> "It is very center of Ban…
## $ picture_url <chr> "https://a0.muscache.com/…
## $ host_id <int> 120437, 120462, 120541, 1…
## $ host_url <chr> "https://www.airbnb.com/u…
## $ host_name <chr> "Nuttee", "Donald", "Emy"…
## $ host_since <chr> "2010-05-08", "2010-05-08…
## $ host_location <chr> "Bangkok", "Thailand", "B…
## $ host_about <chr> "Hi All, I am nuttee patr…
## $ host_response_time <chr> "within a few hours", NA,…
## $ host_response_rate <chr> "100%", NA, NA, "100%", N…
## $ host_acceptance_rate <chr> NA, NA, NA, NA, NA, NA, N…
## $ host_is_superhost <chr> "f", "f", "f", "f", "f", …
## $ host_thumbnail_url <chr> "https://a0.muscache.com/…
## $ host_picture_url <chr> "https://a0.muscache.com/…
## $ host_neighbourhood <chr> "Victory Monument", NA, N…
## $ host_listings_count <int> 2, 1, 2, 1, 1, 1, 1, 1, 1…
## $ host_total_listings_count <int> 2, 1, 2, 1, 1, 1, 1, 1, 1…
## $ host_verifications <chr> "['email', 'phone', 'revi…
## $ host_has_profile_pic <chr> "t", "t", "t", "t", "t", …
## $ host_identity_verified <chr> "t", "f", "f", "t", "f", …
## $ neighbourhood <chr> "Samsen Nai, Bangkok, Tha…
## $ neighbourhood_cleansed <chr> "Ratchathewi", "Bangkok Y…
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, N…
## $ latitude <dbl> 13.76000, 13.73463, 13.66…
## $ longitude <dbl> 100.5427, 100.4754, 100.6…
## $ property_type <chr> "Entire condominium", "Pr…
## $ room_type <chr> "Entire home/apt", "Priva…
## $ accommodates <int> 3, 1, 2, 2, 2, 2, 2, 2, 3…
## $ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N…
## $ bathrooms_text <chr> "1.5 baths", "1 bath", "1…
## $ bedrooms <int> 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ beds <int> 1, 1, 2, 1, 1, 1, 1, 1, 1…
## $ amenities <chr> "[\"Microwave\", \"Pool\"…
## $ price <chr> "$1,694.00", "$924.00", "…
## $ minimum_nights <int> 3, 1, 1, 28, 60, 3, 5, 1,…
## $ maximum_nights <int> 90, 730, 730, 365, 730, 7…
## $ minimum_minimum_nights <int> 3, 1, 1, 28, 60, 3, 5, 1,…
## $ maximum_minimum_nights <int> 3, 1, 1, 28, 60, 3, 5, 1,…
## $ minimum_maximum_nights <int> 90, 730, 730, 1125, 730, …
## $ maximum_maximum_nights <int> 90, 730, 730, 1125, 730, …
## $ minimum_nights_avg_ntm <dbl> 3, 1, 1, 28, 60, 3, 5, 1,…
## $ maximum_nights_avg_ntm <dbl> 90, 730, 730, 1125, 730, …
## $ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N…
## $ has_availability <chr> "t", "f", "t", "t", "t", …
## $ availability_30 <int> 27, 0, 1, 29, 30, 0, 30, …
## $ availability_60 <int> 57, 0, 1, 59, 60, 0, 60, …
## $ availability_90 <int> 87, 0, 1, 89, 90, 0, 90, …
## $ availability_365 <int> 362, 0, 1, 364, 365, 0, 3…
## $ calendar_last_scraped <chr> "2021-03-22", "2021-03-22…
## $ number_of_reviews <int> 65, 0, 0, 52, 0, 0, 1, 0,…
## $ number_of_reviews_ltm <int> 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ number_of_reviews_l30d <int> 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ first_review <chr> "2012-04-07", NA, NA, "20…
## $ last_review <chr> "2020-01-06", NA, NA, "20…
## $ review_scores_rating <int> 97, NA, NA, 96, NA, NA, 8…
## $ review_scores_accuracy <int> 10, NA, NA, 10, NA, NA, 1…
## $ review_scores_cleanliness <int> 10, NA, NA, 10, NA, NA, 6…
## $ review_scores_checkin <int> 10, NA, NA, 10, NA, NA, 4…
## $ review_scores_communication <int> 10, NA, NA, 10, NA, NA, 8…
## $ review_scores_location <int> 9, NA, NA, 9, NA, NA, 8, …
## $ review_scores_value <int> 9, NA, NA, 9, NA, NA, 10,…
## $ license <lgl> NA, NA, NA, NA, NA, NA, N…
## $ instant_bookable <chr> "f", "f", "f", "t", "f", …
## $ calculated_host_listings_count <int> 2, 1, 2, 1, 1, 1, 1, 1, 1…
## $ calculated_host_listings_count_entire_homes <int> 2, 0, 1, 1, 0, 0, 0, 0, 0…
## $ calculated_host_listings_count_private_rooms <int> 0, 1, 1, 0, 1, 1, 1, 1, 1…
## $ calculated_host_listings_count_shared_rooms <int> 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ reviews_per_month <dbl> 0.60, NA, NA, 0.40, NA, N…
The listing dataset contain information about each room listing or room posted on the airbnb website by the host.
A single host is identified by the host_id
and may have multiple room in the listing. Let’s check this hypothesis.
# Count the number of listing for each host and get the top 10
%>%
df_listing count(host_id) %>%
arrange(-n) %>%
head(10)
As we have seen from the above output, several hosts even have hundreds of listing. For efficient storage in the database, we will separate information about the host, such as host_id
, host_url
, etc. from the information about the listing.
<- df_listing %>%
df_host select(host_id:host_identity_verified,
:calculated_host_listings_count_shared_rooms)
calculated_host_listings_count
glimpse(df_host)
## Rows: 19,289
## Columns: 22
## $ host_id <int> 120437, 120462, 120541, 1…
## $ host_url <chr> "https://www.airbnb.com/u…
## $ host_name <chr> "Nuttee", "Donald", "Emy"…
## $ host_since <chr> "2010-05-08", "2010-05-08…
## $ host_location <chr> "Bangkok", "Thailand", "B…
## $ host_about <chr> "Hi All, I am nuttee patr…
## $ host_response_time <chr> "within a few hours", NA,…
## $ host_response_rate <chr> "100%", NA, NA, "100%", N…
## $ host_acceptance_rate <chr> NA, NA, NA, NA, NA, NA, N…
## $ host_is_superhost <chr> "f", "f", "f", "f", "f", …
## $ host_thumbnail_url <chr> "https://a0.muscache.com/…
## $ host_picture_url <chr> "https://a0.muscache.com/…
## $ host_neighbourhood <chr> "Victory Monument", NA, N…
## $ host_listings_count <int> 2, 1, 2, 1, 1, 1, 1, 1, 1…
## $ host_total_listings_count <int> 2, 1, 2, 1, 1, 1, 1, 1, 1…
## $ host_verifications <chr> "['email', 'phone', 'revi…
## $ host_has_profile_pic <chr> "t", "t", "t", "t", "t", …
## $ host_identity_verified <chr> "t", "f", "f", "t", "f", …
## $ calculated_host_listings_count <int> 2, 1, 2, 1, 1, 1, 1, 1, 1…
## $ calculated_host_listings_count_entire_homes <int> 2, 0, 1, 1, 0, 0, 0, 0, 0…
## $ calculated_host_listings_count_private_rooms <int> 0, 1, 1, 0, 1, 1, 1, 1, 1…
## $ calculated_host_listings_count_shared_rooms <int> 0, 0, 0, 0, 0, 0, 0, 0, 0…
Now we have 22 columns for the host and the rest is columns related to the individual listing.
There should be no duplicate host, so we will remove duplicated host.
<- df_host %>%
df_host distinct()
print( paste("Number of rows:", nrow(df_host)) )
## [1] "Number of rows: 8296"
The next we do is preparing the dataset so they will have a proper data type in the database. You may have noticed that some columns should be have a boolean or logical data type, such as the host_is_superhost
, host_has_profile_pic
, and host_identity_verified
. They contain string t if the value is True and f if the value is False. We will transform the data to the proper data type.
<- df_host %>%
df_host mutate_at(vars(host_is_superhost, host_has_profile_pic, host_identity_verified),
function(x) case_when(x == "t" ~ TRUE,
== "f" ~ FALSE,
x ~ NA)
T )
The next thing that I want to clean up is the host_verification
column that is in a list format like following sample.
head(df_host$host_verifications)
## [1] "['email', 'phone', 'reviews', 'jumio', 'offline_government_id', 'selfie', 'government_id', 'identity_manual']"
## [2] "['email', 'phone']"
## [3] "['email', 'phone']"
## [4] "['email', 'phone', 'facebook', 'reviews', 'jumio', 'selfie', 'government_id', 'identity_manual']"
## [5] "['email', 'phone']"
## [6] "['email', 'phone', 'facebook', 'reviews', 'jumio', 'offline_government_id', 'government_id']"
We will clean them so they contain a simple string text. For example, the first row will be email, phone, facebook, ….
<- df_host %>%
df_host
# remove ', [, and ]
mutate(host_verifications = str_remove_all(host_verifications, "[\\'\\[\\]]"))
head(df_host$host_verifications)
## [1] "email, phone, reviews, jumio, offline_government_id, selfie, government_id, identity_manual"
## [2] "email, phone"
## [3] "email, phone"
## [4] "email, phone, facebook, reviews, jumio, selfie, government_id, identity_manual"
## [5] "email, phone"
## [6] "email, phone, facebook, reviews, jumio, offline_government_id, government_id"
We will start creating a table for the database. Since there is no numeric or decimal value from the host data, we will assign all numeric value into integer. The next thing we need to do is to design a proper data type for the string text columns. We will check the maximum length of each string column in the host dataset.
%>%
df_host select_if(is.character) %>%
lapply(FUN = function(x) max(nchar(x, keepNA = F))) %>%
as.data.frame() %>%
pivot_longer(cols = names(.),
names_to = "column",
values_to = "maximum_length")
Now we can create a table named host_info
with the following data type:
Make sure the number of character (n) in VARCHAR is bigger than the length of your data characters. For example, the maximum length of host_location
is 125, so you can create a VARCHAR(200) or VARCHAR(500) just to make sure. Don’t forget to assign host_id
as the primary key for the table.
For more detailed information about different data type allowed by MySQL, you can check the following manuals.
<- "CREATE TABLE host_info(
query host_id INT,
host_url VARCHAR(50),
host_name VARCHAR(100),
host_since DATE,
host_location VARCHAR(500),
host_about VARCHAR(10000),
host_response_time VARCHAR(50),
host_response_rate VARCHAR(50),
host_acceptance_rate VARCHAR(50),
host_is_superhost BOOLEAN,
host_thumbnail_url VARCHAR(500),
host_picture_url VARCHAR(500),
host_neighbourhood VARCHAR(50),
host_listings_count INT,
host_total_listings_count INT,
host_verifications VARCHAR(500),
host_has_profile_pic BOOLEAN,
host_identity_verified BOOLEAN,
calculated_host_listings_count INT,
calculated_host_listings_count_entire_homes INT,
calculated_host_listings_count_private_rooms INT,
calculated_host_listings_count_shared_rooms INT,
PRIMARY KEY(host_id)
)"
query
## [1] "CREATE TABLE host_info(\nhost_id INT,\nhost_url VARCHAR(50),\nhost_name VARCHAR(100),\nhost_since DATE,\nhost_location VARCHAR(500),\nhost_about VARCHAR(10000),\nhost_response_time VARCHAR(50),\nhost_response_rate VARCHAR(50),\nhost_acceptance_rate VARCHAR(50),\nhost_is_superhost BOOLEAN,\nhost_thumbnail_url VARCHAR(500),\nhost_picture_url VARCHAR(500),\nhost_neighbourhood VARCHAR(50),\nhost_listings_count INT,\nhost_total_listings_count INT,\nhost_verifications VARCHAR(500),\nhost_has_profile_pic BOOLEAN,\nhost_identity_verified BOOLEAN,\ncalculated_host_listings_count INT,\ncalculated_host_listings_count_entire_homes INT,\ncalculated_host_listings_count_private_rooms INT,\ncalculated_host_listings_count_shared_rooms INT,\nPRIMARY KEY(host_id)\n)"
After you have created the SQL query, you can execute them.
# Execute query
dbSendQuery(mydb, query)
You can run a query and get a result. For example, you can check data information about the host_info
table.
# sent query
<- dbSendQuery(mydb, "DESCRIBE host_info")
res
# get the result
<- fetch(res)
out_db
# free resource
dbClearResult(res)
## [1] TRUE
# print result
out_db
If you want to check the content or the data from host_info
table, you can use the following query, although for now the table is still empty because we don’t insert anything yet into the table.
<- dbSendQuery(mydb, "SELECT * FROM host_info")
res
<- fetch(res)
out_db dbClearResult(res)
out_db
Let’s start inserting data into MySQL server. The generic formula to insert data into the table is as follows:
INSERT INTO table_name (column_name) VALUES (value_for_each_column)
For example, if you have a table named customer that has reviewer_id
and reviewer_name
column, you can insert a row with the following query:
INSERT INTO customer (reviewer_id, reviewer_name) VALUES (123, ‘John Doe’)
So we need to insert the name of the column and the respective value for each column. Let’s start by preparing the column name into a single string.
<- paste(names(df_host), collapse = ", ")
column_name
column_name
## [1] "host_id, host_url, host_name, host_since, host_location, host_about, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, host_thumbnail_url, host_picture_url, host_neighbourhood, host_listings_count, host_total_listings_count, host_verifications, host_has_profile_pic, host_identity_verified, calculated_host_listings_count, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms"
Next, we can prepare the value into a single string for each row. For example, the following is the input for the first row of the host data.
Just like when you read a csv file, an input for a single column is separated by comma (,) value by default. For example, input for the first column host_id
is 266763, input for the second column is the string for host_url
, etc. However, you may notice that at the middle of the string we have an extra comma from the host_verifications
(email, phone, facebook) which should go into a single column. We need to clean the string so they can fit properly to how the SQL will read the data.
To get a clean values, we need to process the string column first with the following rules:
The following is the proper query for the value in a single row of data from multiple columns.
# Convert value of each row into a single string
<- df_host %>%
value_data slice(1) %>% # take the first row as sample
mutate_if(is.character, function(x) ifelse(is.na(x), NA, x %>%
str_replace_all('"', "'") %>% # Replace all "" into '' in a string before joining
paste0('"', . , '"')
)%>%
) unlist() %>%
paste(collapse = ", ")
value_data
## [1] "120437, \"https://www.airbnb.com/users/show/120437\", \"Nuttee\", \"2010-05-08\", \"Bangkok\", \"Hi All, I am nuttee patranavik from Bangkok, Thailand.\nalways travel but easy to connect via airbnb..\", \"within a few hours\", \"100%\", NA, FALSE, \"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_small\", \"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_x_medium\", \"Victory Monument\", 2, 2, \"email, phone, reviews, jumio, offline_government_id, selfie, government_id, identity_manual\", TRUE, TRUE, 2, 2, 0, 0"
Now you can see that even if a column has internal comma, such as the host_verifications
, they will not be a problem because we give a quotation mark for a string so the SQL will understand that the comma will not be read as a command to fill the next column.
Finally, We need to transform a missing value NA
into explicit NULL since SQL only consisder NULL as the legitimate missing value.
# Replace missing value with explicit NULL
<- str_replace_all(value_data, "\\bNA\\b", "NULL") %>% # Replace NA to NULL
value_data str_replace_all('\\"NULL\\"', "NULL") # Remove "" from NULL
value_data
## [1] "120437, \"https://www.airbnb.com/users/show/120437\", \"Nuttee\", \"2010-05-08\", \"Bangkok\", \"Hi All, I am nuttee patranavik from Bangkok, Thailand.\nalways travel but easy to connect via airbnb..\", \"within a few hours\", \"100%\", NULL, FALSE, \"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_small\", \"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_x_medium\", \"Victory Monument\", 2, 2, \"email, phone, reviews, jumio, offline_government_id, selfie, government_id, identity_manual\", TRUE, TRUE, 2, 2, 0, 0"
The full query for the first row is as follow:
paste0("INSERT INTO host_info (", column_name, ") VALUES (", value_data , ")")
## [1] "INSERT INTO host_info (host_id, host_url, host_name, host_since, host_location, host_about, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, host_thumbnail_url, host_picture_url, host_neighbourhood, host_listings_count, host_total_listings_count, host_verifications, host_has_profile_pic, host_identity_verified, calculated_host_listings_count, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms) VALUES (120437, \"https://www.airbnb.com/users/show/120437\", \"Nuttee\", \"2010-05-08\", \"Bangkok\", \"Hi All, I am nuttee patranavik from Bangkok, Thailand.\nalways travel but easy to connect via airbnb..\", \"within a few hours\", \"100%\", NULL, FALSE, \"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_small\", \"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_x_medium\", \"Victory Monument\", 2, 2, \"email, phone, reviews, jumio, offline_government_id, selfie, government_id, identity_manual\", TRUE, TRUE, 2, 2, 0, 0)"
Let’s prepare the data.
<- df_host %>%
df_host_new mutate_if(is.character, function(x) ifelse(is.na(x), NA, x %>%
str_replace_all('"', "'") %>% # Replace " with '
paste0('"', . , '"') # add " before and after string
) )
You can insert multiple data at once with the following SQL format:
INSERT INTO customer (reviewer_id, reviewer_name) VALUES (123, ‘John Doe’), VALUES (124, ‘Jean Doe’)
Let’s prepare the value for each observation with the following function.
<- apply(df_host_new, MARGIN = 1,
value_data function(x) x %>%
paste(collapse = ",") %>% # Join all column into single string
str_replace_all("\\bNA\\b", "NULL") %>% # Replace NA with NULL
str_trim() # Remove unnecessary whitespace
)
# Check first 3 data
%>%
value_data head(3)
## [1] "120437,\"https://www.airbnb.com/users/show/120437\",\"Nuttee\",\"2010-05-08\",\"Bangkok\",\"Hi All, I am nuttee patranavik from Bangkok, Thailand.\nalways travel but easy to connect via airbnb..\",\"within a few hours\",\"100%\",NULL,FALSE,\"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_small\",\"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_x_medium\",\"Victory Monument\", 2, 2,\"email, phone, reviews, jumio, offline_government_id, selfie, government_id, identity_manual\",TRUE,TRUE, 2, 2, 0, 0"
## [2] "120462,\"https://www.airbnb.com/users/show/120462\",\"Donald\",\"2010-05-08\",\"Thailand\",NULL,NULL,NULL,NULL,FALSE,\"https://a0.muscache.com/im/users/120462/profile_pic/1273318678/original.jpg?aki_policy=profile_small\",\"https://a0.muscache.com/im/users/120462/profile_pic/1273318678/original.jpg?aki_policy=profile_x_medium\",NULL, 1, 1,\"email, phone\",TRUE,FALSE, 1, 0, 1, 0"
## [3] "120541,\"https://www.airbnb.com/users/show/120541\",\"Emy\",\"2010-05-08\",\"Bangkok, Thailand\",NULL,NULL,NULL,NULL,FALSE,\"https://a0.muscache.com/im/users/120541/profile_pic/1436431885/original.jpg?aki_policy=profile_small\",\"https://a0.muscache.com/im/users/120541/profile_pic/1436431885/original.jpg?aki_policy=profile_x_medium\",NULL, 2, 2,\"email, phone\",TRUE,FALSE, 2, 1, 1, 0"
Finally, you will join all rows into a single, giant string. For example, the following is the query to insert the first 3 data.
paste(value_data[1:3]) %>%
paste0("(", ., ")") %>% # Add bracket before and after string
paste(collapse = ", ") %>% # Join all values into single string
paste0("INSERT INTO host_info (", column_name, ") VALUES ", .)
## [1] "INSERT INTO host_info (host_id, host_url, host_name, host_since, host_location, host_about, host_response_time, host_response_rate, host_acceptance_rate, host_is_superhost, host_thumbnail_url, host_picture_url, host_neighbourhood, host_listings_count, host_total_listings_count, host_verifications, host_has_profile_pic, host_identity_verified, calculated_host_listings_count, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms) VALUES (120437,\"https://www.airbnb.com/users/show/120437\",\"Nuttee\",\"2010-05-08\",\"Bangkok\",\"Hi All, I am nuttee patranavik from Bangkok, Thailand.\nalways travel but easy to connect via airbnb..\",\"within a few hours\",\"100%\",NULL,FALSE,\"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_small\",\"https://a0.muscache.com/im/pictures/user/d05a6b51-b459-4eb8-a01e-8c55122b0132.jpg?aki_policy=profile_x_medium\",\"Victory Monument\", 2, 2,\"email, phone, reviews, jumio, offline_government_id, selfie, government_id, identity_manual\",TRUE,TRUE, 2, 2, 0, 0), (120462,\"https://www.airbnb.com/users/show/120462\",\"Donald\",\"2010-05-08\",\"Thailand\",NULL,NULL,NULL,NULL,FALSE,\"https://a0.muscache.com/im/users/120462/profile_pic/1273318678/original.jpg?aki_policy=profile_small\",\"https://a0.muscache.com/im/users/120462/profile_pic/1273318678/original.jpg?aki_policy=profile_x_medium\",NULL, 1, 1,\"email, phone\",TRUE,FALSE, 1, 0, 1, 0), (120541,\"https://www.airbnb.com/users/show/120541\",\"Emy\",\"2010-05-08\",\"Bangkok, Thailand\",NULL,NULL,NULL,NULL,FALSE,\"https://a0.muscache.com/im/users/120541/profile_pic/1436431885/original.jpg?aki_policy=profile_small\",\"https://a0.muscache.com/im/users/120541/profile_pic/1436431885/original.jpg?aki_policy=profile_x_medium\",NULL, 2, 2,\"email, phone\",TRUE,FALSE, 2, 1, 1, 0)"
We will create a function to simplify the process so that we can use the previous steps multiple times in a single line.
<- function(conn, table, data){
insert_to_sql
<- paste(names(data), collapse = ", ")
column_name
<- data %>%
data_new mutate_if(is.character, function(x) ifelse(is.na(x), NA, x %>%
str_replace_all('"', "'") %>% # Replace " with '
paste0('"', . , '"') # Add " before and after string
)
)
<- apply(data_new, MARGIN = 1,
value_data function(x) x %>%
paste(collapse = ",") %>% # Join all column into single string
str_replace_all("\\bNA\\b", "NULL") %>% # Create NULL from NA
str_trim() # remove unnecessary whitespace
)
<- paste(value_data) %>%
query_value paste0("(", ., ")") %>% # Add bracket before and after string
paste(collapse = ", ") %>% # Join all values into single string
str_remove_all("\\\\") %>% # Remove \\ from string
paste0("INSERT INTO ", table, "(", column_name, ") VALUES ", .)
dbSendQuery(conn, query_value)
cat("Data uploaded to MySQL\n")
}
Let’s insert the data into the MySQL server.
# Insert all data to SQL
insert_to_sql(mydb, "host_info", df_host)
Now we will try to do some query to the host_info
table, starting with connect to the MySQL.
Let’s try to get the first 10 observation from the host_info
table. The LIMIT command is the same as head() in python and will give us the first n row.
# sent query
<- dbSendQuery(mydb, "SELECT * FROM host_info LIMIT 10")
res
# get the result
<- fetch(res)
out_db
# free resource
dbClearResult(res)
## [1] TRUE
out_db
You can try to do conditional filtering of the data. For example, let’s try to get the first 5 host who is considered as a superhost. To create a condition, you can use WHERE command.
# sent query
<- dbSendQuery(mydb, "SELECT * FROM host_info WHERE host_is_superhost = TRUE LIMIT 5")
res
# get the result
<- fetch(res)
out_db
# free resource
dbClearResult(res)
## [1] TRUE
out_db
You can get the number of row in the data using COUNT( * )
# sent query
<- dbSendQuery(mydb, "SELECT COUNT(*) as frequency FROM host_info")
res
# get the result
<- fetch(res)
out_db
# free resource
dbClearResult(res)
## [1] TRUE
out_db
We will do full analysis of the data and understand different kind of SQL command in another post. For now, we will finish inserting all data into the database. Let’s continue processing the remaining column from the listing dataset.
# Remove column related to host except host_id
<- df_listing %>%
df_new_listing select(-names(df_host)[-1])
glimpse(df_new_listing)
## Rows: 19,289
## Columns: 53
## $ id <int> 27934, 27942, 27979, 28354, 28745, 35656,…
## $ listing_url <chr> "https://www.airbnb.com/rooms/27934", "ht…
## $ scrape_id <dbl> 2.021032e+13, 2.021032e+13, 2.021032e+13,…
## $ last_scraped <chr> "2021-03-22", "2021-03-22", "2021-03-31",…
## $ name <chr> "Nice room with superb city view", "Town …
## $ description <chr> "Our cool and comfortable one bedroom apa…
## $ neighborhood_overview <chr> "It is very center of Bangkok and easy ac…
## $ picture_url <chr> "https://a0.muscache.com/pictures/566374/…
## $ host_id <int> 120437, 120462, 120541, 121848, 123784, 1…
## $ neighbourhood <chr> "Samsen Nai, Bangkok, Thailand", NA, NA, …
## $ neighbourhood_cleansed <chr> "Ratchathewi", "Bangkok Yai", "Bang Na", …
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ latitude <dbl> 13.76000, 13.73463, 13.66818, 13.81152, 1…
## $ longitude <dbl> 100.5427, 100.4754, 100.6167, 100.5662, 1…
## $ property_type <chr> "Entire condominium", "Private room in ho…
## $ room_type <chr> "Entire home/apt", "Private room", "Priva…
## $ accommodates <int> 3, 1, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 2,…
## $ bathrooms <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bathrooms_text <chr> "1.5 baths", "1 bath", "1 bath", "1 bath"…
## $ bedrooms <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1, 1, 1, 1…
## $ beds <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, NA, 1…
## $ amenities <chr> "[\"Microwave\", \"Pool\", \"Smoke alarm\…
## $ price <chr> "$1,694.00", "$924.00", "$1,188.00", "$1,…
## $ minimum_nights <int> 3, 1, 1, 28, 60, 3, 5, 1, 30, 1, 180, 3, …
## $ maximum_nights <int> 90, 730, 730, 365, 730, 730, 366, 730, 73…
## $ minimum_minimum_nights <int> 3, 1, 1, 28, 60, 3, 5, 1, 30, 1, 180, 3, …
## $ maximum_minimum_nights <int> 3, 1, 1, 28, 60, 3, 5, 1, 30, 1, 180, 3, …
## $ minimum_maximum_nights <int> 90, 730, 730, 1125, 730, 730, 366, 730, 7…
## $ maximum_maximum_nights <int> 90, 730, 730, 1125, 730, 730, 366, 730, 7…
## $ minimum_nights_avg_ntm <dbl> 3, 1, 1, 28, 60, 3, 5, 1, 30, 1, 180, 3, …
## $ maximum_nights_avg_ntm <dbl> 90, 730, 730, 1125, 730, 730, 366, 730, 7…
## $ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ has_availability <chr> "t", "f", "t", "t", "t", "f", "t", "t", "…
## $ availability_30 <int> 27, 0, 1, 29, 30, 0, 30, 30, 0, 0, 28, 30…
## $ availability_60 <int> 57, 0, 1, 59, 60, 0, 60, 60, 0, 0, 58, 60…
## $ availability_90 <int> 87, 0, 1, 89, 90, 0, 90, 90, 0, 0, 88, 90…
## $ availability_365 <int> 362, 0, 1, 364, 365, 0, 365, 365, 0, 0, 3…
## $ calendar_last_scraped <chr> "2021-03-22", "2021-03-22", "2021-03-31",…
## $ number_of_reviews <int> 65, 0, 0, 52, 0, 0, 1, 0, 0, 48, 0, 1, 0,…
## $ number_of_reviews_ltm <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ number_of_reviews_l30d <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ first_review <chr> "2012-04-07", NA, NA, "2010-09-08", NA, N…
## $ last_review <chr> "2020-01-06", NA, NA, "2019-12-09", NA, N…
## $ review_scores_rating <int> 97, NA, NA, 96, NA, NA, 80, NA, NA, 98, N…
## $ review_scores_accuracy <int> 10, NA, NA, 10, NA, NA, 10, NA, NA, 10, N…
## $ review_scores_cleanliness <int> 10, NA, NA, 10, NA, NA, 6, NA, NA, 10, NA…
## $ review_scores_checkin <int> 10, NA, NA, 10, NA, NA, 4, NA, NA, 10, NA…
## $ review_scores_communication <int> 10, NA, NA, 10, NA, NA, 8, NA, NA, 10, NA…
## $ review_scores_location <int> 9, NA, NA, 9, NA, NA, 8, NA, NA, 9, NA, 1…
## $ review_scores_value <int> 9, NA, NA, 9, NA, NA, 10, NA, NA, 10, NA,…
## $ license <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ instant_bookable <chr> "f", "f", "f", "t", "f", "f", "f", "f", "…
## $ reviews_per_month <dbl> 0.60, NA, NA, 0.40, NA, NA, 0.02, NA, NA,…
Some columns, such as the calendar_updated
, license
, and bathrooms
are only consists of missing values. The scrape_id
is also has no apparent meaning for now since we already has the last_scraped
column to show us the lates date of scraping. These columns are not informative, so we will drop them.
<- df_new_listing %>%
df_new_listing select(-c(license, calendar_updated, bathrooms, scrape_id))
Next, we will convert columns that should has a boolean/logical value, including has_availability
and instant_bookable
.
<- df_new_listing %>%
df_new_listing mutate_at(vars(has_availability, instant_bookable),
function(x) case_when(x == "t" ~ TRUE,
== "f" ~ FALSE,
x ~ NA)
T )
We will continue by transforming the price
column into decimal/numeric values by removing the dollar sign from the string.
<- df_new_listing %>%
df_new_listing mutate(price = str_remove_all(price, "[$,]") %>%
as.numeric()
)
head(df_new_listing$price)
## [1] 1694 924 1188 1095 800 4619
Now we will continue transforming the amenities
with the same treatment as the host_verification
column from the host dataset.
<- df_new_listing %>%
df_new_listing mutate(amenities = str_remove_all(amenities, "[\"\\'\\[\\]]"))
head(df_new_listing$amenities)
## [1] "Microwave, Pool, Smoke alarm, Dryer, Long term stays allowed, Hair dryer, Stove, Iron, Free parking on premises, Essentials, Dishes and silverware, Washer, Hot water, Patio or balcony, Hangers, Air conditioning, Luggage dropoff allowed, Kitchen, Gym, Fire extinguisher, Elevator, TV, Dedicated workspace, Host greets you, Shampoo, Wifi, Refrigerator"
## [2] "Cable TV, Kitchen, Air conditioning, Free parking on premises, Breakfast, TV, Long term stays allowed"
## [3] "Cable TV, TV, Pool, Shampoo, Hot tub, Heating, First aid kit, Gym, Wifi, Long term stays allowed, Smoke alarm, Fire extinguisher, Kitchen, Elevator, Essentials, Washer, Free parking on premises"
## [4] "Dedicated workspace, Air conditioning, Hair dryer, Microwave, Essentials, Free parking on premises, Cable TV, Carbon monoxide alarm, Dishes and silverware, Long term stays allowed, Iron, TV, Elevator, Wifi, Shampoo, Pool, Kitchen, Refrigerator, Hangers, Smoke alarm, Gym, Washer, First aid kit, Fire extinguisher"
## [5] "Long term stays allowed"
## [6] "Cable TV, Kitchen, Air conditioning, Free parking on premises, TV, Long term stays allowed"
Now we will start creating a listing table in the MySQL server. First, let’s check the length of each string column in the data.
%>%
df_new_listing select_if(is.character) %>%
lapply(FUN = function(x) max(nchar(x, keepNA = F))) %>%
as.data.frame() %>%
pivot_longer(cols = names(.))
Let’s start writing the query for creating the table. For a numeric values such as price, we will use Decimal data type while the rest is the same as the host table. The primary key for the listing table is the id
, which indicate the listing id. The foreign key for the listing table is host_id
, which we can use to join the listing table with the host_info table for analysis.
<- "CREATE TABLE listing (
query id INT,
listing_url VARCHAR(100),
last_scraped DATE,
name VARCHAR(500),
description VARCHAR(2000),
neighborhood_overview VARCHAR(2000),
picture_url VARCHAR(500),
host_id INT,
neighbourhood VARCHAR(100),
neighbourhood_cleansed VARCHAR(100),
neighbourhood_group_cleansed VARCHAR(100),
latitude DECIMAL(25,18),
longitude DECIMAL(25, 18),
property_type VARCHAR(100),
room_type VARCHAR(100),
accommodates INT,
bathrooms_text VARCHAR(100),
bedrooms INT,
beds INT,
amenities VARCHAR(2000),
price DECIMAL(15, 5),
minimum_nights INT,
maximum_nights INT,
minimum_minimum_nights INT,
maximum_minimum_nights INT,
minimum_maximum_nights INT,
maximum_maximum_nights INT,
minimum_nights_avg_ntm DECIMAL(16, 5),
maximum_nights_avg_ntm DECIMAL(16, 5),
has_availability BOOLEAN,
availability_30 INT,
availability_60 INT,
availability_90 INT,
availability_365 INT,
calendar_last_scraped DATE,
number_of_reviews INT,
number_of_reviews_ltm INT,
number_of_reviews_l30d INT,
first_review DATE,
last_review DATE,
review_scores_rating DECIMAL(10, 5),
review_scores_accuracy DECIMAL(10, 5),
review_scores_cleanliness DECIMAL(10, 5),
review_scores_checkin DECIMAL(10, 5),
review_scores_communication DECIMAL(10, 5),
review_scores_location DECIMAL(10, 5),
review_scores_value DECIMAL(10, 5),
instant_bookable BOOLEAN,
reviews_per_month DECIMAL(10, 5),
PRIMARY KEY(id),
FOREIGN KEY(host_id) REFERENCES host_info(host_id)
)"
query
## [1] "CREATE TABLE listing (\nid INT,\nlisting_url VARCHAR(100),\nlast_scraped DATE,\nname VARCHAR(500),\ndescription VARCHAR(2000),\nneighborhood_overview VARCHAR(2000),\npicture_url VARCHAR(500),\nhost_id INT,\nneighbourhood VARCHAR(100),\nneighbourhood_cleansed VARCHAR(100),\nneighbourhood_group_cleansed VARCHAR(100),\nlatitude DECIMAL(25,18),\nlongitude DECIMAL(25, 18),\nproperty_type VARCHAR(100),\nroom_type VARCHAR(100),\naccommodates INT,\nbathrooms_text VARCHAR(100),\nbedrooms INT,\nbeds INT,\namenities VARCHAR(2000),\nprice DECIMAL(15, 5),\nminimum_nights INT,\nmaximum_nights INT,\nminimum_minimum_nights INT,\nmaximum_minimum_nights INT,\nminimum_maximum_nights INT,\nmaximum_maximum_nights INT,\nminimum_nights_avg_ntm DECIMAL(16, 5),\nmaximum_nights_avg_ntm DECIMAL(16, 5),\nhas_availability BOOLEAN,\navailability_30 INT,\navailability_60 INT,\navailability_90 INT,\navailability_365 INT,\ncalendar_last_scraped DATE,\nnumber_of_reviews INT,\nnumber_of_reviews_ltm INT,\nnumber_of_reviews_l30d INT,\nfirst_review DATE,\nlast_review DATE,\nreview_scores_rating DECIMAL(10, 5),\nreview_scores_accuracy DECIMAL(10, 5),\nreview_scores_cleanliness DECIMAL(10, 5),\nreview_scores_checkin DECIMAL(10, 5),\nreview_scores_communication DECIMAL(10, 5),\nreview_scores_location DECIMAL(10, 5),\nreview_scores_value DECIMAL(10, 5),\ninstant_bookable BOOLEAN,\nreviews_per_month DECIMAL(10, 5),\nPRIMARY KEY(id),\nFOREIGN KEY(host_id) REFERENCES host_info(host_id)\n)"
Let’s run the query.
dbSendQuery(mydb, query)
You can check the listing table with the following query.
# sent query
<- dbSendQuery(mydb, "DESCRIBE listing")
res
# get the result
<- fetch(res)
out_db
# free resource
dbClearResult(res)
## [1] TRUE
out_db
Now we will insert each row into the table using the same function that we have defined earlier. Unfortunately, MySQL has maximum allowed packet that goes in a single query. You can check with the following query.
# sent query
<- dbSendQuery(mydb, "SHOW VARIABLES LIKE 'max_allowed_packet'")
res
# get the result
<- fetch(res)
out_db
# free resource
dbClearResult(res)
## [1] TRUE
out_db
Since our data is larger than the allowed packet, we need to split our data into smaller dataset. Here, we will split the data into 10 equal parts.
Read more about this problem at here
<- seq(1, nrow(df_new_listing), length.out = 10) %>%
data_split round()
data_split
## [1] 1 2144 4287 6430 8573 10717 12860 15003 17146 19289
We will insert the data one at a time with loop.
for (i in 1:length(data_split)) {
if (data_split[i] < nrow(df_new_listing)) {
# If condition to properly select the data
if (i == 1) {
cat(paste("Uploading rows", data_split[i], "-", data_split[i+1], "\n"))
<- df_new_listing %>%
selected_data slice(data_split[i]:data_split[i+1])
else {
}
cat(paste("Uploading rows", data_split[i]+1, "-", data_split[i+1], "\n"))
<- df_new_listing %>%
selected_data slice((data_split[i]+1):data_split[i+1])
}
insert_to_sql(mydb, "listing", selected_data)
}
}
Let’s do some simple query to check the data that we have inserted.
Let’s check how many rows are there on the listing table.
<- dbSendQuery(mydb, "SELECT COUNT(*) as frequency FROM listing")
res
<- fetch(res)
out_db
dbClearResult(res)
## [1] TRUE
out_db
I want to see the top 10 property type based on the frequency from the listing. We can use COUNT to get the number of row and use GROUP BY to make sure we count the number of row for each property type. To sort the value, we can use ORDER BY, folllowed by DESC to indicate that we want to sort descending and finally we only take the first 10 row with LIMIT.
<- dbSendQuery(mydb,
res "SELECT COUNT(*) as frequency, property_type FROM listing GROUP BY property_type ORDER BY COUNT(*) DESC LIMIT 10"
)
<- fetch(res)
out_db
dbClearResult(res)
## [1] TRUE
out_db
How about the top 10 most expensive listing? We can check the sorting the data with the price
column.
<- dbSendQuery(mydb,
res "SELECT price, property_type, neighbourhood FROM listing ORDER BY price DESC LIMIT 10"
)
<- fetch(res)
out_db dbClearResult(res)
## [1] TRUE
out_db
If you have done with doing some query with your MySQL server, don’t forget to close the connection to the database.
# Closing connection to database
dbDisconnect(mydb)
## [1] TRUE
I hope that the notebook has fulfilled the following goal that we have stated earlier:
For my next post, I will write on how to do some data analysis using SQL Query.