Data Analysis
Run libraries
| library(devtools) |
| library(DBI) |
| library(tidyverse) |
| library(RSQLite) |
| library(DescTools) |
| library(DT) |
| library(Datatablethemes) |
airbnb_search_details <- read.csv(file = 'airbnb_search_details.csv')
airbnb_reviews <- read.csv(file = 'airbnb_reviews.csv')
airbnb_hosts <- read.csv(file = 'airbnb_hosts.csv')
airbnb_host_searches <- read.csv(file = 'airbnb_host_searches.csv')
product_pricing <- read.csv(file = 'product_pricing.csv')
product_sales <- read.csv(file = 'product_sales.csv')
orders <- read.csv(file = "orders.csv")
vendor_spend <- read.csv(file = 'Vendor_spend.csv')
candidateColleges <- read.csv(file = 'candidateColleges.csv')
candidateInterviews <- read.csv(file = 'candidateInterviews.csv')
3 Bed Minimum (easy)
Find the average number of beds in each neighborhood that has at
least 3 beds in total.
Output results along with the neighborhood name and sort the results
based on the number of average beds in descending order.
Table: airbnb_search_details
airbnb_search_details %>%
group_by(neighbourhood) %>%
summarise(SUM_beds = sum(beds), AVG_beds = mean(beds)) %>%
filter(SUM_beds >=3) %>%
arrange(desc(AVG_beds)) %>%
select(neighbourhood, AVG_beds) %>%
Datatablethemes::Datatable_green()
Favorite Host Nationality (Medium)
For each guest reviewer, find the nationality of the reviewer’s
favorite host based on the guest’s highest review score given to a host.
Output the user ID of the guest along with their favorite host’s
nationality. In case there is more than one favorite host from the same
country, list that country only once (remove duplicates).
Both the from_user and to_user columns are user IDs.
Tables: airbnb_reviews,
airbnb_hosts
airbnb_reviews %>%
filter(from_type == "guest") %>%
left_join(., airbnb_hosts %>%
group_by(host_id) %>%
summarise(nationality = max(nationality)),
by = c("to_user" = "host_id")) %>%
group_by(from_user) %>%
mutate(max_review_score = max(review_score)) %>%
select(1,6,5,7) %>%
arrange(from_user) %>%
filter(review_score == max_review_score) %>%
group_by(from_user, nationality) %>%
summarise() %>%
Datatablethemes::Datatable_blue()
Host Popularity Rental Prices (Hard)
You’re given a table of rental property searches by users. The table
consists of search results and outputs host information for searchers.
Find the minimum, average, maximum rental prices for each host’s
popularity rating. The host’s popularity rating is defined as
below:
0 reviews: New
1 to 5 reviews: Rising
6 to 15 reviews:
Trending Up
16 to 40 reviews: Popular
more than 40 reviews:
Hot
Tip: The id column in the table refers to the
search ID. You’ll need to create your own host_id by concating price,
room_type, host_since, zipcode, and number_of_reviews.
Output host popularity rating and their minimum, average and maximum
rental prices.
Table: airbnb_host_searches
airbnb_host_searches %>%
unite('Host_id', c(room_type, host_since, zipcode, number_of_reviews), remove = FALSE, sep = "") %>%
mutate(Popularity_Rating = case_when(
number_of_reviews == 0 | number_of_reviews == NA ~ "New",
number_of_reviews < 6 ~ "Rising",
number_of_reviews < 16 ~ "Trending Up",
number_of_reviews < 41 ~ "Popular",
TRUE ~ "Hot" ## Same as saying Else
)) %>%
group_by(Host_id, price, Popularity_Rating) %>%
summarise() %>%
group_by(Popularity_Rating) %>%
summarise(min_price = dollar(min(price)),
avg_price = dollar(mean(price)),
max_price = dollar(max(price))) %>%
Datatablethemes::Datatable_red()
Beverage Sales (Easy)
Suppose you’re provided two tables, as shown below. One table
contains the price for a given product in a given region, while another
contains sales for the past week. Given the two tables write a SQL query
to pull the total earnings for each product/region combination.
Table: product_pricing,
product_sales
product_pricing %>%
mutate(price = gsub(price, pattern = "$", fixed = T, replacement = "")) %>%
mutate(price = as.numeric(price)) %>%
left_join(., product_sales, by = c("region", "product")) %>%
mutate(revenue = dollar(price * num_sales)) %>%
select(1,2,5) %>%
Datatable_purple()
Monthly Revenue growth (Medium)
Given the table below, called ‘orders’, write a SQL query to show
the monthly revenue growth. To calculate the monthly revenue growth, you
can apply the following logic:
revenue growth = (current month’s revenue-prior month’s
revenue)/prior month’s revenue.
orders %>%
group_by(rev_month) %>%
summarise(monthly_revenue = sum(revenue)) %>%
mutate(lm_revenue = lag(monthly_revenue)) %>%
mutate(rev_growth = (monthly_revenue - lm_revenue) / lm_revenue) %>%
Datatable_green()
Ranking Vendors by spend (Easy)
Suppose you have a table called ‘vendor_spend’ containing vendor
spend data, as shown.
Table: vendor_spend
vendor_spend %>%
group_by(name) %>%
summarise(spend = sum(total_spend)) %>%
filter(spend > quantile(spend, .5)) %>%
arrange(desc(spend)) %>%
mutate(spend = dollar(spend)) %>%
Datatable_blue()
Top Candidates by college (Easy)
Suppose your team interviews undergraduate candidates across many
different colleges. You are looking to check which candidates scored the
highest from each college.
Given the below table, write a SQL query (using a window function)
to show which candidates scored the highest from each college.
Tables: candidateColleges,
candidateInterviews
candidateColleges %>%
left_join(., candidateInterviews, by = "candidate_name") %>%
group_by(college_id) %>%
mutate(Rank = rank(-Interview_score, ties.method = "min")) %>%
filter(Rank == 1) %>%
select(2,4,1,5) %>%
Datatable_red()
Twitch Content Creators (Hard)
You are working for a company like Twitch.tv. Twitch is a live
streaming platform, where content creators (e.g. the people creating
content on the live streams) can get donations from viewers for
producing content they support.
Your company is trying to launch a new product that will benefit
content creators that get a large amount of donations per streaming
session. See tables below
Given this, write an SQL query to find the top 10 content creators
in 2018 that have the highest average donations per viewer.
Tables: all_donations,
sessions_info, session_viewers
all_donations
| creator_id |
integer |
Unique id of content creator |
| viewer_id |
integer |
Unique id of viewer |
| session_id |
integer |
unique session id of stream |
| date |
string |
format is ‘YYYY-MM-DD’ |
| donation_amount |
integer |
amount donated in USD |
sessions_info
| creator_id |
integer |
Unique id of content creator |
| session_id |
integer |
unique session id of stream |
| date |
string |
format is ‘YYYY-MM-DD’ |
| length |
integer |
length of session |
session_viewers
| creator_id |
integer |
Unique id of content creator |
| viewer_id |
integer |
Unique id of viewer |
| session_id |
integer |
unique session id of stream |
| date |
string |
format is ‘YYYY-MM-DD’ |
| mins_viewed |
integer |
The stream |
all_donations %>%
filter(date >= '2018-01-01' & date <= '2019-01-01') %>%
group_by(creator_id) %>%
summarise(avg_donation = sum(donation_amount) / length(viewer_id)) %>%
arrange(desc(avg_donation)) %>%
head(10)