This is a 2-part assignment. In the first part, you are asked a series of questions that will help you profile and understand the data just like a data scientist would. For this first part of the assignment, you will be assessed both on the correctness of your findings, as well as the code you used to arrive at your answer. You will be graded on how easy your code is to read, so remember to use proper formatting and comments where necessary.
In the second part of the assignment, you are asked to come up with your own inferences and analysis of the data for a particular research question you want to answer. You will be required to prepare the dataset for the analysis you choose to do. As with the first part, you will be graded, in part, on how easy your code is to read, so use proper formatting and comments to illustrate and communicate your intent as required.
The database that was used is the CHINOOK database: click to access
##Part 1: Yelp Dataset Profiling and Understanding
Note: Primary Keys are denoted in the ER-Diagram with a yellow key icon.
3. Are there any columns with null values in the Users table? Indicate “yes,” or “no.”
Answer: NO
SQL code used to arrive at answer:
SELECT COUNT(*)
FROM user
WHERE id IS NULL OR
name IS NULL OR
review_count IS NULL OR
yelping_since IS NULL OR
useful IS NULL OR
funny IS NULL OR
cool IS NULL OR
fans IS NULL OR
average_stars IS NULL OR
compliment_hot IS NULL OR
compliment_more IS NULL OR
compliment_profile IS NULL OR
compliment_cute IS NULL OR
compliment_list IS NULL OR
compliment_note IS NULL OR
compliment_plain IS NULL OR
compliment_cool IS NULL OR
compliment_funny IS NULL OR
compliment_writer IS NULL OR
compliment_photos IS NULL
4. For each table and column listed below, display the smallest (minimum), largest (maximum), and average (mean) value for the following fields:
i. Table: Review, Column: Stars
min:1 max:5 avg:3.71
ii. Table: Business, Column: Stars
min:1 max:5 avg:3.65
iii. Table: Tip, Column: Likes
min:0 max:2 avg:0.0144
iv. Table: Checkin, Column: Count
min:1 max:53 avg:1.94
v. Table: User, Column: Review_count
min:0 max:2000 avg:24.3
List the cities with the most reviews in descending order:
SQL code used to arrive at answer:+—————–+————+
| city | nr_reviews |
+—————–+————+
| Las Vegas | 82854 |
| Phoenix | 34503 |
| Toronto | 24113 |
| Scottsdale | 20614 |
| Charlotte | 12523 |
| Henderson | 10871 |
| Tempe | 10504 |
| Pittsburgh | 9798 |
| Montr?al | 9448 |
| Chandler | 8112 |
| Mesa | 6875 |
| Gilbert | 6380 |
| Cleveland | 5593 |
| Madison | 5265 |
| Glendale | 4406 |
| Mississauga | 3814 |
| Edinburgh | 2792 |
| Peoria | 2624 |
| North Las Vegas | 2438 |
| Markham | 2352 |
| Champaign | 2029 |
| Stuttgart | 1849 |
| Surprise | 1520 |
| Lakewood | 1465 |
| Goodyear | 1155 |
+—————–+————+ (Output limit exceeded, 25 of 362 total rows shown)
6. Find the distribution of star ratings to the business in the following cities:
i. Avon
SQL code used to arrive at answer:
SELECT count(stars) as Nr_businesses, city, stars
FROM business
WHERE city=“Avon”
GROUP BY stars
Copy and Paste the Resulting Table Below (2 columns - star rating and count):
+———————+——-+
| count_nr_businesses | stars |
+———————+——-+
| 1 | 1.5 |
| 2 | 2.5 |
| 3 | 3.5 |
| 2 | 4.0 |
| 1 | 4.5 |
| 1 | 5.0 |
+———————+——-+
SQL code used to arrive at answer:
SELECT count(stars) as Nr_businesses, city, stars
FROM business
WHERE city=“Beachwood”
GROUP BY stars
Copy and Paste the Resulting Table Below (2 columns - star rating and count):
+———————+——-+
| count_nr_businesses | stars |
+———————+——-+
| 1 | 2.0 |
| 1 | 2.5 |
| 2 | 3.0 |
| 2 | 3.5 |
| 1 | 4.0 |
| 2 | 4.5 |
| 5 | 5.0 |
+———————+——-+
7. Find the top 3 users based on their total number of reviews:
SQL code used to arrive at answer:
SELECT name, sum(review_count) as total_nr_reviews
FROM user
GROUP BY name
ORDER BY by total_nr_reviews desc limit 3
Copy and Paste the Result Below:
+——–+——————+
| name | total_nr_reviews |
+——–+——————+
| Nicole | 2397 |
| Sara | 2253 |
| Gerald | 2034 |
+——–+——————+
SELECT id, name, sum(review_count) as reviews, sum(fans) as totfans
FROM user
GROUP by id
ORDER by reviews desc
Find the top 10 users with the most fans:
select id, name, sum(fans) as total_nr_fans
from user
group by id
order by total_nr_fans desc limit 10
Copy and Paste the Result Below:
+————————+———–+—————+
| id | name | total_nr_fans |
+————————+———–+—————+
| -9I98YbNQnLdAmcYfb324Q | Amy | 503 |
| -8EnCioUmDygAbsYZmTeRQ | Mimi | 497 |
| –2vR0DIsmQ6WfcSzKWigw | Harald | 311 |
| -G7Zkl1wIWBBmD0KRy_sCw | Gerald | 253 |
| -0IiMAZI2SsQ7VmyzJjokQ | Christine | 173 |
| -g3XIcCb2b-BD0QBCcq2Sw | Lisa | 159 |
| -9bbDysuiWeo2VShFJJtcw | Cat | 133 |
| -FZBTkAZEXoP7CYvRV2ZwQ | William | 126 |
| -9da1xk7zgnnfO1uTVYGkA | Fran | 124 |
| -lh59ko3dxChBSZ9U7LfUw | Lissa | 120 |
+————————+———–+—————+
1. Pick one city and category of your choice and group the businesses in that city or category by their overall star rating. Compare the businesses with 2-3 stars to the businesses with 4-5 stars and answer the following questions. Include your code.
I chose Phoenix & the ‘restaurants’ category. There are only 5 distinct business under this category though. Grouped 3 businesses under the 2 to 3.5 stars, the other 2 businesses were 4 & 4.5 stars respectively.
i. Do the two groups you chose to analyze have a different distribution of hours?
The different distribution is only significant between the 2-star restaurant (opens from early morning to midnight) versus the 4.5 stars restaurant that opens only from 10am to 6pm. There are no differences between the other restaurants.
ii. Do the two groups you chose to analyze have a different number of reviews?
It is difficult to analyze due to the low # of businesses in the dataset for this particular group. There appears to be a correlation between low number of reviews and low star rating (2-star restaurant only has 8 reviews).
iii. Are you able to infer anything from the location data provided between these two groups? Explain.
Not really. The zipcodes for all 5 businesses are all distinct. An additional analysis would need to place the GPS location in a map chart to see if the business are located in avenues or high-traffic streets.
SQL code used for analysis:
SELECT bus.id, bus.name, bus.city, cat.category, bus.stars, bus.review_count, hr.hours, bus.postal_code, bus.latitude, bus.longitude
FROM business bus
INNER JOIN category cat
ON bus.id = cat.business_id
INNER JOIN hours hr
ON bus.id = hr.business_id
WHERE bus.city = ‘Phoenix’ and cat.category=‘Restaurants’
ORDER BY by stars (/* order by “stars desc”, to see the bottom of the subset */)
2. Group business based on the ones that are open and the ones that are closed. What differences can you find between the ones that are still open and the ones that are closed? List at least two differences and the SQL code you used to arrive at your answer.
i. Difference 1:
The businesses that are OPEN tend to have > reviews than the ones that are CLOSED.
Open business: AVG(review_count) = 31.76
Closed business: AVG(review_count) = 23.2
ii. Difference 2:
The average star rating is higher for businesses that are OPEN than for businesses that are CLOSED.
Open: AVG(stars) = 3.68
Closed: AVG(stars) = 3.52
SQL code used for analysis:
SELECT COUNT(DISTINCT(id)), AVG(review_count), SUM(review_count), AVG(stars), is_open
FROM business
GROUP BY is_open
iv. Provide the SQL code you used to create your final dataset:
SELECT B.id as Business_ID,
B.name as Business_Name,
B.address,
B.city,
B.state,
B.postal_code,
B.latitude,
B.longitude,
B.review_count as Nr_reviews,
B.stars as Star_Rating,
MAX(CASE WHEN H.hours LIKE “%monday%” THEN TRIM(H.hours,‘%MondayTuesWednesThursFriSatSun|%’)
END) AS open_hours_Mon,
MAX(CASE wHEN H.hours LIKE “%tuesday%” THEN TRIM(H.hours,‘%MondayTuesWednesThursFriSatSun|%’)
END) AS open_hours_Tue,
MAX(CASE WHEN H.hours LIKE “%wednesday%” THEN TRIM(H.hours,‘%MondayTuesWednesThursFriSatSun|%’) END) AS open_hours_Wed,
MAX(CASE WHEN H.hours LIKE “%thursday%” THEN TRIM(H.hours,‘%MondayTuesWednesThursFriSatSun|%’)
END) AS open_hours_Thur, MAX(CASE WHEN H.hours LIKE “%friday%” THEN TRIM(H.hours,‘%MondayTuesWednesThursFriSatSun|%’)
END) AS open_hours_Fri, MAX(CASE WHEN H.hours LIKE “%saturday%” THEN TRIM(H.hours,‘%MondayTuesWednesThursFriSatSun|%’)
END) AS open_hours_Sat, MAX(CASE WHEN H.hours LIKE “%sunday%” THEN TRIM(H.hours,‘%MondayTuesWednesThursFriSatSun|%’)
END) AS open_hours_Sun,
GROUP_CONCAT(DISTINCT(C.category)) AS categories,
GROUP_CONCAT(DISTINCT(A.name)) AS attributes,
B.is_open
FROM business B
INNER JOIN hours H
ON B.id = H.business_id
INNER JOIN category C
ON B.id = C.business_id
INNER JOIN attribute A
ON B.id = A.business_id
GROUP BY B.id
| -PtTGvWsckUL8tTutHr6Ew | Snip-its Rocky River | 21609 Center Ridge Rd | Rocky River | OH | 44116 | 41.4595 | -81.8587 | 18 | 2.5 | 10:00-19:00 | 10:00-19:00 | 10:00-19:00 | 10:00-19:00 | 10:00-19:00 | 9:00-17:30 | 10:00-16:00 | Beauty & Spas,Hair Salons | BusinessAcceptsCreditCards,RestaurantsPriceRange2,GoodForKids,BusinessParking,ByAppointmentOnly | 1 |
| 1 |
+————————+————————-+—————–+—————+——-+—————–+———+
(Output limit exceeded, 25 of 70 total rows shown)