Yelp Rating Analysis - Data Cleaning & Summarization

Description

This project involved cleaning the data downloaded using YELP and Census API. The cleaning process involved deleting duplicate rows, flattening nested columns, eliminating NAs, filtering data to the study area. The data along with census data was further used for ratings analysis.

Data Cleaning

Installing required packages

library(tidyverse)
library(tmap)
library(sf)
library(dplyr)

List of Input Variables

yelp_sf : Business category 1 [Amusement Parks] in Orange County [unclean data]
yelp_sf1 : Business category 2 [Colleges/Universities] in Orange County [unclean data]
tract : Census tract polygons in Orange County Florida with census data

Viewing and Summarising Input Data

Viewing and Summarising data derived from Yelp for Mini Assignment 1. Through visual observation of the data it was found that the data is in a wide format table.

# Viewing and summarising data that has to be cleaned
load("yelp_sf.Rdata")
load("yelp_sf1.Rdata")

# Business category 1 - Amusement Parks 
view(yelp_sf) 
summary(yelp_sf)

# Business category 2 - Colleges/Universities
view(yelp_sf1)
summary(yelp_sf1)

Deleting Duplicate Rows

# Business category 1 - Amusement Parks 
yelp_unique <- yelp_sf %>% 
  distinct(id, .keep_all=T)
glue::glue("Before dropping duplicated rows, there were {nrow(yelp_sf)} rows. After dropping them, there are {nrow(yelp_unique)} rows") %>% 
  print()

## Before dropping duplicated rows, there were 1217 rows. After dropping them, there are 300 rows

# Business category 2 - Colleges/Universities
yelp_unique1 <- yelp_sf1 %>% 
  distinct(id, .keep_all=T)
glue::glue("Before dropping duplicated rows, there were {nrow(yelp_sf1)} rows. After dropping them, there are {nrow(yelp_unique1)} rows") %>% 
  print()

## Before dropping duplicated rows, there were 232 rows. After dropping them, there are 50 rows

Originally, business category - 1 [Amusement Parks] had 1217 rows that were reduced to 300 rows after dropping all duplicate records.

Originally, business category - 2 [Colleges/Universities] had 232 rows that were reduced to 50 rows after dropping all duplicate records.

Flattening Nested Columns

Upon observation it was found that the datasets had no nested columns. But, if some nested columns existed in the datasets they could be flattened using - jsonlite::flatten()

Eliminating NAs

# Finding whether there are any NAs in the datasets
# Business category 1 - Amusement Parks 
yelp_unique %>% 
  map_dbl(., function(x) sum(is.na(x)))

##            id         alias          name     image_url     is_closed 
##             0             0             0             0             0 
##           url  review_count    categories        rating   coordinates 
##             0             0             0             0             0 
##  transactions      location         phone display_phone      distance 
##             0           126             0             0             0 
##         price      geometry 
##           296             0

# Business category 2 - Colleges/Universities
yelp_unique1 %>% 
  map_dbl(., function(x) sum(is.na(x)))

##            id         alias          name     image_url     is_closed 
##             0             0             0             0             0 
##           url  review_count    categories        rating   coordinates 
##             0             0             0             0             0 
##  transactions      location         phone display_phone      distance 
##             0            28             0             0             0 
##      geometry 
##             0

It was found that there were very few NAs or missing values in both the datasets. There were no NAs or missing values in the coordinates variable. But if some NAs or missing values existed in the datasets they could be eliminated using - drop_na(coordinates)

Deleting Records outside the boundary of Orange County

There were many points that fell outside of Orange county.

# importing census tract file
load("tract.Rdata")
view(tract)
st_crs(tract) <- 4326 # making sure both the tracts and yelp data have the same CRS

# Business category 1 - Amusement Parks 
# Converting to sf 
yelp_sfc <- yelp_unique %>% 
  st_as_sf(coords=c("coordinates.longitude", "coordinates.latitude"), crs = 4326)

# subsetting
yelp_in <- yelp_sfc[tract %>% st_union(), ,op = st_intersects]

# Business category 2 - Colleges/Universities
# Converting to sf 
yelp_sfc1 <- yelp_unique1 %>% 
  st_as_sf(coords=c("coordinates.longitude", "coordinates.latitude"), crs = 4326)

# subsetting
yelp_in1 <- yelp_sfc1[tract %>% st_union(), ,op = st_intersects]

# visualising the subsets 
tmap_mode('view')

## tmap mode set to interactive viewing

tm_shape(yelp_in) + tm_dots(col = 'orange') + tm_shape(yelp_in1) + tm_dots(col = 'blue') + tm_shape(tract) + tm_borders()

Subsetting the yelp business data to the county boundary helped in eliminating records that belonged to the census tract buffers but were outside the county boundary. There were 283 and 44 records respectively in the two selected business categories.

Appending Census Data

# Business category 1 - Amusement Parks
# Converting census data to sf
tract_sf <- tract %>% st_sf()

# Spatial join
yelp_tract <- st_join(yelp_in, tract_sf, join = st_intersects)

# number of rows
cat('yelp_tract: ', nrow(yelp_tract))

## yelp_tract:  283

# View
yelp_tract %>% head()

## Simple feature collection with 6 features and 20 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.45621 ymin: 28.4435 xmax: -81.39174 ymax: 28.46855
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 21
##   id        alias name  image_url is_closed url   review_count categories rating
##   <chr>     <chr> <chr> <chr>     <lgl>     <chr>        <int> <list>      <dbl>
## 1 smPOuy1C… fun-… Fun … https://… FALSE     http…          338 <df>          4  
## 2 zdik_Yi7… cray… Cray… https://… FALSE     http…          316 <df>          3.5
## 3 DAIJhOOo… amus… Amus… https://… FALSE     http…           15 <df>          4  
## 4 RJRmIaWp… deze… Deze… https://… FALSE     http…           58 <df>          3.5
## 5 pOlL77K3… k1-s… K1 S… https://… FALSE     http…           76 <df>          3  
## 6 rJ0IqoUU… inte… Inte… https://… FALSE     http…           15 <df>          3.5
## # ℹ 12 more variables: coordinates <df[,2]>, transactions <list>,
## #   location <df[,8]>, phone <chr>, display_phone <chr>, distance <dbl>,
## #   price <chr>, geometry <POINT [°]>, GEOID <chr>, NAME <chr>,
## #   hhincomeE <dbl>, hhincomeM <dbl>

# Visualising Map
tm_shape(yelp_tract) + tm_dots(col="hhincomeE") + tm_shape(tract) + tm_borders(alpha = 0.25)

# Business category 2 - Colleges/Universities
# Converting census data to sf
tract_sf <- tract %>% st_sf()

# Spatial join
yelp_tract1 <- st_join(yelp_in1, tract_sf, join = st_intersects)

# number of rows
cat('yelp_tract1: ', nrow(yelp_tract1))

## yelp_tract1:  44

# View
yelp_tract1 %>% head()

## Simple feature collection with 6 features and 19 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -81.44182 ymin: 28.42896 xmax: -81.39896 ymax: 28.48126
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 20
##   id        alias name  image_url is_closed url   review_count categories rating
##   <chr>     <chr> <chr> <chr>     <lgl>     <chr>        <int> <list>      <dbl>
## 1 IasR2eyS… flor… Flor… https://… FALSE     http…            8 <df>          4.5
## 2 4c9GeRGL… nova… Nova… https://… FALSE     http…            3 <df>          4.5
## 3 ojScOMup… ucfs… UCF'… https://… FALSE     http…            8 <df>          3.5
## 4 ETfK0DRt… orla… Orla… https://… FALSE     http…            7 <df>          2  
## 5 UBs0893-… sout… Sout… https://… FALSE     http…            5 <df>          2.5
## 6 P_Q5apaV… inte… Inte… https://… FALSE     http…            1 <df>          1  
## # ℹ 11 more variables: coordinates <df[,2]>, transactions <list>,
## #   location <df[,8]>, phone <chr>, display_phone <chr>, distance <dbl>,
## #   geometry <POINT [°]>, GEOID <chr>, NAME <chr>, hhincomeE <dbl>,
## #   hhincomeM <dbl>

# Visualising Map
tm_shape(yelp_tract1) + tm_dots(col="hhincomeE") + tm_shape(tract) + tm_borders(alpha = 0.25)

Analysis Ratings and Review count Data

# Rating Frequency using count()
yelp_in %>% count(rating)

## Simple feature collection with 9 features and 2 fields
## Geometry type: MULTIPOINT
## Dimension:     XY
## Bounding box:  xmin: -81.60417 ymin: 28.35101 xmax: -81.20904 ymax: 28.62206
## Geodetic CRS:  WGS 84
## # A tibble: 9 × 3
##   rating     n                                                          geometry
## *  <dbl> <int>                                                  <MULTIPOINT [°]>
## 1    1       4 ((-81.45198 28.54056), (-81.37319 28.41593), (-81.45633 28.46259…
## 2    1.5     3 ((-81.54942 28.37594), (-81.45971 28.40973), (-81.46956 28.47804…
## 3    2       3   ((-81.59007 28.35496), (-81.5512 28.37281), (-81.47155 28.454))
## 4    2.5     3 ((-81.46873 28.44325), (-81.4475 28.46888), (-81.46782 28.47641))
## 5    3      17 ((-81.56646 28.38801), (-81.58863 28.35576), (-81.59209 28.35933…
## 6    3.5    34 ((-81.56387 28.38523), (-81.58753 28.35758), (-81.57511 28.35101…
## 7    4      81 ((-81.57932 28.40467), (-81.57559 28.41061), (-81.58826 28.35637…
## 8    4.5    99 ((-81.58077 28.40601), (-81.58704 28.35867), (-81.5924 28.36023)…
## 9    5      39 ((-81.60417 28.35328), (-81.57726 28.40405), (-81.5728 28.37811)…

yelp_in1 %>% count(rating)

## Simple feature collection with 9 features and 2 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -81.57868 ymin: 28.36538 xmax: -81.19555 ymax: 28.63514
## Geodetic CRS:  WGS 84
## # A tibble: 9 × 3
##   rating     n                                                          geometry
## *  <dbl> <int>                                                    <GEOMETRY [°]>
## 1    1       3 MULTIPOINT ((-81.388 28.61053), (-81.39941 28.46928), (-81.40134…
## 2    1.5     1                                        POINT (-81.30136 28.59656)
## 3    2       1                                        POINT (-81.39896 28.46649)
## 4    2.5     7 MULTIPOINT ((-81.31261 28.53901), (-81.30177 28.56697), (-81.195…
## 5    3       6 MULTIPOINT ((-81.30802 28.48221), (-81.25106 28.55384), (-81.196…
## 6    3.5     2           MULTIPOINT ((-81.20006 28.60244), (-81.44182 28.42896))
## 7    4       5 MULTIPOINT ((-81.28031 28.36722), (-81.24504 28.38473), (-81.309…
## 8    4.5     6 MULTIPOINT ((-81.20295 28.60392), (-81.34855 28.59226), (-81.385…
## 9    5      13 MULTIPOINT ((-81.50467 28.37024), (-81.28963 28.36538), (-81.280…

# Relation between ratings and review count [using group_by and summarise]
yelp_in %>%
    group_by(rating) %>%
    summarise(mean_count = mean(review_count))

## Simple feature collection with 9 features and 2 fields
## Geometry type: MULTIPOINT
## Dimension:     XY
## Bounding box:  xmin: -81.60417 ymin: 28.35101 xmax: -81.20904 ymax: 28.62206
## Geodetic CRS:  WGS 84
## # A tibble: 9 × 3
##   rating mean_count                                                     geometry
##    <dbl>      <dbl>                                             <MULTIPOINT [°]>
## 1    1         4.75 ((-81.45198 28.54056), (-81.37319 28.41593), (-81.45633 28.…
## 2    1.5      43.3  ((-81.54942 28.37594), (-81.45971 28.40973), (-81.46956 28.…
## 3    2       151    ((-81.59007 28.35496), (-81.5512 28.37281), (-81.47155 28.4…
## 4    2.5      58.7  ((-81.46873 28.44325), (-81.4475 28.46888), (-81.46782 28.4…
## 5    3        34.1  ((-81.56646 28.38801), (-81.58863 28.35576), (-81.59209 28.…
## 6    3.5     300.   ((-81.56387 28.38523), (-81.58753 28.35758), (-81.57511 28.…
## 7    4       154.   ((-81.57932 28.40467), (-81.57559 28.41061), (-81.58826 28.…
## 8    4.5      72.4  ((-81.58077 28.40601), (-81.58704 28.35867), (-81.5924 28.3…
## 9    5        11.7  ((-81.60417 28.35328), (-81.57726 28.40405), (-81.5728 28.3…

yelp_in1 %>%
    group_by(rating) %>%
    summarise(mean_count = mean(review_count))

## Simple feature collection with 9 features and 2 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -81.57868 ymin: 28.36538 xmax: -81.19555 ymax: 28.63514
## Geodetic CRS:  WGS 84
## # A tibble: 9 × 3
##   rating mean_count                                                     geometry
##    <dbl>      <dbl>                                               <GEOMETRY [°]>
## 1    1         1.33 MULTIPOINT ((-81.388 28.61053), (-81.39941 28.46928), (-81.…
## 2    1.5       5                                      POINT (-81.30136 28.59656)
## 3    2         7                                      POINT (-81.39896 28.46649)
## 4    2.5      24.7  MULTIPOINT ((-81.31261 28.53901), (-81.30177 28.56697), (-8…
## 5    3         9    MULTIPOINT ((-81.30802 28.48221), (-81.25106 28.55384), (-8…
## 6    3.5      43         MULTIPOINT ((-81.20006 28.60244), (-81.44182 28.42896))
## 7    4         3.6  MULTIPOINT ((-81.28031 28.36722), (-81.24504 28.38473), (-8…
## 8    4.5       6.17 MULTIPOINT ((-81.20295 28.60392), (-81.34855 28.59226), (-8…
## 9    5         1.77 MULTIPOINT ((-81.50467 28.37024), (-81.28963 28.36538), (-8…

# Relation between ratings and HH income [using group_by and summarise]
yelp_tract %>%
    group_by(rating) %>%
    summarise(mean_hhincome = mean(hhincomeE))

## Simple feature collection with 9 features and 2 fields
## Geometry type: MULTIPOINT
## Dimension:     XY
## Bounding box:  xmin: -81.60417 ymin: 28.35101 xmax: -81.20904 ymax: 28.62206
## Geodetic CRS:  WGS 84
## # A tibble: 9 × 3
##   rating mean_hhincome                                                  geometry
##    <dbl>         <dbl>                                          <MULTIPOINT [°]>
## 1    1          56550  ((-81.45198 28.54056), (-81.37319 28.41593), (-81.45633 …
## 2    1.5        56149. ((-81.54942 28.37594), (-81.45971 28.40973), (-81.46956 …
## 3    2          68052  ((-81.59007 28.35496), (-81.5512 28.37281), (-81.47155 2…
## 4    2.5        43569. ((-81.46873 28.44325), (-81.4475 28.46888), (-81.46782 2…
## 5    3          59748. ((-81.56646 28.38801), (-81.58863 28.35576), (-81.59209 …
## 6    3.5        61852. ((-81.56387 28.38523), (-81.58753 28.35758), (-81.57511 …
## 7    4          65449. ((-81.57932 28.40467), (-81.57559 28.41061), (-81.58826 …
## 8    4.5        66132. ((-81.58077 28.40601), (-81.58704 28.35867), (-81.5924 2…
## 9    5          71233. ((-81.60417 28.35328), (-81.57726 28.40405), (-81.5728 2…

yelp_tract1 %>%
    group_by(rating) %>%
    summarise(mean_hhincome = mean(hhincomeE))

## Simple feature collection with 9 features and 2 fields
## Geometry type: GEOMETRY
## Dimension:     XY
## Bounding box:  xmin: -81.57868 ymin: 28.36538 xmax: -81.19555 ymax: 28.63514
## Geodetic CRS:  WGS 84
## # A tibble: 9 × 3
##   rating mean_hhincome                                                  geometry
##    <dbl>         <dbl>                                            <GEOMETRY [°]>
## 1    1          39973. MULTIPOINT ((-81.388 28.61053), (-81.39941 28.46928), (-…
## 2    1.5        47946                                 POINT (-81.30136 28.59656)
## 3    2          40394                                 POINT (-81.39896 28.46649)
## 4    2.5        44846. MULTIPOINT ((-81.31261 28.53901), (-81.30177 28.56697), …
## 5    3          60947. MULTIPOINT ((-81.30802 28.48221), (-81.25106 28.55384), …
## 6    3.5        73162    MULTIPOINT ((-81.20006 28.60244), (-81.44182 28.42896))
## 7    4          69572. MULTIPOINT ((-81.28031 28.36722), (-81.24504 28.38473), …
## 8    4.5        63908. MULTIPOINT ((-81.20295 28.60392), (-81.34855 28.59226), …
## 9    5          71221. MULTIPOINT ((-81.50467 28.37024), (-81.28963 28.36538), …

Findings

There was a significant change that was seen after tidying the yelp data. The input dataset had 1217 amusement parks and 232 colleges/universities, which were reduced to 300 and 50 respectively after dropping all duplicate records. Further, the number was reduced to 283 amusement parks and 44 colleges/universities after eliminating all records outside Orange County, Fl. Apart from this the data was mostly tidy as it had no nested columns or missing coordinate values.

The most frequent rating score was 4.5 and 5 for amusement parks and colleges/universities respectively. While there seemed to be no distinct relation between number of reviews and rating for amusement parks, but for colleges/universities more number of reviews resulted in average rating score (ranging from 2.5 - 3.5). Similarly upon considering census data it was found that there was no relation between household income and ratings for amusement parks but, higher household income meant above average rating score for colleges/universities. The distributional pattern of expensive vs. cheap POIs could not be explored as the input yelp data did not have the price variable.