First, let’s call the same libraries we used last time.

library(tidycensus) #Lets us use Census api
library(sf) #allows us to read and write shapefifles
## Linking to GEOS 3.9.1, GDAL 3.4.3, PROJ 7.2.1; sf_use_s2() is TRUE
library(tmap) #visualizes simple maps 
library(jsonlite) #reads and writes json files
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
library(httr) #lets us make api requests
library(jsonlite)
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(here) #stores paths
## here() starts at C:/Users/kwells65/OneDrive - Georgia Institute of Technology/Assignments
library(yelpr)
library(knitr)

Step One: Prepare the Yelp Data

I wrote my outputs from the script I created in the first mini assignment into an RDS file, which we will be using to wrangle the data.

#First, let's load the data 
dek_yelp <- read_rds(here("yelp_all_dek_data.rds"))
  
#now let's check it
dek_yelp
## # A tibble: 1,564 × 16
##    id           alias name  image…¹ is_cl…² url   revie…³ categ…⁴ rating coord…⁵
##    <chr>        <chr> <chr> <chr>   <lgl>   <chr>   <int> <list>   <dbl>   <dbl>
##  1 Sh7BBAHsDkN… firs… Firs… https:… FALSE   http…     268 <df>       3.5    33.9
##  2 C8SrEYsWjjG… j-ch… J Ch… https:… FALSE   http…      88 <df>       3      33.9
##  3 iJ0DwsHhE75… bell… Bell… https:… FALSE   http…       5 <df>       3.5    33.9
##  4 HW36mkQQdcX… park… Park… https:… FALSE   http…      35 <df>       5      34.0
##  5 aaajyKLRtLL… atla… Atla… https:… FALSE   http…      48 <df>       5      33.8
##  6 Cldc9nU5XRY… hydr… Hydr… https:… FALSE   http…      63 <df>       4      33.8
##  7 W4_ucUE30B4… hydr… Hydr… https:… FALSE   http…      37 <df>       4.5    33.8
##  8 Ns4Cu0YlZlv… b-ne… B Ne… https:… FALSE   http…      10 <df>       4.5    33.8
##  9 pLaH_zlvbF1… lux-… LUX … https:… FALSE   http…      16 <df>       4      33.9
## 10 5-OUTmlfQwB… hydr… Hydr… https:… FALSE   http…      11 <df>       4      33.9
## # … with 1,554 more rows, 7 more variables: coordinates$longitude <dbl>,
## #   transactions <list>, price <chr>, location <df[,8]>, phone <chr>,
## #   display_phone <chr>, distance <dbl>, and abbreviated variable names
## #   ¹​image_url, ²​is_closed, ³​review_count, ⁴​categories, ⁵​coordinates$latitude
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Everything looks okay, so let’s get to wrangling!

Step Two: Scrubbing the Data

We want to delete duplicates and flatten the dataset, check for and delete any null values, and get rid of points outside our bounding box.This can be done in one block of code, but since I’m still familiarizing myself with the dataset, I want the opportunity to check it at each stage.

The first step is to delete duplicate rows.

#The first step is to delete duplicated rows.
dek_yelp_unique <- dek_yelp %>% 
  distinct(id, .keep_all=T)

glue::glue("Before dropping the duplicates, there were {nrow(dek_yelp)} rows. After dropping them, there are {nrow(dek_yelp_unique)} rows") %>% 
  print()
## Before dropping the duplicates, there were 1564 rows. After dropping them, there are 240 rows
#double check
dek_yelp_unique
## # A tibble: 240 × 16
##    id           alias name  image…¹ is_cl…² url   revie…³ categ…⁴ rating coord…⁵
##    <chr>        <chr> <chr> <chr>   <lgl>   <chr>   <int> <list>   <dbl>   <dbl>
##  1 Sh7BBAHsDkN… firs… Firs… https:… FALSE   http…     268 <df>       3.5    33.9
##  2 C8SrEYsWjjG… j-ch… J Ch… https:… FALSE   http…      88 <df>       3      33.9
##  3 iJ0DwsHhE75… bell… Bell… https:… FALSE   http…       5 <df>       3.5    33.9
##  4 HW36mkQQdcX… park… Park… https:… FALSE   http…      35 <df>       5      34.0
##  5 aaajyKLRtLL… atla… Atla… https:… FALSE   http…      48 <df>       5      33.8
##  6 Cldc9nU5XRY… hydr… Hydr… https:… FALSE   http…      63 <df>       4      33.8
##  7 W4_ucUE30B4… hydr… Hydr… https:… FALSE   http…      37 <df>       4.5    33.8
##  8 Ns4Cu0YlZlv… b-ne… B Ne… https:… FALSE   http…      10 <df>       4.5    33.8
##  9 pLaH_zlvbF1… lux-… LUX … https:… FALSE   http…      16 <df>       4      33.9
## 10 5-OUTmlfQwB… hydr… Hydr… https:… FALSE   http…      11 <df>       4      33.9
## # … with 230 more rows, 7 more variables: coordinates$longitude <dbl>,
## #   transactions <list>, price <chr>, location <df[,8]>, phone <chr>,
## #   display_phone <chr>, distance <dbl>, and abbreviated variable names
## #   ¹​image_url, ²​is_closed, ³​review_count, ⁴​categories, ⁵​coordinates$latitude
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Hmmm. I AM a little concerned that so many of my points got deleted since it usually means either a.) something went wrong, or b.) most of our original dataset was unusable. I’m always a bit leery of analyses where lots of points were deleted and now would a time where I’d want to run a few sanity checks or do a QA/QC analysis.

For the time being, let’s move on.

Next, I’ll flatten nested columns that have multiple variables in one column. I’m going to take a look at the “category” column since interesting observations about the different types of businesses might help at later stages in my analysis.

This block of code uses a custom function. If you don’t already have it read into your environment, then use this:

concate_list <- function(x){
  # x is a data frame with columns "alias" and "title" from Yelp$categories
  # returns a character vector containing category concatenated titles 
  titles <- x[["title"]] %>% str_c(collapse = ", ")
  return(titles)
}

Now we can flatten our data!

dek_yelp_flat <- dek_yelp_unique %>% 
  #this line of code is flattening columns with data frame
  
  jsonlite::flatten() %>% 
  relocate(coordinates.latitude, coordinates.longitude, .after = rating) %>% 
  relocate(location.address1, location.address2, location.address3, location.city, 
           location.zip_code, location.country, location.state, location.display_address, .after = price) %>% 
  as_tibble() %>%
  # Flatten list-columns ("transactions" (empty), "location.display_address", and "categories")
  mutate(transactions = transactions %>% map_chr(function(x) str_c(x, collapse = ", ")),
         location.display_address = location.display_address %>% map_chr(function(x) str_c(x, collapse = ", ")),
         categories = categories %>% map_chr(concate_list))

#I'll take another look before we move on
dek_yelp_flat
## # A tibble: 240 × 24
##    id           alias name  image…¹ is_cl…² url   revie…³ categ…⁴ rating coord…⁵
##    <chr>        <chr> <chr> <chr>   <lgl>   <chr>   <int> <chr>    <dbl>   <dbl>
##  1 Sh7BBAHsDkN… firs… Firs… https:… FALSE   http…     268 Cafes,…    3.5    33.9
##  2 C8SrEYsWjjG… j-ch… J Ch… https:… FALSE   http…      88 Creper…    3      33.9
##  3 iJ0DwsHhE75… bell… Bell… https:… FALSE   http…       5 Breakf…    3.5    33.9
##  4 HW36mkQQdcX… park… Park… https:… FALSE   http…      35 Medica…    5      34.0
##  5 aaajyKLRtLL… atla… Atla… https:… FALSE   http…      48 Skin C…    5      33.8
##  6 Cldc9nU5XRY… hydr… Hydr… https:… FALSE   http…      63 Medica…    4      33.8
##  7 W4_ucUE30B4… hydr… Hydr… https:… FALSE   http…      37 IV Hyd…    4.5    33.8
##  8 Ns4Cu0YlZlv… b-ne… B Ne… https:… FALSE   http…      10 Skin C…    4.5    33.8
##  9 pLaH_zlvbF1… lux-… LUX … https:… FALSE   http…      16 Medica…    4      33.9
## 10 5-OUTmlfQwB… hydr… Hydr… https:… FALSE   http…      11 IV Hyd…    4      33.9
## # … with 230 more rows, 14 more variables: coordinates.longitude <dbl>,
## #   transactions <chr>, price <chr>, location.address1 <chr>,
## #   location.address2 <chr>, location.address3 <chr>, location.city <chr>,
## #   location.zip_code <chr>, location.country <chr>, location.state <chr>,
## #   location.display_address <chr>, phone <chr>, display_phone <chr>,
## #   distance <dbl>, and abbreviated variable names ¹​image_url, ²​is_closed,
## #   ³​review_count, ⁴​categories, ⁵​coordinates.latitude
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Next, let’s check for NA values. Here’s a good time to think about what variables you want to analyze as deleting them will be necessary. But the most important part is to check for null location-based data. The block of code below does just that.

dek_yelp_flat %>% 
  map_dbl(., function(x) sum(is.na(x))) 
##                       id                    alias                     name 
##                        0                        0                        0 
##                image_url                is_closed                      url 
##                        0                        0                        0 
##             review_count               categories                   rating 
##                        0                        0                        0 
##     coordinates.latitude    coordinates.longitude             transactions 
##                        0                        0                        0 
##                    price        location.address1        location.address2 
##                       86                        1                       38 
##        location.address3            location.city        location.zip_code 
##                       65                        0                        0 
##         location.country           location.state location.display_address 
##                        0                        0                        0 
##                    phone            display_phone                 distance 
##                        0                        0                        0
# map_dbl is a variant of map() which outputs numeric vector rather than a list.

It looks like there are no NAs in my location-based data, but I DO want to look at price, so let’s drop those points in the price variable with null fields.

dek_yelp_dropna <- dek_yelp_flat %>% 
  drop_na(price)

#Once again, let's take a look
dek_yelp_dropna
## # A tibble: 154 × 24
##    id           alias name  image…¹ is_cl…² url   revie…³ categ…⁴ rating coord…⁵
##    <chr>        <chr> <chr> <chr>   <lgl>   <chr>   <int> <chr>    <dbl>   <dbl>
##  1 Sh7BBAHsDkN… firs… Firs… https:… FALSE   http…     268 Cafes,…    3.5    33.9
##  2 C8SrEYsWjjG… j-ch… J Ch… https:… FALSE   http…      88 Creper…    3      33.9
##  3 HW36mkQQdcX… park… Park… https:… FALSE   http…      35 Medica…    5      34.0
##  4 Cldc9nU5XRY… hydr… Hydr… https:… FALSE   http…      63 Medica…    4      33.8
##  5 pLaH_zlvbF1… lux-… LUX … https:… FALSE   http…      16 Medica…    4      33.9
##  6 l5DR2U_9o4T… body… Body… https:… FALSE   http…       4 Medica…    5      33.6
##  7 C1FUOkvNBVn… mrs-… Mrs … https:… FALSE   http…      34 Chicke…    2.5    33.7
##  8 Tg8JUedupnx… waff… Waff… https:… FALSE   http…      25 Diners…    2      33.9
##  9 hK_GXfqAWQg… face… Face… https:… FALSE   http…      18 Waxing…    5      33.9
## 10 0ZNXgXXRhTe… duck… Duck… https:… FALSE   http…      58 Donuts…    4.5    33.8
## # … with 144 more rows, 14 more variables: coordinates.longitude <dbl>,
## #   transactions <chr>, price <chr>, location.address1 <chr>,
## #   location.address2 <chr>, location.address3 <chr>, location.city <chr>,
## #   location.zip_code <chr>, location.country <chr>, location.state <chr>,
## #   location.display_address <chr>, phone <chr>, display_phone <chr>,
## #   distance <dbl>, and abbreviated variable names ¹​image_url, ²​is_closed,
## #   ³​review_count, ⁴​categories, ⁵​coordinates.latitude
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

When I look at my map from last time, there are lots of points located outside the bounding box. Since these are errors, I’ll delete them.

census <- st_read("https://raw.githubusercontent.com/BonwooKoo/UrbanAnalytics2022/main/Lab/module_0/testdata.geojson") 
## Reading layer `testdata' from data source 
##   `https://raw.githubusercontent.com/BonwooKoo/UrbanAnalytics2022/main/Lab/module_0/testdata.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 519 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.85071 ymin: 33.35246 xmax: -84.02371 ymax: 34.18629
## Geodetic CRS:  WGS 84
#Then, I'll create an sf object and subset my data
dek_sf <- dek_yelp_dropna %>%
  st_as_sf(coords=c("coordinates.longitude", "coordinates.latitude"), crs = 4326)

#now for the subsets
dek_yelp_in <- dek_sf[census %>% 
                     filter(county %in%  "DeKalb County") %>% 
                     st_union(), ,op = st_intersects]

Finally, let’s compare our datasets

glue::glue("nrow before: {nrow(dek_yelp)} -> nrow after: {nrow(dek_yelp_in)} \n
            ncol before: {ncol(dek_yelp)} -> ncol after: {ncol(dek_yelp_in)} \n") %>% 
  print()
## nrow before: 1564 -> nrow after: 106 
## 
## ncol before: 16 -> ncol after: 23

Time to visualize it!

Step Three: Mapping the Data

Let’s start off taking a look at the dataset I just scrubbed.

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(dek_yelp_in) + tm_dots(col = "price")

Not too shabby, all things considered. But it’s kind of boring and doesn’t tell us much, so let’s supplement it with some data.

I want see if there’s a relationship between price, ratings, and household income, so let me bring in some of the Census tract data I used last time.

dek_tracts <- suppressMessages(
  get_acs(geography = "tract", #you can use other geographies like 'county' or 'state' here
          state = "GA",
          county = "Dekalb", 
          variables = c(hhincome = "B19019_001", population = "B01003_001"),
          year = 2019,
          survey = "acs5", 
          geometry = TRUE, 
          output = "wide"))
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  39%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%
#Now I need to select the variables I want. 

dek_tracts2 <- dek_tracts %>% 
  select(GEOID, 
         hhincome = hhincomeE, 
         population = populationE)

#let's check our CRS before we move on

head(dek_tracts2)
## Simple feature collection with 6 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.34782 ymin: 33.71426 xmax: -84.11111 ymax: 33.97001
## Geodetic CRS:  NAD83
##         GEOID hhincome population                       geometry
## 1 13089021213   154063       3526 MULTIPOLYGON (((-84.34783 3...
## 2 13089023506    45924       6465 MULTIPOLYGON (((-84.25237 3...
## 3 13089021305    55109       4970 MULTIPOLYGON (((-84.28811 3...
## 4 13089023313    55143       5294 MULTIPOLYGON (((-84.14593 3...
## 5 13089021604   159306       3237 MULTIPOLYGON (((-84.31051 3...
## 6 13089021913    32983       4450 MULTIPOLYGON (((-84.1905 33...
#Knew it! They're different. We already have dek_yelp_in set to the one we want,
#so let's convert our census data. 
dek_tract_geom <- st_transform(dek_tracts2, crs=4326)

Now that I’ve got everything ready, I can make my maps.

tm_shape(dek_tract_geom) + tm_polygons(col = "hhincome", title= 'Household Income (2019)',palette= "Greens", id= "hhincome") + 
  tm_shape(dek_yelp_in)  + tm_dots(col="price", palette= "YlOrBr", title = 'Brunch Restaurants and Spas by price', id = 'name')

I wanted to use colors that tell us something about the data we’re looking at, so picked a green color palette for income. For the price, I chose a gradient scale to show prices going from low to high. Most cool color ramps would blend in too much, so I picked orange as it is a complimentary color to green and is more colorblind-friendly.

Let’s look at ratings, next.

tm_shape(dek_tract_geom) + tm_polygons(col = "hhincome", title= 'Household Income (2019)',palette= "BuGn", id= "hhincome") + 
  tm_shape(dek_yelp_in)  + tm_dots(col="rating", palette= "-RdYlBu", title = 'Brunch Restaurants and Spas by Rating', id = 'name')

Having maps that are too identical can be confusing, so I tried to go with slightly different colors. For ratings, I chose a divergent palette because I wanted to show establishments people weren’t as impressed with (or are “cold” on) against places people really liked (a “hot” brunch spot). This color ramp is less colorblind friendly when combined with my Census data, so I chose a green palette with a blue tone in the hopes of minimizing that.

Something more sophisticated than just layering ratings and price over income might be useful–like looking at it as a proportion of another variable, or adding data related to waklability, density, and other income indicators–but I’d like to wait until I analyze things a bit more.

Step Four: Discussing the New Findings

There appears to be at least some interesting about the relationships between income, ratings and price, but until I can figure out why so many points were deleted and do some exploratory analyses, I’d hesitate to say anything more certain than that–especially because of the aforementioned concerns over how drastically my initial dataset changed. However, because I know the area well, I can pick out a few interesting observations:

The rectangular census tract in the southwest part of the county is the Eastlake golf course (for reference, the other empty tract is where the prison is), so it’s not surprising that it has highly rated restaurants around it.

The most expensive places are unsurprisingly spas offering medical grade treatments, which suggests I may have to separate my data by category to get better results in the future. (I suspected as much.) Many of the restaurants in the two-to-three dollar signs range are in and around ore affluent neighborhoods and/or private colleges (like Emory) where students might not be making high incomes, but may have money anyway. Moreover, some of the counties in the lower-income tracts also appear to have good restaurants and an inexpensive price point.