(Don’t need to show the libraries).
From mini assignment 1, I am saving the data frames
yelp_bank, yelp_carpenters, and
yelp_final(combined) into one data set with a new name.
Then, I need to load it into this project.
# Saved from Mini_Assignment_1.Rmd
# save(yelp_carpenters, yelp_bank, yelp_final, file = 'data_all.RData')
load('data_all.RData')
There are a few steps to clean up the way the data is presented, so it’s easier to work with and more reliable for visualizations.
Check for duplicated columns in the data yelp_final,
which is the combined data.
yelp_final %>% distinct(.) # returning no duplicates
## # A tibble: 428 Ă— 16
## id alias name image_url is_closed url review_count categories rating
## <chr> <chr> <chr> <chr> <lgl> <chr> <int> <list> <dbl>
## 1 Rd5vXW3… citi… Citi… "https:/… FALSE http… 2 <df> 5
## 2 sFLTL1s… chas… Chas… "https:/… FALSE http… 2 <df> 2
## 3 8EzNyE_… well… Well… "https:/… FALSE http… 4 <df> 2
## 4 YFs1mwU… well… Well… "https:/… FALSE http… 22 <df> 2.5
## 5 tDmsP7e… bond… BOND… "https:/… FALSE http… 12 <df> 2.5
## 6 I4ajnYe… chas… Chas… "https:/… FALSE http… 12 <df> 2.5
## 7 C6iSJji… bank… Bank… "https:/… FALSE http… 26 <df> 2
## 8 aiSmo01… syno… Syno… "https:/… FALSE http… 5 <df> 3
## 9 Rd5vXW3… citi… Citi… "https:/… FALSE http… 2 <df> 5
## 10 FBDOrMp… bank… Bank… "" FALSE http… 3 <df> 2.5
## # ℹ 418 more rows
## # ℹ 7 more variables: coordinates <df[,2]>, transactions <list>,
## # location <df[,8]>, phone <chr>, display_phone <chr>, distance <dbl>,
## # business_type <chr>
##dupl_df <- data.frame(yelp_bank = c("id", "alias", "name"),
# yelp_carpenters = c("id", "alias", "name"))
#duplicated(dupl_df$data.frame)
There are no duplicated columns, but this is an example of how to delete duplicated columns.
# Duplicates in column "location" removed.
# dupl_df[!duplicated(dupl_df$location),]
Flatten the nested columns.
yelp_flat <- yelp_final %>% unnest_wider(categories, names_sep = "_") %>% # as a new data frame using a new name
unnest_wider(coordinates, names_sep = "_") %>% # use _ separator to replace the $ in original data set
unnest_wider(location, names_sep = "_")
New yelp_flat contains 25 variables, while
yelp_final contains 16 variables. Using yelp_flat from here
forward.
Drop NA in my new coordinates separated columns.
yelp_flat %>%
filter(!is.na(coordinates_latitude)) %>%
filter(!is.na(coordinates_longitude)) # there are two columns that have NA, from looking into yelp_flat data
## # A tibble: 428 Ă— 25
## id alias name image_url is_closed url review_count categories_alias
## <chr> <chr> <chr> <chr> <lgl> <chr> <int> <list<chr>>
## 1 Rd5vXW3G… citi… Citi… "https:/… FALSE http… 2 [1]
## 2 sFLTL1sr… chas… Chas… "https:/… FALSE http… 2 [1]
## 3 8EzNyE_7… well… Well… "https:/… FALSE http… 4 [1]
## 4 YFs1mwUA… well… Well… "https:/… FALSE http… 22 [1]
## 5 tDmsP7eT… bond… BOND… "https:/… FALSE http… 12 [2]
## 6 I4ajnYe2… chas… Chas… "https:/… FALSE http… 12 [1]
## 7 C6iSJji5… bank… Bank… "https:/… FALSE http… 26 [1]
## 8 aiSmo01K… syno… Syno… "https:/… FALSE http… 5 [1]
## 9 Rd5vXW3G… citi… Citi… "https:/… FALSE http… 2 [1]
## 10 FBDOrMpN… bank… Bank… "" FALSE http… 3 [1]
## # ℹ 418 more rows
## # ℹ 17 more variables: categories_title <list<chr>>, rating <dbl>,
## # coordinates_latitude <dbl>, coordinates_longitude <dbl>,
## # transactions <list>, location_address1 <chr>, location_address2 <chr>,
## # location_address3 <chr>, location_city <chr>, location_zip_code <chr>,
## # location_country <chr>, location_state <chr>,
## # location_display_address <list<list>>, phone <chr>, display_phone <chr>, …
Finally, we need to remove rows that are not inside the census tract boundary. Instead of importing it from another .Rmd file, I’m just going to re-initialize the census tract here.
# need to get polygon data here, choose a different variable
tract <- tidycensus::get_acs(geography = "tract",
state = "GA",
county = "Dekalb",
variables = c(population = "B01003_001",
medianincome = "B19013_001"),
year = 2019,
survey = "acs5",
geometry = TRUE, # returns sf objects
output = "wide")
## Getting data from the 2015-2019 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|====== | 8%
|
|====== | 9%
|
|======= | 10%
|
|======= | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========== | 14%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 19%
|
|============== | 20%
|
|============== | 21%
|
|=============== | 21%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================= | 25%
|
|================== | 26%
|
|=================== | 27%
|
|=================== | 28%
|
|==================== | 29%
|
|===================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|======================= | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|============================ | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|=============================== | 45%
|
|================================ | 46%
|
|================================= | 47%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 50%
|
|==================================== | 52%
|
|===================================== | 53%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 60%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 63%
|
|============================================ | 64%
|
|============================================= | 64%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 80%
|
|======================================================== | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================= | 88%
|
|============================================================== | 89%
|
|=============================================================== | 90%
|
|================================================================ | 91%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|==================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 100%
# atlanta <- places('GA') %>%
# filter(NAME %in% c('Stone Mountain', 'Atlanta')) ##Dekalb county stretches into two cities
#
# tract <- tract[atlanta,]
# Filter for specific cities
atlanta <- tract %>%
filter(NAME %in% c('Stone Mountain', 'Atlanta'))
# Make yelp_flat an sf object
yelp_sf <- yelp_flat %>%
st_as_sf(coords = c("coordinates_longitude", "coordinates_latitude"), crs = st_crs(tract))
# Perform a spatial join between yelp_sf and atlanta
filtered_yelp <- st_join(yelp_sf, atlanta)
## View acs data
tract
## Simple feature collection with 145 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -84.35022 ymin: 33.61467 xmax: -84.02371 ymax: 33.97088
## Geodetic CRS: NAD83
## First 10 features:
## GEOID NAME populationE
## 1 13089021213 Census Tract 212.13, DeKalb County, Georgia 3526
## 2 13089023506 Census Tract 235.06, DeKalb County, Georgia 6465
## 3 13089021305 Census Tract 213.05, DeKalb County, Georgia 4970
## 4 13089023313 Census Tract 233.13, DeKalb County, Georgia 5294
## 5 13089021604 Census Tract 216.04, DeKalb County, Georgia 3237
## 6 13089021913 Census Tract 219.13, DeKalb County, Georgia 4450
## 7 13089021906 Census Tract 219.06, DeKalb County, Georgia 5572
## 8 13089021413 Census Tract 214.13, DeKalb County, Georgia 4081
## 9 13089021911 Census Tract 219.11, DeKalb County, Georgia 1569
## 10 13089023114 Census Tract 231.14, DeKalb County, Georgia 2901
## populationM medianincomeE medianincomeM geometry
## 1 204 154063 19674 MULTIPOLYGON (((-84.34783 3...
## 2 927 45924 13793 MULTIPOLYGON (((-84.25237 3...
## 3 391 55109 4607 MULTIPOLYGON (((-84.28811 3...
## 4 576 55143 5672 MULTIPOLYGON (((-84.14593 3...
## 5 254 159306 38073 MULTIPOLYGON (((-84.31051 3...
## 6 559 32983 3760 MULTIPOLYGON (((-84.1905 33...
## 7 570 46448 4613 MULTIPOLYGON (((-84.187 33....
## 8 481 47885 10004 MULTIPOLYGON (((-84.32911 3...
## 9 348 27835 5106 MULTIPOLYGON (((-84.19619 3...
## 10 322 51105 3293 MULTIPOLYGON (((-84.24137 3...
The original map included businesses outside the tract boundary of Dekalb county.
This is a static map that still shows the business locations within Dekalb County boundary.
The map is not working but this image will.
knitr::include_graphics("assignment_2_map.png")
During the data preparation phase, I noticed that the bank data seemed more dense at borders of the tract(Dekalb county) and around major streets and interstates. These were presumptions that I would want to validate if that information seemed useful.
Looking at the new map, I notice that a lot of the banks are a more central-western portion of Dekalb county, which is more inside the city of Atlanta. The cities seem to be smaller and more densely occurring in that part of the map too. The carpenters’ businesses seem to be more sparse, and more likely to be located in cities that cover larger geographical portions of the county.
While there are not more carpenters than banks represented in the tract, the carpenters tend to trend farther east than the dense clusters of banks, which east happens to be farther away from Atlanta territory. I don’t see a trustworthy way to validate some of my hypotheses now, like whether the carpenters are over-represented by self-owned and self-reported alternate business addresses, so I did not check for that.
The data is much tidier but the trend of heavy clusters of banks to the west and sparse carpenters becomes more obvious.