I’m working with the pizza data again this week, mostly because thinking of a different dataset overwhelmed me and also I like thinking about pizza. Here are the packages that I think I used:
library(tidyverse)
-- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.0.0 v purrr 0.2.5
v tibble 1.4.2 v dplyr 0.7.6
v tidyr 0.8.1 v stringr 1.3.1
v readr 1.1.1 v forcats 0.3.0
-- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(sp)
library(tmap)
library(ggplot2)
library(rgdal)
rgdal: version: 1.3-6, (SVN revision 773)
Geospatial Data Abstraction Library extensions to R successfully loaded
Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
Path to GDAL shared files: C:/Users/monica/Documents/R/win-library/3.5/rgdal/gdal
GDAL binary built with GEOS: TRUE
Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
Path to PROJ.4 shared files: C:/Users/monica/Documents/R/win-library/3.5/rgdal/proj
Linking to sp version: 1.3-1
library(tigris)
To enable
caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
Attaching package: <U+393C><U+3E31>tigris<U+393C><U+3E32>
The following object is masked from <U+393C><U+3E31>package:graphics<U+393C><U+3E32>:
plot
library(RColorBrewer)
library("tmaptools")
Let’s get the pizza data back in:
dsn <- "C:/Users/monica/Documents/Courses/Q7_Fall2018/GEOG 208/week6_project"
setwd(dsn)
column_names = c("submitted","name","age","gender","ethnicity","state","city","us_born","cntry_birth","pizza_freq","pizza_rating","pizza_1word","best_1","best_2","best_3","worst_1","worst_2","worst_3","add_ons","no_of_toppings","most_surprising","unacceptable","pineapple_q","diet","tacos","hamburgers","sushi","salad","bagels","steak","email")
untidy_pizza <- read_csv("the_pizza_survey.csv", skip = 1, col_names = column_names)
Parsed with column specification:
cols(
.default = col_character(),
age = col_integer(),
pizza_rating = col_integer(),
tacos = col_integer(),
hamburgers = col_integer(),
sushi = col_integer(),
salad = col_integer(),
bagels = col_integer(),
steak = col_integer()
)
See spec(...) for full column specifications.
head(untidy_pizza)
# A tibble: 6 x 31
submitted name age gender ethnicity state city us_born cntry_birth pizza_freq pizza_rating pizza_1word best_1
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
1 2018/10/~ <NA> 63 Male Asian Cali~ Sant~ No Philippines About onc~ 3 Acceptable Fresh~
2 2018/10/~ Cass~ 20 Female White Cali~ San ~ Yes N/A About onc~ 4 Delicious Spina~
3 2018/10/~ Moni~ 28 Female Asian Cali~ Los ~ Yes <NA> About onc~ 4 Delicious Mushr~
4 2018/10/~ Matt 25 Male Asian Cali~ Cost~ Yes <NA> Less than~ 4 Comfortabl~ Peppe~
5 2018/10/~ Soph~ 22 Female White;As~ Cali~ East~ Yes <NA> About onc~ 4 Delicious Mushr~
6 2018/10/~ ciro 29 Male Prefer n~ Cali~ Los ~ No Germany About onc~ 3 Delicious Sausa~
# ... with 18 more variables: best_2 <chr>, best_3 <chr>, worst_1 <chr>, worst_2 <chr>, worst_3 <chr>, add_ons <chr>,
# no_of_toppings <chr>, most_surprising <chr>, unacceptable <chr>, pineapple_q <chr>, diet <chr>, tacos <int>,
# hamburgers <int>, sushi <int>, salad <int>, bagels <int>, steak <int>, email <chr>
pizza <- untidy_pizza[c(2:30)] %>%
mutate(participant_id = row_number()) %>%
filter(age <= 120)
rm(untidy_pizza)
1. Where did people take the pizza survey?
I want to visualize how many respondents there were by state.
state_responses <- pizza %>%
group_by(state) %>%
summarise(responses = n())
summary(state_responses)
state responses
Length:21 Min. : 1.000
Class :character 1st Qu.: 1.000
Mode :character Median : 1.000
Mean : 6.952
3rd Qu.: 3.000
Max. :89.000
I’ll need some spatial data:
states_tig <- states(cb = TRUE)
|
| | 0%
|
|= | 0%
|
|= | 1%
|
|== | 2%
|
|=== | 3%
|
|==== | 3%
|
|===== | 4%
|
|====== | 5%
|
|======= | 6%
|
|======== | 7%
|
|========= | 8%
|
|========== | 9%
|
|=========== | 10%
|
|============ | 11%
|
|============= | 12%
|
|============== | 13%
|
|=============== | 14%
|
|================ | 15%
|
|================= | 15%
|
|================== | 16%
|
|=================== | 17%
|
|==================== | 18%
|
|===================== | 19%
|
|====================== | 20%
|
|======================= | 21%
|
|======================== | 22%
|
|========================= | 22%
|
|========================== | 23%
|
|=========================== | 24%
|
|============================ | 25%
|
|============================= | 26%
|
|============================== | 27%
|
|=============================== | 28%
|
|================================ | 28%
|
|================================= | 29%
|
|================================= | 30%
|
|================================== | 31%
|
|=================================== | 32%
|
|==================================== | 33%
|
|===================================== | 34%
|
|====================================== | 34%
|
|======================================= | 35%
|
|======================================== | 36%
|
|========================================= | 37%
|
|========================================== | 38%
|
|=========================================== | 39%
|
|============================================ | 40%
|
|============================================= | 41%
|
|============================================== | 41%
|
|=============================================== | 42%
|
|================================================ | 43%
|
|================================================= | 44%
|
|================================================== | 45%
|
|=================================================== | 46%
|
|==================================================== | 47%
|
|===================================================== | 47%
|
|====================================================== | 48%
|
|======================================================= | 49%
|
|======================================================= | 50%
|
|======================================================== | 51%
|
|========================================================= | 52%
|
|========================================================== | 53%
|
|=========================================================== | 53%
|
|============================================================ | 54%
|
|============================================================= | 55%
|
|============================================================== | 56%
|
|=============================================================== | 57%
|
|================================================================ | 58%
|
|================================================================= | 59%
|
|================================================================== | 59%
|
|=================================================================== | 60%
|
|==================================================================== | 61%
|
|===================================================================== | 62%
|
|====================================================================== | 63%
|
|======================================================================= | 64%
|
|======================================================================== | 65%
|
|========================================================================= | 66%
|
|========================================================================== | 66%
|
|=========================================================================== | 67%
|
|============================================================================ | 68%
|
|============================================================================= | 69%
|
|============================================================================== | 70%
|
|============================================================================== | 71%
|
|=============================================================================== | 72%
|
|================================================================================ | 72%
|
|================================================================================= | 73%
|
|================================================================================== | 74%
|
|=================================================================================== | 75%
|
|==================================================================================== | 76%
|
|===================================================================================== | 77%
|
|====================================================================================== | 78%
|
|======================================================================================= | 78%
|
|======================================================================================== | 79%
|
|========================================================================================= | 80%
|
|========================================================================================== | 81%
|
|=========================================================================================== | 82%
|
|============================================================================================ | 83%
|
|============================================================================================= | 84%
|
|============================================================================================== | 85%
|
|=============================================================================================== | 85%
|
|================================================================================================ | 86%
|
|================================================================================================= | 87%
|
|================================================================================================== | 88%
|
|=================================================================================================== | 89%
|
|==================================================================================================== | 90%
|
|===================================================================================================== | 91%
|
|====================================================================================================== | 92%
|
|======================================================================================================= | 93%
|
|======================================================================================================== | 94%
|
|========================================================================================================= | 95%
|
|========================================================================================================== | 96%
|
|=========================================================================================================== | 97%
|
|============================================================================================================ | 97%
|
|============================================================================================================= | 98%
|
|============================================================================================================== | 99%
|
|===============================================================================================================| 100%
And some Census 10 population data that I downloaded:
state_pop <- read_csv("state_pop_2010.csv", skip = 1, col_names = c("geoid10", "st_abbr", "state", "total_pop"))
Parsed with column specification:
cols(
geoid10 = col_character(),
st_abbr = col_character(),
state = col_character(),
total_pop = col_integer()
)
head(state_pop)
# A tibble: 6 x 4
geoid10 st_abbr state total_pop
<chr> <chr> <chr> <int>
1 56 WY Wyoming 563626
2 42 PA Pennsylvania 12702379
3 39 OH Ohio 11536504
4 35 NM New Mexico 2059179
5 24 MD Maryland 5773552
6 44 RI Rhode Island 1052567
I need to piece these three together, using a clumsy combination of join and merge. In the join step, I will also calculate the % of each state’s population that participated in the survey, because obviously that’s going to be very valuable data.
join_step <- state_pop %>%
left_join(state_responses, by = "state") %>%
mutate(perc_pizza_pop = responses / total_pop)
pizza_states <- merge(x = states_tig, y = join_step, by.x = "NAME", by.y = "state")
To keep things simple, and because no one from Alaska or Puerto Rico took my survey, I am going to stick to the CONUS:
state_list <- join_step$state
rm(join_step, states_tig)
pizza_conus <- subset(pizza_states, NAME %in% state_list)
pizza_conus <- pizza_conus[pizza_conus$NAME != "Hawaii",]
pizza_conus <- pizza_conus[pizza_conus$NAME != "Alaska",]
pizza_conus <- pizza_conus[pizza_conus$NAME != "Puerto Rico",]
summary(pizza_conus)
Object of class SpatialPolygonsDataFrame
Coordinates:
min max
x -124.7631 -66.94989
y 24.5231 49.38436
Is projected: FALSE
proj4string :
[+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0]
Data attributes:
NAME STATEFP STATENS AFFGEOID GEOID STUSPS
Length:49 Length:49 Length:49 Length:49 Length:49 Length:49
Class :character Class :character Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
LSAD ALAND AWATER geoid10 st_abbr total_pop
Length:49 Length:49 Length:49 Length:49 Length:49 Min. : 563626
Class :character Class :character Class :character Class :character Class :character 1st Qu.: 1852994
Mode :character Mode :character Mode :character Mode :character Mode :character Median : 4533372
Mean : 6258674
3rd Qu.: 6724540
Max. :37253956
responses perc_pizza_pop
Min. : 1.000 Min. :0e+00
1st Qu.: 1.000 1st Qu.:0e+00
Median : 1.000 Median :0e+00
Mean : 7.474 Mean :1e-06
3rd Qu.: 3.500 3rd Qu.:1e-06
Max. :89.000 Max. :3e-06
NA's :30 NA's :30
And NOW! A simple choropleth with a dramatic title:
tmap_mode("plot")
tmap mode set to plotting
brew = brewer.pal(7, "OrRd")
tm_shape(pizza_conus, projection = 5070) +
tm_fill(col = "perc_pizza_pop", colorNA = "gray97", palette = brew, title = "Respondents\n(% of state pop.)", textNA = "No one cared") +
tm_text("st_abbr", size = 0.7, col = "gray35", remove.overlap = TRUE) +
tm_borders(col = "white", lwd = 2) +
tm_layout(main.title = "Who cared enough to take The Pizza Survey?",
frame = "gray60",
inner.margins = 0.06,
legend.title.size = 0.9)
Very legitimate analysis. You can’t seem them, but DC had the highest % of respondents because no one lives there except the people that I and my friend Popov know. California is of course next up, followed by Massachusetts, Texas, Pennsylvania, and Oregon.
2. What toppings do Californians like?
Most of the respondents were from California (n = 89), so I think we can take a closer look at favorite toppings in our state. First, I need to sort out the city data that I collected. I cheated and scrolled through the table preview to find that most folks spelled their city names right! But the main problem is capitalization, so we will fix that here too.
ca_pizza <- pizza %>%
filter(state == "California") %>%
drop_na(city) %>%
mutate(city = str_to_title(city))
summary(ca_pizza$city)
Length Class Mode
83 character character
We are left with 83 observations with city data. Let’s see if changing the city cases took care of any duplication issues:
ca_city_list <- ca_pizza %>%
group_by(city) %>%
summarise(respondents = n())
summary(ca_city_list)
city respondents
Length:38 Min. : 1.000
Class :character 1st Qu.: 1.000
Mode :character Median : 1.000
Mean : 2.184
3rd Qu.: 2.000
Max. :15.000
head(ca_city_list)
# A tibble: 6 x 2
city respondents
<chr> <int>
1 Berkeley 3
2 Campbell 1
3 Corona 2
4 Costa Mesa 15
5 Cupertino 1
6 Danville 1
My, that’s a lot of people from Costa Mesa. Can’t imagine why… but another cheap shot look through the table shows no duplicates due to spacing or anything, so now I will just hope that everyone used their official city names.
What are Californians favorite pizza toppings, by city? I’ll do something veerrry similar to what I did last week, and create a new data frame that lists the best pizza toppings as a single variable (scored)
ca_best_toppings <- ca_pizza %>%
gather("best_rank", "best_toppings", 12:14) %>%
mutate(best_toppings = sub("Chicken \\(BBQ, buffalo, etc.\\)", "Chicken", best_toppings)) %>%
mutate(best_toppings = sub("Olives \\(black, green, kalamata, etc.\\)", "Olives", best_toppings)) %>%
mutate(best_toppings = sub("N/A - I don't eat pizza", "N/A", best_toppings)) %>%
mutate(best_rank = gsub("best_1", 1, best_rank)) %>%
mutate(best_rank = gsub("best_2", 2, best_rank)) %>%
mutate(best_rank = gsub("best_3", 3, best_rank))
ca_best_toppings$score <- ifelse(
ca_best_toppings$best_rank == 1, 3, (ifelse(
ca_best_toppings$best_rank == 2, 2, ifelse(
ca_best_toppings$best_rank == 3, 1, NA))))
I need to figure out how many times each city voted for each topping (via group by and summarise), then select the topping that received the highest score in each city.
toppings_by_city <- ca_best_toppings %>%
subset(!(best_toppings == "N/A")) %>%
group_by(city, best_toppings) %>%
summarise(votes = n(),
score = sum(score)) %>%
ungroup()
head(toppings_by_city)
# A tibble: 6 x 4
city best_toppings votes score
<chr> <chr> <int> <dbl>
1 Berkeley Bacon 1 2
2 Berkeley Chicken 1 3
3 Berkeley Garlic 1 2
4 Berkeley Ham 1 1
5 Berkeley Meatballs 1 1
6 Berkeley Mushrooms 1 3
no1_topping <- toppings_by_city %>%
group_by(city) %>%
slice(which.max(score)) %>%
rename(Toppings = best_toppings)
head(no1_topping)
This would be fine, except that in a few cities, the #1 toppings were tied! I could not figure out how to get around this, short of changing the values of the “best_toppings” variable to include a list of the top toppings. But that would map terribly, so I did not do that…
Let’s get a shapefile for California cities and merge it with our cities data. I’m not using tigris this time (because the places features don’t include my home town!).
places <- readOGR("cities", "tl_2016_06_place")
OGR data source with driver: ESRI Shapefile
Source: "C:\Users\monica\Documents\Courses\Q7_Fall2018\GEOG 208\week6_project\cities", layer: "tl_2016_06_place"
with 1522 features
It has 17 fields
Integer64 fields read as strings: ALAND AWATER
pizza_places <- merge(x = places, y = no1_topping, by.x = "NAME", by.y = "city")
pizza_city_list <- ca_city_list$city
pizza_places <- subset(pizza_places, NAME %in% pizza_city_list)
summary(pizza_places)
Object of class SpatialPolygonsDataFrame
Coordinates:
min max
x -123.17382 -116.90572
y 32.53487 38.29933
Is projected: FALSE
proj4string :
[+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0]
Data attributes:
NAME STATEFP PLACEFP PLACENS GEOID NAMELSAD LSAD CLASSFP PCICBSA PCINECTA
Berkeley : 1 06:36 06000 : 1 02409837: 1 0606000: 1 Berkeley city : 1 25:34 C1:35 N:18 N:36
Campbell : 1 10345 : 1 02409971: 1 0610345: 1 Campbell city : 1 43: 1 M2: 0 Y:18
Corona : 1 16350 : 1 02410116: 1 0616350: 1 Corona city : 1 57: 1 U1: 1
Costa Mesa: 1 16532 : 1 02410232: 1 0616532: 1 Costa Mesa city: 1 U2: 0
Cupertino : 1 17610 : 1 02410239: 1 0617610: 1 Cupertino city : 1
Danville : 1 17988 : 1 02410278: 1 0617988: 1 Danville town : 1
(Other) :30 (Other):30 (Other) :30 (Other):30 (Other) :30
MTFCC FUNCSTAT ALAND AWATER INTPTLAT INTPTLON Region Toppings
G4110:35 A:35 102233537 : 1 0 : 8 +32.8152995: 1 -117.1349930: 1 North:14 Length:36
G4210: 1 S: 1 111418803 : 1 1134748 : 1 +33.5288675: 1 -117.2056845: 1 South:20 Class :character
1214027148: 1 122244773: 1 +33.5915227: 1 -117.4628880: 1 NA's : 2 Mode :character
121455687 : 1 133550 : 1 +33.5971325: 1 -117.5654945: 1
130317967 : 1 13438927 : 1 +33.6659055: 1 -117.6639923: 1
132800801 : 1 1685050 : 1 +33.6783986: 1 -117.6994074: 1
(Other) :30 (Other) :23 (Other) :30 (Other) :30
votes score
Min. :1.000 Min. : 3.000
1st Qu.:1.000 1st Qu.: 3.000
Median :1.000 Median : 3.000
Mean :1.611 Mean : 4.278
3rd Qu.:2.000 3rd Qu.: 4.000
Max. :7.000 Max. :19.000
And here is an interactive map of California, where places are labeled by their number one pizza topping (and the size of the label is based on the score)
tmap_mode("view")
tmap mode set to interactive viewing
toppings_col <- c("#aa530e",
"#535238",
"#ff8b6a",
"#df8931",
"#ffc97c",
"#e2434b",
"#d6c481",
"#857671",
"#726a95",
"#ea5f2d",
"#ffcd3c",
"#860f44",
"#5e8b6f")
tm_shape(pizza_places) +
tm_text("Toppings", size = "score", scale = 3, col = "Toppings", palette = toppings_col, fontface = "bold") +
tm_view(alpha = 1, text.size.variable = TRUE, view.legend.position = c("left", "bottom")) +
tm_layout(title = "Favorite Pizza Toppings by California Cities")
Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
We can zoom in to see which cities beat the Pepperoni-Sausage-Mushroom train - interestingly, Bay Area folks really dig anchovies!! And Orange County loves extra cheese.
I found it really difficult to edit map layouts when in “view” mode. While I could adjust the position of the legend in this map, I would’ve preferred to get rid of it - but couldn’t figure out how!
3. Anchovies…pineapples…olives, oh my…!
Last week, we found that anchovies, pineapple, and olives were respondents’ least favorite toppings. How often and where do they show up in California?
Because I am lazy, I can conveniently set up this data as I did for map #2:
ca_worst_toppings <- ca_pizza %>%
gather("worst_rank", "worst_toppings", 15:17) %>%
mutate(worst_toppings = sub("Chicken \\(BBQ, buffalo, etc.\\)", "Chicken", worst_toppings)) %>%
mutate(worst_toppings = sub("Olives \\(black, green, kalamata, etc.\\)", "Olives", worst_toppings)) %>%
mutate(worst_toppings = sub("N/A - I like it all", "N/A", worst_toppings)) %>%
mutate(worst_toppings = sub("Excessive cheese", "Extra cheese", worst_toppings)) %>%
mutate(worst_rank = gsub("worst_1", 1, worst_rank)) %>%
mutate(worst_rank = gsub("worst_2", 2, worst_rank)) %>%
mutate(worst_rank = gsub("worst_3", 3, worst_rank))
bad_toppings_by_city <- ca_worst_toppings %>%
group_by(city, worst_toppings) %>%
summarise(votes = n()) %>%
ungroup() %>%
spread(worst_toppings, votes)
bad_pizza_places <- merge(x = places, y = bad_toppings_by_city, by.x = "NAME", by.y = "city")
bad_pizza_places <- subset(bad_pizza_places, NAME %in% pizza_city_list)
Then create separate maps for each unloved topping:
anchovies <- tm_shape(bad_pizza_places) +
tm_bubbles("Anchovies", col = "#eec89f", alpha = 0.5, border.col = "#bb3939",
border.lwd = 3, border.alpha = 1, scale = 2) +
tm_view(alpha = 1, basemaps = "Stamen.TonerLite", symbol.size.fixed = TRUE) +
tm_layout(title = "Where Californians Hate Anchovies", title.position = c("left", "bottom"))
pineapple <- tm_shape(bad_pizza_places) +
tm_bubbles("Pineapple", col = "#ffdc76", alpha = 0.5, border.col = "#ff7657",
border.lwd = 3, border.alpha = 1, scale = 2) +
tm_view(alpha = 1, basemaps = "Stamen.TonerLite", symbol.size.fixed = TRUE) +
tm_layout(title = "Where Californians Hate Pineapple", title.position = c("left", "bottom"))
olives <- tm_shape(bad_pizza_places) +
tm_bubbles("Olives", col = "#dde0ab", alpha = 0.4, border.col = "#092a35",
border.lwd = 3, border.alpha = 1, scale = 2) +
tm_view(alpha = 1, basemaps = "Stamen.TonerLite", symbol.size.fixed = TRUE) +
tm_layout(title = "Where Californians Hate Olives", title.position = c("left", "bottom"))
And combine them by trying out tmap_arrange():
tmap_arrange(anchovies, pineapple, olives, ncol = 2, nrow = 2, sync = TRUE)
Legend for symbol sizes not available in view mode.
Legend for symbol sizes not available in view mode.
Legend for symbol sizes not available in view mode.
Those Central California respondents are the easiest to see, and they don’t mind pineapple as much as they do anchovies, which they don’t mind as much as they do olives! But is this a terribly useful map? No. It’s not. But I wanted bubbles, so here we are.
I do wish I were showing the % of respondents per city who voted for each terrible topping, but I felt it was also important to show the number of votes and showing both was not in the cards for me.
Here again, customizing the view-mode maps was really frustrating. I didn’t get a legend for the symbol sizes on, nor an overall map title. I’d love to see how folks worked with this mode!