Module 14: Building Your Portfolio
Reviewing Modeling and Visualization
Welcome to Workshop 14: Final Review!
In this workshop, we will review all key workshop concepts relevant to modeling, simulation, and geospatial visualization, through several review problems. Answers are posted below.
These review problems are optional, but I strongly suggest that you work on them together with your group.
0. Load Packages
library(tidyverse) # data wrangling
library(viridis) # color palettes
library(moderndive) # extra regression functions
library(Zelig) # simulation
library(sf) # mapping
library(ggspatial) # adding scales to maps
Dataset 1: Legislator Partisanship
Our Data
In the famous Voteview dataset, political scientist Jeffrey Lewis and his colleagues measured the partisanship of all 7,825 US members of Congress from 1987 to 2021. Each of our 37,911 rows is a legislator during a specific session of congress (eg. 2017-2019), where partisanship places them on a scale from -1.0 (liberal) to 1.0 (conservative), based on how they often they voted for or against key liberal or democratic economic and redistributive policies (eg. social welfare, health care, etc.).
Let’s examine a lightly curated version of the VoteView dataset, saved in your files as voteview.csv. Load it in and filter to just legislators in New England states using the code below.
congress <- read_csv("voteview.csv") %>%
# Filter to New England stats
filter(state %in% c("MA", "ME", "NH", "VT", "CT", "RI")) %>%
# Convert all categorical variables to factors
mutate(chamber = factor(chamber),
party = factor(party),
state = factor(state))Learn more!
- Citation: Lewis, Jeffrey B., Keith Poole, Howard Rosenthal, Adam Boche, Aaron Rudkin, and Luke Sonnet (2021). Voteview: Congressional Roll-Call Votes Database.
Task 1.1. Variation Explained
Question
Modeling: Use a regression model to predict the partisanship of each legislator (
partisanship), controlling for thechamberthey served in ("House"or"Senate"), theirparty("Republican","Democrat","Other"), theirage, andstate("MA","ME","NH","VT","CT","RI").Model Fit: What percentage of variation does your model explain? Is it a good or bad model? How do you know?
Answer
First, we make our model m1 with the lm() function, placing our outcome partisanship first, followed by our predictors, chamber, party, age, and state.
m1 <- congress %>%
lm(formula = partisanship ~ chamber + party + age + state)Next, we can use the moderndive package’s get_regression_summaries() function to extract the summary statistics for our model. (You can also view these with the summary() function.)
get_regression_summaries(m1)| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.825 | 0.824 | 0.0246135 | 0.1568869 | 0.157 | 1501.086 | 0 | 9 | 2876 |
- Percentage of Variation: Our model explains 82.5% of the variation in legislator partisanship. We know this from our R2 statistic (
r_squared). This is an excellent model; that’s very close to 100%, which would mean a perfect model.
Task 1.2. F-statistic
Question
- Evaluate your model’s F-statistic. Is it statistically significant? How do you know? What does it tell us about our model?
Answer
We can use the moderndive package’s get_regression_summaries() function to extract the summary statistics for our model. The statistic column lists the F-statistic. The p_value column lists the F-statistic’s p-value. (You can also view these with the summary() function.)
get_regression_summaries(m1) %>%
select(statistic, p_value)| statistic | p_value |
|---|---|
| 1501.086 | 0 |
- F-statistic: Our model has an F-statistic of 1501.086. This is really extreme! We know because we can look at the p-value, which is less than 0.001. This is statistically significant, at a level below p < 0.05, meaning it is more extreme than 95% of the F-statistics we would get if this were due to chance.
The F-statistic is a ratio of how much variation in our outcome that was explained by our model, compared to how much variation was not explained (taking into account the number of cases analyzed and variables used). An F-statistic of 0 means your model is not much better than it is bad, so to speak; in contrast, an F-statistic of 3, 4, 5, or as high as 1501.086 means you explain way, way, way more variation than you fail to explain. Check out this workshop for more info.
Task 1.3. Residuals
Using your model m1 from Task 1, please complete the following steps:
Question
Residuals: Gather your model’s regression model’s residuals in a data.frame, together with your observed outcome and predicted outcome. Save that data.frame as
m1data.Meaning: What does a residual mean in this model?
Answer
Residuals: Gather your model’s regression model’s residuals in a data.frame, together with your observed outcome and predicted outcome. Save that data.frame as m1data.
Using get_regression_points(), we can view these data for our model m1. Our observed outcome is partisanship; our predicted outcome is partisanship_hat; our residual is residual (partisanship - partisanship_hat).
m1data <- m1 %>% get_regression_points()# Let's view the first 3 rows
m1data %>% head(3)| ID | partisanship | chamber | party | age | state | partisanship_hat | residual |
|---|---|---|---|---|---|---|---|
| 1 | 0.304 | House | Republican | 53 | CT | 0.345 | -0.041 |
| 2 | 0.477 | House | Republican | 64 | CT | 0.359 | 0.118 |
| 3 | -0.208 | House | Democrat | 57 | CT | -0.340 | 0.132 |
| 4 | 0.437 | House | Republican | 68 | CT | 0.364 | 0.073 |
| 5 | 0.387 | House | Republican | 49 | ME | 0.332 | 0.055 |
| 6 | -0.187 | House | Other | 61 | ME | -0.295 | 0.108 |
- Meaning: A residual refers to how much greater is the observed outcome than the predicted outcome for a given case. Our model has as many predictions and residuals as it has observations (n = 2876). See below!
Task 1.4. Residuals Visualized
Question
- Visualize: Extract your model’s residuals and save it as a data.frame called
m1data. Visualize your model’s residuals as a histogram. Examine your plot. What is the average residual? What are relatively rare or extreme residuals for this model?
Answer
We can extract our model’s residuals using get_regression_points() from the moderndive package.
m1data <- m1 %>%
get_regression_points()Then, we can visualize them as a simple histogram.
m1data %>%
ggplot(mapping = aes(x = residual)) +
geom_histogram(color = "white", fill = "dodgerblue") +
theme_bw(base_size = 24)The average residual is about 0 - meaning most observations are nearly perfectly predicted (yay!). The most extreme or rarely occurring residuals are those above 0.40 or below -0.40. We might guess that 95% of residuals are within this range, meaning our predictions are almost never off by more than either amount.
Task 1.5. Sigma
Question
Sigma: What is sigma, and how does it relate to your residuals? What does it tell us about our model?
Answer
sigma is a single number represent the residual standard error of our model. It’s basically the standard deviation of our residuals, meaning how much do our predictions vary from our observed outcomes on average.
get_regression_summaries(m1) %>%
select(sigma)| sigma |
|---|
| 0.157 |
It tends to be very, very close to the standard deviation of residuals.
get_regression_points(m1) %>%
summarize(sd = sd(residual))| sd |
|---|
| 0.1569074 |
Our partisanship score ranges from -1 to 1, but our model’s predictions tend be off by just plus or minus 0.157! This is great news, and indicates a very, very good model. Check out this workshop for more info..
Simulation tie-in: We use sigma in simulation to tell Zelig how much error occurs in our model on average.
Task 1.6: Coefficients
Question
Finally, please examine the coefficient table for your model m1, and answer the following question:
- Model equation: What is a model equation, and what is the specific model equation for your model?
Answer
We can get the coefficient table using get_regression_table() from the moderndive package.
get_regression_table(m1)| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -0.411 | 0.016 | -25.486 | 0.000 | -0.442 | -0.379 |
| chamber: Senate | -0.013 | 0.007 | -1.802 | 0.072 | -0.026 | 0.001 |
| party: Other | 0.049 | 0.029 | 1.703 | 0.089 | -0.007 | 0.105 |
| party: Republican | 0.690 | 0.006 | 108.228 | 0.000 | 0.678 | 0.703 |
| age | 0.001 | 0.000 | 4.381 | 0.000 | 0.001 | 0.002 |
| state: MA | -0.005 | 0.008 | -0.589 | 0.556 | -0.021 | 0.011 |
| state: ME | -0.009 | 0.011 | -0.819 | 0.413 | -0.030 | 0.012 |
| state: NH | 0.055 | 0.012 | 4.789 | 0.000 | 0.033 | 0.078 |
| state: RI | 0.021 | 0.011 | 1.878 | 0.060 | -0.001 | 0.043 |
| state: VT | -0.058 | 0.012 | -4.628 | 0.000 | -0.082 | -0.033 |
Our model equation is listed in the estimate column, and we can write it out like so:
$ Partisanship_{predicted} = -0.411 - 0.013 Chamber_{senate} + 0.049 Party_{Other} + $
$ 0.690 Party_{Republican} + 0.001 Age + $
\(-0.005 \times State_{MA} -0.009 \times State_{ME} + 0.055 \times State_{NH} +\)
$ 0.021 State_{RI} - 0.058 State_{VT} $
Task 1.7. Alpha
Question
- Alpha: What is the alpha coefficient in your model table? What specifically does it tell us about legislators’ partisanship? (Remember that there are categorical variables in this model.)
Answer
Alpha: The alpha coefficient (shown by the intercept row) describes how much of the outcome our model projects on average if all other predictors are set to 0. In a model with categorical predictors, it means the average outcome for a city with the baseline category for each categorical variable, and all other predictors are set to 0.
m1 %>%
get_regression_table() %>%
filter(term == "intercept")| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -0.411 | 0.016 | -25.486 | 0 | -0.442 | -0.379 |
In our model, the baseline categories are "House" for chamber, "Democrat" for party, and "CT" for state. Our model projects an average partisanship score of for a Democratic member of the House from Connecticut, who is zero years old (age == 0)!
Task 1.8. Beta
Question
- Beta: What is the beta coefficient in your model table? Is there just one, or more than one? What specifically does it tell us about legislators’ partisanship?
Answer
A Beta coefficient shows the projected increase in the outcome, partisanship, as that coefficient’s predictor increases by 1, holding all else constant.
In other words, the beta coefficient is the slope of our regression model equation. But in multiple regression, we have not 1, but 9 beta coefficients.
Our beta coefficients are written in the estimate column.
m1 %>%
get_regression_table() %>%
filter(term != "intercept")| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| chamber: Senate | -0.013 | 0.007 | -1.802 | 0.072 | -0.026 | 0.001 |
| party: Other | 0.049 | 0.029 | 1.703 | 0.089 | -0.007 | 0.105 |
| party: Republican | 0.690 | 0.006 | 108.228 | 0.000 | 0.678 | 0.703 |
| age | 0.001 | 0.000 | 4.381 | 0.000 | 0.001 | 0.002 |
| state: MA | -0.005 | 0.008 | -0.589 | 0.556 | -0.021 | 0.011 |
| state: ME | -0.009 | 0.011 | -0.819 | 0.413 | -0.030 | 0.012 |
| state: NH | 0.055 | 0.012 | 4.789 | 0.000 | 0.033 | 0.078 |
| state: RI | 0.021 | 0.011 | 1.878 | 0.060 | -0.001 | 0.043 |
| state: VT | -0.058 | 0.012 | -4.628 | 0.000 | -0.082 | -0.033 |
Our model’s beta coefficients tell us that:
chamber: Senate: A Senator tends to be 0.013 points less conservative than a member of the House (baseline category).party: Other: A member of a third party tends to be 0.049 points more conservative than a Democrat (baseline category).party: Republican: A Republican tends to be 0.690 points more conservative than a Democrat (baseline category).age: As age increases by 1 year, legislators tend to become 0.001 points more conservative.state: MA: A legislator from Massachusetts tends to be 0.005 points less conservative than one from Connecticut.state: ME: A legislator from Maine tends to be 0.009 points less conservative than one from Connecticut.state: NH: A legislator from New Hampshire tends to be 0.055 points more conservative than one from Connecticut.state: RI: A legislator from Rhode Island tends to be 0.021 points more conservative than one from Connecticut.state: MA: A legislator from Vermont tends to be 0.058 points less conservative than one from Connecticut.
Task 1.9: Prediction
Question
- Using your model equation, please calculate by hand the predicted partisanship score for a 30 year old Massachusetts Democrat member of the House.
Answer
We can plug the following scores into our model to predict our partisanship score.
# Intercept (baseline conditions)
-0.411 +
# Chamber = House, so ChamberSenate = 0
-0.013 * 0 +
# Party = Democrat, so PartyOther = 0
0.049 * 0 +
# Party = Democrat, so PArtyRepublican = 0
0.690 * 0 +
# Age = 30
0.001 * 30 +
# State = MA, so StateMA = 1
-0.005 * 1 +
# State = MA, so StateME = 0
0.009 * 0 +
# State = MA, so StateNH = 0
0.055 * 0 +
# State = MA, so StateRI = 0
0.021 * 0 +
# State = MA, so StateVT = 0
0.058 * 0## [1] -0.386
Our model projects that a 30 year old Massachusetts Democrat member of the House would have a predicted partisanship score of -0.386, the value of a moderate-left liberal.
Task 1.10: Significance
Question
What does it mean for an alpha or beta coefficient to be statistically significant?
Which of our coefficients are statistically significant? How do you know?
Answer
Statistical significance refers to how extreme our statistic is compared to the range of statistics we would get if it were due to chance. If our statistic is really extreme, then we are pretty sure it didn’t happen due to chance. We can probably trust that result to reappear if we repeated the study.
Which of our coefficients are significant? We can look at the
p_valuecolumn and determine which have p-values below 0.05.
m1 %>%
get_regression_table()| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | -0.411 | 0.016 | -25.486 | 0.000 | -0.442 | -0.379 |
| chamber: Senate | -0.013 | 0.007 | -1.802 | 0.072 | -0.026 | 0.001 |
| party: Other | 0.049 | 0.029 | 1.703 | 0.089 | -0.007 | 0.105 |
| party: Republican | 0.690 | 0.006 | 108.228 | 0.000 | 0.678 | 0.703 |
| age | 0.001 | 0.000 | 4.381 | 0.000 | 0.001 | 0.002 |
| state: MA | -0.005 | 0.008 | -0.589 | 0.556 | -0.021 | 0.011 |
| state: ME | -0.009 | 0.011 | -0.819 | 0.413 | -0.030 | 0.012 |
| state: NH | 0.055 | 0.012 | 4.789 | 0.000 | 0.033 | 0.078 |
| state: RI | 0.021 | 0.011 | 1.878 | 0.060 | -0.001 | 0.043 |
| state: VT | -0.058 | 0.012 | -4.628 | 0.000 | -0.082 | -0.033 |
- Our significant effects include: (1) the intercept (p < 0.001), (2) being Republican (p < 0.001), (3) age (p < 0.001), (4) being from New Hampshire (p < 0.001), and (5) being from Vermont (p < 0.001). Other effects are also somewhat significant, but we’re not 95% confident or more about their significance.
Task 1.11: Standard Error
Question
- How do the
statistic,std_error, andp_valuecolumns help us determine statistical significance?
Answer
Our three terms of interest mean:
std_error: the standard deviation of the null distribution (for any statistic, the hypothetical distribution of statistics we would get if our statistic was off just a bit due to random chance.)statistic: a standardized version of our coefficient (theestimatecolumn), ranging from -Infinity to +Infinity. (Literally equals the estimate divided by the standard error.)p_value: the percentage (written as a decimal) of statistics in the null distribution greater than the statistic we actually got. If your percentage is high, your observed statistic is not very extreme. If your percentage is low, your observed statistic is pretty extreme compared to the rest. We use the threshold p < 0.05 to indicate 95% confidence that our statistic was not just due to chance.
Task 1.12: Confidence Intervals
Question
What do
upper_ciandlower_ciin yourget_regression_table()output mean, with regards to the effect of being Republican on a legislator’s partisanship?You can approximate a 95% confidence interval from just the
estimateand 1 other vector in yourget_regression_table()output, using a quick rule of thumb. What is that rule of thumb? Calculate by yourself the 95% confidence interval of the effect of being Republican on a legislator’s partisanship.
Answer
upper_ciandlower_cidepict the 2.5th and 97.5th percentiles of the null distribution for a statistic (in this case, for our coefficient). That is to say, even if our estimate was off due to chance, 95% of the time, it would be somewhere in that zone! We are 95% confident that the true effect of being Republican on a legislator’s partisanship is an increase of between 0.678 and 0.703 points.
m1 %>%
get_regression_table() %>%
filter(term == "party: Republican")| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| party: Republican | 0.69 | 0.006 | 108.228 | 0 | 0.678 | 0.703 |
- You can approximate a 95% confidence interval for any statistic using just the
estimateand thestd_error, if you multiply thestd_errorby about 2 (it’s ~1.96, depending on sample size).
m1 %>%
get_regression_table() %>%
filter(term == "party: Republican") %>%
select(term, estimate, std_error, lower_ci, upper_ci) %>%
# Compare
mutate(mylower = estimate - std_error * 1.96,
myupper = estimate + std_error * 1.96)| term | estimate | std_error | lower_ci | upper_ci | mylower | myupper |
|---|---|---|---|---|---|---|
| party: Republican | 0.69 | 0.006 | 0.678 | 0.703 | 0.67824 | 0.70176 |
Task 1.13: Simulation
Question
Using Zelig, please simulate the expected effect of age on legislator partisanship, as an average legislators’ age increases from 30 to 40 to 50 to 60 to 70 to 80 years old. Calculate 95% confidence intervals and visualize your simulations as a ribbon plot with ggplot(). (Note: You must have already converted your categorical variables in m1 to be factors for Zelig to work.)
Answer
Use the code below to simulate in Zelig!
mysim <- m1 %>%
# Convert model to zelig format
to_zelig() %>%
# For an average legislator who is 30, 40, 50... years old,
setx(age = c(30, 40, 50, 60, 70, 80)) %>%
# Simulate expected partisanship
sim() %>%
# Convert to data.frame
zelig_qi_to_df() %>%
# Grab just age and expected value columns
select(age, ev = expected_value) %>%
# For each age (30, 40, 50, 60, etc.)
group_by(age) %>%
# Give me confidence intervals and median expected value
summarize(lower_ci = quantile(ev, probs= 0.025),
median = quantile(ev, probs = 0.50),
upper_ci = quantile(ev, probs = 0.975))Then visualize it in ggplot like so!
mysim %>%
# Add aesthetics
ggplot(mapping = aes(x = age, y = median,
ymin = lower_ci, ymax = upper_ci)) +
# Plot confidence intervals
geom_ribbon(fill = "cyan", alpha = 0.5, color = "black") +
# Plot median expected value with a dashed line
geom_line(linetype = "dashed", size = 1) +
# Add a nice theme
theme_bw() +
# Add labels!
labs(x = "Age (years)",
y = "Expected Partisanship for Average Legislator\n(-1 = Liberal, +1 = Conservative)",
subtitle = "Expected Effect of Age on Partisanship",
caption = "Based on 1000 simulations in the Zelig package in R.") remove(congress, m1)
Dataset 2: Wine Rankings
Our Data
Zack Thoutt famously webscraped 130,000 wine reviews from WineEnthusiast magaine on November 22, 2017. He wondered if he could build a predictive model that would identify good wines as accurately as a master sommelier (wine tasting expert).
Drawing from his 96,420 complete reviews, I lightly curated the data to focus on the rating, price, variety, country, province, and taster. Each wine was rated on a scale from 0 (lowest) to 100 (highest). And, each wine is classified by the top 50 most frequently occurring provinces and varieties of wine. We’re going to use a random sample of 10,000 of these reviews to build that model and simulate the ranking habits of each sommelier! Please import these reviews from your project’s files, save as wind_reviews.csv.
mywines <- read_csv("wine_reviews.csv") %>%
# Convert categories to factor
mutate(country = factor(country),
variety = factor(variety),
taster = factor(taster))We can glimpse() the dataset to see its contents below!
mywines %>%
glimpse()## Rows: 10,000
## Columns: 7
## $ wine <chr> "Monte Novo e Figueirinha 2007 Herdade da Figueirinha Reserva…
## $ points <dbl> 81, 91, 87, 90, 87, 88, 89, 90, 94, 94, 85, 89, 87, 86, 88, 9…
## $ price <dbl> 9, 54, 20, 48, 34, 36, 15, 25, 50, 32, 14, 28, 40, 7, 13, 33,…
## $ variety <fct> Portuguese Red, Pinot Noir, Other, Chardonnay, Pinot Noir, Pi…
## $ country <fct> Portugal, US, Italy, US, US, US, US, US, US, US, South Africa…
## $ province <chr> "Alentejano", "California", "Northeastern Italy", "California…
## $ taster <fct> Roger Voss, Virginie Boone, Kerin O'Keefe, Matt Kettmann, Pau…
Task 2.1: Simulation with Categorical Variables
Question
Modeling: Use a regression model to predict how highly a wine is rated (
points), controlling for the wine’sprice, itscountryof origin, itsvariety, and thetasterwho rated it.Simulation: Then, use statistical simulation to answer the following question: Holding all else constant, does our model project higher expected ratings for Greek, Italian, Spanish, Portguese, French, or German wines? Use
geom_linerange()to visualize your simulation.
Answer
# First, make the model
m2 <- mywines %>%
lm(formula = points ~ price + country + variety + taster)
# Second, simulate these
mysim <- m2 %>%
to_zelig() %>%
# List European countries in sample
setx(country = c("Greece","Italy", "Spain",
"Portugal", "France", "Germany")) %>%
sim() %>%
zelig_qi_to_df() %>%
select(country, ev = expected_value) %>%
# For each country
group_by(country) %>%
# Give me confidence intervals and median expected value
summarize(lower_ci = quantile(ev, probs= 0.025),
median = quantile(ev, probs = 0.50),
upper_ci = quantile(ev, probs = 0.975))
# Visualize
mysim %>%
ggplot(mapping = aes(x = reorder(country, median), y = median,
ymin = lower_ci, ymax = upper_ci,
color = median)) +
# Giving a line for each country
geom_linerange(size = 1.25) +
geom_point(size = 5) +
scale_color_viridis(option = "plasma", begin = 0, end = 0.8) +
# And flipping the axes
coord_flip() +
theme_bw(base_size = 24) +
labs(y = "Median Expected Wine Rating (0-100)\nwith 95% Confidence Intervals)",
color = "Median\nExpected\nValue",
x = "Country")
Task 2.2: Simulation with Multiple Conditions
Question
- These wines were rated by 19 top sommeliers. Does personality matter? Gather the names of the 5 most prolific tasters (who tasted the most wines in your sample), and then simulate how highly they each rank an average Italian wine vs. an average Greek wine. At the end, you should produce a data.frame of confidence intervals with 10 rows, and visualize it in two facets, one for Italian wines and one for Greek wines.
Answer
The top 5 most prolific wine tasters in our sample are Roger Voss, Michael Schachner, Kerin O’Keefe, Virginie Boone, and Paul Gregutt.
mywines %>%
group_by(taster) %>%
count() %>%
arrange(desc(n))## # A tibble: 18 × 2
## # Groups: taster [18]
## taster n
## <fct> <int>
## 1 Roger Voss 2118
## 2 Michael Schachner 1564
## 3 Virginie Boone 1006
## 4 Paul Gregutt 989
## 5 Kerin O'Keefe 974
## 6 Matt Kettmann 627
## 7 Joe Czerwinski 503
## 8 Sean P. Sullivan 495
## 9 Anna Lee C. Iijima 454
## 10 Jim Gordon 448
## 11 Anne Krebiehl MW 371
## 12 Lauren Buzzeo 177
## 13 Susan Kostrzewa 99
## 14 Mike DeSimone 73
## 15 Alexander Peartree 49
## 16 Jeff Jenssen 37
## 17 Carrie Dykes 13
## 18 Fiona Adams 3
Then, we can simulate them like so!
# Second, simulate these
mysim <- m2 %>%
to_zelig() %>%
# List top 5 most prolific tasters
setx(taster = c("Roger Voss", "Michael Schachner",
"Kerin O'Keefe", "Virginie Boone", "Paul Gregutt",
"Roger Voss", "Michael Schachner",
"Kerin O'Keefe", "Virginie Boone", "Paul Gregutt"),
country = c("Italy", "Italy", "Italy", "Italy", "Italy",
"Greece","Greece","Greece","Greece","Greece")) %>%
sim() %>%
zelig_qi_to_df() %>%
select(taster, country, ev = expected_value) %>%
# For each country
group_by(taster, country) %>%
# Give me confidence intervals and median expected value
summarize(lower_ci = quantile(ev, probs= 0.025),
median = quantile(ev, probs = 0.50),
upper_ci = quantile(ev, probs = 0.975))
mysim %>%
ggplot(mapping = aes(x = reorder(taster, median), y = median,
ymin = lower_ci, ymax = upper_ci)) +
geom_errorbar(width = 0.5) +
geom_point(size = 5) +
facet_wrap(~country) +
coord_flip() +
theme_bw(base_size = 24) +
labs(x = "Top 5 Wine Tasters",
y = "Median Expected Wine Rating (0-100)\nwith 95% Confidence Intervals",
caption = "Based on 1000 simulations in the Zelig package in R.",
subtitle = "Who's the toughest judge?")Our results show that Paul Gregutt and Virginie Boone tend to give more favorable scores, while Michael Schachner tends to give lower scores. Notably, our model projects that Kerin O’Keefe would give a wide range of scores to Greek wines, but narrower scores to Italian wines.
Note: These results are just for fun and should not be used to evaluate any competitions. This model, after all, has a pretty low R2. But I am eager to see datasets like this come out for the Great British Baking Show and other pop competitions!
Task 2.3: Expected vs. Predicted Values
Question
What’s an expected value, and how is it different from a predicted value? When would we use an expected vs. predicted value? Compare the distributions of expected and predicted values generated for an otherwise average wine that costs $100.
Answer
Predicted values and expected values are two types of outputs you can get from a statistical simulation.
Predicted values and expected values both account for estimation uncertainty, but account for fundamental uncertainty in different ways.
Predicted values tend to have wide confidence intervals, while expected values have narrow confidence intervals.
Predicted values are great for forecasting the future conservatively, while expected values are generally better but should mostly be used for explaining the past.
m2 %>%
to_zelig() %>%
setx(price = 100) %>%
sim() %>%
zelig_qi_to_df() %>%
select(price, ev = expected_value, pv = predicted_value) %>%
pivot_longer(cols = c(ev, pv), names_to = "type", values_to = "value") %>%
ggplot(mapping = aes(x = type, y = value, color = type)) +
geom_jitter(alpha = 0.25) +
geom_violin() +
guides(color = "none") +
theme_bw(base_size = 24) +
labs(x = "Type of Output", y = "1000 Simulated Wine Ratings",
subtitle = "Simulated Wine Rating for a $100 wine",
caption = "Based on 1000 simulations in the Zelig package in R.")
Task 2.4: Uncertainty
Question
What’s the difference between estimation uncertainty and fundamental uncertainty? How do predicted and expected value account for each?
Answer
They both account for estimation uncertainty, the idea that our beta coefficients might be slightly off due to limited sample size, by drawing 1000 rows of slightly beta coefficients, each very close to our original coefficients, but just slightly off, and calculating a simulated outcome from each of them.
To get a predicted value, you take a simulated outcome, estimate the range of simulated outcomes you’d get due to fundamental uncertainty, and randomly pick 1 from that distribution.
To get an expected value, you take a simulated outcome, estimate the range of simulated outcomes you’d get due to fundamental uncertainty, randomly pick 1000 from that distribution, and average them to get 1 expected value.
remove(mywines, m2)
Dataset 3: Mapping Boston
Boston City Government has collected dozens of excellent geospatial data on analyze.boston.gov. Your project files contain many different .geojson files we can use to explore geographic trends in Boston in the next several tasks, to practice geographic visualization.
Below, you will be challenged to reverse engineer several maps, using your mapping toolkit.
Task 3.1 Donuts and Democracy
Question
Replicate the map of central Boston voting precincts below.
To start off, you’ll need the following data:
# Locations of Dunkin Donuts shops
mydonuts <- read_sf("dunkin.geojson")
# Train Lines
mytrains <- read_sf("boston_train_lines.geojson")
# Polygons for Boston Precincts, with unique identifier ward_precinct code
myprecincts <- read_sf("precincts.geojson")
# Data.frame of Voter Turnout for each Precinct,
# with unique identifer ward_precinct code
mydata <- read_csv("boston_votes_data.csv") %>%
filter(type == "BALLOTS CAST")And use the following code to get started:
# First, do [SOMETHING] before plotting to get
# the voter turnout data that you need
# Second, visualize!
ggplot() +
###########################
# ADD SEVERAL LAYERS HERE
###########################
# Then add on top a nice, clear theme
theme_void(base_size = 24) +
# And crop the map to the following box
coord_sf(xlim = c(-71.11, -71.05),
ylim = c(42.335, 42.37))
Answer
# Take your precinct polgyons
myvotes <- myprecincts %>%
# Join in the voter turnout using
# the shared unique identifier ward_precinct in mydata
left_join(by = "ward_precinct", y = mydata)
ggplot() +
# Give your precincts a nice, big grey outline
geom_sf(data = myvotes, color = "grey", size = 5) +
# Then overlap that with your precincts, shaded by voter turnout
geom_sf(data = myvotes, mapping = aes(fill = percent),
# With a nice thin white border for each precinct
color = "white", size = 0.25) +
# Shade the polygons using a viridis plasma color scale
scale_fill_viridis(option = "plasma") +
geom_sf(data = mytrains, size = 2, color = "white") +
geom_sf(data = mytrains, size = 1, color = "black") +
# Use shape = 21 to get points with fill AND color options
geom_sf(data = mydonuts, shape = 21,
# Make the inside white and the outline black
fill = "white", color = "black",
# and make each dot pretty big
size = 4) +
# Add a nice, clear theme
theme_void() +
# And crop the map to the following box
coord_sf(xlim = c(-71.11, -71.05),
ylim = c(42.335, 42.37)) +
# Add some labels!
labs(title = "Donuts and Democracy",
subtitle = "Need some caffeine before you head to the polls?",
fill = "% Voter\nTurnout",
caption = "Points show Dunkin Donuts shops. Lines show T lines.
Shapes show Boston voting precincts for 2020 Election")
Task 3.2 Wicked Hot Boston
Question
Researchers at the Boston Museum of Science and the Helmbuth Lab at Northeastern University measured temperatures in Boston neighborhoods to identify ‘heat islands,’ meaning areas disproportionately affected by high heat in the summer. Using their data, saved in the heat_index variable in heat_index.geojson, please replicate the map below.
To start off, you’ll need the following data:
# Neighborhood polygons
myareas <- read_sf("neighborhoods.geojson")
# Street lines in Boston
mystreets <- read_sf("streets.geojson")
# Hexagons measuring heat index
myhex <- read_sf("heat_index.geojson")
# Points for all universities in Boston
myuni <- read_sf("universities.geojson")And use the following code to get started:
# First, do [SOMETHING] before plotting to get
# the point data that you need
# Then visualize
ggplot() +
###########################
# ADD SEVERAL LAYERS HERE
###########################
# Then add on top a nice, clear theme
theme_void(base_size = 24) +
# And crop the map to the following box
coord_sf(xlim = c(-71.11, -71.05),
ylim = c(42.32, 42.36))
Answer
# Zoom into just the three schools shown in the visual
mythreeschools <- myuni %>%
filter(name %in% c("Northeastern University",
"Suffolk University",
"Berklee College of Music"))
# Then visualize
ggplot() +
# Add a nice thick black background based on our neighborhoods
geom_sf(data = myareas, color = "black", size = 10) +
# layer ontop our Heat Index hexagons!
geom_sf(data = myhex, mapping = aes(fill = heat_index), color = NA) +
# With a logged viridis color scale
scale_fill_viridis(option = "viridis", trans = "log") +
# layer over that thin white street lines
geom_sf(data = mystreets, color = "white", size = 0.2) +
# layer over that bold black neighborhood lines with a clear fill
geom_sf(data = myareas, fill = NA, color = "black", size = 1) +
# Plop our school points on top, as nice white points
geom_sf(data = mythreeschools, color = "white", size = 7) +
# and slightly smaller points, so we get an outline effect
geom_sf(data = mythreeschools, color = "black", size = 5) +
# Then add on top a nice, clear theme
theme_void() +
# Zoom into a good box
coord_sf(xlim = c(-71.11, -71.05),
ylim = c(42.32, 42.36)) +
# And add informative labels
labs(
title = "Urban Heat Islands in Boston",
color = "Universities",
fill = "Heat\nIndex",
caption = "Hexagons show mean air temperature at 3 pm, July 29-30, 2019.
White lines depict streets. Black lines depict neighborhoods.
Points depict schools (Northeastern, Suffolk, and Berkley).
Data from Wicked Hot Boston Project & Analyze.Boston.gov")
Task 3.3 Mapping Northeastern University
Thanks to the City of Boston and their immense GIS data archives, we can actually visualize every single building in Boston! Below, we zoom into the area north of Northeastern’s main campus, around Huntington Avenue.
Question
Please replicate the map below! Can you get Huntington Avenue to show up in the legend?
To start off, you’ll need the following data:
# Outlines of every street in Fneway
mystreets <- read_sf("streets_fenway.geojson")
# Shapes of every building in Feneway
mybuildings <- read_sf("buildings_fenway.geojson")
# Neighborhood polygon for Fenway
myfenway <- read_sf("neighborhoods.geojson") %>%
filter(name == "Fenway")And use the following code to get started:
# First, do [SOMETHING] before plotting to get
# the line data that you need
# Then visualize
ggplot() +
###########################
# ADD SEVERAL LAYERS HERE
###########################
# Then add on top a nice, clear theme
theme_void() +
# And crop the map to the following box
coord_sf(xlim = c(-71.092, -71.086),
ylim = c(42.339, 42.344))
Answer
# First, zoom into get just Huntington Avenue
huntington <- mystreets %>%
filter(street == "Huntington")
# Then visualize
ggplot() +
# Map a blue polygon with a thick black outline to show Fenway area
geom_sf(data = myfenway, fill = "steelblue", color = "black", size = 1) +
# Draw thick white street lines
geom_sf(data = mystreets, size = 2, color = "white") +
# Draw dark blue buildings with black outlines
geom_sf(data = mybuildings, fill = "darkslateblue", color = "black") +
# Add the Huntington avenue line over top,
# With it's color based on the street variable
geom_sf(data = huntington, mapping = aes(color = street), size = 1) +
# Telling R to give it an 'orchid' color
scale_color_manual(values = "orchid") +
# Add a simple theme
theme_void(base_size = 24) +
# Zoomed into this box
coord_sf(xlim = c(-71.092, -71.086),
ylim = c(42.339, 42.344)) +
# With a basic distance scale
annotation_scale() +
# And some labels
labs(color = "Center of\nCampus",
subtitle = "Exploring Northeastern's Campus",
caption = "Shapes show buildings, and
white lines show streets.")
Task 3.4: More practice!
Wanting more practice with mapping? Make your own! This project’s files also contain several different types of geospatial data, all sourced from analyze.boston.gov or of my own making. Below, I summarize them and encourage you to check them out and build familiarity with mapping different data.
Points of Interest
bluebikes_stations.geojson: locations of BlueBikes docking stations throughout Boston.food_trucks.geojson: schedule and points for food trucks in Boston.dunkin.geojson: dataset of Dunkin Donuts locations in Boston.walk_to_nu.geojson: path from North Station to Northeastern University that maximizes visits by dog parks.neighborhoods.geojson: basic Boston neighborhood polygons
Elections
precincts.geojson: Boston voting precinctsboston_votes_data.csv: Boston data by precinct for presidential election in 2020.polling_places.geojsonpoint data for Boston precinct polling places.
Infrastructure
buildings_fenway.geojson: building polygons in Fenwaystreets_fenway.geojson: street lines in Fenwaystreets.geojson: street lines for all of Bostonboston_train_lines.geojson: Boston train linesboston_train_stops.geojson: Boston train stops by nameuniversities.geojson: locations of all universities in Boston.
Environment
trees.geojson: locations of all public-managed trees in Boston.heat_index.geojson: hexagons estimating heat index for every neighborhood in Boston.
Census
block_groups.geojson: polygons for census block groupsblock_groups_census.csv: data for joining into census block groups. (Note: must convert geoid to character for successful join. Based on 2014-2019 American Community Survey averages. Some block groups lack data.)