Module 14: Building Your Portfolio

Reviewing Modeling and Visualization

Click here to run the code for this workshop yourself on RStudio.Cloud!

Welcome to Workshop 14: Final Review!

In this workshop, we will review all key workshop concepts relevant to modeling, simulation, and geospatial visualization, through several review problems. Answers are posted below.

These review problems are optional, but I strongly suggest that you work on them together with your group.

0. Load Packages

library(tidyverse) # data wrangling
library(viridis) # color palettes
library(moderndive) # extra regression functions
library(Zelig) # simulation
library(sf) # mapping
library(ggspatial) # adding scales to maps

Dataset 1: Legislator Partisanship

Our Data

In the famous Voteview dataset, political scientist Jeffrey Lewis and his colleagues measured the partisanship of all 7,825 US members of Congress from 1987 to 2021. Each of our 37,911 rows is a legislator during a specific session of congress (eg. 2017-2019), where partisanship places them on a scale from -1.0 (liberal) to 1.0 (conservative), based on how they often they voted for or against key liberal or democratic economic and redistributive policies (eg. social welfare, health care, etc.).

Let’s examine a lightly curated version of the VoteView dataset, saved in your files as voteview.csv. Load it in and filter to just legislators in New England states using the code below.

congress <- read_csv("voteview.csv") %>%
  # Filter to New England stats
  filter(state %in% c("MA", "ME", "NH", "VT", "CT", "RI")) %>%
  # Convert all categorical variables to factors
  mutate(chamber = factor(chamber), 
         party = factor(party),
         state = factor(state))

Learn more!

Citation: Lewis, Jeffrey B., Keith Poole, Howard Rosenthal, Adam Boche, Aaron Rudkin, and Luke Sonnet (2021). Voteview: Congressional Roll-Call Votes Database.

Task 1.1. Variation Explained

Question

Modeling: Use a regression model to predict the partisanship of each legislator (partisanship), controlling for the chamber they served in ("House" or "Senate"), their party ("Republican", "Democrat", "Other"), their age, and state ("MA", "ME", "NH", "VT", "CT", "RI").
Model Fit: What percentage of variation does your model explain? Is it a good or bad model? How do you know?

Answer

First, we make our model m1 with the lm() function, placing our outcome partisanship first, followed by our predictors, chamber, party, age, and state.

m1 <- congress %>%
  lm(formula = partisanship ~ chamber + party + age + state)

Next, we can use the moderndive package’s get_regression_summaries() function to extract the summary statistics for our model. (You can also view these with the summary() function.)

get_regression_summaries(m1)

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df	nobs
0.825	0.824	0.0246135	0.1568869	0.157	1501.086	0	9	2876

Percentage of Variation: Our model explains 82.5% of the variation in legislator partisanship. We know this from our R² statistic (r_squared). This is an excellent model; that’s very close to 100%, which would mean a perfect model.

Task 1.2. F-statistic

Question

Evaluate your model’s F-statistic. Is it statistically significant? How do you know? What does it tell us about our model?

Answer

We can use the moderndive package’s get_regression_summaries() function to extract the summary statistics for our model. The statistic column lists the F-statistic. The p_value column lists the F-statistic’s p-value. (You can also view these with the summary() function.)

get_regression_summaries(m1) %>%
  select(statistic, p_value)

statistic	p_value
1501.086	0

F-statistic: Our model has an F-statistic of 1501.086. This is really extreme! We know because we can look at the p-value, which is less than 0.001. This is statistically significant, at a level below p < 0.05, meaning it is more extreme than 95% of the F-statistics we would get if this were due to chance.

The F-statistic is a ratio of how much variation in our outcome that was explained by our model, compared to how much variation was not explained (taking into account the number of cases analyzed and variables used). An F-statistic of 0 means your model is not much better than it is bad, so to speak; in contrast, an F-statistic of 3, 4, 5, or as high as 1501.086 means you explain way, way, way more variation than you fail to explain. Check out this workshop for more info.

Task 1.3. Residuals

Using your model m1 from Task 1, please complete the following steps:

Question

Residuals: Gather your model’s regression model’s residuals in a data.frame, together with your observed outcome and predicted outcome. Save that data.frame as m1data.
Meaning: What does a residual mean in this model?

Answer

Residuals: Gather your model’s regression model’s residuals in a data.frame, together with your observed outcome and predicted outcome. Save that data.frame as m1data.

Using get_regression_points(), we can view these data for our model m1. Our observed outcome is partisanship; our predicted outcome is partisanship_hat; our residual is residual (partisanship - partisanship_hat).

m1data <- m1 %>% get_regression_points()

# Let's view the first 3 rows
m1data %>% head(3)

ID	partisanship	chamber	party	age	state	partisanship_hat	residual
1	0.304	House	Republican	53	CT	0.345	-0.041
2	0.477	House	Republican	64	CT	0.359	0.118
3	-0.208	House	Democrat	57	CT	-0.340	0.132
4	0.437	House	Republican	68	CT	0.364	0.073
5	0.387	House	Republican	49	ME	0.332	0.055
6	-0.187	House	Other	61	ME	-0.295	0.108

Meaning: A residual refers to how much greater is the observed outcome than the predicted outcome for a given case. Our model has as many predictions and residuals as it has observations (n = 2876). See below!

Task 1.4. Residuals Visualized

Question

Visualize: Extract your model’s residuals and save it as a data.frame called m1data. Visualize your model’s residuals as a histogram. Examine your plot. What is the average residual? What are relatively rare or extreme residuals for this model?

Answer

We can extract our model’s residuals using get_regression_points() from the moderndive package.

m1data <- m1 %>% 
  get_regression_points()

Then, we can visualize them as a simple histogram.

m1data %>%
  ggplot(mapping = aes(x = residual)) +
  geom_histogram(color = "white", fill = "dodgerblue") +
  theme_bw(base_size = 24)

The average residual is about 0 - meaning most observations are nearly perfectly predicted (yay!). The most extreme or rarely occurring residuals are those above 0.40 or below -0.40. We might guess that 95% of residuals are within this range, meaning our predictions are almost never off by more than either amount.

Task 1.5. Sigma

Question

Sigma: What is sigma, and how does it relate to your residuals? What does it tell us about our model?

Answer

sigma is a single number represent the residual standard error of our model. It’s basically the standard deviation of our residuals, meaning how much do our predictions vary from our observed outcomes on average.

get_regression_summaries(m1) %>%
  select(sigma)

sigma
0.157

It tends to be very, very close to the standard deviation of residuals.

get_regression_points(m1) %>%
  summarize(sd = sd(residual))

sd
0.1569074

Our partisanship score ranges from -1 to 1, but our model’s predictions tend be off by just plus or minus 0.157! This is great news, and indicates a very, very good model. Check out this workshop for more info..

Simulation tie-in: We use sigma in simulation to tell Zelig how much error occurs in our model on average.

Task 1.6: Coefficients

Question

Finally, please examine the coefficient table for your model m1, and answer the following question:

Model equation: What is a model equation, and what is the specific model equation for your model?

Answer

We can get the coefficient table using get_regression_table() from the moderndive package.

get_regression_table(m1)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	-0.411	0.016	-25.486	0.000	-0.442	-0.379
chamber: Senate	-0.013	0.007	-1.802	0.072	-0.026	0.001
party: Other	0.049	0.029	1.703	0.089	-0.007	0.105
party: Republican	0.690	0.006	108.228	0.000	0.678	0.703
age	0.001	0.000	4.381	0.000	0.001	0.002
state: MA	-0.005	0.008	-0.589	0.556	-0.021	0.011
state: ME	-0.009	0.011	-0.819	0.413	-0.030	0.012
state: NH	0.055	0.012	4.789	0.000	0.033	0.078
state: RI	0.021	0.011	1.878	0.060	-0.001	0.043
state: VT	-0.058	0.012	-4.628	0.000	-0.082	-0.033

Our model equation is listed in the estimate column, and we can write it out like so:

$ Partisanship_{predicted} = -0.411 - 0.013 Chamber_{senate} + 0.049 Party_{Other} + $

$ 0.690 Party_{Republican} + 0.001 Age + $

$-0.005 \times State_{MA} -0.009 \times State_{ME} + 0.055 \times State_{NH} +$

$ 0.021 State_{RI} - 0.058 State_{VT} $

Task 1.7. Alpha

Question

Alpha: What is the alpha coefficient in your model table? What specifically does it tell us about legislators’ partisanship? (Remember that there are categorical variables in this model.)

Answer

Alpha: The alpha coefficient (shown by the intercept row) describes how much of the outcome our model projects on average if all other predictors are set to 0. In a model with categorical predictors, it means the average outcome for a city with the baseline category for each categorical variable, and all other predictors are set to 0.

m1 %>% 
  get_regression_table() %>%
  filter(term == "intercept")

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	-0.411	0.016	-25.486	0	-0.442	-0.379

In our model, the baseline categories are "House" for chamber, "Democrat" for party, and "CT" for state. Our model projects an average partisanship score of for a Democratic member of the House from Connecticut, who is zero years old (age == 0)!

Task 1.8. Beta

Question

Beta: What is the beta coefficient in your model table? Is there just one, or more than one? What specifically does it tell us about legislators’ partisanship?

Answer

A Beta coefficient shows the projected increase in the outcome, partisanship, as that coefficient’s predictor increases by 1, holding all else constant.

In other words, the beta coefficient is the slope of our regression model equation. But in multiple regression, we have not 1, but 9 beta coefficients.

Our beta coefficients are written in the estimate column.

m1 %>% 
  get_regression_table() %>%
  filter(term != "intercept")

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
chamber: Senate	-0.013	0.007	-1.802	0.072	-0.026	0.001
party: Other	0.049	0.029	1.703	0.089	-0.007	0.105
party: Republican	0.690	0.006	108.228	0.000	0.678	0.703
age	0.001	0.000	4.381	0.000	0.001	0.002
state: MA	-0.005	0.008	-0.589	0.556	-0.021	0.011
state: ME	-0.009	0.011	-0.819	0.413	-0.030	0.012
state: NH	0.055	0.012	4.789	0.000	0.033	0.078
state: RI	0.021	0.011	1.878	0.060	-0.001	0.043
state: VT	-0.058	0.012	-4.628	0.000	-0.082	-0.033

Our model’s beta coefficients tell us that:

chamber: Senate: A Senator tends to be 0.013 points less conservative than a member of the House (baseline category).
party: Other: A member of a third party tends to be 0.049 points more conservative than a Democrat (baseline category).
party: Republican: A Republican tends to be 0.690 points more conservative than a Democrat (baseline category).
age: As age increases by 1 year, legislators tend to become 0.001 points more conservative.
state: MA: A legislator from Massachusetts tends to be 0.005 points less conservative than one from Connecticut.
state: ME: A legislator from Maine tends to be 0.009 points less conservative than one from Connecticut.
state: NH: A legislator from New Hampshire tends to be 0.055 points more conservative than one from Connecticut.
state: RI: A legislator from Rhode Island tends to be 0.021 points more conservative than one from Connecticut.
state: MA: A legislator from Vermont tends to be 0.058 points less conservative than one from Connecticut.

Task 1.9: Prediction

Question

Using your model equation, please calculate by hand the predicted partisanship score for a 30 year old Massachusetts Democrat member of the House.

Answer

We can plug the following scores into our model to predict our partisanship score.

# Intercept (baseline conditions)
-0.411 + 
  # Chamber = House, so ChamberSenate = 0
  -0.013 * 0 +
  # Party = Democrat, so PartyOther = 0
  0.049 * 0 + 
  # Party = Democrat, so PArtyRepublican = 0
  0.690 * 0  + 
  # Age = 30
  0.001 * 30 + 
  # State = MA, so StateMA = 1
  -0.005 * 1 + 
  # State = MA, so StateME = 0
  0.009 * 0 +
  # State = MA, so StateNH = 0
  0.055 * 0 + 
  # State = MA, so StateRI = 0
  0.021 * 0 + 
  # State = MA, so StateVT = 0
  0.058 * 0

## [1] -0.386

Our model projects that a 30 year old Massachusetts Democrat member of the House would have a predicted partisanship score of -0.386, the value of a moderate-left liberal.

Task 1.10: Significance

Question

What does it mean for an alpha or beta coefficient to be statistically significant?
Which of our coefficients are statistically significant? How do you know?

Answer

Statistical significance refers to how extreme our statistic is compared to the range of statistics we would get if it were due to chance. If our statistic is really extreme, then we are pretty sure it didn’t happen due to chance. We can probably trust that result to reappear if we repeated the study.
Which of our coefficients are significant? We can look at the p_value column and determine which have p-values below 0.05.

m1 %>%
  get_regression_table()

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	-0.411	0.016	-25.486	0.000	-0.442	-0.379
chamber: Senate	-0.013	0.007	-1.802	0.072	-0.026	0.001
party: Other	0.049	0.029	1.703	0.089	-0.007	0.105
party: Republican	0.690	0.006	108.228	0.000	0.678	0.703
age	0.001	0.000	4.381	0.000	0.001	0.002
state: MA	-0.005	0.008	-0.589	0.556	-0.021	0.011
state: ME	-0.009	0.011	-0.819	0.413	-0.030	0.012
state: NH	0.055	0.012	4.789	0.000	0.033	0.078
state: RI	0.021	0.011	1.878	0.060	-0.001	0.043
state: VT	-0.058	0.012	-4.628	0.000	-0.082	-0.033

Our significant effects include: (1) the intercept (p < 0.001), (2) being Republican (p < 0.001), (3) age (p < 0.001), (4) being from New Hampshire (p < 0.001), and (5) being from Vermont (p < 0.001). Other effects are also somewhat significant, but we’re not 95% confident or more about their significance.

Task 1.11: Standard Error

Question

How do the statistic, std_error, and p_value columns help us determine statistical significance?

Answer

Our three terms of interest mean:

std_error: the standard deviation of the null distribution (for any statistic, the hypothetical distribution of statistics we would get if our statistic was off just a bit due to random chance.)
statistic: a standardized version of our coefficient (the estimate column), ranging from -Infinity to +Infinity. (Literally equals the estimate divided by the standard error.)
p_value: the percentage (written as a decimal) of statistics in the null distribution greater than the statistic we actually got. If your percentage is high, your observed statistic is not very extreme. If your percentage is low, your observed statistic is pretty extreme compared to the rest. We use the threshold p < 0.05 to indicate 95% confidence that our statistic was not just due to chance.

Task 1.12: Confidence Intervals

Question

What do upper_ci and lower_ci in your get_regression_table() output mean, with regards to the effect of being Republican on a legislator’s partisanship?
You can approximate a 95% confidence interval from just the estimate and 1 other vector in your get_regression_table() output, using a quick rule of thumb. What is that rule of thumb? Calculate by yourself the 95% confidence interval of the effect of being Republican on a legislator’s partisanship.

Answer

upper_ci and lower_ci depict the 2.5th and 97.5th percentiles of the null distribution for a statistic (in this case, for our coefficient). That is to say, even if our estimate was off due to chance, 95% of the time, it would be somewhere in that zone! We are 95% confident that the true effect of being Republican on a legislator’s partisanship is an increase of between 0.678 and 0.703 points.

m1 %>%
  get_regression_table() %>%
  filter(term == "party: Republican")

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
party: Republican	0.69	0.006	108.228	0	0.678	0.703

You can approximate a 95% confidence interval for any statistic using just the estimate and the std_error, if you multiply the std_error by about 2 (it’s ~1.96, depending on sample size).

m1 %>%
  get_regression_table() %>%
  filter(term == "party: Republican") %>%
  select(term, estimate, std_error, lower_ci, upper_ci) %>%
  # Compare
  mutate(mylower = estimate - std_error * 1.96,
         myupper = estimate + std_error * 1.96)

term	estimate	std_error	lower_ci	upper_ci	mylower	myupper
party: Republican	0.69	0.006	0.678	0.703	0.67824	0.70176

Task 1.13: Simulation

Question

Using Zelig, please simulate the expected effect of age on legislator partisanship, as an average legislators’ age increases from 30 to 40 to 50 to 60 to 70 to 80 years old. Calculate 95% confidence intervals and visualize your simulations as a ribbon plot with ggplot(). (Note: You must have already converted your categorical variables in m1 to be factors for Zelig to work.)

Answer

Use the code below to simulate in Zelig!

mysim <- m1 %>%
  # Convert model to zelig format
  to_zelig() %>%
  # For an average legislator who is 30, 40, 50... years old,
  setx(age = c(30, 40, 50, 60, 70, 80)) %>%
  # Simulate expected partisanship
  sim() %>%
  # Convert to data.frame
  zelig_qi_to_df() %>%
  # Grab just age and expected value columns
  select(age, ev = expected_value) %>%
  # For each age (30, 40, 50, 60, etc.)
  group_by(age) %>%
  # Give me confidence intervals and median expected value
  summarize(lower_ci = quantile(ev, probs=  0.025),
            median = quantile(ev, probs = 0.50),
            upper_ci = quantile(ev, probs = 0.975))

Then visualize it in ggplot like so!

mysim %>%
  # Add aesthetics
  ggplot(mapping = aes(x = age, y = median, 
                       ymin = lower_ci, ymax = upper_ci)) +
  # Plot confidence intervals
  geom_ribbon(fill = "cyan", alpha = 0.5, color = "black") +
  # Plot median expected value with a dashed line
  geom_line(linetype = "dashed", size = 1) +
  # Add a nice theme
  theme_bw() +
  # Add labels!
  labs(x = "Age (years)",
       y = "Expected Partisanship for Average Legislator\n(-1 = Liberal, +1 = Conservative)",
       subtitle = "Expected Effect of Age on Partisanship",
       caption = "Based on 1000 simulations in the Zelig package in R.")

remove(congress, m1)

Dataset 2: Wine Rankings

Our Data

Zack Thoutt famously webscraped 130,000 wine reviews from WineEnthusiast magaine on November 22, 2017. He wondered if he could build a predictive model that would identify good wines as accurately as a master sommelier (wine tasting expert).

Drawing from his 96,420 complete reviews, I lightly curated the data to focus on the rating, price, variety, country, province, and taster. Each wine was rated on a scale from 0 (lowest) to 100 (highest). And, each wine is classified by the top 50 most frequently occurring provinces and varieties of wine. We’re going to use a random sample of 10,000 of these reviews to build that model and simulate the ranking habits of each sommelier! Please import these reviews from your project’s files, save as wind_reviews.csv.

mywines <- read_csv("wine_reviews.csv") %>%
  # Convert categories to factor
  mutate(country = factor(country),
         variety = factor(variety),
         taster = factor(taster))

We can glimpse() the dataset to see its contents below!

mywines %>% 
  glimpse()

## Rows: 10,000
## Columns: 7
## $ wine     <chr> "Monte Novo e Figueirinha 2007 Herdade da Figueirinha Reserva…
## $ points   <dbl> 81, 91, 87, 90, 87, 88, 89, 90, 94, 94, 85, 89, 87, 86, 88, 9…
## $ price    <dbl> 9, 54, 20, 48, 34, 36, 15, 25, 50, 32, 14, 28, 40, 7, 13, 33,…
## $ variety  <fct> Portuguese Red, Pinot Noir, Other, Chardonnay, Pinot Noir, Pi…
## $ country  <fct> Portugal, US, Italy, US, US, US, US, US, US, US, South Africa…
## $ province <chr> "Alentejano", "California", "Northeastern Italy", "California…
## $ taster   <fct> Roger Voss, Virginie Boone, Kerin O'Keefe, Matt Kettmann, Pau…

Task 2.1: Simulation with Categorical Variables

Question

Modeling: Use a regression model to predict how highly a wine is rated (points), controlling for the wine’s price, its country of origin, its variety, and the taster who rated it.
Simulation: Then, use statistical simulation to answer the following question: Holding all else constant, does our model project higher expected ratings for Greek, Italian, Spanish, Portguese, French, or German wines? Use geom_linerange() to visualize your simulation.

Answer

# First, make the model
m2 <- mywines %>%
  lm(formula = points ~ price + country + variety + taster)

# Second, simulate these
mysim <- m2 %>%
  to_zelig() %>%
  # List European countries in sample
  setx(country = c("Greece","Italy", "Spain", 
                   "Portugal", "France", "Germany")) %>%
  sim() %>%
  zelig_qi_to_df() %>%
  select(country, ev = expected_value) %>%
  # For each country
  group_by(country) %>%
  # Give me confidence intervals and median expected value
  summarize(lower_ci = quantile(ev, probs=  0.025),
            median = quantile(ev, probs = 0.50),
            upper_ci = quantile(ev, probs = 0.975))

# Visualize
mysim %>%
  ggplot(mapping = aes(x = reorder(country, median), y = median, 
                       ymin = lower_ci, ymax = upper_ci,
                       color = median)) +
  # Giving a line for each country
  geom_linerange(size = 1.25) +
  geom_point(size = 5) +
  scale_color_viridis(option = "plasma", begin = 0, end = 0.8) + 
  # And flipping the axes
  coord_flip() +
  theme_bw(base_size = 24) +
  labs(y = "Median Expected Wine Rating (0-100)\nwith 95% Confidence Intervals)",
       color = "Median\nExpected\nValue",
       x = "Country")

Task 2.2: Simulation with Multiple Conditions

Question

These wines were rated by 19 top sommeliers. Does personality matter? Gather the names of the 5 most prolific tasters (who tasted the most wines in your sample), and then simulate how highly they each rank an average Italian wine vs. an average Greek wine. At the end, you should produce a data.frame of confidence intervals with 10 rows, and visualize it in two facets, one for Italian wines and one for Greek wines.

Answer

The top 5 most prolific wine tasters in our sample are Roger Voss, Michael Schachner, Kerin O’Keefe, Virginie Boone, and Paul Gregutt.

mywines %>%
  group_by(taster) %>%
  count() %>%
  arrange(desc(n))

## # A tibble: 18 × 2
## # Groups:   taster [18]
##    taster                 n
##    <fct>              <int>
##  1 Roger Voss          2118
##  2 Michael Schachner   1564
##  3 Virginie Boone      1006
##  4 Paul Gregutt         989
##  5 Kerin O'Keefe        974
##  6 Matt Kettmann        627
##  7 Joe Czerwinski       503
##  8 Sean P. Sullivan     495
##  9 Anna Lee C. Iijima   454
## 10 Jim Gordon           448
## 11 Anne Krebiehl MW     371
## 12 Lauren Buzzeo        177
## 13 Susan Kostrzewa       99
## 14 Mike DeSimone         73
## 15 Alexander Peartree    49
## 16 Jeff Jenssen          37
## 17 Carrie Dykes          13
## 18 Fiona Adams            3

Then, we can simulate them like so!

# Second, simulate these
mysim <- m2 %>%
  to_zelig() %>%
  # List top 5 most prolific tasters
  setx(taster = c("Roger Voss", "Michael Schachner",
                   "Kerin O'Keefe", "Virginie Boone", "Paul Gregutt",
                  "Roger Voss", "Michael Schachner",
                   "Kerin O'Keefe", "Virginie Boone", "Paul Gregutt"),
       country = c("Italy", "Italy", "Italy", "Italy", "Italy",
                   "Greece","Greece","Greece","Greece","Greece")) %>%
  sim() %>%
  zelig_qi_to_df() %>%
  select(taster, country, ev = expected_value) %>%
  # For each country
  group_by(taster, country) %>%
  # Give me confidence intervals and median expected value
  summarize(lower_ci = quantile(ev, probs=  0.025),
            median = quantile(ev, probs = 0.50),
            upper_ci = quantile(ev, probs = 0.975))

mysim %>%
  ggplot(mapping = aes(x = reorder(taster, median), y = median, 
                       ymin = lower_ci, ymax = upper_ci)) +
  geom_errorbar(width = 0.5) +
  geom_point(size = 5) +
  facet_wrap(~country) +
  coord_flip() +
  theme_bw(base_size = 24) +
  labs(x = "Top 5 Wine Tasters",
       y = "Median Expected Wine Rating (0-100)\nwith 95% Confidence Intervals",
       caption = "Based on 1000 simulations in the Zelig package in R.",
       subtitle = "Who's the toughest judge?")

Our results show that Paul Gregutt and Virginie Boone tend to give more favorable scores, while Michael Schachner tends to give lower scores. Notably, our model projects that Kerin O’Keefe would give a wide range of scores to Greek wines, but narrower scores to Italian wines.

Note: These results are just for fun and should not be used to evaluate any competitions. This model, after all, has a pretty low R2. But I am eager to see datasets like this come out for the Great British Baking Show and other pop competitions!

Task 2.3: Expected vs. Predicted Values

Question

What’s an expected value, and how is it different from a predicted value? When would we use an expected vs. predicted value? Compare the distributions of expected and predicted values generated for an otherwise average wine that costs $100.

Answer

Predicted values and expected values are two types of outputs you can get from a statistical simulation.
Predicted values and expected values both account for estimation uncertainty, but account for fundamental uncertainty in different ways.
Predicted values tend to have wide confidence intervals, while expected values have narrow confidence intervals.
Predicted values are great for forecasting the future conservatively, while expected values are generally better but should mostly be used for explaining the past.

m2 %>%
  to_zelig() %>%
  setx(price = 100) %>%
  sim() %>%
  zelig_qi_to_df() %>%
  select(price, ev = expected_value, pv = predicted_value) %>%
  pivot_longer(cols = c(ev, pv), names_to = "type", values_to = "value") %>%
  ggplot(mapping = aes(x = type, y = value, color = type)) +
  geom_jitter(alpha = 0.25) + 
  geom_violin() +
  guides(color = "none") +
  theme_bw(base_size = 24) +
  labs(x = "Type of Output", y = "1000 Simulated Wine Ratings",
       subtitle = "Simulated Wine Rating for a $100 wine",
       caption = "Based on 1000 simulations in the Zelig package in R.")

Task 2.4: Uncertainty

Question

What’s the difference between estimation uncertainty and fundamental uncertainty? How do predicted and expected value account for each?

Answer

They both account for estimation uncertainty, the idea that our beta coefficients might be slightly off due to limited sample size, by drawing 1000 rows of slightly beta coefficients, each very close to our original coefficients, but just slightly off, and calculating a simulated outcome from each of them.
To get a predicted value, you take a simulated outcome, estimate the range of simulated outcomes you’d get due to fundamental uncertainty, and randomly pick 1 from that distribution.
To get an expected value, you take a simulated outcome, estimate the range of simulated outcomes you’d get due to fundamental uncertainty, randomly pick 1000 from that distribution, and average them to get 1 expected value.

remove(mywines, m2)

Dataset 3: Mapping Boston

Boston City Government has collected dozens of excellent geospatial data on analyze.boston.gov. Your project files contain many different .geojson files we can use to explore geographic trends in Boston in the next several tasks, to practice geographic visualization.

Below, you will be challenged to reverse engineer several maps, using your mapping toolkit.

Task 3.1 Donuts and Democracy

Question

Replicate the map of central Boston voting precincts below.

To start off, you’ll need the following data:

# Locations of Dunkin Donuts shops
mydonuts <- read_sf("dunkin.geojson")
# Train Lines
mytrains <- read_sf("boston_train_lines.geojson")
# Polygons for Boston Precincts, with unique identifier ward_precinct code
myprecincts <- read_sf("precincts.geojson")
# Data.frame of Voter Turnout for each Precinct, 
# with unique identifer ward_precinct code
mydata <- read_csv("boston_votes_data.csv") %>%
  filter(type == "BALLOTS CAST")

And use the following code to get started:

# First, do [SOMETHING] before plotting to get 
# the voter turnout data that you need

# Second, visualize!
ggplot() +
  ###########################
  # ADD SEVERAL LAYERS HERE
  ###########################  

  # Then add on top a nice, clear theme
  theme_void(base_size = 24) +
  # And crop the map to the following box
  coord_sf(xlim = c(-71.11, -71.05),
           ylim = c(42.335, 42.37))

Answer

# Take your precinct polgyons
myvotes <- myprecincts %>%
  # Join in the voter turnout using 
  # the shared unique identifier ward_precinct in mydata
  left_join(by = "ward_precinct", y = mydata)


ggplot() +
  # Give your precincts a nice, big grey outline
  geom_sf(data = myvotes, color = "grey", size = 5) +
  # Then overlap that with your precincts, shaded by voter turnout
  geom_sf(data = myvotes, mapping = aes(fill = percent), 
          # With a nice thin white border for each precinct
          color = "white", size = 0.25) +
  # Shade the polygons using a viridis plasma color scale
  scale_fill_viridis(option = "plasma") +
  geom_sf(data = mytrains, size = 2, color = "white") +
  geom_sf(data = mytrains, size = 1, color = "black") +
  # Use shape = 21 to get points with fill AND color options
  geom_sf(data = mydonuts, shape = 21, 
          # Make the inside white and the outline black
          fill = "white", color = "black", 
          # and make each dot pretty big
          size = 4) +
   # Add a nice, clear theme
  theme_void() +
  # And crop the map to the following box
  coord_sf(xlim = c(-71.11, -71.05),
           ylim = c(42.335, 42.37)) +
  # Add some labels!
  labs(title = "Donuts and Democracy",
       subtitle = "Need some caffeine before you head to the polls?",
       fill = "% Voter\nTurnout",
       caption = "Points show Dunkin Donuts shops. Lines show T lines. 
       Shapes show Boston voting precincts for 2020 Election")

Task 3.2 Wicked Hot Boston

Question

Researchers at the Boston Museum of Science and the Helmbuth Lab at Northeastern University measured temperatures in Boston neighborhoods to identify ‘heat islands,’ meaning areas disproportionately affected by high heat in the summer. Using their data, saved in the heat_index variable in heat_index.geojson, please replicate the map below.

To start off, you’ll need the following data:

# Neighborhood polygons
myareas <- read_sf("neighborhoods.geojson")
# Street lines in Boston
mystreets <- read_sf("streets.geojson")
# Hexagons measuring heat index
myhex <- read_sf("heat_index.geojson")
# Points for all universities in Boston
myuni <- read_sf("universities.geojson")

And use the following code to get started:

# First, do [SOMETHING] before plotting to get 
# the point data that you need

# Then visualize
ggplot() +
  ###########################
  # ADD SEVERAL LAYERS HERE
  ###########################  

  # Then add on top a nice, clear theme
  theme_void(base_size = 24) +
  # And crop the map to the following box
  coord_sf(xlim = c(-71.11, -71.05),
           ylim = c(42.32, 42.36))

Answer

# Zoom into just the three schools shown in the visual
mythreeschools <- myuni %>%
  filter(name %in% c("Northeastern University", 
                     "Suffolk University", 
                     "Berklee College of Music"))

# Then visualize
ggplot() +
  # Add a nice thick black background based on our neighborhoods
  geom_sf(data = myareas, color = "black", size = 10) +
  # layer ontop our Heat Index hexagons!
  geom_sf(data = myhex, mapping = aes(fill = heat_index), color = NA) +
  # With a logged viridis color scale
  scale_fill_viridis(option = "viridis", trans = "log") + 
  # layer over that thin white street lines
  geom_sf(data = mystreets, color = "white", size = 0.2) +
  # layer over that bold black neighborhood lines with a clear fill
  geom_sf(data = myareas, fill = NA, color = "black", size = 1) +
  # Plop our school points on top, as nice white points
  geom_sf(data = mythreeschools, color = "white", size = 7) +
  # and slightly smaller points, so we get an outline effect
  geom_sf(data = mythreeschools, color = "black", size = 5) +
  # Then add on top a nice, clear theme
  theme_void() +
  # Zoom into a good box
  coord_sf(xlim = c(-71.11, -71.05),
           ylim = c(42.32, 42.36)) +
  # And add informative labels
  labs(
    title = "Urban Heat Islands in Boston",
    color = "Universities",
    fill = "Heat\nIndex",
    caption = "Hexagons show mean air temperature at 3 pm, July 29-30, 2019. 
    White lines depict streets. Black lines depict neighborhoods.
    Points depict schools (Northeastern, Suffolk, and Berkley).
    Data from Wicked Hot Boston Project & Analyze.Boston.gov")

Task 3.3 Mapping Northeastern University

Thanks to the City of Boston and their immense GIS data archives, we can actually visualize every single building in Boston! Below, we zoom into the area north of Northeastern’s main campus, around Huntington Avenue.

Question

Please replicate the map below! Can you get Huntington Avenue to show up in the legend?

To start off, you’ll need the following data:

# Outlines of every street in Fneway
mystreets <- read_sf("streets_fenway.geojson")
# Shapes of every building in Feneway
mybuildings <- read_sf("buildings_fenway.geojson")
# Neighborhood polygon for Fenway
myfenway <- read_sf("neighborhoods.geojson") %>%
  filter(name == "Fenway")

And use the following code to get started:

# First, do [SOMETHING] before plotting to get 
# the line data that you need

# Then visualize
ggplot() +
  ###########################
  # ADD SEVERAL LAYERS HERE
  ###########################  

  # Then add on top a nice, clear theme
  theme_void() +
  # And crop the map to the following box
  coord_sf(xlim = c(-71.092, -71.086),
           ylim = c(42.339, 42.344))

Answer

# First, zoom into get just Huntington Avenue
huntington <- mystreets %>%
  filter(street == "Huntington")

# Then visualize
ggplot() +
  # Map a blue polygon with a thick black outline to show Fenway area
  geom_sf(data = myfenway, fill = "steelblue", color = "black", size = 1) +
  # Draw thick white street lines
  geom_sf(data = mystreets, size = 2, color = "white") +
  # Draw dark blue buildings with black outlines
  geom_sf(data = mybuildings, fill = "darkslateblue", color = "black") +
  # Add the Huntington avenue line over top,
  # With it's color based on the street variable
  geom_sf(data = huntington, mapping = aes(color = street), size = 1) +
  # Telling R to give it an 'orchid' color
  scale_color_manual(values = "orchid") +
  # Add a simple theme
  theme_void(base_size = 24) +
  # Zoomed into this box
  coord_sf(xlim = c(-71.092, -71.086),
           ylim = c(42.339, 42.344)) +
  # With a basic distance scale
  annotation_scale() +
  # And some labels
  labs(color = "Center of\nCampus",
       subtitle = "Exploring Northeastern's Campus",
       caption = "Shapes show buildings, and 
       white lines show streets.")

Task 3.4: More practice!

Wanting more practice with mapping? Make your own! This project’s files also contain several different types of geospatial data, all sourced from analyze.boston.gov or of my own making. Below, I summarize them and encourage you to check them out and build familiarity with mapping different data.

Points of Interest

bluebikes_stations.geojson: locations of BlueBikes docking stations throughout Boston.
food_trucks.geojson: schedule and points for food trucks in Boston.
dunkin.geojson: dataset of Dunkin Donuts locations in Boston.
walk_to_nu.geojson: path from North Station to Northeastern University that maximizes visits by dog parks.
neighborhoods.geojson: basic Boston neighborhood polygons

Elections

precincts.geojson: Boston voting precincts
boston_votes_data.csv: Boston data by precinct for presidential election in 2020.
polling_places.geojson point data for Boston precinct polling places.

Infrastructure

buildings_fenway.geojson: building polygons in Fenway
streets_fenway.geojson: street lines in Fenway
streets.geojson: street lines for all of Boston
boston_train_lines.geojson: Boston train lines
boston_train_stops.geojson: Boston train stops by name
universities.geojson: locations of all universities in Boston.

Environment

trees.geojson: locations of all public-managed trees in Boston.
heat_index.geojson: hexagons estimating heat index for every neighborhood in Boston.

Census

block_groups.geojson: polygons for census block groups
block_groups_census.csv: data for joining into census block groups. (Note: must convert geoid to character for successful join. Based on 2014-2019 American Community Survey averages. Some block groups lack data.)