API-209 Problem Set #5

Due Date

November 5, 2024

IDENTIFICATION

1 - Your information

Last Name: Kathryn                               
First Name: Bankart

2 - Group Members (please list below the classmates you worked with on this problem set):

Group members: Muskaan Malhotra, Muhannad Alramlawi, Iris Seunga Ryu, Sebastian Rodriguez Leon, Fateh Farhan

3 - Compliance with Harvard Kennedy School Academic Code: Do you certify that my work in this problem set complies with the Harvard Kennedy School Academic Code (mark with an X below)?

                             [ X  ] YES                [   ] NO

INSTRUCTIONS

  1. Render this file in your computer as it is right now. If it does not render correctly, address the rendering problem immediately. First, consult the “Rendering Hints and Troubleshooting” guide. If these hints and recommendations do not work, please reach out to the teaching team. We will try to help you promptly. It will be much easier to identify what’s causing the rendering problem(s) immediately after they occur. For this reason, render often (i.e., every time you finish a sub-question).

  2. You will be asked to type your responses to this problem set in two types of boxes: text boxes (called “answer box”) and code boxes (where you type and run your R code).

  3. In an effort to assess the workload involved in this course, you will be asked to report on the time spent on each problem set question. Please record this time as you go along. We recommend using the timer on your smartphone (and pausing it every time you take a break).

  4. If you are just learning to code and get stuck, try to figure out yourself with the techniques you learned in math camp and in API-209. For some of the R questions, you might have access to hints. To enable you to develop your coding skills, we strongly suggest you don’t look at these hints unless you are really stuck. In real life you won’t have hints!

  5. Before submitting, make sure you render your qmd file with your answers one last time. You will be asked to submit your qmd file and your html rendered file.

QUESTION 1 – ESTIMATING IMPACTS ON STEPS, PART II

In your previous problem set, you were asked to describe what you would do with a data set to estimate the effects of a program to increase the number of steps people take a day. Most answers included a comparison of means between the treatment and control groups or a bivariate regression, and concluded that the program had a statistically significant effect of 2,793 steps. The goal of this problem set question is to help you see the importance of examining and visualizing the data before running any regressions.

  1. If properly conducted, the RCT helps ensure that the treatment and control groups are equivalent at baseline. Is there evidence in the data that supports this claim? To answer this question, compare the average number of steps at baseline between treatment and control groups. Is the difference between the two groups statistically significant? What does this say about the credibility of this RCT?

steps <- read.csv("Steps.csv")
str(steps)
'data.frame':   11363 obs. of  4 variables:
 $ treatment    : int  0 0 0 0 1 0 1 1 0 1 ...
 $ BaselineSteps: int  21396 21340 23912 16862 147738 18780 41930 28993 78382 58500 ...
 $ PostSteps    : int  63511 37460 59136 59292 167895 49811 80323 63707 127817 81081 ...
 $ StepChange   : int  42115 16120 35224 42430 20157 31031 38393 34714 49435 22581 ...
summary(steps)
   treatment      BaselineSteps      PostSteps     
 Min.   :0.0000   Min.   : 10001   Min.   : 10800  
 1st Qu.:0.0000   1st Qu.: 37490   1st Qu.: 62609  
 Median :0.0000   Median : 64400   Median : 90069  
 Mean   :0.4713   Mean   : 72861   Mean   : 97813  
 3rd Qu.:1.0000   3rd Qu.:100000   3rd Qu.:127213  
 Max.   :1.0000   Max.   :199938   Max.   :245951  
   StepChange   
 Min.   :   10  
 1st Qu.:12299  
 Median :25108  
 Mean   :24952  
 3rd Qu.:37419  
 Max.   :49997  
baseline_means <- steps %>%
  group_by(treatment) %>%
  summarize(mean_baseline_steps = mean(BaselineSteps, na.rm = TRUE), 
            sd_baseline_steps = sd(BaselineSteps, na.rm = TRUE), 
            n = n())

print(baseline_means)
# A tibble: 2 × 4
  treatment mean_baseline_steps sd_baseline_steps     n
      <int>               <dbl>             <dbl> <int>
1         0              75512.            44921.  6008
2         1              69887.            43132.  5355
t_test_result <- t.test(BaselineSteps ~ treatment, data = steps)

print(t_test_result)
Welch Two Sample t-test

data: BaselineSteps by treatment t = 6.8048, df = 11298, p-value = 1.063e-11 alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0 95 percent confidence interval: 4004.569 7245.137 sample estimates: mean in group 0 mean in group 1 75511.95 69887.10

Answer:

The difference in Baseline Steps between treatment and control is statistically significantly different (as seen by the results of the t-test, printed above - see a t-value much larger than 2, an extremely small p-value; i.e. we can reject the null hypothesis that the difference between the two means is 0 with over 99% confidence.) This implies an issue with the credibility of the RCT, specifically with randomisation of treatment and control assignment. Specifically, it seems they have somehow selected less active people for treatment, on average.


  1. Now let’s focus on the outcome variable (StepChange), which measures the number of steps that people took in the final week of the program. What would you expect to be the distribution of this variable? Create a histogram of this variable. Is this what you expected? What does this say about the credibility of the outcome variable used in this RCT?

library(ggplot2)

ggplot(steps, aes(x = StepChange)) +
  geom_histogram(binwidth = 1000, color = "black", fill = "skyblue") +
  labs(title = "Distribution of Step Change",
       x = "Step Change (PostSteps - BaselineSteps)",
       y = "Frequency") +
  theme_minimal()

Answer:

I would have expected an approximately normally distributed set of data for the StepChange variable. But it does not appear to be concentrated around the mean as I would expect. Instead, it seems almost uniformly distributed, but with very tiny tails. It looks as though the results might be fabricated.


  1. What lessons do you draw from this exercise as you think about the next time you are about to start running regressions on a data set? [one short paragraph]

Answer:

It’s very important to investigate the data before running a regression. It’s a matter of checking on the integrity of the dataset based on the claims about it (in this case, that it is a legitimate, well-run RCT, where treatment and control were adequately randomised). Without doing this, we run the risk of claiming statistically significant impacts, when the underlying data may be problematic to begin with.


  1. [Optional] It turns out the data set you analyzed is a masked version of a data set that was used to produce a highly influential paper that argued that dishonesty can be reduced by asking people to sign a statement of honest intent before providing information (i.e., at the top of a document) rather than after providing information (i.e., at the bottom of a document). The paper was based on a field experiment conducted by an auto insurance company in the southeastern United States. Customers were asked to report the current odometer reading of up to four cars covered by their policy. They were randomly assigned to sign a statement indicating, “I promise that the information I am providing is true” either at the top or bottom of the form. Customers assigned to the ‘sign-at-the-top’ condition reported driving 2,400 more miles than those assigned to the ‘sign-at-the-bottom’ condition. This was seen as evidence that signing at the top could be a cheap and effective way of reducing dishonesty. The data set you analyzed last week is the one used in this paper except that we referred to the outcome as steps taken rather than miles driven, and we focused on the odometer reading of the first car only (hence the impact you found was slightly different than the one reported in the paper). The analysis you conduced above plus some additional analyses provided compelling evidence that the findings from the paper were not real and were partially based on fake data. This posting goes over many of the details. The authors of the paper retracted the original publication, and several of them issued personal replies to the posting. This controversy raised some important issues about data analysis, reproducibility of research findings, detection of fake data, and admission of error. Feel free to comment below and/or on our Slack workspace (using the #reflections channel) your views about any of these issues.

Answer:

Please insert your answer here.


NOTE: Please remember to record the time it took you to complete this question.

QUESTION 2 – OMMITTED VARIABLE BIAS

Friendly reminder: Please remember to record the time it takes you to complete each of the problem set questions.

As indicated in class, the sign of the bias of \(\hat{\beta_1}\) when omitting \(X_2\) in the estimation of \(Y=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + u\) can be summarized in the table below:

Pick two of the boxes in the table above. Illustrate each of these 2 cases with an example. Assume the regression equation you would like to estimate is \(Y=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + u\), but for some reason (lack of data or knowledge, for example), you end up omitting \(X_2\) from the regression.

  1. Describe the variables \(Y, X_1, X_2\):

Answer:

Let Y be credit extension to the corporate sector. Let X1 be corporate investment (gross fixed capital formation) and let X2 be interest rates.


  1. Indicate the sign of the correlation between \(X_1\) and \(X_2\), and explain why you should expect such sign [1-2 sentences]:

Answer:

Corr(X1, X2) would be negative, as higher interest rates would disincentivise investment by making it more expensive.


  1. Indicate the sign of \(\beta_2\), and explain why you would expect such sign [1-2 sentences]:

Answer:

B2 would also be negative for the same reason - more expensive credit would reduce take-up.


  1. Indicate the sign of the bias if you were to omit \(X_2\) from the regression, and explain why you would expect such sign [1-2 sentences]:

Answer:

There would be a positive bias, since both the correlation between the indep variables, and the B2 are negative.


  1. Indicate how your estimated \(\hat{\beta_1}\) is likely to change when omitting \(X_2\) from the regression, i.e., will it get larger or smaller relative to when you estimate the full regression (with both \(X_1\) and \(X_2\) as explanatory variables)? Explain your reasoning both in technical terms (using your answers to the previous sub questions) and in terms a policymaker can understand (i.e., explain whether you will be over or understating the importance of \(X_1\)).

Answer:

B1 would be less positive (smaller) than its unbiased counterfactual would be (or would even be negative if the bias is large enough). If we don’t include interest rates in our analysis, then it will appear as though investment decisions are not as strong of a driver of corporate credit - when really, it’s just that we are missing the negative impacts of higher interest rates on both investment and borrowing.


  1. Training and Wages: Suppose that wages depend on two factors: hours of training (hourstraining) and years of education (educ):

    \(wage_i=\beta_0 + \beta_1 hourstraining_i + \beta_2 educ_i + u_i\)

    Suppose that in a certain city, a large subsidy was offered to workers with low levels of schooling such that hourstraining and educ are negatively correlated. Suppose you had data on workers that live in the city where the program is operating, and you estimate:

    \(wage_i=\alpha_0 + \alpha_1 hourstraining_i + v_i\)

    Do you think \(\hat{\alpha_1}\) is an unbiased estimator of the causal effect of hours of training on wages? If yes, explain why. If not, explain whether omitting educ from the regression is likely to lead to over or understating the importance of hours of training for wages.


Answer:

No, a1 would not be unbiased. Specifically, it would be negatively biased - it would underestimate the impact of hours of training on wages. This is because there is a trade-off in this case between additional training or additional education so that they are negatively correlated, meaning that if we exclude education from the regression, it appears as though training has less of an effect on wages than it really does, because we are missing the fact that education is also positively associated with wages, but is associated with fewer hours of training.

We also cannot claim causality from the a1 correlation, whether biased or not.


NOTE: Please remember to record the time it took you to complete this question.

QUESTION 3 – FORECASTING US PRESIDENTIAL ELECTIONS1

This part of the problem set is designed to be completed as a group. As opposed to other problem set questions where you are asked to write your answers in your own words, all team members can submit identical answers for this question. But please submit answers individually (as part of the problem set you submit) to enable timely grading.

In this question, you will use analytical, statistical, and quantitative tools to predict the results of the 2024 U.S. Presidential election.

At the end of the question, you will be asked to submit your predictions for the number of electoral college votes Harris will win.2

The U.S. Presidential election is really a set of separate state-by-state elections that are then aggregated (through the “Electoral College”) to determine the winner.  If you are not familiar with the nuances of the U.S. Presidential election system, consider visiting these links:3

A key step in any election simulation model is the use of recent polls of voters to predict the probability that a candidate wins a voting jurisdiction (e.g., a U.S. state).  For example, a model might identify a set of recent polls in the pivotal state of Pennsylvania, average them, and use the average to predict the probability that Democratic candidate Kamala Harris ultimately wins Pennsylvania.  This is called Harris’s “win probability” for the state.

Generally speaking, if the simulation model estimates that Harris is ahead in the vote in a state based on recent polls, Harris’s win probability will be greater than 50 percent. The further ahead Harris is in the polls of a state, the larger her win probability will be. 

Import the dataset ‘US_Election_24.csv’ into R by writing the relevant commands into your R script and running it.

  1. Suppose that Nate Silver, a prominent election forecaster, suggests that Kamala Harris leads 54% to 46% in Pennsylvania, based on recent polls.

    True, False, or Uncertain?  This means that Nate Silver thinks that Kamala Harris has a 54% chance of winning Pennsylvania. Please briefly explain your answer. [2-3 sentences]


Answer: False, the wording “leads 54% to 46%” refers to the percentage of votes that Nate Silver expects Harris to receive in Pennsylvania. It does not represent Harris’s chance of winning in Pennsylvania. In other words, Nate Silver is saying that he expects Harris to win by an 8% lead. He is not saying that if we run the election 100 times that Harris would win 54 times.


We will now use the latest estimates of Kamala Harris’s probabilities of winning each state and relevant congressional district to simulate the election that will take place on November 5.4

The approach to simulating the election will be as follows. In each state or district, you will randomly select a number from the uniform distribution between 0 and 1.5 If the random number is less than Harris’s win probability in that state or district, Harris wins the state. If the random number is greater than or equal to Harris’s win probability in that state or district, Harris does not win the state. Think about why this makes sense. If it does not, please come and see us as this is an important idea in the course.

  1. Let’s begin by simulating the Pennsylvania election 1000 times. In how many of the 1,000 iterations does Harris win the state of Pennsylvania? Does your answer approximately correspond to Harris’s win probability for Pennsylvania? Explain briefly. [1-3 sentences]

# A tibble: 6 × 4
  State_ED   Region    Electoral_Votes Harris_Prob
  <chr>      <chr>               <dbl>       <dbl>
1 Alabama    South                   9        0   
2 Alaska     Mountain                3        0.21
3 Arizona    Southwest              11        0.3 
4 Arkansas   South                   6        0.01
5 California Pacific                54        1   
6 Colorado   Southwest              10        0.97
# A tibble: 1 × 4
  State_ED     Region    Electoral_Votes Harris_Prob
  <chr>        <chr>               <dbl>       <dbl>
1 Pennsylvania Rust Belt              19        0.46
[1] 467
[1] 46211

Answer: In 467 of the 1,000 iterations, Harris won the state of Pennsylvania. This does roughly correspond to Harris’s win probability for Pennsylvania. This makes sense, given that a win probability of 0.46 translates to: if we run the election 1000 times, then Harris will take the lead approximately 460 of those times. The more times the election is run, the closer we will get to 0.46. This is shown in the second simulation, where Harris won 46,211 times out of 100,000.


  1. Now let’s simulate all 56 states/districts, 1000 times each. Generate a histogram of the number of electoral votes Harris wins in the 1,000 iterations of your simulation. Similar histograms are produced by venues such as FiveThirtyEight and The Economist. Feel free to adopt the design features you find effective from these sites in your histogram. (Suggested code to run the simulation loop is provided below. After you take a look at the code to make sure you understand what is doing, write the code to produce the histogram and make sure the histogram shows up in your rendered html file)

Answer: Insert your answer here


  1. Based on your simulation, what is the probability that Harris obtains at least 270 electoral votes and therefore wins the election?

[1] 0.459
[1] 269.47
[1] 209
[1] 350

Answer: The probability that Harris wins the election, by obtaining at least 270 electoral votes, is 45.9%.


  1. [OPTIONAL] One big disadvantage of the model is that it does not account for correlation between states/districts.6 There are many different approaches to incorporate correlation into an election simulation model. A relatively crude one is the following:

    Accounting for correlations across states

    This approach considers correlations for states within the same “political region” of the United States.7 The Region variable in the dataset shows the political region that each state belongs to. For each of these regions, you should draw a random number (would suggest saving this as a separate table) from a uniform distribution between -x and +x (where 0 < x < 0.5 and x is a number you select based on how strongly you believe outcomes for states/districts in that region will be correlated – for example, if you believe the Mountain region will be highly correlated and the Southeast to not be correlated at all, you may choose x to be 0.5 for Mountain and 0 for Southeast).

    Next, run the simulation from Q2c again, except your random number for each state/district should be the sum of a state/district-specific random number (like in Q2c) and the common random number for the state/district’s region (from the correlations you compute).

    Feel free to use the above approach or a different one. Do the simulation accounting for the correlations across states and generate a histogram of the number of electoral votes Harris wins in the 1,000 iterations of your simulation. Based on your revised simulation, what is the probability that Harris obtains at least 270 electoral votes and therefore wins the election?


Answer: Insert your answer here


  1. [OPTIONAL] How does your probability that Harris wins the election compare to the probabilities predicted by election forecast models (e.g., 538, The Economist) and prediction markets? What could explain any differences? In which do you have more confidence? [2-3 sentences]

Answer: Insert your answer here


  1. OK, folks, it's time to get real. You are welcome to continue enhancing the simulation model you have developed in this question or use a different method. But it is now time to summon all your powers to predict what will happen on Election Day. You will not be marked on whether your prediction matches the result, but the best prediction will be awarded with bragging rights and some potential prizes. This is your moment to shine!

    Report your prediction of the number of electoral votes Kamala Harris will actually win on Election Day. Submit your answers on this Google Form here, and write DONE in the box below


Answer:

QUESTION 4 - FINAL EXERCISE

This part of the problem set is designed to be completed with your final exercise team. As opposed to other problem set questions where you are asked to write your answers in your own words, all team members can submit identical answers for this question. But please submit answers individually (as part of the problem set you submit) this time to enable timely grading.

Note: The goal of this question is to help you advance in the final exercise, so you increase your chances of producing a final product you are proud of. Don’t feel too constrained by the specific prompts you see below. You should try to answer each of the prompts but your team should decide how much time it is worth to spend at this time on each of the items below. Ultimate goal is to nudge you in the direction of making progress.

Final Exercise is posted on Canvas.

The first task for your group is to decide which of the final exercise options you will pursue. Please indicate the members of your team and your final exercise option by filling in this brief survey.

NOTE: Only one member of the team should fill in the survey


Done


OPTION 1 - CONDUCTING MINI-GROWTH DIAGNOSTIC DOMINICAN REPUBLIC

  1. Skim background documents to familiarize yourself with the context.

  2. Download and read the seminal paper Growth Diagnostics (Hausmann, Rodrik and Velasco, 2005). Among your full set of group members, also read the expanded “Mindbook” of Growth Diagnostics (Hausmann, Klinger and Wagner, 2008) and examples of Growth Diagnostics. We recommend you to read mostly through the following sections of this book:

    1. Identifying Constraints to Growth (p.4)

    2. Growth Diagnostics: theoretical considerations, GD and the HRV model (p.16 - p.21)

    3. Principles of a Differential Diagnosis (p.31 - p.47)

    4. From diagnostics to therapeutics (p.90-p.94)

  3. Brainstorm some possible “tests” or analysis that you would run to identify the most severe binding constraints. How do these tests help you answer the first question? What kind of data will you need to run these “tests”? If there is confusion over the concepts of Growth Diagnostics, consult an example growth diagnostic or the teaching team.

  4. Decide how you plan to organize the work. Who will do what? Establish some deadlines.


  • Characterize growth based on growth decomposition
  • Regress aspects of decomposed growth on potential constraint variables, to evaluate if there is an association and the degree to which they explain variation
  • Analyze the extent to which the potential constraint is differential. For example, we could compare values of potential constraint to comparator countries - Latin America and Caribbean, other upper middle income countries - to assess if the DR is significantly outside of the “normal” or comparator range. We could also compare values of potential constraint to historical precedents in-country, to assess if the potential issue is new or historically relevant.
  • Determine the extent to which the constraint is binding (i.e. high shadow price, movement in shadow price leading to movement in growth function, presence of mechanisms to avoid constraint, heterogeneous acuity). At this stage we would look to empirical and theoretical literature to understand the linkages between the hypothesized constraint, characterized syndrome, and potential solutions.
  1. We have had some delays in establishing our team, so we will need to spend more time delegating the work and establishing deadlines.

OPTION 2 - IDENTIFYING RISK PREGNANCIES IN SURAT (INDIA)

  1. Familiarize yourself with the data. Remember that you will be using the training dataset for your analysis. Do some summary statistics on the main variables, including complications at time of delivery (last column)

  2. Read/skim the final exercise document to familiarize yourself with the context.

  3. Start thinking about what variables in the dataset (also called predictors or independent variables) you might want to use to predict high-risk pregnancies in the dataset.

  4. Brainstorm some ideas for what hypotheses you might explore and how you might use the data to explore these hypotheses.

  5. Decide how you plan to organize the work. Who will do what? Establish some deadlines.


Please enter your answers here


TIME USE

Please enter in the form linked below the time you spent on each question.

This information will only be used for teaching improvements; please be candid and report the time (in MINUTES) spent in each question.

The form is available here:

https://forms.gle/9Q9EHnxavAqpKULx9


Please enter “Done” in this field once you have completed the form.


This is a copy of your code.

.answer-box {
  background-color: LemonChiffon;
}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(options(width = 60))
knitr::opts_chunk$set(class.output = "bg-warning")

packages <- c('haven','dplyr', 'ggplot2', 'reshape2', 'tidyverse', 'pracma',
              'lubridate', 'scales', 'ggthemes', 'gt', 'dineq', 'gglorenz')  
to_install <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(to_install)>0) install.packages(to_install, 
                                          repos='http://cran.us.r-project.org')
lapply(packages, require, character.only=TRUE)


Last Name: Kathryn                               
First Name: Bankart
Group members: Muskaan Malhotra, Muhannad Alramlawi, Iris Seunga Ryu, Sebastian Rodriguez Leon, Fateh Farhan
                             [ X  ] YES                [   ] NO
steps <- read.csv("Steps.csv")
str(steps)
summary(steps)

baseline_means <- steps %>%
  group_by(treatment) %>%
  summarize(mean_baseline_steps = mean(BaselineSteps, na.rm = TRUE), 
            sd_baseline_steps = sd(BaselineSteps, na.rm = TRUE), 
            n = n())

print(baseline_means)




t_test_result <- t.test(BaselineSteps ~ treatment, data = steps)

print(t_test_result)
library(ggplot2)

ggplot(steps, aes(x = StepChange)) +
  geom_histogram(binwidth = 1000, color = "black", fill = "skyblue") +
  labs(title = "Distribution of Step Change",
       x = "Step Change (PostSteps - BaselineSteps)",
       y = "Frequency") +
  theme_minimal()
#Import data 
US_Election_24 <- read_csv("US_Election_24.csv")
head(US_Election_24)
subset(US_Election_24, US_Election_24$State_ED=="Pennsylvania")
Harris_Prob_Penn <- 0.46
 
set.seed(123)
n <- 1000
 
results <- runif(n,0,1) < Harris_Prob_Penn
harris_wins <- sum(results)
 
harris_wins
 
n_1 <- 100000
 
results <- runif(n_1,0,1) < Harris_Prob_Penn
harris_wins_1 <- sum(results)
 
harris_wins_1

n_sim <- 1000
US_Election_Sim <- US_Election_24

#Harris final results table
Harris_Results <- tibble(EC_Votes = rep(NA, n_sim),
                         Election_Result = rep(NA, n_sim))

#Here's the loop that runs the simulation  
#It first creates the column for the simulation
#Then it goes state by state within each iteration to simulate a Harris victory or loss
#Don't forget to set.seed to keep your results consistent

set.seed(02138)

  for (i in 1:n_sim) {
  
    col_name <- paste0("E", i)
    US_Election_Sim[[col_name]] <- NA
  
    for (j in 1:nrow(US_Election_Sim)) {
    
      result <- runif(1,0,1)
      
      US_Election_Sim[j,col_name] <- ifelse(US_Election_Sim$Harris_Prob[j] > result, 1, 0)
    
    }
    
    #Computing the final result for the iteration
    #Sum up the electoral college votes
    #Determine whether Harris won the election
    Harris_Results$EC_Votes[i] = sum(US_Election_Sim[,col_name]*US_Election_Sim$Electoral_Votes)
    Harris_Results$Election_Result[i] = ifelse(Harris_Results$EC_Votes[i] >= 270, 1, 0)
  }

ggplot(Harris_Results, aes(x = EC_Votes)) +
  geom_histogram(binwidth = 5, fill = "lightblue", color = "black") +
  geom_vline(xintercept = 270, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Simulated Distribution of Harris's Electoral Votes",
    x = "Electoral Votes Won by Harris",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

sum(Harris_Results$EC_Votes >= 270, na.rm=TRUE)/1000
mean(Harris_Results$EC_Votes)
min(Harris_Results$EC_Votes)
max(Harris_Results$EC_Votes)

# Insert only code here.

Footnotes

  1. My full gratitude to my colleague Jonathan Borck for being the intellectual architect of the assignment question that served as a basis for this question, to Andrew Kidd for designing, improving and refining this assignment question, and to Vaibhav Parik for his help adapting it to API-209.↩︎

  2. Throughout this assignment, we will focus on explicitly forecasting the electoral votes won by the Democratic candidate, Kamala Harris, instead of the votes won by the Republican candidate, Donald Trump. This is an arbitrary choice made only to ensure consistency of student responses and thus facilitate efficient and equitable grading of your assignments.↩︎

  3. We suggest that you do not consider the non-traditional voting systems used in some jurisdictions such as Maine.↩︎

  4. For 48 of the 50 U.S. states as well as the District of Columbia, the winner of the popular vote in the state wins all of the state’s electoral votes.  The remaining two states (Maine and Nebraska) each award two electoral votes to the winner of the statewide popular vote and one additional electoral vote to the winner of each of the state’s districts.  Thus, the simulation will use 56 states or districts in total (48 states + Washington DC + Maine statewide + Maine District 1 + Maine District 2 + Nebraska statewide + Nebraska District 1 + Nebraska District 2 + Nebraska District 3).  There are 538 electoral votes to award on Election Day.  A candidate will win the race if they obtain 270 electoral votes or more.↩︎

  5. When drawing a random number from a uniform distribution, every value has the same probability of selection.↩︎

  6. Of course, there are many other factors that this model does not account for – we focus here on correlation between states/districts both because handling correlations between events is an important component of your analytical toolkit and because most errors in state/district polling at the end of a campaign are correlated with errors in other states (per Nate Silver)↩︎

  7. Dividing the United States into regions is a tricky and often-controversial exercise. However, states which are physically near each other often behave similarly in presidential elections (in part because geographical proximity is often a helpful proxy for shared history, culture, economic structures, and demographics). We will use the political regions developed by 538 (despite some concerns about their forecast model), because they are fully transparent about their methodology and its results.↩︎