Hello everyone! I have prepared an analysis on an article I found on FiveThirtyEight. “FiveThirtyEight, sometimes rendered as 538, is a website that focuses on opinion poll analysis, politics, economics, and sports blogging”, according to their Wikipedia page. I was interested in this website because I remember they like to use visuals. It would be fun to make some of my own based on their content.
The article I chose to analyze is about “Winning the Race for Congress”. Aaron Bycoffe and Dhrumil Mehta explain their work as “An updating estimate of the congressional generic ballot, based on polls that ask people which party they would support in an election.”
The link to the article is attached here: https://projects.fivethirtyeight.com/congress-generic-ballot-polls/
This page is updated on occasion. New polls are added in that case. These results are as of 2:00 PM on May 13, 2020.
Without further ado, let’s get started.
A few R packages were necessary for this analysis. Tidyverse is a versatile package used to read in and wrangle the data, as well as visualize it. Stringr is a package used to manipulate obsrevations that are character-based. XML is a package for pulling tables from websites. I used a few other magic packages to do pull some more witch craft with these web pages.
## -- Attaching packages ---------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
##
## Attaching package: 'XML'
## The following object is masked from 'package:rvest':
##
## xml
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
Now we have scraped their XML table from the article’s page. However, this table needs a lot of adjustments before any analysis can be done. Many observations need to systematically changed for easier manipulation. Data types are a bit of a disaster at the moment, but no need to worry.
The table is finally ready for some analysis. Buckle your seatbelts.
Before I even began writing any code, I looked through the table out of curiosity. I noticed one column had what appeared to be the same answer for every observation. The column I am speaking of is the “Party Leader” column that reports ‘Republican’ or ‘Democrat’ and their margin of victory in a particular poll. I noticed that the Democrats had won nearly all of them.
I wanted to be sure, so I went through a search to see if I was missing something.
## Pollster Leader
## 1 McLaughlin & Associates Republican
## 2 McLaughlin & Associates Republican
## 3 McLaughlin & Associates Republican
As you can see up above, the Republican party lead three of these polls. All of the three polls are “McLaughlin & Associates”. This inspired me to get more information on this poll specifically.
## # A tibble: 2 x 4
## Leader total polls AvgWin
## <chr> <dbl> <int> <dbl>
## 1 Democrat 18 8 2.25
## 2 Republican -7 3 -2.33
McLaughlin had 11 polls, all of which had a sample size of 1000. The other eight polls were won by the Democrats. The two parties had similar average margins of victory in McLaughlin’s polls. Republicans had an average margin of victory of 2.33 points, while Democrats won by 2.25 points on average.
We are going to take a look into what polls are most lopsided on the Democrat side.
## # A tibble: 28 x 3
## # Groups: Pollster [27]
## Pollster Leader lead
## <chr> <chr> <dbl>
## 1 HarrisX Democrat 447
## 2 YouGov Democrat 352
## 3 Morning Consult Democrat 224
## 4 McLaughlin & Associates Democrat 50
## 5 Firehouse Strategies/Øptimus Democrat 37
## 6 RMG Research Democrat 24
## 7 NBC News/Wall Street Journal Democrat 20
## 8 Marist College Democrat 19
## 9 GQR Research Democrat 14
## 10 Emerson College Democrat 11
## # ... with 18 more rows
Three polls stand out here from the rest. HarrisX, YouGov, and Morning Consult. It is important to see sample sizes for these polls. HarrisX polls average 3,000 respondents. Morning Consult polls average about 2,000. And YouGov polls have over 1,000 respondents on average.
The poll with the biggest sample has the smallest average margin of victory, HarrisX. Democrats win by a margin of 6 on average. Morning Consult and YouGov polls average around 7 and 8, respectively.
The takeaway here is the negative relationship between sample size and margin of victory. In this case, as sample size increases, margin of victory decreases.
This is a fair question to ask. While people have responded in these polls, they may not be registered to vote. If they are not, we will have to reconsider how accurate this is as a predictor.
Up above, these two visuals show the disparity in registered voters and likely voters. I went ahead and totalled these polls together. Over 370,000 of respondents are registered voters, while around 22,000 of them were deemed as “Likely Voters”.
Out of the 205 polls at this time, 182 of them are composed of registered voters. The remaining 23 are made up of likely voters. So it appears that this information is valuable.
We have a variety of variables in this dataset. I want an explanation for the scores of these polls. I think the best way to do that is by running a linear regression model. This will include sample size, poll weight, and the proportions of votes for Republicans and Democrats, respectively. We are going to see if these variables are good predictors of poll scores.
##
## Call:
## lm(formula = LinearModel$Score ~ LinearModel$Sample + LinearModel$Weight +
## LinearModel$Republican + LinearModel$Democrat)
##
## Coefficients:
## (Intercept) LinearModel$Sample LinearModel$Weight
## -2.4187065 0.0002128 0.2144762
## LinearModel$Republican LinearModel$Democrat
## -0.6487733 0.7412880
##
## Call:
## lm(formula = LinearModel$Score ~ LinearModel$Sample + LinearModel$Weight +
## LinearModel$Republican + LinearModel$Democrat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7390 -0.4728 -0.0237 0.2458 3.1661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.4187065 1.9818187 -1.220 0.2237
## LinearModel$Sample 0.0002128 0.0001200 1.773 0.0777 .
## LinearModel$Weight 0.2144762 0.1618941 1.325 0.1868
## LinearModel$Republican -0.6487733 0.0312242 -20.778 <2e-16 ***
## LinearModel$Democrat 0.7412880 0.0340511 21.770 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8028 on 198 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.8038, Adjusted R-squared: 0.7999
## F-statistic: 202.8 on 4 and 198 DF, p-value: < 2.2e-16
The regression results are statistically significant, we got a p-value of < 2.2e-16. Nearly 80% of variation in scores can be explained by these variables regression equation:
score = 1.826 + 0.00001049(sample) + 0.3229(weight) + 0.6904(difference) # What does this equation mean?
An increase in sample size increases the score generally.
An increase in the poll’s weight increases the score generally.
An increase in the Democrat’s margin of victory increases the score generally.
That makes sense since so many of these polls tab Democrats as the leader.
Thank you for checking out my project. Like I said at the beginning, this article does update. These results should not be too different if you decide to run some analysis of your own in the near future.