AHCA Polls

Data source

https://github.com/fivethirtyeight/data/blob/master/ahca-polls/ahca_polls.csv

Collection of AHCA polls extracted from different websites.

Load Libraries

#install.packages("ggplot2")
#install.packages("tidyverse")
#install.packages("factoextra")
#install.packages("reshape") 
library(reshape)
library(stringr)
library(XML)
library(RCurl)
library(RJSONIO)
library(tidyr)
library(dplyr)
library(ggplot2)
library(knitr)
options(knitr.table.format = "html")
library(tidyverse)  # data manipulation
library(cluster)    # clustering algorithms
library(factoextra)
options(stringsAsFactors = FALSE)

Introduction.

The data in GitHub contains summary of AHSA repeal polls.We want to do simple cleaning and analysis of the data.

Read CSV file.

  • Add “others” collumn (people who did not either oppose or favor the repeal. I assume these individuals had no opinion).

  • Fix typoes.

  • Fix dates.

  • Calculate days per poll.

  • There was a mistake in one of the record. Pollster name was recorded incorrectly.

  • Calculate length of poll question.

  • Run summary function for dataframe.

polls<-read.csv(file='https://raw.githubusercontent.com/fivethirtyeight/data/master/ahca-polls/ahca_polls.csv', header=TRUE, sep=",")

polls$Others<-100-(polls$Favor+polls$Oppose)

polls$Pollster[polls$Pollster == "Qunnipiac"] <- "Quinnipiac"

polls$Pollster[polls$Pollster == 'YouGov' & polls$End=='5/13/17'] <- 'Economist/YouGov'

polls$StartD<-as.Date(polls$Start,"%m/%d/%y")

polls$EndD<-as.Date(polls$End,"%m/%d/%y")

polls$days <-as.numeric(polls$EndD-polls$StartD+1)

polls$newpoll = str_wrap(polls$Pollster, width = 10)

polls$textlen<-str_length(polls$Text)

summary(polls)
##     Start               End              Pollster             Favor      
##  Length:15          Length:15          Length:15          Min.   :17.00  
##  Class :character   Class :character   Class :character   1st Qu.:23.50  
##  Mode  :character   Mode  :character   Mode  :character   Median :31.00  
##                                                           Mean   :28.33  
##                                                           3rd Qu.:31.50  
##                                                           Max.   :38.00  
##      Oppose          Url                Text               Others    
##  Min.   :44.00   Length:15          Length:15          Min.   :13.0  
##  1st Qu.:46.50   Class :character   Class :character   1st Qu.:19.5  
##  Median :48.00   Mode  :character   Mode  :character   Median :22.0  
##  Mean   :50.67                                         Mean   :21.0  
##  3rd Qu.:55.00                                         3rd Qu.:23.0  
##  Max.   :62.00                                         Max.   :29.0  
##      StartD                EndD                 days      
##  Min.   :2017-05-04   Min.   :2017-05-06   Min.   :1.000  
##  1st Qu.:2017-05-08   1st Qu.:2017-05-14   1st Qu.:4.500  
##  Median :2017-05-13   Median :2017-05-20   Median :6.000  
##  Mean   :2017-05-17   Mean   :2017-05-21   Mean   :5.467  
##  3rd Qu.:2017-05-25   3rd Qu.:2017-05-30   3rd Qu.:7.000  
##  Max.   :2017-06-09   Max.   :2017-06-11   Max.   :8.000  
##    newpoll             textlen     
##  Length:15          Min.   :115.0  
##  Class :character   1st Qu.:148.0  
##  Mode  :character   Median :181.0  
##                     Mean   :218.9  
##                     3rd Qu.:265.0  
##                     Max.   :540.0

Basic Plots.

  • Distribution of polls. Quinnipiac and YouGov are the most common - 20% of all polls each.

  • Favor/Oppose/Others/Spread/Days by Pollster. Favor - max Morning Consult(38%), while min Quinnipiac(19.333(3)). Oppose varries from 45.5 for Morning Consult to 58.333(3) for Quinnipiac. Others varries from 13% for Monmouth to 29% for NBC. Spread (difference between Oppose and Favor) varies from 7.5% for Morning Consult to 39% for Quinnipiac. Days per poll ranges from 4 to 8.

  • If we look at the polls by date, it is hard to see clear pattern. But we can see that results by different Pollster is consistent.

  • All our variables seem to be distributed not normally, except for non-responders.

  • There seems to be no strong correlation between length of poll question and favorable/unfavorable repsonse.

pollfreq<-prop.table(table(polls$newpoll))

barplot(pollfreq, xlab="Pollsters", ylab="% of Total", main="% of Polls by Pollster")

AveByPollster<-polls %>% group_by(newpoll) %>% summarise(avgF = mean(Favor),avgOpp= mean(Oppose),avgOth= mean(Others), avgD=mean(days),avgL=mean(textlen)) 

AveByPollster$spread<-AveByPollster$avgOpp-AveByPollster$avgF

ggplot(AveByPollster, aes(x=reorder(newpoll,avgF), y=avgF)) + geom_point()+ggtitle("Plot of Favorable Response") + labs(x = "Pollsters", y="Favor to repeal")

ggplot(AveByPollster, aes(x=reorder(newpoll,avgOpp), y=avgOpp)) + geom_point()+ggtitle("Plot of Opposing Response") + labs(x = "Pollsters", y="Oppose to repeal")

ggplot(AveByPollster, aes(x=reorder(newpoll,avgOth), y=avgOth)) + geom_point()+ggtitle("Plot of No Response") + labs(x = "Pollsters", y="No response")

ggplot(AveByPollster, aes(x=reorder(newpoll,spread), y=spread)) + geom_point()+ggtitle("Plot of DIfference between Favorable and Unfavorable Responses") + labs(x = "Pollsters", y="Difference between favor and oppose to repeal")

AveByPollster1<-select(AveByPollster,newpoll,avgF,avgOpp,avgOth)

AveByPollster1 <- as.data.frame(AveByPollster1)

AveByPollster2<-melt(AveByPollster1,id = c("newpoll"))

ggplot(AveByPollster2, aes(fill=variable, y=value, x=newpoll)) + geom_bar( stat="identity", position="fill")+ggtitle("Barchart of Reponses by Pollster") + labs(x = "Pollsters", y="Distribution of reposnses")

ggplot(AveByPollster, aes(x=reorder(newpoll,avgD), y=avgD)) + geom_point()+ggtitle("Plot of Days of Polling") + labs(x = "Pollsters", y="Days of polling")

ggplot(AveByPollster, aes(x=reorder(newpoll,avgL), y=avgL)) + geom_point()+ggtitle("Plot of Poll Question Length") + labs(x = "Pollsters", y="Length of question")

ggplot(AveByPollster, aes(x=avgF, y=avgL)) + geom_point()+ggtitle("Plot of Dependency of Poll Question Length and Favorable Response")  + labs(x = "Favorable response", y="Length of question")

qplot(polls$StartD, polls$Favor,geom=c("point", "line"),colour = polls$newpoll, main="Plot of Date of Polling vs Favorable Response", xlab="Date of Polling", ylab="Favorable Reponse")

x <- AveByPollster[2:4]

y <- AveByPollster[6]

cor(x, y)
##              avgL
## avgF    0.1044624
## avgOpp  0.2288785
## avgOth -0.3375974
library("ggpubr")
## Loading required package: magrittr
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
ggdensity(polls$Favor, 
          main = "Density plot of favorable response to a poll",
          xlab = "Favorable response to a poll")

library("ggpubr")
ggdensity(polls$Oppose, 
          main = "Density plot of opposing response to a poll",
          xlab = "Opposing response to a poll")

library("ggpubr")
ggdensity(polls$Others, 
          main = "Density plot of no response to a poll",
          xlab = "No response to a poll")

library("ggpubr")
ggdensity(polls$days, 
          main = "Density plot of length of a poll(days)",
          xlab = "Length of a poll(days)")

library("ggpubr")
ggdensity(polls$textlen, 
          main = "Density plot of length of question of a poll",
          xlab = "Length of question of a poll")

Cluster Analysis.

Morning consult, Keiser Family, and Monmouth got clustered together, while all other pollsters got clustered seperetly (I have only attempted to create 2 different clusters).

pollcluster<-select(polls,newpoll, Start, Favor, Oppose, Others, days, textlen)

pollcluster<-unite(pollcluster,ncol,c(1:2))

pollcluster
##                                  ncol Favor Oppose Others days textlen
## 1      Public\nPolicy\nPolling_6/9/17    24     55     21    3     116
## 2                       YouGov_6/4/17    33     45     22    3     145
## 3                  Quinnipiac_5/31/17    17     62     21    7     181
## 4                       IPSOS_5/26/17    30     46     24    5     119
## 5            Morning\nConsult_5/25/17    38     47     15    6     284
## 6                  Quinnipiac_5/17/17    20     57     23    7     181
## 7  Kaiser\nFamily\nFoundation_5/16/17    31     55     14    7     540
## 8                    Monmouth_5/13/17    32     55     13    8     246
## 9                      YouGov_5/13/17    31     47     22    4     170
## 10    Public\nPolicy\nPolling_5/12/17    25     52     23    5     115
## 11                    NBC/WSJ_5/11/17    23     48     29    5     400
## 12          Economist/\nYouGov_5/6/17    31     44     25    8     151
## 13                      YouGov_5/6/17    31     47     22    1     170
## 14                  Quinnipiac_5/4/17    21     56     23    7     181
## 15            Morning\nConsult_5/4/17    38     44     18    6     284
row.names(pollcluster)<-pollcluster[,1]

pollcluster1 <- scale(pollcluster[-1])

pollcluster1
##                                         Favor     Oppose     Others
## Public\nPolicy\nPolling_6/9/17     -0.6831301  0.7683224  0.0000000
## YouGov_6/4/17                       0.7356785 -1.0047293  0.2311604
## Quinnipiac_5/31/17                 -1.7866478  2.0094587  0.0000000
## IPSOS_5/26/17                       0.2627423 -0.8274242  0.6934811
## Morning\nConsult_5/25/17            1.5239055 -0.6501190 -1.3869622
## Quinnipiac_5/17/17                 -1.3137116  1.1229328  0.4623207
## Kaiser\nFamily\nFoundation_5/16/17  0.4203877  0.7683224 -1.6181226
## Monmouth_5/13/17                    0.5780331  0.7683224 -1.8492829
## YouGov_5/13/17                      0.4203877 -0.6501190  0.2311604
## Public\nPolicy\nPolling_5/12/17    -0.5254847  0.2364069  0.4623207
## NBC/WSJ_5/11/17                    -0.8407754 -0.4728138  1.8492829
## Economist/\nYouGov_5/6/17           0.4203877 -1.1820345  0.9246415
## YouGov_5/6/17                       0.4203877 -0.6501190  0.2311604
## Quinnipiac_5/4/17                  -1.1560662  0.9456276  0.4623207
## Morning\nConsult_5/4/17             1.5239055 -1.1820345 -0.6934811
##                                          days    textlen
## Public\nPolicy\nPolling_6/9/17     -1.2146780 -0.8701821
## YouGov_6/4/17                      -1.2146780 -0.6248618
## Quinnipiac_5/31/17                  0.7550701 -0.3203263
## IPSOS_5/26/17                      -0.2298039 -0.8448042
## Morning\nConsult_5/25/17            0.2626331  0.5509838
## Quinnipiac_5/17/17                  0.7550701 -0.3203263
## Kaiser\nFamily\nFoundation_5/16/17  0.7550701  2.7165698
## Monmouth_5/13/17                    1.2475071  0.2295296
## YouGov_5/13/17                     -0.7222410 -0.4133788
## Public\nPolicy\nPolling_5/12/17    -0.2298039 -0.8786414
## NBC/WSJ_5/11/17                    -0.2298039  1.5322649
## Economist/\nYouGov_5/6/17           1.2475071 -0.5741059
## YouGov_5/6/17                      -2.1995521 -0.4133788
## Quinnipiac_5/4/17                   0.7550701 -0.3203263
## Morning\nConsult_5/4/17             0.2626331  0.5509838
## attr(,"scaled:center")
##      Favor     Oppose     Others       days    textlen 
##  28.333333  50.666667  21.000000   5.466667 218.866667 
## attr(,"scaled:scale")
##      Favor     Oppose     Others       days    textlen 
##   6.343350   5.639993   4.326001   2.030717 118.212802
distance <- get_dist(pollcluster1)

fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

k2 <- kmeans(pollcluster1, centers = 2, nstart = 25)

str(k2)
## List of 9
##  $ cluster     : Named int [1:15] 2 2 2 2 1 2 1 1 2 2 ...
##   ..- attr(*, "names")= chr [1:15] "Public\nPolicy\nPolling_6/9/17" "YouGov_6/4/17" "Quinnipiac_5/31/17" "IPSOS_5/26/17" ...
##  $ centers     : num [1:2, 1:5] 1.0116 -0.3678 -0.0739 0.0269 -1.387 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "1" "2"
##   .. ..$ : chr [1:5] "Favor" "Oppose" "Others" "days" ...
##  $ totss       : num 70
##  $ withinss    : num [1:2] 9.4 36.7
##  $ tot.withinss: num 46.1
##  $ betweenss   : num 23.9
##  $ size        : int [1:2] 4 11
##  $ iter        : int 1
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
fviz_cluster(k2, data = pollcluster1)

Conclusion.

  • The main conclusion is obvious - majority of Americans oppose to the repeal of Obamacare.

  • However, data collected in our file is very limited. It does not have sample size. It does not have breakdown by different demographics. So, next step should be pulling additional data from the website provided in the file.

  • It would be also interesting to understand wide range of results coming from different pollsters. Would the reason for discrepency be the way poll question was asked or is it error due to poll size or the way sample population was chosen?