https://github.com/fivethirtyeight/data/blob/master/ahca-polls/ahca_polls.csv
Collection of AHCA polls extracted from different websites.
#install.packages("ggplot2")
#install.packages("tidyverse")
#install.packages("factoextra")
#install.packages("reshape")
library(reshape)
library(stringr)
library(XML)
library(RCurl)
library(RJSONIO)
library(tidyr)
library(dplyr)
library(ggplot2)
library(knitr)
options(knitr.table.format = "html")
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra)
options(stringsAsFactors = FALSE)
The data in GitHub contains summary of AHSA repeal polls.We want to do simple cleaning and analysis of the data.
Add “others” collumn (people who did not either oppose or favor the repeal. I assume these individuals had no opinion).
Fix typoes.
Fix dates.
Calculate days per poll.
There was a mistake in one of the record. Pollster name was recorded incorrectly.
Calculate length of poll question.
Run summary function for dataframe.
polls<-read.csv(file='https://raw.githubusercontent.com/fivethirtyeight/data/master/ahca-polls/ahca_polls.csv', header=TRUE, sep=",")
polls$Others<-100-(polls$Favor+polls$Oppose)
polls$Pollster[polls$Pollster == "Qunnipiac"] <- "Quinnipiac"
polls$Pollster[polls$Pollster == 'YouGov' & polls$End=='5/13/17'] <- 'Economist/YouGov'
polls$StartD<-as.Date(polls$Start,"%m/%d/%y")
polls$EndD<-as.Date(polls$End,"%m/%d/%y")
polls$days <-as.numeric(polls$EndD-polls$StartD+1)
polls$newpoll = str_wrap(polls$Pollster, width = 10)
polls$textlen<-str_length(polls$Text)
summary(polls)
## Start End Pollster Favor
## Length:15 Length:15 Length:15 Min. :17.00
## Class :character Class :character Class :character 1st Qu.:23.50
## Mode :character Mode :character Mode :character Median :31.00
## Mean :28.33
## 3rd Qu.:31.50
## Max. :38.00
## Oppose Url Text Others
## Min. :44.00 Length:15 Length:15 Min. :13.0
## 1st Qu.:46.50 Class :character Class :character 1st Qu.:19.5
## Median :48.00 Mode :character Mode :character Median :22.0
## Mean :50.67 Mean :21.0
## 3rd Qu.:55.00 3rd Qu.:23.0
## Max. :62.00 Max. :29.0
## StartD EndD days
## Min. :2017-05-04 Min. :2017-05-06 Min. :1.000
## 1st Qu.:2017-05-08 1st Qu.:2017-05-14 1st Qu.:4.500
## Median :2017-05-13 Median :2017-05-20 Median :6.000
## Mean :2017-05-17 Mean :2017-05-21 Mean :5.467
## 3rd Qu.:2017-05-25 3rd Qu.:2017-05-30 3rd Qu.:7.000
## Max. :2017-06-09 Max. :2017-06-11 Max. :8.000
## newpoll textlen
## Length:15 Min. :115.0
## Class :character 1st Qu.:148.0
## Mode :character Median :181.0
## Mean :218.9
## 3rd Qu.:265.0
## Max. :540.0
Distribution of polls. Quinnipiac and YouGov are the most common - 20% of all polls each.
Favor/Oppose/Others/Spread/Days by Pollster. Favor - max Morning Consult(38%), while min Quinnipiac(19.333(3)). Oppose varries from 45.5 for Morning Consult to 58.333(3) for Quinnipiac. Others varries from 13% for Monmouth to 29% for NBC. Spread (difference between Oppose and Favor) varies from 7.5% for Morning Consult to 39% for Quinnipiac. Days per poll ranges from 4 to 8.
If we look at the polls by date, it is hard to see clear pattern. But we can see that results by different Pollster is consistent.
All our variables seem to be distributed not normally, except for non-responders.
There seems to be no strong correlation between length of poll question and favorable/unfavorable repsonse.
pollfreq<-prop.table(table(polls$newpoll))
barplot(pollfreq, xlab="Pollsters", ylab="% of Total", main="% of Polls by Pollster")
AveByPollster<-polls %>% group_by(newpoll) %>% summarise(avgF = mean(Favor),avgOpp= mean(Oppose),avgOth= mean(Others), avgD=mean(days),avgL=mean(textlen))
AveByPollster$spread<-AveByPollster$avgOpp-AveByPollster$avgF
ggplot(AveByPollster, aes(x=reorder(newpoll,avgF), y=avgF)) + geom_point()+ggtitle("Plot of Favorable Response") + labs(x = "Pollsters", y="Favor to repeal")
ggplot(AveByPollster, aes(x=reorder(newpoll,avgOpp), y=avgOpp)) + geom_point()+ggtitle("Plot of Opposing Response") + labs(x = "Pollsters", y="Oppose to repeal")
ggplot(AveByPollster, aes(x=reorder(newpoll,avgOth), y=avgOth)) + geom_point()+ggtitle("Plot of No Response") + labs(x = "Pollsters", y="No response")
ggplot(AveByPollster, aes(x=reorder(newpoll,spread), y=spread)) + geom_point()+ggtitle("Plot of DIfference between Favorable and Unfavorable Responses") + labs(x = "Pollsters", y="Difference between favor and oppose to repeal")
AveByPollster1<-select(AveByPollster,newpoll,avgF,avgOpp,avgOth)
AveByPollster1 <- as.data.frame(AveByPollster1)
AveByPollster2<-melt(AveByPollster1,id = c("newpoll"))
ggplot(AveByPollster2, aes(fill=variable, y=value, x=newpoll)) + geom_bar( stat="identity", position="fill")+ggtitle("Barchart of Reponses by Pollster") + labs(x = "Pollsters", y="Distribution of reposnses")
ggplot(AveByPollster, aes(x=reorder(newpoll,avgD), y=avgD)) + geom_point()+ggtitle("Plot of Days of Polling") + labs(x = "Pollsters", y="Days of polling")
ggplot(AveByPollster, aes(x=reorder(newpoll,avgL), y=avgL)) + geom_point()+ggtitle("Plot of Poll Question Length") + labs(x = "Pollsters", y="Length of question")
ggplot(AveByPollster, aes(x=avgF, y=avgL)) + geom_point()+ggtitle("Plot of Dependency of Poll Question Length and Favorable Response") + labs(x = "Favorable response", y="Length of question")
qplot(polls$StartD, polls$Favor,geom=c("point", "line"),colour = polls$newpoll, main="Plot of Date of Polling vs Favorable Response", xlab="Date of Polling", ylab="Favorable Reponse")
x <- AveByPollster[2:4]
y <- AveByPollster[6]
cor(x, y)
## avgL
## avgF 0.1044624
## avgOpp 0.2288785
## avgOth -0.3375974
library("ggpubr")
## Loading required package: magrittr
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
ggdensity(polls$Favor,
main = "Density plot of favorable response to a poll",
xlab = "Favorable response to a poll")
library("ggpubr")
ggdensity(polls$Oppose,
main = "Density plot of opposing response to a poll",
xlab = "Opposing response to a poll")
library("ggpubr")
ggdensity(polls$Others,
main = "Density plot of no response to a poll",
xlab = "No response to a poll")
library("ggpubr")
ggdensity(polls$days,
main = "Density plot of length of a poll(days)",
xlab = "Length of a poll(days)")
library("ggpubr")
ggdensity(polls$textlen,
main = "Density plot of length of question of a poll",
xlab = "Length of question of a poll")
Morning consult, Keiser Family, and Monmouth got clustered together, while all other pollsters got clustered seperetly (I have only attempted to create 2 different clusters).
pollcluster<-select(polls,newpoll, Start, Favor, Oppose, Others, days, textlen)
pollcluster<-unite(pollcluster,ncol,c(1:2))
pollcluster
## ncol Favor Oppose Others days textlen
## 1 Public\nPolicy\nPolling_6/9/17 24 55 21 3 116
## 2 YouGov_6/4/17 33 45 22 3 145
## 3 Quinnipiac_5/31/17 17 62 21 7 181
## 4 IPSOS_5/26/17 30 46 24 5 119
## 5 Morning\nConsult_5/25/17 38 47 15 6 284
## 6 Quinnipiac_5/17/17 20 57 23 7 181
## 7 Kaiser\nFamily\nFoundation_5/16/17 31 55 14 7 540
## 8 Monmouth_5/13/17 32 55 13 8 246
## 9 YouGov_5/13/17 31 47 22 4 170
## 10 Public\nPolicy\nPolling_5/12/17 25 52 23 5 115
## 11 NBC/WSJ_5/11/17 23 48 29 5 400
## 12 Economist/\nYouGov_5/6/17 31 44 25 8 151
## 13 YouGov_5/6/17 31 47 22 1 170
## 14 Quinnipiac_5/4/17 21 56 23 7 181
## 15 Morning\nConsult_5/4/17 38 44 18 6 284
row.names(pollcluster)<-pollcluster[,1]
pollcluster1 <- scale(pollcluster[-1])
pollcluster1
## Favor Oppose Others
## Public\nPolicy\nPolling_6/9/17 -0.6831301 0.7683224 0.0000000
## YouGov_6/4/17 0.7356785 -1.0047293 0.2311604
## Quinnipiac_5/31/17 -1.7866478 2.0094587 0.0000000
## IPSOS_5/26/17 0.2627423 -0.8274242 0.6934811
## Morning\nConsult_5/25/17 1.5239055 -0.6501190 -1.3869622
## Quinnipiac_5/17/17 -1.3137116 1.1229328 0.4623207
## Kaiser\nFamily\nFoundation_5/16/17 0.4203877 0.7683224 -1.6181226
## Monmouth_5/13/17 0.5780331 0.7683224 -1.8492829
## YouGov_5/13/17 0.4203877 -0.6501190 0.2311604
## Public\nPolicy\nPolling_5/12/17 -0.5254847 0.2364069 0.4623207
## NBC/WSJ_5/11/17 -0.8407754 -0.4728138 1.8492829
## Economist/\nYouGov_5/6/17 0.4203877 -1.1820345 0.9246415
## YouGov_5/6/17 0.4203877 -0.6501190 0.2311604
## Quinnipiac_5/4/17 -1.1560662 0.9456276 0.4623207
## Morning\nConsult_5/4/17 1.5239055 -1.1820345 -0.6934811
## days textlen
## Public\nPolicy\nPolling_6/9/17 -1.2146780 -0.8701821
## YouGov_6/4/17 -1.2146780 -0.6248618
## Quinnipiac_5/31/17 0.7550701 -0.3203263
## IPSOS_5/26/17 -0.2298039 -0.8448042
## Morning\nConsult_5/25/17 0.2626331 0.5509838
## Quinnipiac_5/17/17 0.7550701 -0.3203263
## Kaiser\nFamily\nFoundation_5/16/17 0.7550701 2.7165698
## Monmouth_5/13/17 1.2475071 0.2295296
## YouGov_5/13/17 -0.7222410 -0.4133788
## Public\nPolicy\nPolling_5/12/17 -0.2298039 -0.8786414
## NBC/WSJ_5/11/17 -0.2298039 1.5322649
## Economist/\nYouGov_5/6/17 1.2475071 -0.5741059
## YouGov_5/6/17 -2.1995521 -0.4133788
## Quinnipiac_5/4/17 0.7550701 -0.3203263
## Morning\nConsult_5/4/17 0.2626331 0.5509838
## attr(,"scaled:center")
## Favor Oppose Others days textlen
## 28.333333 50.666667 21.000000 5.466667 218.866667
## attr(,"scaled:scale")
## Favor Oppose Others days textlen
## 6.343350 5.639993 4.326001 2.030717 118.212802
distance <- get_dist(pollcluster1)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
k2 <- kmeans(pollcluster1, centers = 2, nstart = 25)
str(k2)
## List of 9
## $ cluster : Named int [1:15] 2 2 2 2 1 2 1 1 2 2 ...
## ..- attr(*, "names")= chr [1:15] "Public\nPolicy\nPolling_6/9/17" "YouGov_6/4/17" "Quinnipiac_5/31/17" "IPSOS_5/26/17" ...
## $ centers : num [1:2, 1:5] 1.0116 -0.3678 -0.0739 0.0269 -1.387 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "1" "2"
## .. ..$ : chr [1:5] "Favor" "Oppose" "Others" "days" ...
## $ totss : num 70
## $ withinss : num [1:2] 9.4 36.7
## $ tot.withinss: num 46.1
## $ betweenss : num 23.9
## $ size : int [1:2] 4 11
## $ iter : int 1
## $ ifault : int 0
## - attr(*, "class")= chr "kmeans"
fviz_cluster(k2, data = pollcluster1)
The main conclusion is obvious - majority of Americans oppose to the repeal of Obamacare.
However, data collected in our file is very limited. It does not have sample size. It does not have breakdown by different demographics. So, next step should be pulling additional data from the website provided in the file.
It would be also interesting to understand wide range of results coming from different pollsters. Would the reason for discrepency be the way poll question was asked or is it error due to poll size or the way sample population was chosen?