Introduction

This dataset is from FiveThirtyEight which is a website that collects data to use for statistical analysis. The dataset that I choose from FiveThirtyEight is the Polls dataset and specificially the approval poll ratings. This dataset is a collection of data from different polling organization which aggregates them into this one file.

x = read.csv(url("https://projects.fivethirtyeight.com/polls-page/president_approval_polls.csv"))

head(x)
str(x)
## 'data.frame':    10009 obs. of  27 variables:
##  $ question_id         : int  139304 139305 139306 139307 139298 139287 139262 139267 139261 139250 ...
##  $ poll_id             : int  74225 74225 74226 74226 74224 74221 74214 74218 74213 74210 ...
##  $ state               : logi  NA NA NA NA NA NA ...
##  $ politician_id       : int  11 11 11 11 11 11 11 11 11 11 ...
##  $ politician          : chr  "Donald Trump" "Donald Trump" "Donald Trump" "Donald Trump" ...
##  $ pollster_id         : int  568 568 23 23 1528 399 396 1189 399 568 ...
##  $ pollster            : chr  "YouGov" "YouGov" "American Research Group" "American Research Group" ...
##  $ sponsor_ids         : chr  "352" "352" "" "" ...
##  $ sponsors            : chr  "Economist" "Economist" "" "" ...
##  $ display_name        : chr  "YouGov" "YouGov" "American Research Group" "American Research Group" ...
##  $ pollster_rating_id  : int  391 391 9 9 546 277 267 218 277 391 ...
##  $ pollster_rating_name: chr  "YouGov" "YouGov" "American Research Group" "American Research Group" ...
##  $ fte_grade           : chr  "B" "B" "B" "B" ...
##  $ sample_size         : int  1500 1155 1100 990 5188 1500 1131 1993 1500 2166 ...
##  $ population          : chr  "a" "rv" "a" "rv" ...
##  $ population_full     : chr  "a" "rv" "a" "rv" ...
##  $ methodology         : chr  "Online" "Online" "Live Phone" "Live Phone" ...
##  $ start_date          : chr  "1/16/21" "1/16/21" "1/16/21" "1/16/21" ...
##  $ end_date            : chr  "1/19/21" "1/19/21" "1/19/21" "1/19/21" ...
##  $ sponsor_candidate   : logi  NA NA NA NA NA NA ...
##  $ tracking            : logi  NA NA NA NA NA TRUE ...
##  $ created_at          : chr  "1/20/21 10:18" "1/20/21 10:18" "1/20/21 10:18" "1/20/21 10:18" ...
##  $ notes               : chr  "" "" "" "" ...
##  $ url                 : chr  "https://docs.cdn.yougov.com/y9zsit5bzd/weeklytrackingreport.pdf" "https://docs.cdn.yougov.com/y9zsit5bzd/weeklytrackingreport.pdf" "https://americanresearchgroup.com/economy/" "https://americanresearchgroup.com/economy/" ...
##  $ source              : chr  "538" "538" "538" "538" ...
##  $ yes                 : num  42 44 30 29 44.6 51 34 39 48 41 ...
##  $ no                  : num  53 55 66 67 53.9 48 61 58 51 59 ...

We can see that the data is not up to date as Donald Trump is the only president in this data file

unique(x$politician)
## [1] "Donald Trump"
unique(x$state)
## [1] NA
unique(x$tracking)
## [1]   NA TRUE
unique(x$sponsor_candidate)
## [1] NA

Creation of Subset

First I am going to trim down the dataset for only the columns that are needed. Some columns like pollster_rating_id, sponsor_candidate, and tracking are most likely not needed in the final subset as they do not help further the exploration of the data. Some columns like state does not have any information in them and will also be removed

subset = x[c("politician", "pollster", "fte_grade", "sample_size", "population", "methodology", "start_date", "end_date", "created_at")]

Looking at the column called population there is a seperate table which tells you what each letter represents and I will be joining that into the subset in order to make it easier to understand what the codes mean

unique(subset$population)
## [1] "a"  "rv" "lv" "v"
population_key = data.frame(c("a", "rv", "v", "lv"), c("Adults", "Registered Voters", "Voters", "Likely Voters"))
colnames(population_key) = c("population", "population_description")

subset = merge(subset, population_key, by="population")

head(subset)

Conclusion

With this dataset one can graph the approval rates over time in order to see the historical approval rating of the president. One can also put specific filters based on whether they only want a certain sample size, population type, or grade of the pollster. Based on the different filters applied is can change the results drastically and which can change the public perception. I would also like to see if the dataset would be able to include the parties affiliation as that also may have a affect as maybe some pollsters only interview certain groups of people.