This dataset is from FiveThirtyEight which is a website that collects data to use for statistical analysis. The dataset that I choose from FiveThirtyEight is the Polls dataset and specificially the approval poll ratings. This dataset is a collection of data from different polling organization which aggregates them into this one file.
x = read.csv(url("https://projects.fivethirtyeight.com/polls-page/president_approval_polls.csv"))
head(x)
str(x)
## 'data.frame': 10009 obs. of 27 variables:
## $ question_id : int 139304 139305 139306 139307 139298 139287 139262 139267 139261 139250 ...
## $ poll_id : int 74225 74225 74226 74226 74224 74221 74214 74218 74213 74210 ...
## $ state : logi NA NA NA NA NA NA ...
## $ politician_id : int 11 11 11 11 11 11 11 11 11 11 ...
## $ politician : chr "Donald Trump" "Donald Trump" "Donald Trump" "Donald Trump" ...
## $ pollster_id : int 568 568 23 23 1528 399 396 1189 399 568 ...
## $ pollster : chr "YouGov" "YouGov" "American Research Group" "American Research Group" ...
## $ sponsor_ids : chr "352" "352" "" "" ...
## $ sponsors : chr "Economist" "Economist" "" "" ...
## $ display_name : chr "YouGov" "YouGov" "American Research Group" "American Research Group" ...
## $ pollster_rating_id : int 391 391 9 9 546 277 267 218 277 391 ...
## $ pollster_rating_name: chr "YouGov" "YouGov" "American Research Group" "American Research Group" ...
## $ fte_grade : chr "B" "B" "B" "B" ...
## $ sample_size : int 1500 1155 1100 990 5188 1500 1131 1993 1500 2166 ...
## $ population : chr "a" "rv" "a" "rv" ...
## $ population_full : chr "a" "rv" "a" "rv" ...
## $ methodology : chr "Online" "Online" "Live Phone" "Live Phone" ...
## $ start_date : chr "1/16/21" "1/16/21" "1/16/21" "1/16/21" ...
## $ end_date : chr "1/19/21" "1/19/21" "1/19/21" "1/19/21" ...
## $ sponsor_candidate : logi NA NA NA NA NA NA ...
## $ tracking : logi NA NA NA NA NA TRUE ...
## $ created_at : chr "1/20/21 10:18" "1/20/21 10:18" "1/20/21 10:18" "1/20/21 10:18" ...
## $ notes : chr "" "" "" "" ...
## $ url : chr "https://docs.cdn.yougov.com/y9zsit5bzd/weeklytrackingreport.pdf" "https://docs.cdn.yougov.com/y9zsit5bzd/weeklytrackingreport.pdf" "https://americanresearchgroup.com/economy/" "https://americanresearchgroup.com/economy/" ...
## $ source : chr "538" "538" "538" "538" ...
## $ yes : num 42 44 30 29 44.6 51 34 39 48 41 ...
## $ no : num 53 55 66 67 53.9 48 61 58 51 59 ...
We can see that the data is not up to date as Donald Trump is the only president in this data file
unique(x$politician)
## [1] "Donald Trump"
unique(x$state)
## [1] NA
unique(x$tracking)
## [1] NA TRUE
unique(x$sponsor_candidate)
## [1] NA
First I am going to trim down the dataset for only the columns that are needed. Some columns like pollster_rating_id, sponsor_candidate, and tracking are most likely not needed in the final subset as they do not help further the exploration of the data. Some columns like state does not have any information in them and will also be removed
subset = x[c("politician", "pollster", "fte_grade", "sample_size", "population", "methodology", "start_date", "end_date", "created_at")]
Looking at the column called population there is a seperate table which tells you what each letter represents and I will be joining that into the subset in order to make it easier to understand what the codes mean
unique(subset$population)
## [1] "a" "rv" "lv" "v"
population_key = data.frame(c("a", "rv", "v", "lv"), c("Adults", "Registered Voters", "Voters", "Likely Voters"))
colnames(population_key) = c("population", "population_description")
subset = merge(subset, population_key, by="population")
head(subset)
With this dataset one can graph the approval rates over time in order to see the historical approval rating of the president. One can also put specific filters based on whether they only want a certain sample size, population type, or grade of the pollster. Based on the different filters applied is can change the results drastically and which can change the public perception. I would also like to see if the dataset would be able to include the parties affiliation as that also may have a affect as maybe some pollsters only interview certain groups of people.