This is the the assignment for Week 1 of the Daat 607 course for the Fall 2024 Semester. THe goal for the assignment is to establish a base level of understanding for working with cloud-hosted data sets within R. There are several steps to this assignment and each one is broken down below.
Assuming this is the article that goes along with the dataset: https://projects.fivethirtyeight.com/polls/generic-ballot/ I think this data is interesting. I did not realize that 538 actually made their data available publicly like this. Additionally, i never worked with polling data. I’m unfamiliar with the polling grading system that one can see in the “grade” column. Im curious how this grade is obtained. Lastly, I can assuming that the dem/rep columns, are the inital numbers of dems and reps that are polled, but I’m more curious about thge adjusted versions of these columns. I didnt see any data dictionary or metadata that goes with this data.
Take the data, and create one or more code blocks. You should finish with a data frame that contains a subset of the columns in your selected dataset. If there is an obvious target (aka predictor or independent) variable, you should include this in your set of columns. You should include (or add if necessary) meaningful column names and replace (if necessary) any non-intuitive abbreviations used in the data that you selected. For example, if you had instead been tasked with working with the UCI mushroom dataset, you would include the target column for edible or poisonous, and transform “e” values to “edible.” Your deliverable is the R code to perform these transformation tasks.
Firstly, reading in the data source I initially grabbed from data.538.com. I pulled the “Do Voters Want Democrats Or Republicans In Congress?” dataset.
#Importing the relevant lib
library(readr)
# Placing the External CSV into a dataframe named x variable
x <- read.csv('https://raw.githubusercontent.com/jhnboyy/CUNYSPS_DATA607/main/Week1/538_generic_poll_list.csv')
#Printing the firs4 5 rows of the dataframe
print(head(x))
## subgroup modeldate startdate enddate pollster grade
## 1 All polls 11/7/2022 11/21/2020 11/23/2020 McLaughlin & Associates C/D
## 2 All polls 11/7/2022 12/9/2020 12/13/2020 McLaughlin & Associates C/D
## 3 All polls 11/7/2022 1/20/2021 1/22/2021 Morning Consult B
## 4 All polls 11/7/2022 1/21/2021 1/23/2021 Morning Consult B
## 5 All polls 11/7/2022 1/22/2021 1/24/2021 Morning Consult B
## 6 All polls 11/7/2022 1/23/2021 1/25/2021 Morning Consult B
## samplesize population weight influence dem rep adjusted_dem adjusted_rep
## 1 1000 lv 0.43667210 0 45 47 44.37718 44.04876
## 2 1000 lv 0.41408315 0 46 48 45.37718 45.04876
## 3 10989 rv 0.06026821 0 45 41 44.14622 41.92001
## 4 10764 rv 0.05825534 0 45 40 44.14622 40.92001
## 5 11402 rv 0.05816801 0 45 41 44.14622 41.92001
## 6 13334 rv 0.06049161 0 45 40 44.14622 40.92001
## multiversions tracking
## 1 NA
## 2 NA
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
## url
## 1 https://mclaughlinonline.com/pols/wp-content/uploads/2020/11/Newsmax-National-PPT-11-24-20-.pdf
## 2 https://mclaughlinonline.com/pols/wp-content/uploads/2020/12/National-Monthly-December-For-Release-1.pdf
## 3 https://morningconsult.com/2022-midterm-elections-tracker/
## 4 https://morningconsult.com/2022-midterm-elections-tracker/
## 5 https://morningconsult.com/2022-midterm-elections-tracker/
## 6 https://morningconsult.com/2022-midterm-elections-tracker/
## poll_id question_id createddate timestamp
## 1 73339 139355 1/20/2021 11/7/2022 18:59
## 2 73755 139356 1/20/2021 11/7/2022 18:59
## 3 80108 160543 9/8/2022 11/7/2022 18:59
## 4 80109 160544 9/8/2022 11/7/2022 18:59
## 5 80110 160545 9/8/2022 11/7/2022 18:59
## 6 80111 160546 9/8/2022 11/7/2022 18:59
Next, step is limiting the columns from the original 21 columns.
# Now working to get a subset for the assignment. The columns we want to keep are:
# startdate, enddate, pollster, grade, samplesize, pollid, url, dem, rep adjusted_dem adjusted_rep
subset_df <- x[, c("poll_id", "startdate", "enddate", "pollster", "grade", "samplesize", "url", "dem", "rep", "adjusted_dem", "adjusted_rep")]
#Printing the first 5 rows
print(head(subset_df))
## poll_id startdate enddate pollster grade samplesize
## 1 73339 11/21/2020 11/23/2020 McLaughlin & Associates C/D 1000
## 2 73755 12/9/2020 12/13/2020 McLaughlin & Associates C/D 1000
## 3 80108 1/20/2021 1/22/2021 Morning Consult B 10989
## 4 80109 1/21/2021 1/23/2021 Morning Consult B 10764
## 5 80110 1/22/2021 1/24/2021 Morning Consult B 11402
## 6 80111 1/23/2021 1/25/2021 Morning Consult B 13334
## url
## 1 https://mclaughlinonline.com/pols/wp-content/uploads/2020/11/Newsmax-National-PPT-11-24-20-.pdf
## 2 https://mclaughlinonline.com/pols/wp-content/uploads/2020/12/National-Monthly-December-For-Release-1.pdf
## 3 https://morningconsult.com/2022-midterm-elections-tracker/
## 4 https://morningconsult.com/2022-midterm-elections-tracker/
## 5 https://morningconsult.com/2022-midterm-elections-tracker/
## 6 https://morningconsult.com/2022-midterm-elections-tracker/
## dem rep adjusted_dem adjusted_rep
## 1 45 47 44.37718 44.04876
## 2 46 48 45.37718 45.04876
## 3 45 41 44.14622 41.92001
## 4 45 40 44.14622 40.92001
## 5 45 41 44.14622 41.92001
## 6 45 40 44.14622 40.92001
Then just for the sake of this exercise i want to limit to the polls that have a sample size larger than the median.
#Further limiting the df, but filtering by the sample size. Want to grab the median, and then those that are above the median.
sample_size_med <- median(subset_df$samplesize)
print(sample_size_med)
## [1] NA
# Need to remove the NA's
sample_size_med <- median(subset_df$samplesize, na.rm = TRUE)
print(sample_size_med)
## [1] 7947
#Limiting the df to the rows over that value
grt_med<-subset_df[subset_df$samplesize > sample_size_med, ]
print(head(grt_med))
## poll_id startdate enddate pollster grade samplesize
## 3 80108 1/20/2021 1/22/2021 Morning Consult B 10989
## 4 80109 1/21/2021 1/23/2021 Morning Consult B 10764
## 5 80110 1/22/2021 1/24/2021 Morning Consult B 11402
## 6 80111 1/23/2021 1/25/2021 Morning Consult B 13334
## 7 80112 1/24/2021 1/26/2021 Morning Consult B 14159
## 8 80113 1/25/2021 1/27/2021 Morning Consult B 14390
## url dem rep
## 3 https://morningconsult.com/2022-midterm-elections-tracker/ 45 41
## 4 https://morningconsult.com/2022-midterm-elections-tracker/ 45 40
## 5 https://morningconsult.com/2022-midterm-elections-tracker/ 45 41
## 6 https://morningconsult.com/2022-midterm-elections-tracker/ 45 40
## 7 https://morningconsult.com/2022-midterm-elections-tracker/ 45 40
## 8 https://morningconsult.com/2022-midterm-elections-tracker/ 44 40
## adjusted_dem adjusted_rep
## 3 44.14622 41.92001
## 4 44.14622 40.92001
## 5 44.14622 41.92001
## 6 44.14622 40.92001
## 7 44.14622 40.92001
## 8 43.14622 40.92001
print(dim(grt_med))
## [1] 609 11
Make sure that the original data file is accessible through your code—for example, stored in a GitHub repository or AWS S3 bucket and referenced in your code. If the code references data on your local machine, then your work is not reproducible!
#THis has been completed with this line above:
x <- read.csv('https://raw.githubusercontent.com/jhnboyy/CUNYSPS_DATA607/main/Week1/538_generic_poll_list.csv')
Start your R Markdown document with a two to three sentence “Overview” or “Introduction” description of what the article that you chose is about, and include a link to the article.
This has been completed please reference the overview section above.
Finish with a “Conclusions” or “Findings and Recommendations” text block that includes what you might do to extend, verify, or update the work from the selected article.
In conclusion, I would build out this data to perhaps weight the latest polls a bit more than past polls, or maybe those polls with registered voters should be weighted more. Additionally, i would want to cut the data to see how the final numbers shift when the higher graded polls are looked at vs the lower graded ones. Lastly, I would also want to look at this for large vs small polls.
Each of your text blocks should minimally include at least one header, and additional non-header text.
The code should be sufficiently commented. I also leveraged headers for all of the step by step sections.
You’re of course welcome—but not required–to include additional information, such as exploratory data analysis graphics (which we will cover later in the course).
Did not do this part, decided to skip.
Place your solution into a single R Markdown (.Rmd) file and publish your solution out to rpubs.com.
Completed. Solution uploaded to rpubs.
Post the .Rmd file in your GitHub repository, and provide the appropriate URLs to your GitHub repositor y and your rpubs.com file in your assignment link.
The RMD file has been pushed to the Git Repo for this class. (https://github.com/jhnboyy/CUNYSPS_DATA607)