DATA607_Week1_Assignment

Overview

This is the the assignment for Week 1 of the Daat 607 course for the Fall 2024 Semester. THe goal for the assignment is to establish a base level of understanding for working with cloud-hosted data sets within R. There are several steps to this assignment and each one is broken down below.

Article Commentary

Assuming this is the article that goes along with the dataset: https://projects.fivethirtyeight.com/polls/generic-ballot/ I think this data is interesting. I did not realize that 538 actually made their data available publicly like this. Additionally, i never worked with polling data. I’m unfamiliar with the polling grading system that one can see in the “grade” column. Im curious how this grade is obtained. Lastly, I can assuming that the dem/rep columns, are the inital numbers of dems and reps that are polled, but I’m more curious about thge adjusted versions of these columns. I didnt see any data dictionary or metadata that goes with this data.

Step 1

Description:

Take the data, and create one or more code blocks. You should finish with a data frame that contains a subset of the columns in your selected dataset. If there is an obvious target (aka predictor or independent) variable, you should include this in your set of columns. You should include (or add if necessary) meaningful column names and replace (if necessary) any non-intuitive abbreviations used in the data that you selected. For example, if you had instead been tasked with working with the UCI mushroom dataset, you would include the target column for edible or poisonous, and transform “e” values to “edible.” Your deliverable is the R code to perform these transformation tasks.

Firstly, reading in the data source I initially grabbed from data.538.com. I pulled the “Do Voters Want Democrats Or Republicans In Congress?” dataset.

#Importing the relevant lib
library(readr)
# Placing the External CSV into a dataframe named x variable
x <- read.csv('https://raw.githubusercontent.com/jhnboyy/CUNYSPS_DATA607/main/Week1/538_generic_poll_list.csv')
#Printing the firs4 5 rows of the dataframe
print(head(x))

##    subgroup modeldate  startdate    enddate                pollster grade
## 1 All polls 11/7/2022 11/21/2020 11/23/2020 McLaughlin & Associates   C/D
## 2 All polls 11/7/2022  12/9/2020 12/13/2020 McLaughlin & Associates   C/D
## 3 All polls 11/7/2022  1/20/2021  1/22/2021         Morning Consult     B
## 4 All polls 11/7/2022  1/21/2021  1/23/2021         Morning Consult     B
## 5 All polls 11/7/2022  1/22/2021  1/24/2021         Morning Consult     B
## 6 All polls 11/7/2022  1/23/2021  1/25/2021         Morning Consult     B
##   samplesize population     weight influence dem rep adjusted_dem adjusted_rep
## 1       1000         lv 0.43667210         0  45  47     44.37718     44.04876
## 2       1000         lv 0.41408315         0  46  48     45.37718     45.04876
## 3      10989         rv 0.06026821         0  45  41     44.14622     41.92001
## 4      10764         rv 0.05825534         0  45  40     44.14622     40.92001
## 5      11402         rv 0.05816801         0  45  41     44.14622     41.92001
## 6      13334         rv 0.06049161         0  45  40     44.14622     40.92001
##   multiversions tracking
## 1                     NA
## 2                     NA
## 3                   TRUE
## 4                   TRUE
## 5                   TRUE
## 6                   TRUE
##                                                                                                        url
## 1          https://mclaughlinonline.com/pols/wp-content/uploads/2020/11/Newsmax-National-PPT-11-24-20-.pdf
## 2 https://mclaughlinonline.com/pols/wp-content/uploads/2020/12/National-Monthly-December-For-Release-1.pdf
## 3                                               https://morningconsult.com/2022-midterm-elections-tracker/
## 4                                               https://morningconsult.com/2022-midterm-elections-tracker/
## 5                                               https://morningconsult.com/2022-midterm-elections-tracker/
## 6                                               https://morningconsult.com/2022-midterm-elections-tracker/
##   poll_id question_id createddate       timestamp
## 1   73339      139355   1/20/2021 11/7/2022 18:59
## 2   73755      139356   1/20/2021 11/7/2022 18:59
## 3   80108      160543    9/8/2022 11/7/2022 18:59
## 4   80109      160544    9/8/2022 11/7/2022 18:59
## 5   80110      160545    9/8/2022 11/7/2022 18:59
## 6   80111      160546    9/8/2022 11/7/2022 18:59

Next, step is limiting the columns from the original 21 columns.

# Now working to get a subset for the assignment. The columns we want to keep are:
# startdate, enddate, pollster, grade, samplesize, pollid, url,  dem, rep adjusted_dem adjusted_rep
subset_df <- x[, c("poll_id", "startdate", "enddate", "pollster", "grade", "samplesize", "url", "dem", "rep", "adjusted_dem", "adjusted_rep")]
#Printing the first 5 rows
print(head(subset_df))

##   poll_id  startdate    enddate                pollster grade samplesize
## 1   73339 11/21/2020 11/23/2020 McLaughlin & Associates   C/D       1000
## 2   73755  12/9/2020 12/13/2020 McLaughlin & Associates   C/D       1000
## 3   80108  1/20/2021  1/22/2021         Morning Consult     B      10989
## 4   80109  1/21/2021  1/23/2021         Morning Consult     B      10764
## 5   80110  1/22/2021  1/24/2021         Morning Consult     B      11402
## 6   80111  1/23/2021  1/25/2021         Morning Consult     B      13334
##                                                                                                        url
## 1          https://mclaughlinonline.com/pols/wp-content/uploads/2020/11/Newsmax-National-PPT-11-24-20-.pdf
## 2 https://mclaughlinonline.com/pols/wp-content/uploads/2020/12/National-Monthly-December-For-Release-1.pdf
## 3                                               https://morningconsult.com/2022-midterm-elections-tracker/
## 4                                               https://morningconsult.com/2022-midterm-elections-tracker/
## 5                                               https://morningconsult.com/2022-midterm-elections-tracker/
## 6                                               https://morningconsult.com/2022-midterm-elections-tracker/
##   dem rep adjusted_dem adjusted_rep
## 1  45  47     44.37718     44.04876
## 2  46  48     45.37718     45.04876
## 3  45  41     44.14622     41.92001
## 4  45  40     44.14622     40.92001
## 5  45  41     44.14622     41.92001
## 6  45  40     44.14622     40.92001

Then just for the sake of this exercise i want to limit to the polls that have a sample size larger than the median.

#Further limiting the df, but filtering by the sample size. Want to grab the median, and then those that are above the median.
sample_size_med <- median(subset_df$samplesize)
print(sample_size_med)

## [1] NA

# Need to remove the NA's 
sample_size_med <- median(subset_df$samplesize, na.rm = TRUE)
print(sample_size_med)

## [1] 7947

#Limiting the df to the rows over that value
grt_med<-subset_df[subset_df$samplesize > sample_size_med, ]
print(head(grt_med))

##   poll_id startdate   enddate        pollster grade samplesize
## 3   80108 1/20/2021 1/22/2021 Morning Consult     B      10989
## 4   80109 1/21/2021 1/23/2021 Morning Consult     B      10764
## 5   80110 1/22/2021 1/24/2021 Morning Consult     B      11402
## 6   80111 1/23/2021 1/25/2021 Morning Consult     B      13334
## 7   80112 1/24/2021 1/26/2021 Morning Consult     B      14159
## 8   80113 1/25/2021 1/27/2021 Morning Consult     B      14390
##                                                          url dem rep
## 3 https://morningconsult.com/2022-midterm-elections-tracker/  45  41
## 4 https://morningconsult.com/2022-midterm-elections-tracker/  45  40
## 5 https://morningconsult.com/2022-midterm-elections-tracker/  45  41
## 6 https://morningconsult.com/2022-midterm-elections-tracker/  45  40
## 7 https://morningconsult.com/2022-midterm-elections-tracker/  45  40
## 8 https://morningconsult.com/2022-midterm-elections-tracker/  44  40
##   adjusted_dem adjusted_rep
## 3     44.14622     41.92001
## 4     44.14622     40.92001
## 5     44.14622     41.92001
## 6     44.14622     40.92001
## 7     44.14622     40.92001
## 8     43.14622     40.92001

print(dim(grt_med))

## [1] 609  11

Step 2

Description:

Make sure that the original data file is accessible through your code—for example, stored in a GitHub repository or AWS S3 bucket and referenced in your code. If the code references data on your local machine, then your work is not reproducible!

#THis has been completed with this line above:
x <- read.csv('https://raw.githubusercontent.com/jhnboyy/CUNYSPS_DATA607/main/Week1/538_generic_poll_list.csv')

Step 3

Description:

Start your R Markdown document with a two to three sentence “Overview” or “Introduction” description of what the article that you chose is about, and include a link to the article.

This has been completed please reference the overview section above.

Step 4

Description:

Finish with a “Conclusions” or “Findings and Recommendations” text block that includes what you might do to extend, verify, or update the work from the selected article.

Conclusion

In conclusion, I would build out this data to perhaps weight the latest polls a bit more than past polls, or maybe those polls with registered voters should be weighted more. Additionally, i would want to cut the data to see how the final numbers shift when the higher graded polls are looked at vs the lower graded ones. Lastly, I would also want to look at this for large vs small polls.

Step 5

Description:

Each of your text blocks should minimally include at least one header, and additional non-header text.

The code should be sufficiently commented. I also leveraged headers for all of the step by step sections.

Step 6

Description:

You’re of course welcome—but not required–to include additional information, such as exploratory data analysis graphics (which we will cover later in the course).

Did not do this part, decided to skip.

Step 7

Description:

Place your solution into a single R Markdown (.Rmd) file and publish your solution out to rpubs.com.

Completed. Solution uploaded to rpubs.

Step 8

Description:

Post the .Rmd file in your GitHub repository, and provide the appropriate URLs to your GitHub repositor y and your rpubs.com file in your assignment link.

The RMD file has been pushed to the Git Repo for this class. (https://github.com/jhnboyy/CUNYSPS_DATA607)

DATA607_Week1_Assignment

John Ferrara

2024-08-30

Overview

Article Commentary

Step 1

Description:

Step 2

Description:

Step 3

Description:

Step 4

Description:

Conclusion

Step 5

Description:

Step 6

Description:

Step 7

Description:

Step 8

Description: