Overview

This folder contains an R script that showcases how to read a csv file, parse its data columns, perform simple formatting tasks, and create subset dataframes. The data used is from across multiple poll firms who gauge bipartisan voter sentiment between democrats and republicans in the United States.

This file contains links to the data behind [Do Voters Want Democrats or Republicans in Congress?]https://projects.fivethirtyeight.com/polls/generic-ballot/

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.8     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.1
## v readr   2.1.2     v forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library("anytime")
## Warning: package 'anytime' was built under R version 4.1.3
library(RCurl)
## Warning: package 'RCurl' was built under R version 4.1.3
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:tidyr':
## 
##     complete
#load in data into a dataframe and parse columns
urlfile <- 'https://raw.githubusercontent.com/jlixander/DATA607/main/Assignment1/Data/congress-generic-ballot/generic_topline.csv'
polls <- read_csv(url(urlfile))
## Rows: 1529 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): subgroup, modeldate, timestamp
## dbl (6): dem_estimate, dem_hi, dem_lo, rep_estimate, rep_hi, rep_lo
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#View df
polls
## # A tibble: 1,529 x 9
##    subgroup  modeldate dem_estimate dem_hi dem_lo rep_es~1 rep_hi rep_lo times~2
##    <chr>     <chr>            <dbl>  <dbl>  <dbl>    <dbl>  <dbl>  <dbl> <chr>  
##  1 All polls 9/18/2018         48.8   53.4   44.3     39.8   43.8   35.8 15:27:~
##  2 All polls 9/17/2018         49.0   53.5   44.5     39.9   43.9   35.9 12:43:~
##  3 All polls 9/16/2018         49.0   53.5   44.5     39.9   43.9   35.9 18:00:~
##  4 All polls 9/15/2018         49.0   53.5   44.5     39.9   43.9   35.9 23:19:~
##  5 All polls 9/14/2018         48.9   53.4   44.4     39.8   43.8   35.8 15:55:~
##  6 All polls 9/13/2018         48.8   53.3   44.3     39.7   43.7   35.7 12:35:~
##  7 All polls 9/12/2018         48.8   53.3   44.3     39.6   43.6   35.6 09:51:~
##  8 All polls 9/11/2018         48.5   53.0   44.0     39.9   43.9   35.8 18:31:~
##  9 All polls 9/10/2018         48.4   52.9   43.9     39.9   43.9   35.9 18:00:~
## 10 All polls 9/9/2018          48.4   52.9   43.9     39.9   43.9   35.9 18:00:~
## # ... with 1,519 more rows, and abbreviated variable names 1: rep_estimate,
## #   2: timestamp

Data wrangling performed:

#FORMATTING DATE COLUMN
#Set locale to avoid return NA values for dates
lct <- Sys.getlocale("LC_TIME"); Sys.setlocale("LC_TIME", "C")
## [1] "C"
polls$modeldate <- anydate(polls$modeldate)

#Remove timestamp and subgroup column
polls = subset(polls, select = -c(timestamp,subgroup))

#Rename columns
polls <- polls %>%
  rename(ModelDate = modeldate,
         PercDemEst = dem_estimate,
         PercDemHigh = dem_hi,
         PercDemLow = dem_lo,
         PercRepEst = rep_estimate,
         PercRepHigh = rep_hi,
         PercRepLow = rep_lo,
         )


#Format all columns holding a percent value to 2 decimal places
cols.perc <- c("PercDemEst","PercDemHigh","PercDemLow","PercRepEst","PercRepHigh","PercRepLow")
polls[cols.perc] <- sapply(polls[cols.perc],as.numeric)

polls <- polls %>% 
  mutate_if(is.numeric, round, digits = 2)

Subsetting performed:

#Subset data to the year 2017 and 2018
polls_2017 <- polls %>% 
  filter(ModelDate < '2018-01-01')

polls_2018 <- polls %>% 
  filter(ModelDate >= '2018-01-01')

Results

Below you will find the final formatted dataframe along with its 2 subsets:

polls
## # A tibble: 1,529 x 7
##    ModelDate  PercDemEst PercDemHigh PercDemLow PercRepEst PercRepHigh PercRep~1
##    <date>          <dbl>       <dbl>      <dbl>      <dbl>       <dbl>     <dbl>
##  1 2018-09-18       48.8        53.4       44.2       39.8        43.8      35.8
##  2 2018-09-17       49          53.5       44.5       39.9        43.9      35.9
##  3 2018-09-16       49.0        53.5       44.5       39.9        43.9      35.9
##  4 2018-09-15       49.0        53.5       44.5       39.9        43.9      35.9
##  5 2018-09-14       48.9        53.4       44.4       39.8        43.8      35.8
##  6 2018-09-13       48.8        53.3       44.3       39.7        43.7      35.7
##  7 2018-09-12       48.8        53.3       44.3       39.6        43.6      35.6
##  8 2018-09-11       48.5        53.0       44.0       39.9        43.9      35.8
##  9 2018-09-10       48.4        52.9       43.9       39.9        43.9      35.9
## 10 2018-09-09       48.4        52.9       43.9       39.9        43.9      35.9
## # ... with 1,519 more rows, and abbreviated variable name 1: PercRepLow
polls_2017
## # A tibble: 772 x 7
##    ModelDate  PercDemEst PercDemHigh PercDemLow PercRepEst PercRepHigh PercRep~1
##    <date>          <dbl>       <dbl>      <dbl>      <dbl>       <dbl>     <dbl>
##  1 2017-12-31       47.0        54.8       39.2       32.6        39.0      26.2
##  2 2017-12-31       49.9        55.5       44.3       37.0        41.5      32.4
##  3 2017-12-31       49.9        55.4       44.3       37.0        41.6      32.5
##  4 2017-12-30       49.9        55.5       44.3       37.0        41.5      32.4
##  5 2017-12-30       47.0        54.8       39.2       32.6        39.0      26.2
##  6 2017-12-30       49.9        55.4       44.3       37.0        41.6      32.5
##  7 2017-12-29       47.0        54.8       39.2       32.6        39.0      26.2
##  8 2017-12-29       49.9        55.4       44.3       37.0        41.6      32.5
##  9 2017-12-29       49.9        55.5       44.3       37.0        41.5      32.4
## 10 2017-12-28       46.7        54.7       38.7       32.7        39.2      26.2
## # ... with 762 more rows, and abbreviated variable name 1: PercRepLow
polls_2018
## # A tibble: 757 x 7
##    ModelDate  PercDemEst PercDemHigh PercDemLow PercRepEst PercRepHigh PercRep~1
##    <date>          <dbl>       <dbl>      <dbl>      <dbl>       <dbl>     <dbl>
##  1 2018-09-18       48.8        53.4       44.2       39.8        43.8      35.8
##  2 2018-09-17       49          53.5       44.5       39.9        43.9      35.9
##  3 2018-09-16       49.0        53.5       44.5       39.9        43.9      35.9
##  4 2018-09-15       49.0        53.5       44.5       39.9        43.9      35.9
##  5 2018-09-14       48.9        53.4       44.4       39.8        43.8      35.8
##  6 2018-09-13       48.8        53.3       44.3       39.7        43.7      35.7
##  7 2018-09-12       48.8        53.3       44.3       39.6        43.6      35.6
##  8 2018-09-11       48.5        53.0       44.0       39.9        43.9      35.8
##  9 2018-09-10       48.4        52.9       43.9       39.9        43.9      35.9
## 10 2018-09-09       48.4        52.9       43.9       39.9        43.9      35.9
## # ... with 747 more rows, and abbreviated variable name 1: PercRepLow

Findings and recommendations

To extend the accuracy and integrity of these polls it would be good practice to include a weighted score for geographical delineation. Historically its been proven that there are drastic opinions between voters who reside in rural, suburban, and urban areas. In a sense, this score would provide any insight into possible skewness that would introduce bias to the voter data. Additionally, I would include other parties such the Libertarian or Green party to gauge if the “other” category parties are gaining traction as time progresses. The next dimension I would add is if polls are influenced by some sort of politically affiliated party. Lastly, it may prove worthy to add if the pollster is a for-profit or non-profit organization.