Week 1 Assignment – Loading Data into a Data Frame

The goal of this project is to analyze the US election polls between Trump and Biden and all of the other presedential candidates in 2020. We have a dataset combines the many different polls with the percentages of each candidte got in these polls.

the original data is from https://data.fivethirtyeight.com/. I have added the row data on new repository on Github via the link ‘https://raw.githubusercontent.com/akarimhammoud/Recen2020PollsUS/master/president_polls.csv

First we get the data from Github

url <- "https://raw.githubusercontent.com/akarimhammoud/Recen2020PollsUS/master/president_polls.csv"
Polls <- read.csv(file= url, header=TRUE)
head(Polls)
##   question_id poll_id cycle state pollster_id                       pollster
## 1      127807   68235  2020              1610 USC Dornsife/Los Angeles Times
## 2      127807   68235  2020              1610 USC Dornsife/Los Angeles Times
## 3      127808   68235  2020              1610 USC Dornsife/Los Angeles Times
## 4      127808   68235  2020              1610 USC Dornsife/Los Angeles Times
## 5      127825   68237  2020              1562   Redfield & Wilton Strategies
## 6      127825   68237  2020              1562   Redfield & Wilton Strategies
##   sponsor_ids sponsors                 display_name pollster_rating_id
## 1                                      USC Dornsife                343
## 2                                      USC Dornsife                343
## 3                                      USC Dornsife                343
## 4                                      USC Dornsife                343
## 5                      Redfield & Wilton Strategies                562
## 6                      Redfield & Wilton Strategies                562
##             pollster_rating_name fte_grade sample_size population
## 1 USC Dornsife/Los Angeles Times       B/C        2545         lv
## 2 USC Dornsife/Los Angeles Times       B/C        2545         lv
## 3 USC Dornsife/Los Angeles Times       B/C        2544         lv
## 4 USC Dornsife/Los Angeles Times       B/C        2544         lv
## 5   Redfield & Wilton Strategies                  1834         lv
## 6   Redfield & Wilton Strategies                  1834         lv
##   population_full methodology    office_type seat_number seat_name start_date
## 1              lv      Online U.S. President           0        NA    8/21/20
## 2              lv      Online U.S. President           0        NA    8/21/20
## 3              lv      Online U.S. President           0        NA    8/21/20
## 4              lv      Online U.S. President           0        NA    8/21/20
## 5              lv      Online U.S. President           0        NA    8/25/20
## 6              lv      Online U.S. President           0        NA    8/25/20
##   end_date election_date sponsor_candidate internal partisan tracking
## 1  8/27/20       11/3/20                      FALSE              TRUE
## 2  8/27/20       11/3/20                      FALSE              TRUE
## 3  8/27/20       11/3/20                      FALSE              TRUE
## 4  8/27/20       11/3/20                      FALSE              TRUE
## 5  8/26/20       11/3/20                      FALSE                NA
## 6  8/26/20       11/3/20                      FALSE                NA
##   nationwide_batch ranked_choice_reallocated    created_at
## 1            FALSE                     FALSE  8/28/20 6:02
## 2            FALSE                     FALSE  8/28/20 6:02
## 3            FALSE                     FALSE  8/28/20 6:02
## 4            FALSE                     FALSE  8/28/20 6:02
## 5            FALSE                     FALSE 8/28/20 10:08
## 6            FALSE                     FALSE 8/28/20 10:08
##                           notes
## 1 probabilistic voting question
## 2 probabilistic voting question
## 3   traditional voting question
## 4   traditional voting question
## 5                              
## 6                              
##                                                                                 url
## 1                                                         https://election.usc.edu/
## 2                                                         https://election.usc.edu/
## 3                                                         https://election.usc.edu/
## 4                                                         https://election.usc.edu/
## 5 https://redfieldandwiltonstrategies.com/latest-usa-voting-intention-august-25-26/
## 6 https://redfieldandwiltonstrategies.com/latest-usa-voting-intention-august-25-26/
##     stage race_id answer candidate_id      candidate_name candidate_party   pct
## 1 general    6210  Biden        13256 Joseph R. Biden Jr.             DEM 52.73
## 2 general    6210  Trump        13254        Donald Trump             REP 40.32
## 3 general    6210  Biden        13256 Joseph R. Biden Jr.             DEM 54.24
## 4 general    6210  Trump        13254        Donald Trump             REP 39.68
## 5 general    6210  Biden        13256 Joseph R. Biden Jr.             DEM 48.85
## 6 general    6210  Trump        13254        Donald Trump             REP 38.83

Summary of the data

summary(Polls)
##   question_id        poll_id          cycle         state          
##  Min.   : 92078   Min.   :57025   Min.   :2020   Length:6244       
##  1st Qu.:102947   1st Qu.:59595   1st Qu.:2020   Class :character  
##  Median :116839   Median :63470   Median :2020   Mode  :character  
##  Mean   :114085   Mean   :63307   Mean   :2020                     
##  3rd Qu.:124539   3rd Qu.:66696   3rd Qu.:2020                     
##  Max.   :127826   Max.   :68238   Max.   :2020                     
##                                                                    
##   pollster_id       pollster         sponsor_ids          sponsors        
##  Min.   :  11.0   Length:6244        Length:6244        Length:6244       
##  1st Qu.: 509.0   Class :character   Class :character   Class :character  
##  Median :1102.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 964.9                                                           
##  3rd Qu.:1416.0                                                           
##  Max.   :1610.0                                                           
##                                                                           
##  display_name       pollster_rating_id pollster_rating_name  fte_grade        
##  Length:6244        Min.   :  3.0      Length:6244          Length:6244       
##  Class :character   1st Qu.:133.0      Class :character     Class :character  
##  Mode  :character   Median :245.0      Mode  :character     Mode  :character  
##                     Mean   :264.3                                             
##                     3rd Qu.:391.0                                             
##                     Max.   :606.0                                             
##                     NA's   :2                                                 
##   sample_size     population        population_full    methodology       
##  Min.   :  140   Length:6244        Length:6244        Length:6244       
##  1st Qu.:  767   Class :character   Class :character   Class :character  
##  Median : 1000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 1900                                                           
##  3rd Qu.: 1279                                                           
##  Max.   :33549                                                           
##                                                                          
##  office_type         seat_number seat_name       start_date       
##  Length:6244        Min.   :0    Mode:logical   Length:6244       
##  Class :character   1st Qu.:0    NA's:6244      Class :character  
##  Mode  :character   Median :0                   Mode  :character  
##                     Mean   :0                                     
##                     3rd Qu.:0                                     
##                     Max.   :0                                     
##                                                                   
##    end_date         election_date      sponsor_candidate   internal      
##  Length:6244        Length:6244        Length:6244        Mode :logical  
##  Class :character   Class :character   Class :character   FALSE:6228     
##  Mode  :character   Mode  :character   Mode  :character   TRUE :16       
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##    partisan         tracking       nationwide_batch ranked_choice_reallocated
##  Length:6244        Mode:logical   Mode :logical    Mode :logical            
##  Class :character   TRUE:450       FALSE:6244       FALSE:6244               
##  Mode  :character   NA's:5794                                                
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##   created_at           notes               url               stage          
##  Length:6244        Length:6244        Length:6244        Length:6244       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     race_id        answer           candidate_id   candidate_name    
##  Min.   :6210   Length:6244        Min.   :13253   Length:6244       
##  1st Qu.:6210   Class :character   1st Qu.:13254   Class :character  
##  Median :6214   Mode  :character   Median :13256   Mode  :character  
##  Mean   :6238                      Mean   :13301                     
##  3rd Qu.:6238                      3rd Qu.:13257                     
##  Max.   :8718                      Max.   :15856                     
##                                                                      
##  candidate_party         pct       
##  Length:6244        Min.   : 0.00  
##  Class :character   1st Qu.:41.00  
##  Mode  :character   Median :45.00  
##                     Mean   :43.24  
##                     3rd Qu.:48.10  
##                     Max.   :68.80  
## 

Create data frame with subset of few columns from the data

myframe <- subset(Polls, select = c(state, pollster, sample_size, created_at, answer, pct ))
head(myframe)
##   state                       pollster sample_size    created_at answer   pct
## 1       USC Dornsife/Los Angeles Times        2545  8/28/20 6:02  Biden 52.73
## 2       USC Dornsife/Los Angeles Times        2545  8/28/20 6:02  Trump 40.32
## 3       USC Dornsife/Los Angeles Times        2544  8/28/20 6:02  Biden 54.24
## 4       USC Dornsife/Los Angeles Times        2544  8/28/20 6:02  Trump 39.68
## 5         Redfield & Wilton Strategies        1834 8/28/20 10:08  Biden 48.85
## 6         Redfield & Wilton Strategies        1834 8/28/20 10:08  Trump 38.83

Change Column Names

colnames(myframe) <- c("state", "pollster", "size", "date", "candidate", "percentage")
head(myframe)
##   state                       pollster size          date candidate percentage
## 1       USC Dornsife/Los Angeles Times 2545  8/28/20 6:02     Biden      52.73
## 2       USC Dornsife/Los Angeles Times 2545  8/28/20 6:02     Trump      40.32
## 3       USC Dornsife/Los Angeles Times 2544  8/28/20 6:02     Biden      54.24
## 4       USC Dornsife/Los Angeles Times 2544  8/28/20 6:02     Trump      39.68
## 5         Redfield & Wilton Strategies 1834 8/28/20 10:08     Biden      48.85
## 6         Redfield & Wilton Strategies 1834 8/28/20 10:08     Trump      38.83

Create file and call it 2020polls.csv.

write.csv(myframe, file="2020polls.csv", row.names=FALSE)
getwd()
## [1] "/Users/karimh/Documents/Google Drive/R"

After creating a repository on github and upload the file to it.

url <- "https://raw.githubusercontent.com/akarimhammoud/Recen2020PollsUS/master/2020polls.csv"
Pollsfile <- read.csv(file= url, header=TRUE)
head(Pollsfile)
##   state                       pollster size          date candidate percentage
## 1       USC Dornsife/Los Angeles Times 2545  8/28/20 6:02     Biden      52.73
## 2       USC Dornsife/Los Angeles Times 2545  8/28/20 6:02     Trump      40.32
## 3       USC Dornsife/Los Angeles Times 2544  8/28/20 6:02     Biden      54.24
## 4       USC Dornsife/Los Angeles Times 2544  8/28/20 6:02     Trump      39.68
## 5         Redfield & Wilton Strategies 1834 8/28/20 10:08     Biden      48.85
## 6         Redfield & Wilton Strategies 1834 8/28/20 10:08     Trump      38.83

Change “Biden” to “Joe Biden” and “Trump” to “Donald Trump” in the file

Pollsfile$candidate <- sub("Biden", "Joe Biden", Pollsfile$candidate)
Pollsfile$candidate <- sub("Trump", "Donald Trump", Pollsfile$candidate)
head(Pollsfile)
##   state                       pollster size          date    candidate
## 1       USC Dornsife/Los Angeles Times 2545  8/28/20 6:02    Joe Biden
## 2       USC Dornsife/Los Angeles Times 2545  8/28/20 6:02 Donald Trump
## 3       USC Dornsife/Los Angeles Times 2544  8/28/20 6:02    Joe Biden
## 4       USC Dornsife/Los Angeles Times 2544  8/28/20 6:02 Donald Trump
## 5         Redfield & Wilton Strategies 1834 8/28/20 10:08    Joe Biden
## 6         Redfield & Wilton Strategies 1834 8/28/20 10:08 Donald Trump
##   percentage
## 1      52.73
## 2      40.32
## 3      54.24
## 4      39.68
## 5      48.85
## 6      38.83

require(ggplot2)
## Loading required package: ggplot2

We present the presedential candidates on a bar chart.

barplot(table(Pollsfile$candidate ), main = "candidate")

Conclusion:

As we see in the chart some presednetial candidates droped off race earlier than others thats what some of them shows on the polls more than the other and of course the major two candidate that most of the polling show are Biden and Trump.

Useful URLs:

The data is from https://data.fivethirtyeight.com/‘. the original data was saved to: https://raw.githubusercontent.com/akarimhammoud/Recen2020PollsUS/master/president_polls.csv’ The new frame was saved at: “https://raw.githubusercontent.com/akarimhammoud/Recen2020PollsUS/master/2020polls.csv” Github link for the assingment: https://github.com/akarimhammoud/Recen2020PollsUS/blob/master/CUNY%20SPS%20-%20607%20Week%201%20Assignment..Rmd