For this assignment, we are tasked to find a dataset from the fivethirtyeight website. I went ahead and downloaded the latest polls from the 2024 Presidential general election polls. Afterwards, I uploaded the raw csv file into the Github repository for open-access.
polls_data <- read_csv("https://raw.githubusercontent.com/GullitNa/fivethirtyeight-polls2024/main/president_polls.csv")
## Rows: 10 Columns: 52
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (20): pollster, sponsors, display_name, pollster_rating_name, methodolog...
## dbl (11): poll_id, pollster_id, sponsor_ids, pollster_rating_id, question_id...
## lgl (21): numeric_grade, pollscore, transparency_score, state, sponsor_candi...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 52
## poll_id pollster_id pollster sponsor_ids sponsors display_name
## <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 89457 1890 SoCal Strategies 2152 On Point Politi… SoCal Strat…
## 2 89457 1890 SoCal Strategies 2152 On Point Politi… SoCal Strat…
## 3 89457 1890 SoCal Strategies 2152 On Point Politi… SoCal Strat…
## 4 89457 1890 SoCal Strategies 2152 On Point Politi… SoCal Strat…
## 5 89457 1890 SoCal Strategies 2152 On Point Politi… SoCal Strat…
## 6 89457 1890 SoCal Strategies 2152 On Point Politi… SoCal Strat…
## # ℹ 46 more variables: pollster_rating_id <dbl>, pollster_rating_name <chr>,
## # numeric_grade <lgl>, pollscore <lgl>, methodology <chr>,
## # transparency_score <lgl>, state <lgl>, start_date <chr>, end_date <chr>,
## # sponsor_candidate_id <lgl>, sponsor_candidate <lgl>,
## # sponsor_candidate_party <lgl>, endorsed_candidate_id <lgl>,
## # endorsed_candidate_name <lgl>, endorsed_candidate_party <lgl>,
## # question_id <dbl>, sample_size <dbl>, population <chr>, …
Within the data, there are lots of columns that either have “N/A” or 0 values that are basically unavailable data taking up space. Additionally for data clarity as stated in step 1, I will also capitalize all necessary columns and/or abbreviate. I clean this out using the following code:
cleaned_data <- polls_data %>%
drop_na() %>%
select(-seat_number) %>%
rename(
Poll_ID = poll_id,
Pollster_ID = pollster_id,
Pollster_Name = pollster,
Sponsor_IDs = sponsor_ids,
Sponsors = sponsors,
Pollster_Rating_ID = pollster_rating_id,
Pollster_Rating_Name = pollster_rating_name,
Pollster_Numeric_Grade = numeric_grade,
Poll_Score = pollscore,
Methodology = methodology,
Transparency_Score = transparency_score,
State = state
)
For the sake of cleaning, I did remove all the unavailable information in this dataset. However, I would extend the selected article by filling in these invalid data spaces with the required information for it’s completion. To verify this article, it is possible to technically independently search for each of the columns missing values via the previous information given such as the poll ID for example.
## Rows: 0
## Columns: 51
## $ Poll_ID <dbl>
## $ Pollster_ID <dbl>
## $ Pollster_Name <chr>
## $ Sponsor_IDs <dbl>
## $ Sponsors <chr>
## $ display_name <chr>
## $ Pollster_Rating_ID <dbl>
## $ Pollster_Rating_Name <chr>
## $ Pollster_Numeric_Grade <lgl>
## $ Poll_Score <lgl>
## $ Methodology <chr>
## $ Transparency_Score <lgl>
## $ State <lgl>
## $ start_date <chr>
## $ end_date <chr>
## $ sponsor_candidate_id <lgl>
## $ sponsor_candidate <lgl>
## $ sponsor_candidate_party <lgl>
## $ endorsed_candidate_id <lgl>
## $ endorsed_candidate_name <lgl>
## $ endorsed_candidate_party <lgl>
## $ question_id <dbl>
## $ sample_size <dbl>
## $ population <chr>
## $ subpopulation <lgl>
## $ population_full <chr>
## $ tracking <lgl>
## $ created_at <chr>
## $ notes <lgl>
## $ url <chr>
## $ url_article <chr>
## $ url_topline <chr>
## $ url_crosstab <chr>
## $ source <lgl>
## $ internal <lgl>
## $ partisan <lgl>
## $ race_id <dbl>
## $ cycle <dbl>
## $ office_type <chr>
## $ seat_name <lgl>
## $ election_date <chr>
## $ stage <chr>
## $ nationwide_batch <lgl>
## $ ranked_choice_reallocated <lgl>
## $ ranked_choice_round <lgl>
## $ hypothetical <lgl>
## $ party <chr>
## $ answer <chr>
## $ candidate_id <dbl>
## $ candidate_name <chr>
## $ pct <dbl>