In this vignette, we’ll explore the polls_2016 dataset
from FiveThirtyEight. Our objective is to showcase the
capabilities of TidyVerse packages, specifically
dplyr and ggplot2, in analyzing and
visualizing the trend of average polls for presidential candidates from
1968 to 2016.
# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
Here, we will highlight the utility of two essential functions from
the tidyverse suite: read_csv() for data loading and
filter() for data filtering.
# Read the CSV data
polls <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/polls/pres_pollaverages_1968-2016.csv")
## Rows: 217473 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): state, modeldate, candidate_name, timestamp, comment, election_dat...
## dbl (24): cycle, candidate_id, pct_estimate, pct_trend_adjusted, election_qd...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filtering rows where cycle is 2016
polls_2016 <- polls %>%
filter(cycle == 2016)
# Display the first few rows of the dataset
head(polls_2016)
glimpse(polls_2016)
In this section, we’ll visualize the pct_estimate trend
for the presidential candidates in each state during the 2016 elections.
We employ the mutate() and str_replace_all()
functions to ensure that the state names in our dataset are consistent
and correctly formatted.
# Check state names
state_names <- unique(polls_2016$state)
print(state_names)
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Colorado" "Florida" "Georgia"
## [7] "Illinois" "Iowa" "Kansas"
## [10] "Michigan" "Minnesota" "Montana"
## [13] "National" "Nevada" "New Hampshire"
## [16] "New Jersey" "New York" "North Carolina"
## [19] "Ohio" "Pennsylvania" "South Carolina"
## [22] "Texas" "Utah" "Virginia"
## [25] "West Virginia" "Wisconsin" "Missouri"
## [28] "Maine" "California" "Mississippi"
## [31] "Maryland" "Massachusetts" "Connecticut"
## [34] "Idaho" "Indiana" "Oklahoma"
## [37] "Louisiana" "Oregon" "Tennessee"
## [40] "New Mexico" "Washington" "Arkansas"
## [43] "ME-1" "ME-2" "Vermont"
## [46] "Kentucky" "Delaware" "Hawaii"
## [49] "Nebraska" "North Dakota" "Rhode Island"
## [52] "South Dakota" "Wyoming" "NE-1"
## [55] "NE-2" "NE-3" "District of Columbia"
# Ensure that all state names are properly formatted
polls_2016 <- polls_2016 %>%
mutate(state = str_replace_all(state, c("^ME-\\d" = "Maine", "^NE-\\d" = "Nebraska")))
# Plotting the data
polls_2016 %>%
ggplot(aes(x=state, y=pct_estimate, fill=candidate_name)) +
geom_bar(stat="identity", position="dodge") +
labs(title="Estimated Poll Percentage of Presidential Candidates by State in 2016",
x="State",
y="Estimated Poll Percentage",
fill="Candidate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
In this analysis, we aim to identify the states where each of the top two presidential candidates of 2016 had the highest average poll rating. This will provide insights into key battlegrounds and stronghold regions for each candidate.
# Determine the top 2 candidates based on overall average pct_estimate
top_2_candidates <- polls_2016 %>%
group_by(candidate_name) %>%
summarize(avg_pct_estimate = mean(pct_estimate, na.rm = TRUE)) %>%
arrange(desc(avg_pct_estimate)) %>%
slice(1:2) %>%
pull(candidate_name)
top_2_candidates
## [1] "Donald Trump" "Hillary Rodham Clinton"
# Identify the state where each of the top 2 candidates is most popular
most_popular_states_for_each_candidate <- polls_2016 %>%
filter(candidate_name %in% top_2_candidates) %>%
group_by(candidate_name, state) %>%
summarize(state_avg_pct = mean(pct_estimate, na.rm = TRUE)) %>%
ungroup() %>%
arrange(candidate_name, desc(state_avg_pct)) %>%
group_by(candidate_name) %>%
slice(1) %>% # Take the top state for each candidate
ungroup()
## `summarise()` has grouped output by 'candidate_name'. You can override using
## the `.groups` argument.
most_popular_states_for_each_candidate
## # A tibble: 2 × 3
## candidate_name state state_avg_pct
## <chr> <chr> <dbl>
## 1 Donald Trump Alabama 66.3
## 2 Hillary Rodham Clinton District of Columbia 86.9
group_by(): Groups data by specified variables.summarize(): Aggregates data, like finding the mean of
a group.arrange(): Orders rows by specific variables.slice(): Selects rows by their position.filter(): Filters rows based on conditions.pull(): Extracts a column as a vector.By executing the above code, we can pinpoint the states where Donald Trump and Hillary Clinton had their strongest support in terms of average poll ratings during the 2016 elections.
The mutate() function is part of the dplyr
package and is used for creating or transforming columns in a data
frame. It’s a handy function when we want to perform operations on
existing columns to generate new ones.
Let’s say we want to calculate how far each candidate’s average poll
is from the highest average poll for each year.
We can use the mutate() function to create a new column
that represents this difference.
The code creates a new dataset polls_2016_transformed by
adding a column diff_estimate_trend. This column captures
the difference between the raw poll estimate pct_estimate
and its trend-adjusted value pct_trend_adjusted , helping
us understand the variance in poll data.
polls_2016_transformed <- polls_2016 %>%
mutate(diff_estimate_trend = pct_estimate - pct_trend_adjusted)
# Displaying the first few rows of the modified dataset
head(polls_2016_transformed)
## # A tibble: 6 × 33
## cycle state modeldate candidate_name candidate_id pct_estimate
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 2016 Alabama 3/3/2016 Donald Trump 9849 70.1
## 2 2016 Alaska 3/3/2016 Donald Trump 9849 51.6
## 3 2016 Arizona 3/3/2016 Donald Trump 9849 44
## 4 2016 Colorado 3/3/2016 Donald Trump 9849 46.0
## 5 2016 Florida 3/3/2016 Donald Trump 9849 46.5
## 6 2016 Georgia 3/3/2016 Donald Trump 9849 50.0
## # ℹ 27 more variables: pct_trend_adjusted <dbl>, timestamp <chr>,
## # comment <chr>, election_date <chr>, election_qdate <dbl>, last_qdate <dbl>,
## # last_enddate <chr>, `_medpoly2` <dbl>, trend_medpoly2 <dbl>,
## # `_shortpoly0` <dbl>, trend_shortpoly0 <dbl>, sum_weight_medium <dbl>,
## # sum_weight_short <dbl>, sum_influence <dbl>, sum_nat_influence <dbl>,
## # `_minpoints` <dbl>, `_defaultbasetime` <dbl>, `_numloops` <dbl>,
## # `_state_houseeffects_weight` <dbl>, `_state_trendline_weight` <dbl>, …
In the above code:
year column using
group_by(year). This ensures that our subsequent operations
are performed within each year.mutate() to create a new column called
diff_from_max. For each row, this column is calculated as
the difference between the maximum average poll of that year and the
candidate’s average poll.We’ll visualize the difference between the raw poll estimate and its trend-adjusted value for each candidate during the 2016 elections.
polls_2016_transformed %>%
ggplot(aes(x=cycle, y=diff_estimate_trend, color=candidate_name)) +
geom_line() +
labs(title="Difference Between Pct Estimate and Pct Trend Adjusted Over the Years",
x="Election Cycle",
y="Difference",
color="Candidate") +
theme_minimal()
By using the mutate() function, we were able to easily
create a new column in our dataset and derive additional insights from
our data. This function is a versatile tool for data manipulation and
can be used in a variety of scenarios to enhance our data analysis.
In this vignette, we explored the polls_2016 dataset
using Tidyerse tools. Our visualizations highlighted trends
in presidential poll averages, and the use of functions like
mutate() showcased the ease of data manipulation with
dplyr. The TidyVerse ecosystem proves to be a powerful ally
in understanding and visualizing complex datasets, enabling clear
insights into historical polling data.