Data607: Tidyverse Create

Author

Anthony Josue Roman

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RCurl)


Attaching package: 'RCurl'

The following object is masked from 'package:tidyr':

    complete

library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(lubridate)
library(ggplot2)
library(ggthemes)
library(ggExtra)
library(ggpubr)
library(gganimate)
library(ggplot2movies)
library(scales)


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

library(maps)


Attaching package: 'maps'

The following object is masked from 'package:purrr':

    map

library(sf)

Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE

library(viridis)

Loading required package: viridisLite

Attaching package: 'viridis'

The following object is masked from 'package:maps':

    unemp

The following object is masked from 'package:scales':

    viridis_pal

library(ggsci)
library(mapproj)

Introduction

In this vignette, we will explore how to use several TidyVerse functions to clean and analyze a dataset. The data used in this example is related to presidential polls. We will load the data, clean it, and analyze it using various TidyVerse functions. We will create visualizations to explore the trends in poll percentages for different candidates over time. The visualizations will include line plots, smoothed lines, stacked area plots, grouped bar plots, faceted line plots, faceted area plots, faceted bar plots, heatmaps, box plots, violin plots, density plots, and a map visualization showing the leading candidate by state. By leveraging the power of the TidyVerse, we can efficiently explore and analyze complex datasets, gaining valuable insights and informing data-driven decisions. The flexibility and ease of use of the TidyVerse tools make them essential for data analysis and visualization tasks.

Loading and Exploring the Data

First, we load the dataset and get a sense of its structure. We will then proceed to clean the data by selecting relevant columns and filtering out any missing values.

# Load the dataset
raw_polls <- getURL("https://raw.githubusercontent.com/spacerome/TidyVerseCREATE/refs/heads/main/president_general_polls_2016.csv")
polls_data <- read_csv(raw_polls)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 12624 Columns: 27
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): branch, type, matchup, forecastdate, state, startdate, enddate, po...
dbl (13): cycle, samplesize, poll_wt, rawpoll_clinton, rawpoll_trump, rawpol...
lgl  (1): multiversions

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Display the first few rows
head(polls_data)

# A tibble: 6 × 27
  cycle branch type  matchup forecastdate state startdate enddate pollster grade
  <dbl> <chr>  <chr> <chr>   <chr>        <chr> <chr>     <chr>   <chr>    <chr>
1  2016 Presi… poll… Clinto… 11/8/16      U.S.  11/3/2016 11/6/2… ABC New… A+   
2  2016 Presi… poll… Clinto… 11/8/16      U.S.  11/1/2016 11/7/2… Google … B    
3  2016 Presi… poll… Clinto… 11/8/16      U.S.  11/2/2016 11/6/2… Ipsos    A-   
4  2016 Presi… poll… Clinto… 11/8/16      U.S.  11/4/2016 11/7/2… YouGov   B    
5  2016 Presi… poll… Clinto… 11/8/16      U.S.  11/3/2016 11/6/2… Gravis … B-   
6  2016 Presi… poll… Clinto… 11/8/16      U.S.  11/3/2016 11/6/2… Fox New… A    
# ℹ 17 more variables: samplesize <dbl>, population <chr>, poll_wt <dbl>,
#   rawpoll_clinton <dbl>, rawpoll_trump <dbl>, rawpoll_johnson <dbl>,
#   rawpoll_mcmullin <dbl>, adjpoll_clinton <dbl>, adjpoll_trump <dbl>,
#   adjpoll_johnson <dbl>, adjpoll_mcmullin <dbl>, multiversions <lgl>,
#   url <chr>, poll_id <dbl>, question_id <dbl>, createddate <chr>,
#   timestamp <chr>

# Check the structure of the dataset
glimpse(polls_data)

Rows: 12,624
Columns: 27
$ cycle            <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,…
$ branch           <chr> "President", "President", "President", "President", "…
$ type             <chr> "polls-plus", "polls-plus", "polls-plus", "polls-plus…
$ matchup          <chr> "Clinton vs. Trump vs. Johnson", "Clinton vs. Trump v…
$ forecastdate     <chr> "11/8/16", "11/8/16", "11/8/16", "11/8/16", "11/8/16"…
$ state            <chr> "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.", "U.S.…
$ startdate        <chr> "11/3/2016", "11/1/2016", "11/2/2016", "11/4/2016", "…
$ enddate          <chr> "11/6/2016", "11/7/2016", "11/6/2016", "11/7/2016", "…
$ pollster         <chr> "ABC News/Washington Post", "Google Consumer Surveys"…
$ grade            <chr> "A+", "B", "A-", "B", "B-", "A", "A-", "A-", NA, "A-"…
$ samplesize       <dbl> 2220, 26574, 2195, 3677, 16639, 1295, 1426, 1282, 843…
$ population       <chr> "lv", "lv", "lv", "lv", "rv", "lv", "lv", "lv", "lv",…
$ poll_wt          <dbl> 8.720654, 7.628472, 6.424334, 6.087135, 5.316449, 5.2…
$ rawpoll_clinton  <dbl> 47.00, 38.03, 42.00, 45.00, 47.00, 48.00, 45.00, 44.0…
$ rawpoll_trump    <dbl> 43.00, 35.69, 39.00, 41.00, 43.00, 44.00, 41.00, 40.0…
$ rawpoll_johnson  <dbl> 4.00, 5.46, 6.00, 5.00, 3.00, 3.00, 5.00, 6.00, 6.00,…
$ rawpoll_mcmullin <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ adjpoll_clinton  <dbl> 45.20163, 43.34557, 42.02638, 45.65676, 46.84089, 49.…
$ adjpoll_trump    <dbl> 41.72430, 41.21439, 38.81620, 40.92004, 42.33184, 43.…
$ adjpoll_johnson  <dbl> 4.626221, 5.175792, 6.844734, 6.069454, 3.726098, 3.0…
$ adjpoll_mcmullin <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ multiversions    <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ url              <chr> "https://www.washingtonpost.com/news/the-fix/wp/2016/…
$ poll_id          <dbl> 48630, 48847, 48922, 48687, 48848, 48619, 48521, 4848…
$ question_id      <dbl> 76192, 76443, 76636, 76262, 76444, 76163, 76058, 7597…
$ createddate      <chr> "11/7/16", "11/7/16", "11/8/16", "11/7/16", "11/7/16"…
$ timestamp        <chr> "09:35:33  8 Nov 2016", "09:35:33  8 Nov 2016", "09:3…

Data Cleaning

Next, we will clean the data by selecting relevant columns and filtering rows. We will remove any missing data to ensure the dataset is ready for analysis.

# Select relevant columns and remove any missing data
polls_clean <- polls_data %>%
  select(pollster, state, startdate, enddate, rawpoll_clinton, rawpoll_trump, rawpoll_johnson) %>%
  filter(!is.na(rawpoll_clinton) & !is.na(rawpoll_trump) & !is.na(rawpoll_johnson))

# Display a summary of the cleaned data
summary(polls_clean)

   pollster            state            startdate           enddate         
 Length:8397        Length:8397        Length:8397        Length:8397       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
 rawpoll_clinton rawpoll_trump   rawpoll_johnson 
 Min.   :11.04   Min.   : 4.00   Min.   : 0.000  
 1st Qu.:36.79   1st Qu.:34.00   1st Qu.: 5.400  
 Median :42.00   Median :39.00   Median : 7.000  
 Mean   :40.97   Mean   :38.62   Mean   : 7.382  
 3rd Qu.:45.50   3rd Qu.:44.00   3rd Qu.: 9.000  
 Max.   :88.00   Max.   :63.29   Max.   :25.000

Analyzing the Data

We will now analyze the data to visualize how different candidates are performing over time. We will create various visualizations to explore the trends in poll percentages for each candidate. The visualizations will include line plots, smoothed lines, stacked area plots, grouped bar plots, faceted line plots, faceted area plots, faceted bar plots, heatmaps, box plots, violin plots, density plots, and a map visualization showing the leading candidate by state.

# Reshape the data for visualization
polls_long <- polls_clean %>%
  pivot_longer(cols = starts_with("rawpoll_"), names_to = "candidate", values_to = "poll_percentage") %>%
  mutate(candidate = str_replace(candidate, "rawpoll_", ""))  # Clean up candidate names

# Convert date columns to Date type for plotting
polls_long$startdate <- as.Date(polls_long$startdate, format = "%m/%d/%Y")

# Plot the trend of poll percentages over time for each candidate
ggplot(polls_long, aes(x = startdate, y = poll_percentage, color = candidate)) +
  geom_line(size = 1.2) +
  theme_minimal() +
  scale_color_manual(values = c("clinton" = "blue", "trump" = "red", "johnson" = "gold")) +
  scale_y_continuous(labels = percent_format(scale = 1)) +  # Format y-axis as percentages
  labs(title = "Poll Percentage Trend Over Time by Candidate",
       x = "Date",
       y = "Poll Percentage",
       color = "Candidate") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top"
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Analyzing the Data with Smoothed Lines

We will now analyze the data using smoothed lines to better visualize trends. The smoothed lines provide a clearer view of the overall trend in poll percentages over time for each candidate.

# Reshape the data for visualization
polls_long <- polls_clean %>%
  pivot_longer(cols = starts_with("rawpoll_"), names_to = "candidate", values_to = "poll_percentage") %>%
  mutate(candidate = str_replace(candidate, "rawpoll_", ""))  # Clean up candidate names

# Convert date columns to Date type for plotting
polls_long$startdate <- as.Date(polls_long$startdate, format = "%m/%d/%Y")

# Aggregate data by week to reduce clutter
polls_weekly <- polls_long %>%
  group_by(candidate, week = cut(startdate, "week")) %>%
  summarise(avg_poll_percentage = mean(poll_percentage, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(week = as.Date(week))

`summarise()` has grouped output by 'candidate'. You can override using the
`.groups` argument.

# Plot the smoothed trend of poll percentages over time for each candidate
ggplot(polls_weekly, aes(x = week, y = avg_poll_percentage, color = candidate)) +
  geom_line(size = 1.2) +
  geom_smooth(se = FALSE, linetype = "dashed") +
  theme_minimal() +
  scale_color_manual(values = c("clinton" = "blue", "trump" = "red", "johnson" = "gold")) +
  scale_y_continuous(labels = percent_format(scale = 1)) +  # Format y-axis as percentages
  labs(title = "Smoothed Poll Percentage Trend Over Time by Candidate",
       x = "Date",
       y = "Poll Percentage",
       color = "Candidate") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top"
  )

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Analyzing the Data with a Stacked Area Plot

We will now analyze the data using a stacked area plot to visualize the overall distribution of poll percentages over time. The stacked area plot provides a clear view of how the poll percentages of different candidates have evolved over time.

# Reshape the data for visualization

polls_long <- polls_clean %>%
  pivot_longer(cols = starts_with("rawpoll_"), names_to = "candidate", values_to = "poll_percentage") %>%
  mutate(candidate = str_replace(candidate, "rawpoll_", ""))  # Clean up candidate names

# Convert date columns to Date type for plotting

polls_long$startdate <- as.Date(polls_long$startdate, format = "%m/%d/%Y")

# Aggregate data by week to reduce clutter

polls_weekly <- polls_long %>%
  group_by(candidate, week = cut(startdate, "week")) %>%
  summarise(avg_poll_percentage = mean(poll_percentage, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(week = as.Date(week))

`summarise()` has grouped output by 'candidate'. You can override using the
`.groups` argument.

# Create a stacked area plot of poll percentages over time for each candidate

ggplot(polls_weekly, aes(x = week, y = avg_poll_percentage, fill = candidate)) +
  geom_area() +
  theme_minimal() +
  scale_fill_manual(values = c("clinton" = "blue", "trump" = "red", "johnson" = "gold")) +
  scale_y_continuous(labels = percent_format(scale = 1)) +  # Format y-axis as percentages
  labs(title = "Stacked Area Plot of Poll Percentage Over Time by Candidate",
       x = "Date",
       y = "Poll Percentage",
       fill = "Candidate") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top"
  )

Analyzing the Data with a Grouped Bar Plot

We will now analyze the data using a grouped bar plot to compare the poll percentages of different candidates over time. The grouped bar plot provides a clear view of how the poll percentages of different candidates have evolved over time.

# Reshape the data for visualization

polls_long <- polls_clean %>%
  pivot_longer(cols = starts_with("rawpoll_"), names_to = "candidate", values_to = "poll_percentage") %>%
  mutate(candidate = str_replace(candidate, "rawpoll_", ""))  # Clean up candidate names

# Convert date columns to Date type for plotting

polls_long$startdate <- as.Date(polls_long$startdate, format = "%m/%d/%Y")

# Aggregate data by week to reduce clutter

polls_weekly <- polls_long %>%
  group_by(candidate, week = cut(startdate, "week")) %>%
  summarise(avg_poll_percentage = mean(poll_percentage, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(week = as.Date(week))

`summarise()` has grouped output by 'candidate'. You can override using the
`.groups` argument.

# Create a grouped bar plot of poll percentages over time for each candidate

ggplot(polls_weekly, aes(x = week, y = avg_poll_percentage, fill = candidate)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  scale_fill_manual(values = c("clinton" = "blue", "trump" = "red", "johnson" = "gold")) +
  scale_y_continuous(labels = percent_format(scale = 1)) +  # Format y-axis as percentages
  labs(title = "Grouped Bar Plot of Poll Percentage Over Time by Candidate",
       x = "Date",
       y = "Poll Percentage",
       fill = "Candidate") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top"
  )

Analyzing the Data with a Faceted Line Plot

We will now analyze the data using a faceted line plot to compare the poll percentages of different candidates over time. The faceted line plot provides a clear view of how the poll percentages of different candidates have evolved over time.

# Reshape the data for visualization

polls_long <- polls_clean %>%
  pivot_longer(cols = starts_with("rawpoll_"), names_to = "candidate", values_to = "poll_percentage") %>%
  mutate(candidate = str_replace(candidate, "rawpoll_", ""))  # Clean up candidate names

# Convert date columns to Date type for plotting

polls_long$startdate <- as.Date(polls_long$startdate, format = "%m/%d/%Y")

# Aggregate data by week to reduce clutter

polls_weekly <- polls_long %>%
  group_by(candidate, week = cut(startdate, "week")) %>%
  summarise(avg_poll_percentage = mean(poll_percentage, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(week = as.Date(week))

`summarise()` has grouped output by 'candidate'. You can override using the
`.groups` argument.

# Create a faceted line plot of poll percentages over time for each candidate

ggplot(polls_weekly, aes(x = week, y = avg_poll_percentage, color = candidate)) +
  geom_line(size = 1.2) +
  facet_wrap(~candidate, scales = "free_y") +
  theme_minimal() +
  scale_color_manual(values = c("clinton" = "blue", "trump" = "red", "johnson" = "gold")) +
  scale_y_continuous(labels = percent_format(scale = 1)) +  # Format y-axis as percentages
  labs(title = "Faceted Line Plot of Poll Percentage Over Time by Candidate",
       x = "Date",
       y = "Poll Percentage",
       color = "Candidate") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top"
  )

Analyzing the Data with a Faceted Area Plot

We will now analyze the data using a faceted area plot to compare the poll percentages of different candidates over time. The faceted area plot provides a clear view of how the poll percentages of different candidates have evolved over time.

# Reshape the data for visualization

polls_long <- polls_clean %>%
  pivot_longer(cols = starts_with("rawpoll_"), names_to = "candidate", values_to = "poll_percentage") %>%
  mutate(candidate = str_replace(candidate, "rawpoll_", ""))  # Clean up candidate names

# Convert date columns to Date type for plotting

polls_long$startdate <- as.Date(polls_long$startdate, format = "%m/%d/%Y")

# Aggregate data by week to reduce clutter

polls_weekly <- polls_long %>%
  group_by(candidate, week = cut(startdate, "week")) %>%
  summarise(avg_poll_percentage = mean(poll_percentage, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(week = as.Date(week))

`summarise()` has grouped output by 'candidate'. You can override using the
`.groups` argument.

# Create a faceted area plot of poll percentages over time for each candidate

ggplot(polls_weekly, aes(x = week, y = avg_poll_percentage, fill = candidate)) +
  geom_area() +
  facet_wrap(~candidate, scales = "free_y") +
  theme_minimal() +
  scale_fill_manual(values = c("clinton" = "blue", "trump" = "red", "johnson" = "gold")) +
  scale_y_continuous(labels = percent_format(scale = 1)) +  # Format y-axis as percentages
  labs(title = "Faceted Area Plot of Poll Percentage Over Time by Candidate",
       x = "Date",
       y = "Poll Percentage",
       fill = "Candidate") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top"
  )

Analyzing the Data with a Faceted Bar Plot

We will now analyze the data using a faceted bar plot to compare the poll percentages of different candidates over time. The faceted bar plot provides a clear view of how the poll percentages of different candidates have evolved over time.

# Reshape the data for visualization

polls_long <- polls_clean %>%
  pivot_longer(cols = starts_with("rawpoll_"), names_to = "candidate", values_to = "poll_percentage") %>%
  mutate(candidate = str_replace(candidate, "rawpoll_", ""))  # Clean up candidate names

# Convert date columns to Date type for plotting

polls_long$startdate <- as.Date(polls_long$startdate, format = "%m/%d/%Y")

# Aggregate data by week to reduce clutter

polls_weekly <- polls_long %>%
  group_by(candidate, week = cut(startdate, "week")) %>%
  summarise(avg_poll_percentage = mean(poll_percentage, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(week = as.Date(week))

`summarise()` has grouped output by 'candidate'. You can override using the
`.groups` argument.

# Create a faceted bar plot of poll percentages over time for each candidate

ggplot(polls_weekly, aes(x = week, y = avg_poll_percentage, fill = candidate)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~candidate, scales = "free_y") +
  theme_minimal() +
  scale_fill_manual(values = c("clinton" = "blue", "trump" = "red", "johnson" = "gold")) +
  scale_y_continuous(labels = percent_format(scale = 1)) +  # Format y-axis as percentages
  labs(title = "Faceted Bar Plot of Poll Percentage Over Time by Candidate",
       x = "Date",
       y = "Poll Percentage",
       fill = "Candidate") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top"
  )

Analyzing the Data with a Heatmap

We will now analyze the data using a heatmap to visualize the poll percentages of different candidates over time. The heatmap provides a clear view of how the poll percentages of different candidates have evolved over time.

# Reshape the data for visualization

polls_long <- polls_clean %>%
  pivot_longer(cols = starts_with("rawpoll_"), names_to = "candidate", values_to = "poll_percentage") %>%
  mutate(candidate = str_replace(candidate, "rawpoll_", ""))  # Clean up candidate names

# Convert date columns to Date type for plotting

polls_long$startdate <- as.Date(polls_long$startdate, format = "%m/%d/%Y")

# Aggregate data by week to reduce clutter

polls_weekly <- polls_long %>%
  group_by(candidate, week = cut(startdate, "week")) %>%
  summarise(avg_poll_percentage = mean(poll_percentage, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(week = as.Date(week))

`summarise()` has grouped output by 'candidate'. You can override using the
`.groups` argument.

# Create a heatmap of poll percentages over time for each candidate

ggplot(polls_weekly, aes(x = week, y = candidate, fill = avg_poll_percentage)) +
  geom_tile() +
  theme_minimal() +
  scale_fill_viridis_c() +
  labs(title = "Heatmap of Poll Percentage Over Time by Candidate",
       x = "Date",
       y = "Candidate",
       fill = "Poll Percentage") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "right"
  )

Analyzing the Data with a Box Plot

We will now analyze the data using a box plot to compare the distribution of poll percentages for different candidates.

# Reshape the data for visualization

polls_long <- polls_clean %>%
  pivot_longer(cols = starts_with("rawpoll_"), names_to = "candidate", values_to = "poll_percentage") %>%
  mutate(candidate = str_replace(candidate, "rawpoll_", ""))  # Clean up candidate names

# Convert date columns to Date type for plotting

polls_long$startdate <- as.Date(polls_long$startdate, format = "%m/%d/%Y")

# Aggregate data by week to reduce clutter

polls_weekly <- polls_long %>%
  group_by(candidate, week = cut(startdate, "week")) %>%
  summarise(avg_poll_percentage = mean(poll_percentage, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(week = as.Date(week))

`summarise()` has grouped output by 'candidate'. You can override using the
`.groups` argument.

# Create a box plot of poll percentages over time for each candidate

ggplot(polls_weekly, aes(x = candidate, y = avg_poll_percentage, fill = candidate)) +
  geom_boxplot() +
  theme_minimal() +
  scale_fill_manual(values = c("clinton" = "blue", "trump" = "red", "johnson" = "gold")) +
  scale_y_continuous(labels = percent_format(scale = 1)) +  # Format y-axis as percentages
  labs(title = "Box Plot of Poll Percentage Over Time by Candidate",
       x = "Candidate",
       y = "Poll Percentage",
       fill = "Candidate") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "none"
  )

Analyzing the Data with a Violin Plot

We will now analyze the data using a violin plot to compare the distribution of poll percentages for different candidates.

# Reshape the data for visualization

polls_long <- polls_clean %>%
  pivot_longer(cols = starts_with("rawpoll_"), names_to = "candidate", values_to = "poll_percentage") %>%
  mutate(candidate = str_replace(candidate, "rawpoll_", ""))  # Clean up candidate names

# Convert date columns to Date type for plotting

polls_long$startdate <- as.Date(polls_long$startdate, format = "%m/%d/%Y")

# Aggregate data by week to reduce clutter

polls_weekly <- polls_long %>%
  group_by(candidate, week = cut(startdate, "week")) %>%
  summarise(avg_poll_percentage = mean(poll_percentage, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(week = as.Date(week))

`summarise()` has grouped output by 'candidate'. You can override using the
`.groups` argument.

# Create a violin plot of poll percentages over time for each candidate

ggplot(polls_weekly, aes(x = candidate, y = avg_poll_percentage, fill = candidate)) +
  geom_violin() +
  theme_minimal() +
  scale_fill_manual(values = c("clinton" = "blue", "trump" = "red", "johnson" = "gold")) +
  scale_y_continuous(labels = percent_format(scale = 1)) +  # Format y-axis as percentages
  labs(title = "Violin Plot of Poll Percentage Over Time by Candidate",
       x = "Candidate",
       y = "Poll Percentage",
       fill = "Candidate") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "none"
  )

Analyzing the Data with a Density Plot

We will now analyze the data using a density plot to compare the distribution of poll percentages for different candidates. The density plot provides a clear view of the distribution of poll percentages over time for each candidate.

# Reshape the data for visualization

polls_long <- polls_clean %>%
  pivot_longer(cols = starts_with("rawpoll_"), names_to = "candidate", values_to = "poll_percentage") %>%
  mutate(candidate = str_replace(candidate, "rawpoll_", ""))  # Clean up candidate names

# Convert date columns to Date type for plotting

polls_long$startdate <- as.Date(polls_long$startdate, format = "%m/%d/%Y")

# Aggregate data by week to reduce clutter

polls_weekly <- polls_long %>%
  group_by(candidate, week = cut(startdate, "week")) %>%
  summarise(avg_poll_percentage = mean(poll_percentage, na.rm = TRUE)) %>%
  ungroup() %>%
  mutate(week = as.Date(week))

`summarise()` has grouped output by 'candidate'. You can override using the
`.groups` argument.

# Create a density plot of poll percentages over time for each candidate

ggplot(polls_weekly, aes(x = avg_poll_percentage, fill = candidate)) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
  scale_fill_manual(values = c("clinton" = "blue", "trump" = "red", "johnson" = "gold")) +
  labs(title = "Density Plot of Poll Percentage Over Time by Candidate",
       x = "Poll Percentage",
       fill = "Candidate") +
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "top"
  )

Analyzing the States with the Highest Poll Percentages

We will now analyze the states with the highest poll percentages for each candidate. We will summarize the data to identify the states where each candidate is leading based on the poll data.

# Summarize the poll data to get the candidate leading in each state
polls_state_summary <- polls_data %>%
  filter(!is.na(state)) %>%  # Ensure state data is not missing
  group_by(state) %>%
  summarise(
    clinton_avg = mean(rawpoll_clinton, na.rm = TRUE),
    trump_avg = mean(rawpoll_trump, na.rm = TRUE)
  ) %>%
  mutate(
    leading_candidate = case_when(
      clinton_avg > trump_avg ~ "Clinton",
      trump_avg > clinton_avg ~ "Trump",
      TRUE ~ "Tie"
    )
  )

# Get U.S. states map data
states_map <- map_data("state")

# Prepare the data for merging
polls_state_summary$state <- tolower(polls_state_summary$state)
states_map$region <- tolower(states_map$region)

# Merge the summarized poll data with map data
map_data <- left_join(states_map, polls_state_summary, by = c("region" = "state"))

# Plot the map
ggplot(map_data, aes(x = long, y = lat, group = group, fill = leading_candidate)) +
  geom_polygon(color = "white") +
  scale_fill_manual(values = c("Clinton" = "blue", "Trump" = "red", "Tie" = "gray")) +
  theme_minimal() +
  labs(
    title = "Leading Candidate by State",
    fill = "Candidate"
  ) +
  coord_map()

This map shows the states where each candidate is leading based on the poll data. The data also shows that the leading candidate varies by state, with some states favoring Clinton, others favoring Trump, and some showing a tie. The map visualization provides a clear overview of the distribution of poll percentages across different states.

Findings

The visualizations show the trends in poll percentages for different candidates over time. The faceted line plot and faceted area plot provide a detailed view of how each candidate’s poll percentage has evolved over time. The heatmap and box plot offer insights into the distribution of poll percentages for each candidate. The map visualization highlights the states where each candidate is leading based on the poll data. The data also shows that the leading candidate varies by state, with some states favoring Clinton, others favoring Trump, and some showing a tie. The visualizations provide a comprehensive overview of the poll data and help identify patterns and trends in the data. The analysis can be further extended by exploring additional variables and conducting more in-depth statistical analysis.

Conclusion

This example demonstrates how to use TidyVerse functions to clean, summarize, and visualize data using the dplyr and ggplot2 packages. The process makes it easy to transform raw data into insightful visualizations and summaries. By leveraging the power of the TidyVerse, analysts can efficiently explore and analyze complex datasets, gaining valuable insights and informing data-driven decisions. The flexibility and ease of use of the TidyVerse tools make them essential for data analysis and visualization tasks. Unfortunately, we know that Clinton lost the election and the forecasts were biased towards the Democrats. I believe that the data was not enough to predict the outcome of the election, and should gather data similar to how RealClear Polling projects elections and polls.