Dslab Assignment

Author

Charlene Stephia

##Description for the Markdown/Quarto Document:

In this analysis, I am using the polls_2008 dataset from the dslabs package. This dataset contains polling data from the 2008 U.S. presidential election, including the margin of Obama’s lead over McCain as a percentage, recorded at different days leading up to the election.

For this visualization, I created a scatter plot to show the relationship between the number of days before the election (day) and the polling margin (margin). Each point represents a specific day’s polling data, with the margin showing how much Obama was leading by that particular day. A linear regression line has been added to the plot to visualize the trend over time, showing how Obama’s polling margin changed as the election day approached.

The x-axis represents the number of days before the election, and the y-axis represents Obama’s polling margin as a percentage. The plot uses a minimal theme to maintain clarity and focus on the data, with the title “Polling Margin in 2008 U.S. Presidential Election” placed at the top for context.

By examining this plot, we can see how Obama’s polling margin fluctuated and potentially identify any significant trends leading up to the election. I will provide two graphs.

##load necessary packages

library(dslabs)
Warning: package 'dslabs' was built under R version 4.4.3
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
Warning: package 'ggthemes' was built under R version 4.4.3

##load the dataset

data("polls_2008")

##Check the structure of the dataset

str(polls_2008)
tibble [131 × 2] (S3: tbl_df/tbl/data.frame)
 $ day   : num [1:131] -155 -153 -152 -151 -150 -149 -147 -146 -145 -144 ...
 $ margin: num [1:131] 0.02 0.03 0.065 0.06 0.07 ...
summary(polls_2008)
      day              margin        
 Min.   :-155.00   Min.   :-0.05000  
 1st Qu.:-111.50   1st Qu.: 0.02417  
 Median : -72.00   Median : 0.04500  
 Mean   : -74.31   Mean   : 0.04223  
 3rd Qu.: -35.00   3rd Qu.: 0.06083  
 Max.   :  -1.00   Max.   : 0.12000  

##Check the missing data

colSums(is.na(polls_2008))
   day margin 
     0      0 

##Remove missing data

polls_2008 <- polls_2008 %>% drop_na()

##Check for duplicate entries

polls_2008 <- polls_2008 %>% distinct()

##Head first few rows

head(polls_2008)
# A tibble: 6 × 2
    day margin
  <dbl>  <dbl>
1  -155 0.0200
2  -153 0.0300
3  -152 0.065 
4  -151 0.06  
5  -150 0.07  
6  -149 0.05  

##Converting ‘day’ into Dates Correctly

library(ggplot2)
library(dplyr)

polls_2008 <- polls_2008 %>%
  mutate(date = as.Date("2008-11-04") - day)

##Checking if the new column is created correctly

head(polls_2008)
# A tibble: 6 × 3
    day margin date      
  <dbl>  <dbl> <date>    
1  -155 0.0200 2009-04-08
2  -153 0.0300 2009-04-06
3  -152 0.065  2009-04-05
4  -151 0.06   2009-04-04
5  -150 0.07   2009-04-03
6  -149 0.05   2009-04-02

##Create a multivariable scatter plot with ggplot

ggplot(polls_2008, aes(x = day, y = margin)) +
  geom_point(alpha = 0.6, color = "blue") +  # Plot individual points with some transparency
  geom_smooth(method = "lm", se = FALSE, color = "green") +  # Add a linear regression line
  scale_x_continuous(name = "Days Before Election") +  # Label for the x-axis
  scale_y_continuous(name = "Polling Margin (%)") +  # Label for the y-axis
  ggtitle("Polling Margin in 2008 U.S. Presidential Election") +  # Title of the plot
  theme_minimal() +  # Use a clean, minimal theme
  theme(plot.title = element_text(hjust = 0.5))  # Center the title
`geom_smooth()` using formula = 'y ~ x'

##First visualization interpretation

This graph shows the polling margin in the 2008 U.S. presidential election over the days leading up to the election. The x-axis represents the number of days before the election, and the y-axis shows the polling margin, which indicates the percentage difference in support between the two main candidates. The blue dots represent individual polling data points, and the green line is a linear regression showing the trend over time. The plot suggests that as the election day approached, the polling margin shifted in a certain direction. The smooth line helps us see the general trend of the data without focusing on every small fluctuation. The graph uses a minimal theme for a clean look and centers the title for better readability.

Interpretation of the graph:

X-axis (Days Before Election): This axis shows the number of days remaining until the election. As you move from left to right, the days get closer to the election date.

Y-axis (Polling Margin %): The y-axis represents the polling margin, which shows the difference in support between the two main candidates in percentage terms. Positive values would indicate one candidate leading, and negative values would suggest the other candidate was ahead.

Red Dots (Individual Data Points): Each red dot represents a polling result on a specific day leading up to the election. The transparency (alpha = 0.6) makes it easier to see overlapping points.

Black Line (Linear Regression): The black line shows the general trend of polling results over time. This line is calculated using linear regression, which helps us understand if the polling margin was consistently increasing or decreasing as the election day got closer.

Title: The title at the top of the graph (“Polling Margin in 2008 U.S. Presidential Election”) gives context to the graph and helps us know exactly what we are looking at.

Theme and Aesthetics: The minimal theme makes the graph look clean and simple. The title is centered to make it more readable and visually appealing.

This graph shows the polling margin in the 2008 U.S. presidential election as the election day got closer. The x-axis shows how many days were left before the election, and the y-axis shows the polling margin, which tells us the percentage difference in support between the two main candidates. The red dots are individual polling results, and the black line represents the overall trend of the data. The line is a straight line (linear regression), which helps us see if the margin was generally increasing or decreasing over time.

Creating a scatterplot with a trend line

ggplot(polls_2008, aes(x = date, y = margin)) +
  geom_line(color = "blue", size = 1) +  # Trend line
  geom_point(color = "purple", alpha = 0.6) +  # Individual data points
  theme_minimal() +  # Clean theme
  labs(title = "Polling Margin in 2008 Election", 
       x = "Date", 
       y = "Obama Lead (%)") +
  theme(plot.title = element_text(hjust = 0.5))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Second Visualisation interpretation

In this visualization, I created a plot to show the polling margin for Obama in the 2008 election as the election date approached. Here’s how I interpret it:

X-axis (Date): This shows the number of days before Election Day (Nov 4, 2008). As the days get closer to the election, we can see how polling data changes.

Y-axis (Obama Lead %): This shows the margin or percentage by which Obama was leading in the polls. Positive values mean Obama was ahead, while negative values mean McCain was ahead.

Blue Line (Trend line): The line shows the overall trend of Obama’s polling lead as the election date gets closer. It helps us see if Obama’s lead was increasing or decreasing over time.

Purple Points (Data Points): Each red point represents a specific poll from a certain day, showing Obama’s lead on that exact day.

Title: The title “Polling Margin in 2008 Election” tells us what the plot is about.

Labels: The X-axis and Y-axis are clearly labeled to tell us what the data represents.

In a nutshell, this plot helps us understand how Obama’s lead in the polls changed over time leading up to the election. The trend line shows the general direction, and the points represent daily polling results.