My Questions: How has voter turnout changed over time in East Coast states overall?
The name of my dataset is “US Voter Turnout Data” I got this dataset from OpenIntro.org. This dataset analyzes state-level data on federal elections held in November between 1980 and 2014. This dataset has 936 observations and 7 variables. The variables relevant to my question, and will be using are region, year, and percent_total_ballots_counted. I chose this dataset because I am currently taking SOCY105 which is a Sociology class for social issues and problems. I specifically want to go into the east coast states because I live in the east coast, and I want to see which states have a higher voter turnout. Voter turnout essentially indicates a states overall well being. Higher voter turnout indicates a more engaged and active citizenry, whereas lower voter turnouts help to identify trends and areas needing help/improvement. This comes to play with my sociology class, as I can connect voter turnout to the roles of government, global economy/ economic inequities, and even immigration.
For my data analysis part, I will first select the variables I will be using(year, region and percent_total_ballots_counted). Then I will be filtering specifically for the east coat region only taking 5 states. I’m going to be doing voter turnout percentages for Maryland, Massachusetts, New Hampshire, New York, and Vermont. I picked these states because these are the ones with the most indication for low to high voter turnouts. Next I am going to filter the years 1994-2014(which is in increments of 2 in the dataset) this will give a more focused area of analysis(two decades worth of data collection). Lastly I will be creating a box plot, as I think that will be able to display when I want the audience to see when it comes to comparing the east coast states voter turnout throughout the years 1994-2014.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
#Setting Working directory
setwd("C:/Users/Joanne G/OneDrive/Data101(Fall 2025)/Datasets")
#read the StudentSurvey.csv in here
voter_df <- read.csv("voter_count.csv")
Clean your dataset and perform exploratory data analysis (EDA) on the dataset (need minimum of two functions).
# EDA Dataset Chunk
#dimensions
dim(voter_df)
## [1] 936 7
#head
head(voter_df)
## year region voting_eligible_population total_ballots_counted
## 1 2014 United States 227157964 83262122
## 2 2014 Alabama 3588783 1191274
## 3 2014 Alaska 520562 285431
## 4 2014 Arizona 4510186 1537671
## 5 2014 Arkansas 2117881 852642
## 6 2014 California 24440416 7513972
## highest_office percent_total_ballots_counted percent_highest_office
## 1 81687059 0.3665384 0.3596046
## 2 1180413 0.3319437 0.3289174
## 3 282382 0.5483132 0.5424560
## 4 1506416 0.3409329 0.3340031
## 5 848592 0.4025920 0.4006797
## 6 7317581 0.3074404 0.2994049
summary(voter_df)
## year region voting_eligible_population
## Min. :1980 Length:936 Min. : 270122
## 1st Qu.:1988 Class :character 1st Qu.: 999644
## Median :1997 Mode :character Median : 2662524
## Mean :1997 Mean : 7277622
## 3rd Qu.:2006 3rd Qu.: 4569632
## Max. :2014 Max. :227157964
##
## total_ballots_counted highest_office percent_total_ballots_counted
## Min. : 122356 Min. : 117623 Min. :0.2507
## 1st Qu.: 422851 1st Qu.: 488820 1st Qu.:0.4338
## Median : 1170867 Median : 1236230 Median :0.5234
## Mean : 3074280 Mean : 3509231 Mean :0.5183
## 3rd Qu.: 2395791 3rd Qu.: 2336586 3rd Qu.:0.6047
## Max. :132609063 Max. :131304731 Max. :0.7877
## NA's :223 NA's :1 NA's :223
## percent_highest_office
## Min. :0.2020
## 1st Qu.:0.4141
## Median :0.5010
## Mean :0.4993
## 3rd Qu.:0.5839
## Max. :0.7837
## NA's :1
Utilize functions such as filter
,
select
, mutate
, summary
,
mean
, max
, etc., to create the dataset you
need to answer your question (minimum of two functions).
#Filtering regions/year & Selecting the variables I am using
my_filtered_df <- voter_df |>
select(region,year, percent_total_ballots_counted)|>
filter(region %in% c("Maryland", "Massachusetts", "New Hampshire", "New York", "Vermont")) |>
filter(year %in% c(1994:2014))
my_filtered_df
## region year percent_total_ballots_counted
## 1 Maryland 2014 0.4200469
## 2 Massachusetts 2014 0.4466211
## 3 New Hampshire 2014 0.4830718
## 4 New York 2014 0.2899865
## 5 Vermont 2014 0.4082507
## 6 Maryland 2012 0.6728207
## 7 Massachusetts 2012 0.6620397
## 8 New Hampshire 2012 0.7091828
## 9 New York 2012 0.5350341
## 10 Vermont 2012 0.6117157
## 11 Maryland 2010 0.4669240
## 12 Massachusetts 2010 0.4942748
## 13 New Hampshire 2010 0.4614608
## 14 New York 2010 0.3626861
## 15 Vermont 2010 0.4982534
## 16 Maryland 2008 0.6781721
## 17 Massachusetts 2008 0.6727288
## 18 New Hampshire 2008 0.7252813
## 19 New York 2008 0.5963191
## 20 Vermont 2008 0.6771029
## 21 Maryland 2006 0.4719729
## 22 Massachusetts 2006 0.4928479
## 23 New Hampshire 2006 0.4291584
## 24 New York 2006 0.3652430
## 25 Vermont 2006 0.5499773
## 26 Maryland 2004 0.6309256
## 27 Massachusetts 2004 0.6456873
## 28 New Hampshire 2004 0.7148225
## 29 New York 2004 0.5847255
## 30 Vermont 2004 0.6674824
## 31 Maryland 2002 0.4680435
## 32 Massachusetts 2002 0.4979571
## 33 New Hampshire 2002 0.4869911
## 34 New York 2002 0.3699541
## 35 Vermont 2002 0.4942544
## 36 Maryland 2000 0.5579893
## 37 Massachusetts 2000 0.6057061
## 38 New Hampshire 2000 0.6497212
## 39 New York 2000 0.5622050
## 40 Vermont 2000 0.6471316
## 41 Maryland 1998 0.4360000
## 42 Massachusetts 1998 0.4367645
## 43 New Hampshire 1998 0.3762686
## 44 New York 1998 0.4068933
## 45 Vermont 1998 0.5044877
## 46 Maryland 1996 0.5081914
## 47 Massachusetts 1996 0.5929576
## 48 New Hampshire 1996 0.6006171
## 49 New York 1996 0.5288443
## 50 Vermont 1996 0.6001304
## 51 Maryland 1994 0.4130283
## 52 Massachusetts 1994 0.5127588
## 53 New Hampshire 1994 0.3850613
## 54 New York 1994 0.4367770
## 55 Vermont 1994 0.5075851
#Turns decimal values into percentage
my_filtered_df$percent_total_ballots_counted <- my_filtered_df$percent_total_ballots_counted * 100
Create a summary table or a visualization (e.g., histograms, scatter plots, etc.) to describe the distribution and relationships within the data. [If you chose to graph, you can use the help of AI or previous courses’ knowledge for graphing]
# Create grouped bar chart
ggplot(my_filtered_df, aes(fill = region, x = factor(year), y = percent_total_ballots_counted)) +
geom_bar(position = 'dodge', stat='identity', width = 0.7) + # side-by-side bars
theme_minimal() +
labs(
title = "Voter Turnout Percentage Trends: East Coast States (1994-2014)",
x = "Election Year",
y = "Voter Turnout (%)",
fill = "Region"
) +
theme(
plot.title = element_text(size = 14, face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1), #puts year labels to an angle
)+
scale_fill_manual(values = c("Maryland" = "bisque4", "Massachusetts" = "chocolate4", "New Hampshire" = "orange", "New York"='deeppink2', "Vermont"= "antiquewhite3"))
Summarize the key findings of your analysis.
Discuss the implications of your results and their relevance to the research question.
Suggest potential avenues for future research or further analysis.
Important findings based on my analysis is…
New york has stayed as the lowest point in voter turnout throughout the years. This had be questioning why when that state is high in population, industry, governement, etc. This data is only from 1994-2014 and it wasn’t until 2019 that early in-person laws were implemented. People in New York were also known to not vote when outcomes of political races were too predictable. Other factors like demographics like age played a big role as well with the percentages (higher population to voter eligibility to ballots counted lead to lower voter turnout). Compared to the state that stayed at the top when it came to voter turnout which was New Hampshire. New Hampshire’s more rural and small-town structure often means more face-to-face interaction, higher social stakes locally, and easier access to polling places. To conclude states that are more rural areas have a higher voter turnout because they have more face-to-face interactions. Those interactions play a key role in government and voting as the people in society tend to feel more valued, which is something I learned in my sociology class. So essentially my graph matches exactly what I learned in that sociology class, and essentially I could use this graph as an example to prove that if states with bigger population want to get higher voter turnouts, then there needs to be a more face-to-face interactions to show that people in higher power(the government) cares.
link: https://www.openintro.org/data/index.php?data=voter_count