Data 101 Project 1

A. Introdution (1-2 paragraphs):

My Questions: How has voter turnout changed over time in East Coast states overall?

The name of my dataset is “US Voter Turnout Data” I got this dataset from OpenIntro.org. This dataset analyzes state-level data on federal elections held in November between 1980 and 2014. This dataset has 936 observations and 7 variables. The variables relevant to my question, and will be using are region, year, and percent_total_ballots_counted. I chose this dataset because I am currently taking SOCY105 which is a Sociology class for social issues and problems. I specifically want to go into the east coast states because I live in the east coast, and I want to see which states have a higher voter turnout. Voter turnout essentially indicates a states overall well being. Higher voter turnout indicates a more engaged and active citizenry, whereas lower voter turnouts help to identify trends and areas needing help/improvement. This comes to play with my sociology class, as I can connect voter turnout to the roles of government, global economy/ economic inequities, and even immigration.

B. Data Analysis (1 paragraph and 3-5 chunks of code): describe what type of data analysis you will perform and what kind of plot(s) you will generate to address your question. Your code should be used to:**

For my data analysis part, I will first select the variables I will be using(year, region and percent_total_ballots_counted). Then I will be filtering specifically for the east coat region only taking 5 states. I’m going to be doing voter turnout percentages for Maryland, Massachusetts, New Hampshire, New York, and Vermont. I picked these states because these are the ones with the most indication for low to high voter turnouts. Next I am going to filter the years 1994-2014(which is in increments of 2 in the dataset) this will give a more focused area of analysis(two decades worth of data collection). Lastly I will be creating a box plot, as I think that will be able to display when I want the audience to see when it comes to comparing the east coast states voter turnout throughout the years 1994-2014.

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.4.3

## Warning: package 'dplyr' was built under R version 4.4.3

## Warning: package 'forcats' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)

#Setting Working directory
setwd("C:/Users/Joanne G/OneDrive/Data101(Fall 2025)/Datasets")

#read the StudentSurvey.csv in here
voter_df <- read.csv("voter_count.csv")

 Clean your dataset and perform exploratory data analysis (EDA) on the dataset (need minimum of two functions).

# EDA Dataset Chunk

#dimensions
dim(voter_df)

## [1] 936   7

#head
head(voter_df)

##   year        region voting_eligible_population total_ballots_counted
## 1 2014 United States                  227157964              83262122
## 2 2014       Alabama                    3588783               1191274
## 3 2014        Alaska                     520562                285431
## 4 2014       Arizona                    4510186               1537671
## 5 2014      Arkansas                    2117881                852642
## 6 2014    California                   24440416               7513972
##   highest_office percent_total_ballots_counted percent_highest_office
## 1       81687059                     0.3665384              0.3596046
## 2        1180413                     0.3319437              0.3289174
## 3         282382                     0.5483132              0.5424560
## 4        1506416                     0.3409329              0.3340031
## 5         848592                     0.4025920              0.4006797
## 6        7317581                     0.3074404              0.2994049

summary(voter_df)

##       year         region          voting_eligible_population
##  Min.   :1980   Length:936         Min.   :   270122         
##  1st Qu.:1988   Class :character   1st Qu.:   999644         
##  Median :1997   Mode  :character   Median :  2662524         
##  Mean   :1997                      Mean   :  7277622         
##  3rd Qu.:2006                      3rd Qu.:  4569632         
##  Max.   :2014                      Max.   :227157964         
##                                                              
##  total_ballots_counted highest_office      percent_total_ballots_counted
##  Min.   :   122356     Min.   :   117623   Min.   :0.2507               
##  1st Qu.:   422851     1st Qu.:   488820   1st Qu.:0.4338               
##  Median :  1170867     Median :  1236230   Median :0.5234               
##  Mean   :  3074280     Mean   :  3509231   Mean   :0.5183               
##  3rd Qu.:  2395791     3rd Qu.:  2336586   3rd Qu.:0.6047               
##  Max.   :132609063     Max.   :131304731   Max.   :0.7877               
##  NA's   :223           NA's   :1           NA's   :223                  
##  percent_highest_office
##  Min.   :0.2020        
##  1st Qu.:0.4141        
##  Median :0.5010        
##  Mean   :0.4993        
##  3rd Qu.:0.5839        
##  Max.   :0.7837        
##  NA's   :1

 Utilize functions such as filter, select, mutate, summary, mean, max, etc., to create the dataset you need to answer your question (minimum of two functions).

#Filtering regions/year & Selecting the variables I am using
my_filtered_df <- voter_df |>
  select(region,year, percent_total_ballots_counted)|>
  filter(region %in% c("Maryland", "Massachusetts", "New Hampshire", "New York", "Vermont")) |>
  filter(year %in% c(1994:2014))

my_filtered_df

##           region year percent_total_ballots_counted
## 1       Maryland 2014                     0.4200469
## 2  Massachusetts 2014                     0.4466211
## 3  New Hampshire 2014                     0.4830718
## 4       New York 2014                     0.2899865
## 5        Vermont 2014                     0.4082507
## 6       Maryland 2012                     0.6728207
## 7  Massachusetts 2012                     0.6620397
## 8  New Hampshire 2012                     0.7091828
## 9       New York 2012                     0.5350341
## 10       Vermont 2012                     0.6117157
## 11      Maryland 2010                     0.4669240
## 12 Massachusetts 2010                     0.4942748
## 13 New Hampshire 2010                     0.4614608
## 14      New York 2010                     0.3626861
## 15       Vermont 2010                     0.4982534
## 16      Maryland 2008                     0.6781721
## 17 Massachusetts 2008                     0.6727288
## 18 New Hampshire 2008                     0.7252813
## 19      New York 2008                     0.5963191
## 20       Vermont 2008                     0.6771029
## 21      Maryland 2006                     0.4719729
## 22 Massachusetts 2006                     0.4928479
## 23 New Hampshire 2006                     0.4291584
## 24      New York 2006                     0.3652430
## 25       Vermont 2006                     0.5499773
## 26      Maryland 2004                     0.6309256
## 27 Massachusetts 2004                     0.6456873
## 28 New Hampshire 2004                     0.7148225
## 29      New York 2004                     0.5847255
## 30       Vermont 2004                     0.6674824
## 31      Maryland 2002                     0.4680435
## 32 Massachusetts 2002                     0.4979571
## 33 New Hampshire 2002                     0.4869911
## 34      New York 2002                     0.3699541
## 35       Vermont 2002                     0.4942544
## 36      Maryland 2000                     0.5579893
## 37 Massachusetts 2000                     0.6057061
## 38 New Hampshire 2000                     0.6497212
## 39      New York 2000                     0.5622050
## 40       Vermont 2000                     0.6471316
## 41      Maryland 1998                     0.4360000
## 42 Massachusetts 1998                     0.4367645
## 43 New Hampshire 1998                     0.3762686
## 44      New York 1998                     0.4068933
## 45       Vermont 1998                     0.5044877
## 46      Maryland 1996                     0.5081914
## 47 Massachusetts 1996                     0.5929576
## 48 New Hampshire 1996                     0.6006171
## 49      New York 1996                     0.5288443
## 50       Vermont 1996                     0.6001304
## 51      Maryland 1994                     0.4130283
## 52 Massachusetts 1994                     0.5127588
## 53 New Hampshire 1994                     0.3850613
## 54      New York 1994                     0.4367770
## 55       Vermont 1994                     0.5075851

#Turns decimal values into percentage
my_filtered_df$percent_total_ballots_counted <- my_filtered_df$percent_total_ballots_counted * 100

 Create a summary table or a visualization (e.g., histograms, scatter plots, etc.) to describe the distribution and relationships within the data. [If you chose to graph, you can use the help of AI or previous courses’ knowledge for graphing]

# Create grouped bar chart
ggplot(my_filtered_df, aes(fill = region, x = factor(year), y = percent_total_ballots_counted)) +
  geom_bar(position = 'dodge', stat='identity', width = 0.7) +  # side-by-side bars
  theme_minimal() +
  labs(
    title = "Voter Turnout Percentage Trends: East Coast States (1994-2014)",
    x = "Election Year",
    y = "Voter Turnout (%)",
    fill = "Region"
  ) +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1), #puts year labels to an angle
  )+
  scale_fill_manual(values = c("Maryland" = "bisque4", "Massachusetts" = "chocolate4", "New Hampshire" = "orange", "New York"='deeppink2', "Vermont"= "antiquewhite3"))

C. Conclusion and Future Directions (1-2 paragraphs):

 Summarize the key findings of your analysis.

 Discuss the implications of your results and their relevance to the research question.

 Suggest potential avenues for future research or further analysis.

Important findings based on my analysis is…

New york has stayed as the lowest point in voter turnout throughout the years. This had be questioning why when that state is high in population, industry, governement, etc. This data is only from 1994-2014 and it wasn’t until 2019 that early in-person laws were implemented. People in New York were also known to not vote when outcomes of political races were too predictable. Other factors like demographics like age played a big role as well with the percentages (higher population to voter eligibility to ballots counted lead to lower voter turnout). Compared to the state that stayed at the top when it came to voter turnout which was New Hampshire. New Hampshire’s more rural and small-town structure often means more face-to-face interaction, higher social stakes locally, and easier access to polling places. To conclude states that are more rural areas have a higher voter turnout because they have more face-to-face interactions. Those interactions play a key role in government and voting as the people in society tend to feel more valued, which is something I learned in my sociology class. So essentially my graph matches exactly what I learned in that sociology class, and essentially I could use this graph as an example to prove that if states with bigger population want to get higher voter turnouts, then there needs to be a more face-to-face interactions to show that people in higher power(the government) cares.

D. References: Provide citations for any datasets, literature, orresources referenced in your paper. (ChatGPT is not a source)

link: https://www.openintro.org/data/index.php?data=voter_count

Data 101 Project 1

Joanne Gazmen

2025-10-10

A. Introdution (1-2 paragraphs):

B. Data Analysis (1 paragraph and 3-5 chunks of code): describe what type of data analysis you will perform and what kind of plot(s) you will generate to address your question. Your code should be used to:**

C. Conclusion and Future Directions (1-2 paragraphs):

D. References: Provide citations for any datasets, literature, orresources referenced in your paper. (ChatGPT is not a source)