##Instructions

  1. For every data visualization you make, add a proper description on the axes. Even if the variable name isn’t terrible, explicitly add them. This is worth 1 point per axis per problem!
  2. For each question asking you to make a calculation, you must add a comment or a markdown cell explicitly answering the question. R output alone is insufficient.

###Loading Stuff Load in the ‘dplyr’and ’ggplot2’ libraries.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(knitr)

###Data Load in the dataset as a dataframe named “voters”. It can also be found on Brightspace with the assigmnent.

voters <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/voter-registration/new-voter-registrations.csv")
View(voters)

##Introduction (10 points) You’ve been hired to work with a government organization interested in formation about voter registration in the United States. They have asked you to come up with two research questions–one about national trends and one about a state trend of your choice.

###Introduction Clearly state: * the purpose of your project * what the goal is * why it is important for someone to read this notebook?

In additon, state your two research questions

# the purpose of the project is to use data to use voter registrations across the country to analyze and separate the 2016 and 2020 presidential election years.

# the goal is to examine the changes in voter registration across the states during the first 5 months of the 2016 and 2020 presidential election years and compare the monthly registration trends by state

# it is important for someone to read this notebook, because it could potentially help politicians and their parties be able to identify the most important months to campaign and find the best opportunity to start promoting themselves.

###Data Processing (8 points) Show the first five rows of data and explain what the variable names are. (5 points)

head(voters, 5)
# jurisdiction is the state or washington dc, year is the presidential election year (2016 or 2020), month is the month of the year from (january to may), and registered voters is the number of people that registered to vote for that month of year in that state. 

Check for any missing data data. (1 point)

sum(is.na(voters))
## [1] 0

Check your data types. (1 point)

str(voters)
## 'data.frame':    106 obs. of  4 variables:
##  $ Jurisdiction         : chr  "Arizona" "Arizona" "Arizona" "Arizona" ...
##  $ Year                 : int  2016 2016 2016 2016 2020 2020 2020 2020 2016 2016 ...
##  $ Month                : chr  "Jan" "Feb" "Mar" "Apr" ...
##  $ New.registered.voters: int  25852 51155 48614 30668 33229 50853 31872 10249 87574 103377 ...

Rename the jurisdiction column to “NAME” and the “New registered voters” to “registered_voters” to use throughout the notebook. (2 points)

voters <- voters %>% rename(NAME = Jurisdiction, registered_voters = New.registered.voters)

Explain any potential limitations to your data (2 points)

# some potential limitations to the data are that there are no political affiliations included, so we do not know whether it is democrats, republicans, or independents that are registering to vote. the data also doesn't include voters that are being removed from the total count due to death or moving away. 

###Exploratory Data Analysis When was the highest amount of new voter registration? Show the state, month, year, and number of registered voters. (5 points)

voters %>% filter(registered_voters == max(registered_voters, na.rm = TRUE))

What is the average voter registration? (1 points)

average_voter_registration <- mean(voters$registered_voters)
print(average_voter_registration)
## [1] 48223.46

Create a dataframe called “high_voters” showing only the times where voter registration was higher than your above average. How many times did this happen? (3 points)

high_voters <- voters %>% filter(registered_voters > average_voter_registration)
high_voter_count <- nrow(high_voters)
print(high_voter_count)
## [1] 36

How many times did each state go above the national average for new voter registration? (2 points)

voters %>% group_by(NAME) %>% summarise(state_count_above_average = sum(registered_voters > average_voter_registration))

Which three states had the highest average voter registration overall? Show only the top three results. (5 points)

voters %>% group_by(NAME) %>% summarise(average_state_registration = mean(registered_voters)) %>% arrange(desc(average_state_registration)) %>% head(3)

###Data Visualization (30 points, as noted) Create a plot showing the voter registration by state.

total_voters <- voters %>% group_by(NAME) %>% summarise(total_registered = sum(registered_voters))

total_voters %>% ggplot(aes(x = NAME, y = total_registered/1e6)) + geom_bar(stat = "identity", fill = "purple") +
  labs(title = "total voter registrations by state (2016 and 2020)", x = "us jurisdiction", y = "total voters registered (in millions)") + theme(legend.position = "none")

# the more people that are populated in the state, the more total voter registrations that there tends to be. for example, california and texas have a higher population than delaware and have far more registrations than it. 

Produce a plot comparing voter registration in 2016 and 2020. (2 pts) * Color the graph based on the month.(1 pt) * Change the default color palette used. (1 pt) * Comment on any trends you see. (2 pts) * Add the appropriate labels and title (1) * Comment on any trends you see (2 points)

(9 pts total)

order <- voters %>% group_by(Year, Month) %>% summarise(total = sum(registered_voters))
## `summarise()` has grouped output by 'Year'. You can override using the
## `.groups` argument.
order$Month <- factor(order$Month, levels = c("Jan","Feb","Mar","Apr","May"))
ggplot(order, aes(x = factor(Year), y = total/1e6, fill = Month)) + geom_bar(stat = "identity") + scale_fill_manual(values = c("green","yellow","orange","pink","purple")) + labs(x = "year", y = "total voter registrations (in millions)", title = "total voter registrations by month and year", fill = "month")

# 2016 has more voter registrations than 2020, and the biggest month for new voter registrations was february of 2020. may tends to have the least amount of voter registrations, as they 

Create a data visualization that relates to either your state level research question or your national level research question. If one of your questions was answered by the above graphs, you may not use that question. (12 points)

highest_2020_states <- voters %>% filter(Year == 2020) %>% group_by(NAME) %>% summarise(total_2020 = sum(registered_voters)) %>% arrange(desc(total_2020)) %>% head(3)
highest_2020_states %>% ggplot(aes(x = reorder(NAME, total_2020), y = total_2020/1e5)) + geom_bar(stat = "identity", fill = "gray") +
  labs(title = "top 3 states in 2020 with the highest voter registration", subtitle = "january to may totals", x = "US jurisdiction", y = "total new voters registered (hundred thousands)")

###Notebook Conclusion (15 points) Write a conclusion section that includes * Insights: the insights/outcomes of your notebook (5 points) * Suggestions: Any suggestions or ideas you could offer your client (5 point) * Possible next steps: At least one step that you would take if continuing to work with this project (5 points)

This should only be reflective of your data as presented here.

# insights - 2016 had much more voter registrations compared to 2020. new voter registrations in both presidential election years (2016 and 2020) were primarily concentrated throughout california, texas, florida, illinois, and north carolina, with the first three states having the highest total voter registration count overall. in february of 2020, california had the highest new voter registration count with over 238,000 people. the avergae voter registration count per month per state is approximately 48,000. 
# suggestions - use strategies that avoid the small new voter registration count in april and may of 2020 following the start of the pandemic. 
# possible next steps - if continuing to work with this project, i would include voter registration data from june to december for both 2016 and 2020. other possible next steps include making a graph showing the correlations between voter registrations and populations by state. you could also include statistics that compare whether or not a state politically leans democrat or republican and observe/analyze their voter registration rate. 

Each section should be at least 3-4 complete sentences.

###Stakeholder Analysis (14 points, as noted) Loosely adapted from the Cambridge Analytica/Facebook reveal in 2018. While it is not necessary to look this up, it’s always good to know.

All answers should be in full sentences and in a text box, not code.

You work for a social media company which is currently allowing third party apps access to data through the use of an API; this means another company can create an app on your platform and acces the data of users who consent.

One such company is a lobbying group. They create a third party app called “Get Out and Vote!” that asks users about their voter registration and other habits. While users of the app must give consent, the app also allows the lobbying group to collect data on the social media habits of the friends of those who consented to the app. While only only a few hundred thousand partake in the original app, the company is still able to collect from millions of users.

They use this data to create personality profiles of unregistered voters. That data is then sold to specific political campaigns who use the information to create political ads that encourage people in populated US cities not to vote while creating ads that tell people in swing states to vote.

While it is not necessary to do research on swing states, here is some information about the demographics of that if you are curious.

Identity a key ethical issue here, from the standpoint of the social media company (3 points).

# a key ethical issue here is that the social media company is allowing third-party access to users' data who did not give consent. 

What are at least three of our ethical standards that apply to this situation, and why are they relevant (5 points).

# three of our ethical standards that apply to this situation are be competent and act with integrity, transparency, and uphold privacy and confidentiality. these are relevant, because the social media company is not being directly honest if they are using user data for ill intent. if they do not inform users of this, they are not being transparent and clear about their use of their data. in addition, by selling that data to specific political campaigns to create political ads that are catered to them, users' privacy and confidentiality are at risk. 

Your company wants to continue work with the lobbying firm and does not want to lose them as a client. Looking at this from a utilitarian ethics standpoint, propose an alternative action that allows you to work with them while still addressing the ethical issue you mentioned above (6 points)

# utilitarian ethics states aim to maximize benefits and minimize harms. in order to work with them while still addressing the ethical issues mentioned prior, they should only be able to access the data of individuals who explicitly provided consent to do so. in addition, they should not be able to create targeted ads which encourage people in populated US cities not to vote and encourage people in swing states to vote. 

###Github (3 points) Post this to your Github and include the accessible link either here or on brightspace.

# https://github.com/damecny/dida-325-midterm/blob/main/dida%20325%20midterm

###Academic Integrity Statement By writing my name in the cell below, I certify that:

  1. I did not use resources other than:
    • the Python notebooks provided by the instructor,
    • links provided in this notebook,
    • the assigned readings, and
    • my own personal notes
  2. This means that I did not:
    • look up anything on Google, or Stack Overflow, Chatgpt, &c.,
    • discuss the content of the exam with anyone other than the instructors or TAs, or
    • do anything that technically doen’t break these rules but is against their spirit.
# Damian Chang