Introduction:

For this project I chose three data sources to clean and perform analysis. The data sources are student test scores, presidential voting data, and different religions’ income break downs. All three data sets are examples of wide data and in this project I will attempt to make the data more usable analysis and then go on to conduct said analysis.

Loading Necessary Packages

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(stringr)
library(ggplot2)

Loading Data Source 1: Student Testing Data (Alice Ding)

ds1_link <- 'https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv'

ds1 <- read_csv(ds1_link)
## Rows: 10 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, sex and age, test number
## dbl (5): id, phone, term 1, term 2, term 3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning Data Source 1

ds1 <- ds1 %>%
  melt(id.vars = c('id', 'name', 'phone', 'sex and age','test number')) %>%
  rename(`term number` = variable, score = value)

ds1 <- ds1 %>%
  separate(`sex and age`, into = c("gender", "age"), sep = "_") %>%
  mutate(age = as.numeric(age))

knitr::kable(head(ds1))
id name phone gender age test number term number score
1 Mike 134 m 12 test 1 term 1 76
2 Linda 270 f 13 test 1 term 1 88
3 Sam 210 m 11 test 1 term 1 78
4 Esther 617 f 12 test 1 term 1 68
5 Mary 114 f 14 test 1 term 1 65
1 Mike 134 m 12 test 2 term 1 85

Converting ds1 to a csv

write.csv(ds1, 'Students_Test_Scores.csv')

The suggested analysis for Data Source One

The suggested analysis was, “We could find the average test score for each student or average test score overall. We could also count how many students are of each sex and find the average age as well.”

# Average Test Score
avg_by_student <- ds1 %>%
  group_by(name) %>%
  summarise(avg_test_score = mean(score)) %>%
  arrange(desc(avg_test_score))

knitr::kable(avg_by_student)
name avg_test_score
Linda 85.66667
Mike 83.66667
Sam 79.83333
Esther 73.33333
Mary 66.16667
# Counts per gender
gender_counts <- unique(ds1[c("id", "gender")]) %>%
  group_by(gender) %>%
  summarize(count = n())

knitr::kable(gender_counts)
gender count
f 3
m 2
# Averages by gender
avg_by_gender <- ds1 %>%
  group_by(gender) %>%
  summarise(avg_test_score = mean(score)) %>%
  arrange(desc(avg_test_score))

knitr::kable(avg_by_gender)
gender avg_test_score
m 81.75000
f 75.05556

Loading Data Source 2: Religion and Income

I should note that I did not realize that RStudio has a built in table relig_income, but it had more religions than the example I originally provided, so I just recreated the example I provided for consistency.

relig_inc <- read_csv('Raw_Religion_Income.csv')
## Rows: 5 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): religion
## dbl (10): <10k, 10-20k, 20-30k, 30-40k, 40-50k, 50-75k, 75-100k, 100-150k, >...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning Data Source 2

relig_inc <- relig_inc %>%
  melt(id.vars = 'religion') %>%
  mutate(income_level = variable, count = value) %>%
  select(-variable, -value)

knitr::kable(head(relig_inc))
religion income_level count
Agnostic <10k 27
Athiest <10k 12
Buddhist <10k 27
Catholic <10k 418
refused <10k 15
Agnostic 10-20k 34

Writing relig_inc to a csv

write.csv(relig_inc, "Religion_&_Income_Levels.csv")

The suggested analysis for Data Source Two

The suggested analysis was, “As far as analysis goes you could group by religion and see what the religious makeup of all respondents was by percentages.”

# Percent of respondents at each income level by religion

percents <- relig_inc %>%
  group_by(religion) %>%
  mutate(percent = round(100 * count/sum(count), 2)) %>%
  ungroup() %>%
  arrange(religion)

knitr::kable(head(percents))
religion income_level count percent
Agnostic <10k 27 3.27
Agnostic 10-20k 34 4.12
Agnostic 20-30k 60 7.26
Agnostic 30-40k 81 9.81
Agnostic 40-50k 76 9.20
Agnostic 50-75k 137 16.59

Loading Data Source 3:

votes <- read_csv("presidential_vote_tallies.csv")
## Rows: 4 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Candidate
## dbl (2): CA, FL
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning voter data and writing to a CSV

votes <- votes %>%
  melt(id.vars = 'Candidate') %>%
  rename(State = variable, Number_of_votes = value)

knitr::kable(head(votes))
Candidate State Number_of_votes
Hillary Clinton CA 5931283
Donald Trump CA 3184721
Gary Johnson CA 308392
Jill Stein CA 166311
Hillary Clinton FL 4485745
Donald Trump FL 4605515
write.csv(votes,"vote_totals_by_state.csv")

Analysis of voter data

The requested analysis was, “As per analysis, we can compare the vote counts in each state vs candidate.” So I decided to looks at the percentage results by state for each candidate.

voting_percents <- votes %>%
  group_by(State) %>%
  mutate(percent = round(100 * Number_of_votes/sum(Number_of_votes), 2)) %>%
  ungroup() %>%
  select(-Number_of_votes) 

knitr::kable(voting_percents)
Candidate State percent
Hillary Clinton CA 61.84
Donald Trump CA 33.21
Gary Johnson CA 3.22
Jill Stein CA 1.73
Hillary Clinton FL 47.92
Donald Trump FL 49.20
Gary Johnson FL 2.20
Jill Stein FL 0.68

Graphing voter data

ggplot(data = voting_percents, aes(x = State, y = percent, fill = Candidate)) +
  geom_bar(stat = "identity", position = position_dodge(), alpha = 0.75)  +
  ylim(0,65) + 
  geom_text(aes(label = percent), fontface = "bold", vjust = 1.5,
             position = position_dodge(.9), size = 4) +
  labs(x = "State", title = "Candidate Percentages by State")

Total Votes

I wanted to highlight the difference between popular opinion v.s. the electoral process.

vote_totals <-votes %>%
  group_by(Candidate) %>%
  summarise(total_votes = sum(Number_of_votes)) %>%
  mutate(total_percents = round(100 * total_votes/sum(total_votes),2)) %>%
  select(-total_votes)

knitr::kable(head(vote_totals))
Candidate total_percents
Donald Trump 41.11
Gary Johnson 2.71
Hillary Clinton 54.97
Jill Stein 1.22
ggplot(data = vote_totals, aes(x = Candidate, y = total_percents, fill = Candidate)) +
  geom_bar(stat = "identity", position = position_dodge(), alpha = 0.75)  +
  ylim(0,60) + 
  geom_text(aes(label = total_percents), fontface = "bold", vjust = 1.5,
             position = position_dodge(.9), size = 4) +
  labs(x = "Candidate", title = "Candidate Total Percentages")

While this analysis only represents two states out of all fifty, it hopefully illustrates the disconnect between popular opinion and the president who may be elected. Trump did not win the popular vote, meaning a minority of American’s stated using a specific single choice voting method that they wanted Trump to be president and yet Trump won.

Conclusion:

In this project I cleaned three data sources and performed analysis on each. I displayed students average grades by individual student and then by student gender. I went on to provide the percentage breakdowns of income by religion and concluded my project with an analysis of 2016 presidential election vote count from Florida and California. In my analysis, I found that tidyr, dplyr, and reshape2 were especially helpful packages. I had also used the function “melt” once before, and having used it in the past I relied heavily on it for the purpose of this project.