Introduction:
For this project I chose three data sources to clean and perform
analysis. The data sources are student test scores, presidential voting
data, and different religions’ income break downs. All three data sets
are examples of wide data and in this project I will attempt to make the
data more usable analysis and then go on to conduct said analysis.
Loading Necessary Packages
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(stringr)
library(ggplot2)
Loading Data Source 1: Student Testing Data (Alice Ding)
ds1_link <- 'https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv'
ds1 <- read_csv(ds1_link)
## Rows: 10 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, sex and age, test number
## dbl (5): id, phone, term 1, term 2, term 3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cleaning Data Source 1
ds1 <- ds1 %>%
melt(id.vars = c('id', 'name', 'phone', 'sex and age','test number')) %>%
rename(`term number` = variable, score = value)
ds1 <- ds1 %>%
separate(`sex and age`, into = c("gender", "age"), sep = "_") %>%
mutate(age = as.numeric(age))
knitr::kable(head(ds1))
| 1 |
Mike |
134 |
m |
12 |
test 1 |
term 1 |
76 |
| 2 |
Linda |
270 |
f |
13 |
test 1 |
term 1 |
88 |
| 3 |
Sam |
210 |
m |
11 |
test 1 |
term 1 |
78 |
| 4 |
Esther |
617 |
f |
12 |
test 1 |
term 1 |
68 |
| 5 |
Mary |
114 |
f |
14 |
test 1 |
term 1 |
65 |
| 1 |
Mike |
134 |
m |
12 |
test 2 |
term 1 |
85 |
Converting ds1 to a csv
write.csv(ds1, 'Students_Test_Scores.csv')
The suggested analysis for Data Source One
The suggested analysis was, “We could find the average test score for
each student or average test score overall. We could also count how many
students are of each sex and find the average age as well.”
# Average Test Score
avg_by_student <- ds1 %>%
group_by(name) %>%
summarise(avg_test_score = mean(score)) %>%
arrange(desc(avg_test_score))
knitr::kable(avg_by_student)
| Linda |
85.66667 |
| Mike |
83.66667 |
| Sam |
79.83333 |
| Esther |
73.33333 |
| Mary |
66.16667 |
# Counts per gender
gender_counts <- unique(ds1[c("id", "gender")]) %>%
group_by(gender) %>%
summarize(count = n())
knitr::kable(gender_counts)
# Averages by gender
avg_by_gender <- ds1 %>%
group_by(gender) %>%
summarise(avg_test_score = mean(score)) %>%
arrange(desc(avg_test_score))
knitr::kable(avg_by_gender)
Loading Data Source 2: Religion and Income
I should note that I did not realize that RStudio has a built in
table relig_income, but it had more religions than the example I
originally provided, so I just recreated the example I provided for
consistency.
relig_inc <- read_csv('Raw_Religion_Income.csv')
## Rows: 5 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): religion
## dbl (10): <10k, 10-20k, 20-30k, 30-40k, 40-50k, 50-75k, 75-100k, 100-150k, >...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cleaning Data Source 2
relig_inc <- relig_inc %>%
melt(id.vars = 'religion') %>%
mutate(income_level = variable, count = value) %>%
select(-variable, -value)
knitr::kable(head(relig_inc))
| Agnostic |
<10k |
27 |
| Athiest |
<10k |
12 |
| Buddhist |
<10k |
27 |
| Catholic |
<10k |
418 |
| refused |
<10k |
15 |
| Agnostic |
10-20k |
34 |
Writing relig_inc to a csv
write.csv(relig_inc, "Religion_&_Income_Levels.csv")
The suggested analysis for Data Source Two
The suggested analysis was, “As far as analysis goes you could group
by religion and see what the religious makeup of all respondents was by
percentages.”
# Percent of respondents at each income level by religion
percents <- relig_inc %>%
group_by(religion) %>%
mutate(percent = round(100 * count/sum(count), 2)) %>%
ungroup() %>%
arrange(religion)
knitr::kable(head(percents))
| Agnostic |
<10k |
27 |
3.27 |
| Agnostic |
10-20k |
34 |
4.12 |
| Agnostic |
20-30k |
60 |
7.26 |
| Agnostic |
30-40k |
81 |
9.81 |
| Agnostic |
40-50k |
76 |
9.20 |
| Agnostic |
50-75k |
137 |
16.59 |
Loading Data Source 3:
votes <- read_csv("presidential_vote_tallies.csv")
## Rows: 4 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Candidate
## dbl (2): CA, FL
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cleaning voter data and writing to a CSV
votes <- votes %>%
melt(id.vars = 'Candidate') %>%
rename(State = variable, Number_of_votes = value)
knitr::kable(head(votes))
| Hillary Clinton |
CA |
5931283 |
| Donald Trump |
CA |
3184721 |
| Gary Johnson |
CA |
308392 |
| Jill Stein |
CA |
166311 |
| Hillary Clinton |
FL |
4485745 |
| Donald Trump |
FL |
4605515 |
write.csv(votes,"vote_totals_by_state.csv")
Analysis of voter data
The requested analysis was, “As per analysis, we can compare the vote
counts in each state vs candidate.” So I decided to looks at the
percentage results by state for each candidate.
voting_percents <- votes %>%
group_by(State) %>%
mutate(percent = round(100 * Number_of_votes/sum(Number_of_votes), 2)) %>%
ungroup() %>%
select(-Number_of_votes)
knitr::kable(voting_percents)
| Hillary Clinton |
CA |
61.84 |
| Donald Trump |
CA |
33.21 |
| Gary Johnson |
CA |
3.22 |
| Jill Stein |
CA |
1.73 |
| Hillary Clinton |
FL |
47.92 |
| Donald Trump |
FL |
49.20 |
| Gary Johnson |
FL |
2.20 |
| Jill Stein |
FL |
0.68 |
Graphing voter data
ggplot(data = voting_percents, aes(x = State, y = percent, fill = Candidate)) +
geom_bar(stat = "identity", position = position_dodge(), alpha = 0.75) +
ylim(0,65) +
geom_text(aes(label = percent), fontface = "bold", vjust = 1.5,
position = position_dodge(.9), size = 4) +
labs(x = "State", title = "Candidate Percentages by State")

Total Votes
I wanted to highlight the difference between popular opinion v.s. the
electoral process.
vote_totals <-votes %>%
group_by(Candidate) %>%
summarise(total_votes = sum(Number_of_votes)) %>%
mutate(total_percents = round(100 * total_votes/sum(total_votes),2)) %>%
select(-total_votes)
knitr::kable(head(vote_totals))
| Donald Trump |
41.11 |
| Gary Johnson |
2.71 |
| Hillary Clinton |
54.97 |
| Jill Stein |
1.22 |
ggplot(data = vote_totals, aes(x = Candidate, y = total_percents, fill = Candidate)) +
geom_bar(stat = "identity", position = position_dodge(), alpha = 0.75) +
ylim(0,60) +
geom_text(aes(label = total_percents), fontface = "bold", vjust = 1.5,
position = position_dodge(.9), size = 4) +
labs(x = "Candidate", title = "Candidate Total Percentages")

While this analysis only represents two states out of all fifty, it
hopefully illustrates the disconnect between popular opinion and the
president who may be elected. Trump did not win the popular vote,
meaning a minority of American’s stated using a specific single choice
voting method that they wanted Trump to be president and yet Trump
won.
Conclusion:
In this project I cleaned three data sources and performed analysis
on each. I displayed students average grades by individual student and
then by student gender. I went on to provide the percentage breakdowns
of income by religion and concluded my project with an analysis of 2016
presidential election vote count from Florida and California. In my
analysis, I found that tidyr, dplyr, and reshape2 were especially
helpful packages. I had also used the function “melt” once before, and
having used it in the past I relied heavily on it for the purpose of
this project.