Project 2

Introduction:

For this project I chose three data sources to clean and perform analysis. The data sources are student test scores, presidential voting data, and different religions’ income break downs. All three data sets are examples of wide data and in this project I will attempt to make the data more usable analysis and then go on to conduct said analysis.

Loading Necessary Packages

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(stringr)
library(ggplot2)

Loading Data Source 1: Student Testing Data (Alice Ding)

ds1_link <- 'https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv'

ds1 <- read_csv(ds1_link)

## Rows: 10 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, sex and age, test number
## dbl (5): id, phone, term 1, term 2, term 3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning Data Source 1

ds1 <- ds1 %>%
  melt(id.vars = c('id', 'name', 'phone', 'sex and age','test number')) %>%
  rename(`term number` = variable, score = value)

ds1 <- ds1 %>%
  separate(`sex and age`, into = c("gender", "age"), sep = "_") %>%
  mutate(age = as.numeric(age))

knitr::kable(head(ds1))

id	name	phone	gender	age	test number	term number	score
1	Mike	134	m	12	test 1	term 1	76
2	Linda	270	f	13	test 1	term 1	88
3	Sam	210	m	11	test 1	term 1	78
4	Esther	617	f	12	test 1	term 1	68
5	Mary	114	f	14	test 1	term 1	65
1	Mike	134	m	12	test 2	term 1	85

Converting ds1 to a csv

write.csv(ds1, 'Students_Test_Scores.csv')

The suggested analysis for Data Source One

The suggested analysis was, “We could find the average test score for each student or average test score overall. We could also count how many students are of each sex and find the average age as well.”

# Average Test Score
avg_by_student <- ds1 %>%
  group_by(name) %>%
  summarise(avg_test_score = mean(score)) %>%
  arrange(desc(avg_test_score))

knitr::kable(avg_by_student)

name	avg_test_score
Linda	85.66667
Mike	83.66667
Sam	79.83333
Esther	73.33333
Mary	66.16667

# Counts per gender
gender_counts <- unique(ds1[c("id", "gender")]) %>%
  group_by(gender) %>%
  summarize(count = n())

knitr::kable(gender_counts)

gender	count
f	3
m	2

# Averages by gender
avg_by_gender <- ds1 %>%
  group_by(gender) %>%
  summarise(avg_test_score = mean(score)) %>%
  arrange(desc(avg_test_score))

knitr::kable(avg_by_gender)

gender	avg_test_score
m	81.75000
f	75.05556

Loading Data Source 2: Religion and Income

I should note that I did not realize that RStudio has a built in table relig_income, but it had more religions than the example I originally provided, so I just recreated the example I provided for consistency.

relig_inc <- read_csv('Raw_Religion_Income.csv')

## Rows: 5 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): religion
## dbl (10): <10k, 10-20k, 20-30k, 30-40k, 40-50k, 50-75k, 75-100k, 100-150k, >...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning Data Source 2

relig_inc <- relig_inc %>%
  melt(id.vars = 'religion') %>%
  mutate(income_level = variable, count = value) %>%
  select(-variable, -value)

knitr::kable(head(relig_inc))

religion	income_level	count
Agnostic	<10k	27
Athiest	<10k	12
Buddhist	<10k	27
Catholic	<10k	418
refused	<10k	15
Agnostic	10-20k	34

Writing relig_inc to a csv

write.csv(relig_inc, "Religion_&_Income_Levels.csv")

The suggested analysis for Data Source Two

The suggested analysis was, “As far as analysis goes you could group by religion and see what the religious makeup of all respondents was by percentages.”

# Percent of respondents at each income level by religion

percents <- relig_inc %>%
  group_by(religion) %>%
  mutate(percent = round(100 * count/sum(count), 2)) %>%
  ungroup() %>%
  arrange(religion)

knitr::kable(head(percents))

religion	income_level	count	percent
Agnostic	<10k	27	3.27
Agnostic	10-20k	34	4.12
Agnostic	20-30k	60	7.26
Agnostic	30-40k	81	9.81
Agnostic	40-50k	76	9.20
Agnostic	50-75k	137	16.59

Loading Data Source 3:

votes <- read_csv("presidential_vote_tallies.csv")

## Rows: 4 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Candidate
## dbl (2): CA, FL
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning voter data and writing to a CSV

votes <- votes %>%
  melt(id.vars = 'Candidate') %>%
  rename(State = variable, Number_of_votes = value)

knitr::kable(head(votes))

Candidate	State	Number_of_votes
Hillary Clinton	CA	5931283
Donald Trump	CA	3184721
Gary Johnson	CA	308392
Jill Stein	CA	166311
Hillary Clinton	FL	4485745
Donald Trump	FL	4605515

write.csv(votes,"vote_totals_by_state.csv")

Analysis of voter data

The requested analysis was, “As per analysis, we can compare the vote counts in each state vs candidate.” So I decided to looks at the percentage results by state for each candidate.

voting_percents <- votes %>%
  group_by(State) %>%
  mutate(percent = round(100 * Number_of_votes/sum(Number_of_votes), 2)) %>%
  ungroup() %>%
  select(-Number_of_votes) 

knitr::kable(voting_percents)

Candidate	State	percent
Hillary Clinton	CA	61.84
Donald Trump	CA	33.21
Gary Johnson	CA	3.22
Jill Stein	CA	1.73
Hillary Clinton	FL	47.92
Donald Trump	FL	49.20
Gary Johnson	FL	2.20
Jill Stein	FL	0.68

Graphing voter data

ggplot(data = voting_percents, aes(x = State, y = percent, fill = Candidate)) +
  geom_bar(stat = "identity", position = position_dodge(), alpha = 0.75)  +
  ylim(0,65) + 
  geom_text(aes(label = percent), fontface = "bold", vjust = 1.5,
             position = position_dodge(.9), size = 4) +
  labs(x = "State", title = "Candidate Percentages by State")

Total Votes

I wanted to highlight the difference between popular opinion v.s. the electoral process.

vote_totals <-votes %>%
  group_by(Candidate) %>%
  summarise(total_votes = sum(Number_of_votes)) %>%
  mutate(total_percents = round(100 * total_votes/sum(total_votes),2)) %>%
  select(-total_votes)

knitr::kable(head(vote_totals))

Candidate	total_percents
Donald Trump	41.11
Gary Johnson	2.71
Hillary Clinton	54.97
Jill Stein	1.22

ggplot(data = vote_totals, aes(x = Candidate, y = total_percents, fill = Candidate)) +
  geom_bar(stat = "identity", position = position_dodge(), alpha = 0.75)  +
  ylim(0,60) + 
  geom_text(aes(label = total_percents), fontface = "bold", vjust = 1.5,
             position = position_dodge(.9), size = 4) +
  labs(x = "Candidate", title = "Candidate Total Percentages")

While this analysis only represents two states out of all fifty, it hopefully illustrates the disconnect between popular opinion and the president who may be elected. Trump did not win the popular vote, meaning a minority of American’s stated using a specific single choice voting method that they wanted Trump to be president and yet Trump won.

Conclusion:

In this project I cleaned three data sources and performed analysis on each. I displayed students average grades by individual student and then by student gender. I went on to provide the percentage breakdowns of income by religion and concluded my project with an analysis of 2016 presidential election vote count from Florida and California. In my analysis, I found that tidyr, dplyr, and reshape2 were especially helpful packages. I had also used the function “melt” once before, and having used it in the past I relied heavily on it for the purpose of this project.