The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work.
Your task is to:
Choose any three of the “wide” datasets identified in the Week 5 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!) For each of the three chosen datasets:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Please include in your homework submission, for each of the three chosen datasets:
The URL to the .Rmd file in your GitHub repository, and
The URL for your rpubs.com web page.
R packages used in this project:
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(ggpubr)
library(usmap)
library(lubridate)
The first data set is provied by Mael Illien in Discussion 5. We will look at the preference of candies.
Firstly we load the data into R.
data <- read.csv('https://raw.githubusercontent.com/oggyluky11/-DATA607-Project-2/master/candyhierarchy2017.csv', stringsAsFactors = FALSE)
head(data)
Secondly, we do the following data cleaning process.
1. Select the gender column as well as the columns in Question 6 which present the preference of candies by each responser.
2. Rename the gender column.
3. gather the columns of candies into 2 columns, one named 'candy_names' with names of candies, the other named 'preference' with values of JOY, MEH or DESPAIR.
4. remove the pattern 'Q6...' from all candy names and trim the value of in the column 'preference'.
5. filter row to select gender column with values 'Male' or 'Female' as well as to remove empty values in column 'preference'.
6. group the dataset by gender, candy name and preference.
7. count the number of JOY, MEH and DESPAIR in each group.
8. spread the column 'preference'.
data1 <- data %>% select(starts_with('Q2'), starts_with('Q6')) %>%
rename(gender = Q2..GENDER) %>%
gather('candy_name','preference',2:104) %>%
mutate(candy_name= str_replace(candy_name, 'Q6\\.{3}(.+)','\\1'),
preference = str_trim(preference, side = 'both')) %>%
filter(!preference == '' & (gender == 'Male' | gender == 'Female')) %>%
group_by(gender,candy_name,preference) %>%
tally() %>%
spread(preference,n)
data1
Now we create some plots for the cleaned data.
1. Top 10 candies with preference ‘JOY’ selected by male / female.
From the plot, we see that
1) Both male and female's most popular choices of candy is any full sized candy bar.
2) There are 9 kinds of candies are selected by both male and female, which are any full sized candy bar,Reeseos Peanut Butter Cups, Kit Kat, Snickers, Cash or other forms of legal tender (funny), Twix, Peanut M Mos, Tolberone something or other and Lindt Truffle.
3) There are more male than female response to the survey.
p1=ggplot(arrange(filter(data1, gender == 'Male'),desc(JOY))[1:10,], aes(x=reorder(candy_name, JOY), y = JOY, fill = JOY))+
geom_bar(stat = 'identity')+
coord_flip()+
geom_text(aes(label=JOY),vjust=0.3, hjust=1.20, position = position_dodge(width = 1), color="white",size = 3)+
xlab("Candy Name")+ggtitle("Top 10 Candies with 'JOY' by Male")
p2=ggplot(arrange(filter(data1, gender == 'Female'),desc(JOY))[1:10,], aes(x=reorder(candy_name, JOY), y = JOY, fill = JOY))+
geom_bar(stat = 'identity')+
coord_flip()+
geom_text(aes(label=JOY),vjust=0.3, hjust=1.20, position = position_dodge(width = 1), color="white", size = 3)+
xlab("Candy Name")+ggtitle("Top 10 Candies with 'JOY' by Female")+
scale_fill_gradient(low = 'deeppink4', high = 'deeppink1')
ggarrange(p1,p2,nrow=2)
2. Top 10 Candies with preference ‘DESPAIR’ selected by male /female.
From the plot, we see that
1) Both male and female are most despaired with broken glow stick.
2) although the orders are different, exactly the same items are shown in the top 10 list of male and female.
3) combine with the observations in the first plot, we know both male and female and similar favor and similar 'despair' in candies.
p3=ggplot(arrange(filter(data1, gender == 'Male'),desc(DESPAIR))[1:10,], aes(x=reorder(candy_name, DESPAIR), y = DESPAIR, fill = DESPAIR))+
geom_bar(stat = 'identity')+
coord_flip()+
geom_text(aes(label=DESPAIR),vjust=0.3, hjust=1.20, position = position_dodge(width = 1), color="white",size = 3)+
xlab("Candy Name")+ggtitle("Top 10 Candies with 'DESPAIR' by Male")
p4=ggplot(arrange(filter(data1, gender == 'Female'),desc(DESPAIR))[1:10,], aes(x=reorder(candy_name, DESPAIR), y = DESPAIR, fill = DESPAIR))+
geom_bar(stat = 'identity')+
coord_flip()+
geom_text(aes(label=DESPAIR),vjust=0.3, hjust=1.20, position = position_dodge(width = 1), color="white", size = 3)+
xlab("Candy Name")+ggtitle("Top 10 Candies with 'DESPAIR' by Female")+
scale_fill_gradient(low = 'deeppink4', high = 'deeppink1')
ggarrange(p3,p4,nrow=2)
3. Top 10 Candies with preference ‘MEH’ selected by male /female.
1) Lollipops most commonly rated as 'Meh' by male and female.
2) There are 6 candies are commonly in the top 10 list of 'Meh' by male and female including Lollipops, Hard Candy, Bonkers the candy, Minibags of chips, Chiclets and Now n laters.
p5=ggplot(arrange(filter(data1, gender == 'Male'),desc(MEH))[1:10,], aes(x=reorder(candy_name, MEH), y = MEH, fill = MEH))+
geom_bar(stat = 'identity')+
coord_flip()+
geom_text(aes(label=MEH),vjust=0.3, hjust=1.20, position = position_dodge(width = 1), color="white",size = 3)+
xlab("Candy Name")+ggtitle("Top 10 Candies with 'MEH' by Male")
p6=ggplot(arrange(filter(data1, gender == 'Female'),desc(MEH))[1:10,], aes(x=reorder(candy_name, MEH), y = MEH, fill = MEH))+
geom_bar(stat = 'identity')+
coord_flip()+
geom_text(aes(label=MEH),vjust=0.3, hjust=1.20, position = position_dodge(width = 1), color="white", size = 3)+
xlab("Candy Name")+ggtitle("Top 10 Candies with 'MEH' by Female")+
scale_fill_gradient(low = 'deeppink4', high = 'deeppink1')
ggarrange(p5,p6,nrow=2)
The second data set is provied by Md Forhad Akbar in Discussion 5. The dataset is student grands and loans by location.
Firstly we load the dataset into R
data2 <- read.csv('https://raw.githubusercontent.com/oggyluky11/-DATA607-Project-2/master/AY18Disbursements-by-Location.csv', stringsAsFactors = FALSE)
head(data)
Secondly, we do the following data cleaning process.
1. remove the first 6 rows from data which are original titles and empty rows.
2. Rename the colums.
3. gather all columns with number values, remove '$' and ',' signs from the values, change the data type of the values from character to numeric, then spread the data back to original form.
4. add new columns grant.disbursements.per.recipient, loans.disbursements.per.recipient and total.disbursements.per.recipient
5. reorder the columns.
data2 <- data2 %>%
slice(7:n()) %>%
rename(state = X2017.2018.New.Disbursements.by.Location,
grant.disbursements = X,
grant.recipients = X.1,
loans.disbursements = X.2,
loans.recipients = X.3,
total.disbursements = X.4,
total.recipients = X.5) %>%
gather('column.name', 'column.value', 2:7) %>%
mutate(column.value= as.numeric(str_remove_all(column.value, '[$,]'))) %>%
spread(column.name,column.value) %>%
mutate(grant.disbursements.per.recipient = grant.disbursements/grant.recipients,
loans.disbursements.per.recipient = loans.disbursements/loans.recipients,
total.disbursements.per.recipient = total.disbursements/total.recipients) %>%
select(state,grant.disbursements,grant.recipients,grant.disbursements.per.recipient,
loans.disbursements,loans.recipients,loans.disbursements.per.recipient,
total.disbursements,total.recipients,total.disbursements.per.recipient)
data2
Plot the grant disbursements
1) from the bar plot and map plot below, we can see the top 5 states of grant disbursements are California, Texas, New York, Floirda and Georgia.
p7=ggplot(data2, aes(x=reorder(state,desc(grant.disbursements)), y=grant.disbursements, fill = grant.disbursements))+
geom_bar(stat = 'identity')+
xlab('State')+ylab('Disbursement')+
theme(axis.text.x = element_text(angle=90, hjust=1, size=6))+
scale_fill_gradient(low = 'white', high = 'deeppink4', name='disbursement')
p8=plot_usmap(data=data2, values= 'grant.disbursements', color= 'gray')+
scale_fill_gradient(low = 'white', high = 'deeppink4', name='disbursement')+
ggtitle('Grant Disbursements by State')+
theme(legend.position = 'right')
#p8=plot_usmap(data=data2, values= 'grant.recipients', color= 'gray')+scale_fill_gradient(low = 'white', high = 'navyblue')
ggarrange(p8,p7, nrow =2)
2) from the point plot below we see there is a strong positive relationship between grant disbursements amounts and number of recipients.
ggplot(data2, aes(x=grant.recipients, y = grant.disbursements,color='black', fill = grant.disbursements))+
geom_point(shape = 21)+
scale_fill_gradient(low = 'white', high = 'deeppink4')
3) By computing the average disbursement per recipient, we see that the top 5 states of grant disbursement per recipient are Puerto Rico, New York, Mississippi, West Virginia and New Jersey.
p9=ggplot(data2, aes(x=reorder(state,desc(grant.disbursements.per.recipient)), y=grant.disbursements.per.recipient, fill = grant.disbursements.per.recipient))+
geom_bar(stat = 'identity')+
xlab('State')+ylab('Disbursement')+
theme(axis.text.x = element_text(angle=90, hjust=1, size=6))+
scale_fill_gradient(low = 'white', high = 'deeppink4',name='disbursement per recipient')
p10=plot_usmap(data=data2, values= 'grant.disbursements.per.recipient', color= 'gray')+
scale_fill_gradient(low = 'white', high = 'deeppink4', name='disbursement per recipient')+
ggtitle('Grant Disbursements Per Recipient by State')+
theme(legend.position = 'right')
#p8=plot_usmap(data=data2, values= 'grant.recipients', color= 'gray')+scale_fill_gradient(low = 'white', high = 'navyblue')
ggarrange(p10,p9, nrow =2)
Plot the loan disbursements
1) from the bar plot and map plot below, we can see the top 5 states of loan disbursements are California, Texas, New York, Floirda and Pennsylvania.
p7=ggplot(data2, aes(x=reorder(state,desc(loans.disbursements)), y=loans.disbursements, fill = loans.disbursements))+
geom_bar(stat = 'identity')+
xlab('State')+ylab('Disbursement')+
theme(axis.text.x = element_text(angle=90, hjust=1, size=6))+
scale_fill_gradient(low = 'white', high = 'deeppink4', name='disbursement')
p8=plot_usmap(data=data2, values= 'loans.disbursements', color= 'gray')+
scale_fill_gradient(low = 'white', high = 'deeppink4', name='disbursement')+
ggtitle('Loan Disbursements by State')+
theme(legend.position = 'right')
ggarrange(p8,p7, nrow =2)
2) from the point plot below we see there is also a strong positive relationship between loan disbursements amounts and number of recipients.
ggplot(data2, aes(x=loans.recipients, y = loans.disbursements,color='black', fill = loans.disbursements))+
geom_point(shape = 21)+
scale_fill_gradient(low = 'white', high = 'deeppink4')
3) By computing the average disbursement per recipient, we see that the top 5 states of loans disbursement per recipient are District of Columbia, California, Maryland, Hawaii and New York.
p9=ggplot(data2, aes(x=reorder(state,desc(loans.disbursements.per.recipient)), y=loans.disbursements.per.recipient, fill = loans.disbursements.per.recipient))+
geom_bar(stat = 'identity')+
xlab('State')+ylab('Disbursement')+
theme(axis.text.x = element_text(angle=90, hjust=1, size=6))+
scale_fill_gradient(low = 'white', high = 'deeppink4',name='disbursement per recipient')
p10=plot_usmap(data=data2, values= 'loans.disbursements.per.recipient', color= 'gray')+
scale_fill_gradient(low = 'white', high = 'deeppink4', name='disbursement per recipient')+
ggtitle('loans Disbursements Per Recipient by State')+
theme(legend.position = 'right')
ggarrange(p10,p9, nrow =2)
Recipients of either grants / loans vs recipients of both
The total recipients in the data does not equal to the sum of grant recepients and loan recepients, we know that not all recipients received both grants and loans. Let’s see the portion of recipients who receive both.
data3 <- data2 %>%
mutate(recipient.both = grant.recipients + loans.recipients - total.recipients) %>%
mutate(recipient.grant.only = grant.recipients-recipient.both,
recipient.loans.only = loans.recipients-recipient.both) %>%
select(state,recipient.grant.only,recipient.loans.only,recipient.both) %>%
gather('recipient.type','recipient.cnt', 2:4)
data3
ggplot(data3,aes(x=recipient.type, y=recipient.cnt,fill=recipient.cnt))+
geom_bar(stat= 'identity')+
scale_fill_gradient(low = 'white', high = 'deeppink4')+
facet_wrap(~state)+
theme(axis.text.x = element_text(angle=90, hjust=1, size=6),
axis.text.y = element_text(size=6))
1) California looks has significantly more recipients who only receive grants compared to recipients who only receive loans or receive both.
2) Pennsylvania looks has significantly less recipients who only receive grants compared to recipients who only recive loans or receive both.
3) generally, the number of recipients who receive grants only, loans only or receive both are very close.
The second data set is provied by Uliana Plotnikova in Discussion 5. The dataset is numbers of birth in US during 1994 and 2003.
Firstly we load the data into R
data4 <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv')
data4
Secondly, we do the following data cleaning process.
We will prepare two sets of data. The first one the numbers of birth are group by year and month combo.
1. Concatnate the columns year, month and date_of_month into one column named 'date' with pattern xxxx-xx-xx.
2. Change the data type of column 'date' to DATE.
3. Group and sum the number of births by month.
The second one the numbers of birth are group by date of week.
1. group the data by day of week/
2. Summarize the dataset using mean function.
data5 <- data4 %>%
mutate(date = str_c(year,'-',str_pad(month,2,pad='0'),'-',str_pad(date_of_month,2,pad='0')),day_of_week,births) %>%
mutate(date = as.Date(date)) %>%
group_by(month=floor_date(date, 'month')) %>%
summarise(births = sum(births))
data5
data6 <- data4 %>%
group_by(day_of_week) %>%
summarise(births = mean(births))
data6
Plot the number of birth during 1994 to 2003
From the plot we see there is a seasonal effect in the number of birth each year.
ggplot(data5, aes(x=month, y = births))+
geom_line()+scale_x_date(date_breaks = "1 month", date_labels = '%b-%y')+
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 4))+
ggtitle("Number of Birth in US During 1994 to 2003")
Plot the number of birth in 2003
Take year 2003 as example, we always have the lowest number of birth in February, and the largest number in July.
ggplot(filter(data5,year(month)==2003), aes(x=month, y = births))+
geom_line()+scale_x_date(date_breaks = "1 month", date_labels = '%b-%y')+
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 10))+geom_smooth()+
ggtitle("Number of Birth in US in Year 2003")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Plot the average number of birth on each day of week
We see that week days always have larger number of births than weekends.
ggplot(data6, aes(x=reorder(day_of_week, desc(births)), y = births, fill=births))+
geom_bar(stat='identity')+
geom_text(aes(label=round(births)),vjust=1.5, hjust=0.5, position = position_dodge(width = 1), color="white",size = 3)+
ggtitle('Average Number of Birth by Day of Week')+
xlab('Day of Week')+ylab('Birth')