In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions. GitHub repository: https://github.com/acatlin/SPRING2023TIDYVERSE FiveThirtyEight.com datasets. Kaggle datasets.
Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)
Later, you’ll be asked to extend an existing vignette. Using one of your classmate’s examples (as created above), you’ll then extend his or her example with additional annotated code. (15 points)
You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example. After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.
The tidyverse
package is a collection of packages that
includes ggplot2
, dplyr
, tidyr
,
readr
, purr
, tibble
,
stringr
, forcats
.
For this assignment, we will be using the following packages.
Package | Function |
---|---|
readr |
read_csv |
dplyr |
glimpse() group_by()
summarise() mutate() |
ggplot2 |
ggplot() geom_bar()
scale_x_continuous() scale_y_continuous()
labs() xlab() ylab()
ggtitle() theme()
coord_flip() |
DT |
datatable() |
We will analyze the college major dataset from FiveThirtyEight,
The data is in store in a csv file that can be found here.
Use read_csv
from readr
package to read the
csv file as a dataframe. Use <-
to store the data as a
variable in R. Below, we store the data as raw_data
. You
can use glimpse()
from dplyr
package to get a
glimpse of the data.
library(readr)
raw_data <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv")
library(DT)
datatable(raw_data, options = list(scrollX = TRUE))
We need to create additional variables. Find the total recent gradates that were employed and umemployed. Find the total recent gradates that had jobs with college degree or non-degree as a qualification.
Just extend our last code.
college_major <- raw_data %>%
group_by(Major_category) %>%
summarise(total = sum(Total,na.rm=TRUE),
Men = sum(Men,na.rm=TRUE),
Women = sum(Women,na.rm=TRUE),
Employed = sum(Employed, na.rm=TRUE),
Full_time = sum(Full_time, na.rm=TRUE),
Part_time = sum(Part_time, na.rm=TRUE),
Full_time_year_round = sum(Full_time_year_round, na.rm=TRUE),
Unemployed = sum(Unemployed, na.rm=TRUE),
College_jobs = sum(College_jobs, na.rm=TRUE),
Non_college_jobs = sum(Non_college_jobs,na.rm=TRUE),
Low_wage_jobs = sum(Low_wage_jobs, na.rm=TRUE))
Use mutate
to create a new variable,
Unemployment_rate
. To find the unemployment rate, $
Unemployment.rate=$
college_major <- raw_data %>%
group_by(Major_category) %>%
summarise(total = sum(Total,na.rm=TRUE),
Men = sum(Men,na.rm=TRUE),
Women = sum(Women,na.rm=TRUE),
Employed = sum(Employed, na.rm=TRUE),
Full_time = sum(Full_time, na.rm=TRUE),
Part_time = sum(Part_time, na.rm=TRUE),
Full_time_year_round = sum(Full_time_year_round, na.rm=TRUE),
Unemployed = sum(Unemployed, na.rm=TRUE),
College_jobs = sum(College_jobs, na.rm=TRUE),
Non_college_jobs = sum(Non_college_jobs,na.rm=TRUE),
Low_wage_jobs = sum(Low_wage_jobs, na.rm=TRUE)) %>%
mutate(Unemployment_rate = Unemployed/(Unemployed + Employed) )
datatable(college_major, options = list(scrollX = TRUE))
Here is another approach to create our bar graph. For the 1st
question, we plot the Major_Category
on the y-axis. For
this question, we can plot the Major_Category
on the
x-axis.
However, there are some issues with the visualization:
Major_Category
)ggplot(data = college_major, aes(x = Major_category, y = Unemployment_rate)) + geom_bar(stat="identity") +
labs(x = "Major Category" , y = "Unemployment Rate") +
ggtitle("Unemployment Rate") +
scale_y_continuous(labels = scales::percent)
Problem | Solution |
---|---|
Illegible Major Category | Keep Major_Category on the x-axis and rotate the labels
45° using
+ theme(axis.text.x=element_text(angle=45,hjust=1)) |
Center the Title | use + theme(plot.title=element_text(hjust=0.5)) |
Reorder Major Category | use reorder() to reorder the major category based on
unemployment rate |
ggplot(data = college_major, aes(x = reorder(Major_category,Unemployment_rate), y = Unemployment_rate)) +
ggtitle("Unemployment Rate") + theme(plot.title=element_text(hjust=0.5)) + xlab("Major Category")+ ylab("Unemployment Rate") + theme(axis.text.x=element_text(angle=45,hjust=1)) +
geom_bar(stat = "identity",fill = "seagreen", color = "black")+
scale_y_continuous(labels = scales::percent)
Problem | Solution |
---|---|
Illegible Major Category | Use coord_flip() to flip the axes |
Center the Title | use + theme(plot.title=element_text(hjust=0.5)) |
Reorder Major Category | use reorder() to reorder the major category based on
unemployment rate |
ggplot(data = college_major, aes(x = reorder(Major_category, -Unemployment_rate), y = Unemployment_rate)) +
coord_flip() +
ggtitle("Unemployment Rate") + xlab("Major Category")+ ylab("Unemployment Rate") +
geom_bar(stat = "identity",fill = "seagreen", color = "black") +
scale_y_continuous(labels = scales::percent)
The top three major category that has the highest unemployment rate are Social Science, Arts, and Humanities & Liberal Arts. The top three major category that has the lowest unemployment rate are Education, Physical Sciences, and Agriculture & Natural Resources.
We can also use the same code as in question 1. Plot the
Major_Category
on the y-axis. Simply, just replace
ShareWomen
with Unemployment_rate
and rename
the axes with their appropriate names.
ggplot(data = college_major, aes(y = reorder(Major_category, -Unemployment_rate), x = Unemployment_rate)) + geom_bar(stat="identity",fill = "seagreen", color = "black") + labs(x = "Unemployment Rate" , y = "Major Category") + ggtitle("Unemployment Rate") + scale_x_continuous(labels = scales::percent)
Question 1: Which major category has the highest share of women?
The top three major category that has the highest share of women are Health, Education, and Psychology & Social Work.
Question 2: Which major category has the lowest unemployment rate?
The top three major category that has the highest unemployment rate are Social Science, Arts, and Humanities & Liberal Arts. The top three major category that has the lowest unemployment rate are Education, Physical Sciences, and Agriculture & Natural Resources.
https://github.com/fivethirtyeight/data/tree/master/college-majors
Below are links to get more information related to the functions in the mentioned packages.
https://www.tidyverse.org/packages/
https://dplyr.tidyverse.org/reference/index.html
https://ggplot2.tidyverse.org/reference/