In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions. GitHub repository:
https://github.com/acatlin/SPRING2023TIDYVERSE
FiveThirtyEight.com datasets.
Kaggle datasets.
Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)
Later, you’ll be asked to extend an existing vignette. Using one of your classmate’s examples (as created above), you’ll then extend his or her example with additional annotated code. (15 points) You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository.
You should also update the README.md file with your example. After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded. You should complete your submission on the schedule stated in the course syllabus.
We are going to use the following packages
dplyr,ggplot2, and readr.
We took a dataset from Fivethirtyeight we have recent college grads. We can see how many of of each have jobs etc.
library(ggplot2)
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/recent-grads.csv"
raw_data <- read_csv(url)
## Rows: 173 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Major, Major_category
## dbl (19): Rank, Major_code, Total, Men, Women, ShareWomen, Sample_size, Empl...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# head(raw_data)
glimpse(raw_data)
## Rows: 173
## Columns: 21
## $ Rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ Major_code <dbl> 2419, 2416, 2415, 2417, 2405, 2418, 6202, 5001, 2…
## $ Major <chr> "PETROLEUM ENGINEERING", "MINING AND MINERAL ENGI…
## $ Total <dbl> 2339, 756, 856, 1258, 32260, 2573, 3777, 1792, 91…
## $ Men <dbl> 2057, 679, 725, 1123, 21239, 2200, 2110, 832, 803…
## $ Women <dbl> 282, 77, 131, 135, 11021, 373, 1667, 960, 10907, …
## $ Major_category <chr> "Engineering", "Engineering", "Engineering", "Eng…
## $ ShareWomen <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132, 0.341…
## $ Sample_size <dbl> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, 399, 14…
## $ Employed <dbl> 1976, 640, 648, 758, 25694, 1857, 2912, 1526, 764…
## $ Full_time <dbl> 1849, 556, 558, 1069, 23170, 2038, 2924, 1085, 71…
## $ Part_time <dbl> 270, 170, 133, 150, 5180, 264, 296, 553, 13101, 1…
## $ Full_time_year_round <dbl> 1207, 388, 340, 692, 16697, 1449, 2482, 827, 5463…
## $ Unemployed <dbl> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, 3895, 2…
## $ Unemployment_rate <dbl> 0.018380527, 0.117241379, 0.024096386, 0.05012531…
## $ Median <dbl> 110000, 75000, 73000, 70000, 65000, 65000, 62000,…
## $ P25th <dbl> 95000, 55000, 50000, 43000, 50000, 50000, 53000, …
## $ P75th <dbl> 125000, 90000, 105000, 80000, 75000, 102000, 7200…
## $ College_jobs <dbl> 1534, 350, 456, 529, 18314, 1142, 1768, 972, 5284…
## $ Non_college_jobs <dbl> 364, 257, 176, 102, 4440, 657, 314, 500, 16384, 1…
## $ Low_wage_jobs <dbl> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 3170, 98…
We first take the raw_data and then take the top 10 from the Unemployment rate, and see what majors pop out.
df1 <- raw_data %>% arrange(desc(Unemployment_rate)) %>% head(10)
ggplot(df1, aes(x = reorder(Major, Unemployment_rate), y = Unemployment_rate)) +
geom_bar(stat = "identity", fill = "steelblue", height = 0.8) +
labs(title = "Top 10 College Majors with \nHighest Unemployment Rates",
x = "Major",
y = "Unemployment Rate") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip()
## Warning in geom_bar(stat = "identity", fill = "steelblue", height = 0.8):
## Ignoring unknown parameters: `height`
This seems a little odd considering Nuclear Engineering seems very employable. We then turn to see how many people are sampled in each of the majors from above, then we will be able to see a clearer picture as to why they are organized in that way.
ggplot(df1, aes(x = reorder(Major, Unemployment_rate), y = Sample_size)) +
geom_bar(stat = "identity", fill = "steelblue", height = 0.8) +
labs(title = "Sampe size of Majors",
x = "Major",
y = "Sample Size") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip()
## Warning in geom_bar(stat = "identity", fill = "steelblue", height = 0.8):
## Ignoring unknown parameters: `height`
Well that makes more sense according to the data; less than 25 Nuclear Engineers were sampled in this data exploration, while Architecture had roughly around more than 350! This means that the data for the Unemployment statistic for Nuclear Engineering might be inaccurate.
Next we look at the majors with the most jobs.
df2 <-raw_data %>% arrange(desc(College_jobs)) %>% head(10)
ggplot(df2, aes(x = reorder(Major, College_jobs), y = College_jobs)) +
geom_bar(stat = "identity", fill = "steelblue", width = 0.5) +
labs(title = "Top 10 College Majors by Number of College Jobs",
x = "Major",
y = "Number of College Jobs") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
coord_flip()
Overall, there more than 150000 jobs for Nursing and about 60000 jobs for Computer Science.
ggplot, which is part of the Tidyverse, is a versatile and powerful tool for creating visualizations in R, providing a rich and flexible grammar of graphics that allows for the creation of highly customizable and professional-quality plots and charts.