Assignment Description: In this assignment we will get to practice collaborating around a code project with GitHub. We will create and example using one or more TidyVerse packages and demonstrate how to use the capabilities. We will also extend an existing example from one of our classmate’s code with additional annotated code.
We will use a college majors dataset from ‘fivethirtyeight.com’ and we will load it from their GitHub repository:
# load data
all_ages <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv", header = TRUE)
Let’s view the dataset before we start transforming it:
head(all_ages)
## Major_code Major
## 1 1100 GENERAL AGRICULTURE
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## 3 1102 AGRICULTURAL ECONOMICS
## 4 1103 ANIMAL SCIENCES
## 5 1104 FOOD SCIENCE
## 6 1105 PLANT SCIENCE AND AGRONOMY
## Major_category Total Employed
## 1 Agriculture & Natural Resources 128148 90245
## 2 Agriculture & Natural Resources 95326 76865
## 3 Agriculture & Natural Resources 33955 26321
## 4 Agriculture & Natural Resources 103549 81177
## 5 Agriculture & Natural Resources 24280 17281
## 6 Agriculture & Natural Resources 79409 63043
## Employed_full_time_year_round Unemployed Unemployment_rate Median P25th
## 1 74078 2423 0.02614711 50000 34000
## 2 64240 2266 0.02863606 54000 36000
## 3 22810 821 0.03024832 63000 40000
## 4 64937 3619 0.04267890 46000 30000
## 5 12722 894 0.04918845 62000 38500
## 6 51077 2070 0.03179089 50000 35000
## P75th
## 1 80000
## 2 80000
## 3 98000
## 4 72000
## 5 90000
## 6 75000
Now, I will transform the dataset by using functions from the dplyr and ggplot package:
Extract columns as a table. Also select_if(). select(iris, Sepa.Length, Species)
Example
I will extract 7 of the 11 original columns from the dataset:
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
all <- select(all_ages, Major_code, Major, Major_category, Employed, Unemployed, Unemployment_rate, Median)
head(all)
## Major_code Major
## 1 1100 GENERAL AGRICULTURE
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## 3 1102 AGRICULTURAL ECONOMICS
## 4 1103 ANIMAL SCIENCES
## 5 1104 FOOD SCIENCE
## 6 1105 PLANT SCIENCE AND AGRONOMY
## Major_category Employed Unemployed Unemployment_rate
## 1 Agriculture & Natural Resources 90245 2423 0.02614711
## 2 Agriculture & Natural Resources 76865 2266 0.02863606
## 3 Agriculture & Natural Resources 26321 821 0.03024832
## 4 Agriculture & Natural Resources 81177 3619 0.04267890
## 5 Agriculture & Natural Resources 17281 894 0.04918845
## 6 Agriculture & Natural Resources 63043 2070 0.03179089
## Median
## 1 50000
## 2 54000
## 3 63000
## 4 46000
## 5 62000
## 6 50000
Compute new column(s). Take vectors as input and return vectors of the same length as output. mutate(tcars, gpm_2/mpg)
Example
With this function I create a new column (Employment_rate), which is calculated dividing employed by employed plus unemployed:
all <- mutate(all, Employment_rate=Employed/(Employed+Unemployed))
head(all)
## Major_code Major
## 1 1100 GENERAL AGRICULTURE
## 2 1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## 3 1102 AGRICULTURAL ECONOMICS
## 4 1103 ANIMAL SCIENCES
## 5 1104 FOOD SCIENCE
## 6 1105 PLANT SCIENCE AND AGRONOMY
## Major_category Employed Unemployed Unemployment_rate
## 1 Agriculture & Natural Resources 90245 2423 0.02614711
## 2 Agriculture & Natural Resources 76865 2266 0.02863606
## 3 Agriculture & Natural Resources 26321 821 0.03024832
## 4 Agriculture & Natural Resources 81177 3619 0.04267890
## 5 Agriculture & Natural Resources 17281 894 0.04918845
## 6 Agriculture & Natural Resources 63043 2070 0.03179089
## Median Employment_rate
## 1 50000 0.9738529
## 2 54000 0.9713639
## 3 63000 0.9697517
## 4 46000 0.9573211
## 5 62000 0.9508116
## 6 50000 0.9682091
Count number of rows in each group defined by the variables in… Also tally(). count(Iris, Species)
Example
This function will help me count how many majors there are in each major category:
all_count <- count(all, Major_category)
all_count
## # A tibble: 16 x 2
## Major_category n
## <fct> <int>
## 1 Agriculture & Natural Resources 10
## 2 Arts 8
## 3 Biology & Life Science 14
## 4 Business 13
## 5 Communications & Journalism 4
## 6 Computers & Mathematics 11
## 7 Education 16
## 8 Engineering 29
## 9 Health 12
## 10 Humanities & Liberal Arts 15
## 11 Industrial Arts & Consumer Services 7
## 12 Interdisciplinary 1
## 13 Law & Public Policy 5
## 14 Physical Sciences 10
## 15 Psychology & Social Work 9
## 16 Social Science 9
Use data frame as input for a graphic and specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden. ggplot(data = mpg, aes(x = city, y = hwy)) Begins a plot that you finis by adding layers to.
Example
I have used the ggplot() function below to visualize the count of majors in each major category. I have added different layers such as title, labels, and name for each axis among others:
ggplot(all_count, aes(x=reorder(Major_category, -n), n)) + geom_bar(stat="identity", width = 0.5, fill = "tomato2") + labs(x = "Major Category", y = "Count", title = "Count of Major Categories") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8)) + geom_label(aes(label=all_count$n), position = position_dodge(width = 0.1), size = 3, label.padding = unit(0.1, "lines"), label.size = 0.09, inherit.aes = TRUE)