CUNY SPS - Master of Science in Data Science

TidyVerse Assignment: Collaborating Around a Code Project with GitHub.

Assignment Description: In this assignment we will get to practice collaborating around a code project with GitHub. We will create and example using one or more TidyVerse packages and demonstrate how to use the capabilities. We will also extend an existing example from one of our classmate’s code with additional annotated code.

Load Data

We will use a college majors dataset from ‘fivethirtyeight.com’ and we will load it from their GitHub repository:

# load data
all_ages <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv", header = TRUE)

View the Data

Let’s view the dataset before we start transforming it:

head(all_ages)

##   Major_code                                 Major
## 1       1100                   GENERAL AGRICULTURE
## 2       1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## 3       1102                AGRICULTURAL ECONOMICS
## 4       1103                       ANIMAL SCIENCES
## 5       1104                          FOOD SCIENCE
## 6       1105            PLANT SCIENCE AND AGRONOMY
##                    Major_category  Total Employed
## 1 Agriculture & Natural Resources 128148    90245
## 2 Agriculture & Natural Resources  95326    76865
## 3 Agriculture & Natural Resources  33955    26321
## 4 Agriculture & Natural Resources 103549    81177
## 5 Agriculture & Natural Resources  24280    17281
## 6 Agriculture & Natural Resources  79409    63043
##   Employed_full_time_year_round Unemployed Unemployment_rate Median P25th
## 1                         74078       2423        0.02614711  50000 34000
## 2                         64240       2266        0.02863606  54000 36000
## 3                         22810        821        0.03024832  63000 40000
## 4                         64937       3619        0.04267890  46000 30000
## 5                         12722        894        0.04918845  62000 38500
## 6                         51077       2070        0.03179089  50000 35000
##   P75th
## 1 80000
## 2 80000
## 3 98000
## 4 72000
## 5 90000
## 6 75000

Now, I will transform the dataset by using functions from the dplyr and ggplot package:

dplyr::select()

Usage: select(data, …)

Extract columns as a table. Also select_if(). select(iris, Sepa.Length, Species)

Example

I will extract 7 of the 11 original columns from the dataset:

library(tidyverse)

## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

all <- select(all_ages, Major_code, Major, Major_category, Employed, Unemployed, Unemployment_rate, Median)
head(all)

##   Major_code                                 Major
## 1       1100                   GENERAL AGRICULTURE
## 2       1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## 3       1102                AGRICULTURAL ECONOMICS
## 4       1103                       ANIMAL SCIENCES
## 5       1104                          FOOD SCIENCE
## 6       1105            PLANT SCIENCE AND AGRONOMY
##                    Major_category Employed Unemployed Unemployment_rate
## 1 Agriculture & Natural Resources    90245       2423        0.02614711
## 2 Agriculture & Natural Resources    76865       2266        0.02863606
## 3 Agriculture & Natural Resources    26321        821        0.03024832
## 4 Agriculture & Natural Resources    81177       3619        0.04267890
## 5 Agriculture & Natural Resources    17281        894        0.04918845
## 6 Agriculture & Natural Resources    63043       2070        0.03179089
##   Median
## 1  50000
## 2  54000
## 3  63000
## 4  46000
## 5  62000
## 6  50000

dplyr::mutate()

Usage: mutate(data, …)

Compute new column(s). Take vectors as input and return vectors of the same length as output. mutate(tcars, gpm_2/mpg)

Example

With this function I create a new column (Employment_rate), which is calculated dividing employed by employed plus unemployed:

all <- mutate(all, Employment_rate=Employed/(Employed+Unemployed))
head(all)

##   Major_code                                 Major
## 1       1100                   GENERAL AGRICULTURE
## 2       1101 AGRICULTURE PRODUCTION AND MANAGEMENT
## 3       1102                AGRICULTURAL ECONOMICS
## 4       1103                       ANIMAL SCIENCES
## 5       1104                          FOOD SCIENCE
## 6       1105            PLANT SCIENCE AND AGRONOMY
##                    Major_category Employed Unemployed Unemployment_rate
## 1 Agriculture & Natural Resources    90245       2423        0.02614711
## 2 Agriculture & Natural Resources    76865       2266        0.02863606
## 3 Agriculture & Natural Resources    26321        821        0.03024832
## 4 Agriculture & Natural Resources    81177       3619        0.04267890
## 5 Agriculture & Natural Resources    17281        894        0.04918845
## 6 Agriculture & Natural Resources    63043       2070        0.03179089
##   Median Employment_rate
## 1  50000       0.9738529
## 2  54000       0.9713639
## 3  63000       0.9697517
## 4  46000       0.9573211
## 5  62000       0.9508116
## 6  50000       0.9682091

dplyr::count()

Usage: count(x, …, wt = NULL, sort = FALSE)

Count number of rows in each group defined by the variables in… Also tally(). count(Iris, Species)

Example

This function will help me count how many majors there are in each major category:

all_count <- count(all, Major_category)
all_count

## # A tibble: 16 x 2
##    Major_category                          n
##    <fct>                               <int>
##  1 Agriculture & Natural Resources        10
##  2 Arts                                    8
##  3 Biology & Life Science                 14
##  4 Business                               13
##  5 Communications & Journalism             4
##  6 Computers & Mathematics                11
##  7 Education                              16
##  8 Engineering                            29
##  9 Health                                 12
## 10 Humanities & Liberal Arts              15
## 11 Industrial Arts & Consumer Services     7
## 12 Interdisciplinary                       1
## 13 Law & Public Policy                     5
## 14 Physical Sciences                      10
## 15 Psychology & Social Work                9
## 16 Social Science                          9

ggplot2::ggplot()

Usage: ggplot(data = DATA) + GEOM_FUNCTION(mapping = aes (MAPPING), stat = sTAT, position = POSITION) + COORDINATE_FUNCTION + FACET_FUNCTION + SCALE_FUNCTION + THEME_FUNCTION

Use data frame as input for a graphic and specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden. ggplot(data = mpg, aes(x = city, y = hwy)) Begins a plot that you finis by adding layers to.

Example

I have used the ggplot() function below to visualize the count of majors in each major category. I have added different layers such as title, labels, and name for each axis among others:

ggplot(all_count, aes(x=reorder(Major_category, -n), n)) + geom_bar(stat="identity", width = 0.5, fill = "tomato2") + labs(x = "Major Category", y = "Count", title = "Count of Major Categories") + theme(axis.text.x = element_text(angle = 60, hjust = 1, size=8)) + geom_label(aes(label=all_count$n), position = position_dodge(width = 0.1), size = 3, label.padding = unit(0.1, "lines"), label.size = 0.09, inherit.aes = TRUE)

CUNY SPS - Master of Science in Data Science - DATA607

Mario Pena

December 01, 2019

TidyVerse Assignment: Collaborating Around a Code Project with GitHub.

Load Data

View the Data

dplyr::select()

Usage: select(data, …)

dplyr::mutate()

Usage: mutate(data, …)

dplyr::count()

Usage: count(x, …, wt = NULL, sort = FALSE)

ggplot2::ggplot()

Usage: ggplot(data = DATA) + GEOM_FUNCTION(mapping = aes (MAPPING), stat = sTAT, position = POSITION) + COORDINATE_FUNCTION + FACET_FUNCTION + SCALE_FUNCTION + THEME_FUNCTION