TidyVerse: Worked Example

Objective: Create an Example:

I am using couple of TidyVerse packages here, and college ‘major_list’ dataset from fivethirtyeight.com , and creating a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with my selected dataset.

Readr – Reading Files

The first package of TidyVerse to read the file is “Readr” It is a fast way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV).

The easiest way to get readr is to install the whole tidyverse:

readr supports the following file formats with these read_*() functions:

read_csv(): comma-separated values (CSV) files read_tsv(): tab-separated values (TSV) files read_delim(): delimited files (CSV and TSV are important special cases) read_fwf(): fixed-width files read_table(): whitespace-separated files read_log(): web log files

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.2

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(readr)
library(dplyr)

Fetching the majors-list dataset from the relational dataset on GitHub

college_major <- read_csv("https://raw.githubusercontent.com/uzmabb182/CUNY-SPS-Assignments/main/data_607/tidyverse_assignment/majors-list.csv", na="")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   FOD1P = col_character(),
##   Major = col_character(),
##   Major_Category = col_character()
## )

head(college_major)

## # A tibble: 6 x 3
##   FOD1P Major                                 Major_Category                 
##   <chr> <chr>                                 <chr>                          
## 1 1100  GENERAL AGRICULTURE                   Agriculture & Natural Resources
## 2 1101  AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102  AGRICULTURAL ECONOMICS                Agriculture & Natural Resources
## 4 1103  ANIMAL SCIENCES                       Agriculture & Natural Resources
## 5 1104  FOOD SCIENCE                          Agriculture & Natural Resources
## 6 1105  PLANT SCIENCE AND AGRONOMY            Agriculture & Natural Resources

Using package ‘Deplyr’ – to Rename Columns

For rename(): Use new_name = old_name to rename selected variables.

college_major <- as_tibble(college_major) # so it prints a little nicer

college_major <- rename(college_major, Major_Id = FOD1P)

head(college_major)

## # A tibble: 6 x 3
##   Major_Id Major                                 Major_Category                 
##   <chr>    <chr>                                 <chr>                          
## 1 1100     GENERAL AGRICULTURE                   Agriculture & Natural Resources
## 2 1101     AGRICULTURE PRODUCTION AND MANAGEMENT Agriculture & Natural Resources
## 3 1102     AGRICULTURAL ECONOMICS                Agriculture & Natural Resources
## 4 1103     ANIMAL SCIENCES                       Agriculture & Natural Resources
## 5 1104     FOOD SCIENCE                          Agriculture & Natural Resources
## 6 1105     PLANT SCIENCE AND AGRONOMY            Agriculture & Natural Resources

Reading ‘all-ages’ dataset

all_ages <- read_csv("https://raw.githubusercontent.com/uzmabb182/CUNY-SPS-Assignments/main/data_607/tidyverse_assignment/all-ages.csv", na="")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Major_code = col_double(),
##   Major = col_character(),
##   Major_category = col_character(),
##   Total = col_double(),
##   Employed = col_double(),
##   Employed_full_time_year_round = col_double(),
##   Unemployed = col_double(),
##   Unemployment_rate = col_double(),
##   Median = col_double(),
##   P25th = col_double(),
##   P75th = col_double()
## )

head(all_ages)

## # A tibble: 6 x 11
##   Major_code Major   Major_category   Total Employed Employed_full_t~ Unemployed
##        <dbl> <chr>   <chr>            <dbl>    <dbl>            <dbl>      <dbl>
## 1       1100 GENERA~ Agriculture & ~ 128148    90245            74078       2423
## 2       1101 AGRICU~ Agriculture & ~  95326    76865            64240       2266
## 3       1102 AGRICU~ Agriculture & ~  33955    26321            22810        821
## 4       1103 ANIMAL~ Agriculture & ~ 103549    81177            64937       3619
## 5       1104 FOOD S~ Agriculture & ~  24280    17281            12722        894
## 6       1105 PLANT ~ Agriculture & ~  79409    63043            51077       2070
## # ... with 4 more variables: Unemployment_rate <dbl>, Median <dbl>,
## #   P25th <dbl>, P75th <dbl>

This ‘pipe operator’ replicate SQL query

summary <- all_ages %>% 
  group_by(Major_category) %>% 
  summarise(Employed = mean(Employed, na.rm = TRUE)) %>% 
  arrange(desc(Employed))

summary

## # A tibble: 16 x 2
##    Major_category                      Employed
##    <chr>                                  <dbl>
##  1 Business                             579219.
##  2 Communications & Journalism          355760.
##  3 Social Science                       207978.
##  4 Health                               182724.
##  5 Education                            177075.
##  6 Humanities & Liberal Arts            166612.
##  7 Arts                                 163587.
##  8 Psychology & Social Work             156887 
##  9 Law & Public Policy                  143785.
## 10 Computers & Mathematics              128237 
## 11 Industrial Arts & Consumer Services  107683.
## 12 Engineering                           90413.
## 13 Physical Sciences                     70713.
## 14 Biology & Life Science                67647 
## 15 Agriculture & Natural Resources       48042.
## 16 Interdisciplinary                     35706

Using the ‘ggplot’ for visualizing which category has most most employed

ggplot(summary, aes(x = Employed, y = Major_category, fill = Major_category)) +
  geom_col(position = "dodge")

Reading the ‘grad-students’ dataset

grad_students <- read_csv("https://raw.githubusercontent.com/uzmabb182/CUNY-SPS-Assignments/main/data_607/tidyverse_assignment/grad-students.csv", na="")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   Major = col_character(),
##   Major_category = col_character()
## )
## i Use `spec()` for the full column specifications.

head(grad_students)

## # A tibble: 6 x 22
##   Major_code Major     Major_category  Grad_total Grad_sample_size Grad_employed
##        <dbl> <chr>     <chr>                <dbl>            <dbl>         <dbl>
## 1       5601 CONSTRUC~ Industrial Art~       9173              200          7098
## 2       6004 COMMERCI~ Arts                 53864              882         40492
## 3       6211 HOSPITAL~ Business             24417              437         18368
## 4       2201 COSMETOL~ Industrial Art~       5411               72          3590
## 5       2001 COMMUNIC~ Computers & Ma~       9109              171          7512
## 6       3201 COURT RE~ Law & Public P~       1542               22          1008
## # ... with 16 more variables: Grad_full_time_year_round <dbl>,
## #   Grad_unemployed <dbl>, Grad_unemployment_rate <dbl>, Grad_median <dbl>,
## #   Grad_P25 <dbl>, Grad_P75 <dbl>, Nongrad_total <dbl>,
## #   Nongrad_employed <dbl>, Nongrad_full_time_year_round <dbl>,
## #   Nongrad_unemployed <dbl>, Nongrad_unemployment_rate <dbl>,
## #   Nongrad_median <dbl>, Nongrad_P25 <dbl>, Nongrad_P75 <dbl>,
## #   Grad_share <dbl>, Grad_premium <dbl>

Mutating joins:

Mutating joins allow you to combine variables from multiple tables.

merge_df <- all_ages %>% left_join(grad_students, by = "Major_code")

merge_df

## # A tibble: 173 x 32
##    Major_code Major.x      Major_category.x    Total Employed Employed_full_tim~
##         <dbl> <chr>        <chr>               <dbl>    <dbl>              <dbl>
##  1       1100 GENERAL AGR~ Agriculture & Nat~ 128148    90245              74078
##  2       1101 AGRICULTURE~ Agriculture & Nat~  95326    76865              64240
##  3       1102 AGRICULTURA~ Agriculture & Nat~  33955    26321              22810
##  4       1103 ANIMAL SCIE~ Agriculture & Nat~ 103549    81177              64937
##  5       1104 FOOD SCIENCE Agriculture & Nat~  24280    17281              12722
##  6       1105 PLANT SCIEN~ Agriculture & Nat~  79409    63043              51077
##  7       1106 SOIL SCIENCE Agriculture & Nat~   6586     4926               4042
##  8       1199 MISCELLANEO~ Agriculture & Nat~   8549     6392               5074
##  9       1301 ENVIRONMENT~ Biology & Life Sc~ 106106    87602              65238
## 10       1302 FORESTRY     Agriculture & Nat~  69447    48228              39613
## # ... with 163 more rows, and 26 more variables: Unemployed <dbl>,
## #   Unemployment_rate <dbl>, Median <dbl>, P25th <dbl>, P75th <dbl>,
## #   Major.y <chr>, Major_category.y <chr>, Grad_total <dbl>,
## #   Grad_sample_size <dbl>, Grad_employed <dbl>,
## #   Grad_full_time_year_round <dbl>, Grad_unemployed <dbl>,
## #   Grad_unemployment_rate <dbl>, Grad_median <dbl>, Grad_P25 <dbl>,
## #   Grad_P75 <dbl>, Nongrad_total <dbl>, Nongrad_employed <dbl>,
## #   Nongrad_full_time_year_round <dbl>, Nongrad_unemployed <dbl>,
## #   Nongrad_unemployment_rate <dbl>, Nongrad_median <dbl>, Nongrad_P25 <dbl>,
## #   Nongrad_P75 <dbl>, Grad_share <dbl>, Grad_premium <dbl>

Remove multiple Column in R using dplyr (by name and index) – TidyVerse

merge_df <- select(merge_df, -c(Major_category.y, Major.y))

head(merge_df)

## # A tibble: 6 x 30
##   Major_code Major.x       Major_category.x    Total Employed Employed_full_tim~
##        <dbl> <chr>         <chr>               <dbl>    <dbl>              <dbl>
## 1       1100 GENERAL AGRI~ Agriculture & Nat~ 128148    90245              74078
## 2       1101 AGRICULTURE ~ Agriculture & Nat~  95326    76865              64240
## 3       1102 AGRICULTURAL~ Agriculture & Nat~  33955    26321              22810
## 4       1103 ANIMAL SCIEN~ Agriculture & Nat~ 103549    81177              64937
## 5       1104 FOOD SCIENCE  Agriculture & Nat~  24280    17281              12722
## 6       1105 PLANT SCIENC~ Agriculture & Nat~  79409    63043              51077
## # ... with 24 more variables: Unemployed <dbl>, Unemployment_rate <dbl>,
## #   Median <dbl>, P25th <dbl>, P75th <dbl>, Grad_total <dbl>,
## #   Grad_sample_size <dbl>, Grad_employed <dbl>,
## #   Grad_full_time_year_round <dbl>, Grad_unemployed <dbl>,
## #   Grad_unemployment_rate <dbl>, Grad_median <dbl>, Grad_P25 <dbl>,
## #   Grad_P75 <dbl>, Nongrad_total <dbl>, Nongrad_employed <dbl>,
## #   Nongrad_full_time_year_round <dbl>, Nongrad_unemployed <dbl>,
## #   Nongrad_unemployment_rate <dbl>, Nongrad_median <dbl>, Nongrad_P25 <dbl>,
## #   Nongrad_P75 <dbl>, Grad_share <dbl>, Grad_premium <dbl>

Change Name of Columns in R with dplyr rename()

merge_df <- merge_df %>%
  rename(Major_category = Major_category.x,
         Major = Major.x
         )

head(merge_df)

## # A tibble: 6 x 30
##   Major_code Major   Major_category   Total Employed Employed_full_t~ Unemployed
##        <dbl> <chr>   <chr>            <dbl>    <dbl>            <dbl>      <dbl>
## 1       1100 GENERA~ Agriculture & ~ 128148    90245            74078       2423
## 2       1101 AGRICU~ Agriculture & ~  95326    76865            64240       2266
## 3       1102 AGRICU~ Agriculture & ~  33955    26321            22810        821
## 4       1103 ANIMAL~ Agriculture & ~ 103549    81177            64937       3619
## 5       1104 FOOD S~ Agriculture & ~  24280    17281            12722        894
## 6       1105 PLANT ~ Agriculture & ~  79409    63043            51077       2070
## # ... with 23 more variables: Unemployment_rate <dbl>, Median <dbl>,
## #   P25th <dbl>, P75th <dbl>, Grad_total <dbl>, Grad_sample_size <dbl>,
## #   Grad_employed <dbl>, Grad_full_time_year_round <dbl>,
## #   Grad_unemployed <dbl>, Grad_unemployment_rate <dbl>, Grad_median <dbl>,
## #   Grad_P25 <dbl>, Grad_P75 <dbl>, Nongrad_total <dbl>,
## #   Nongrad_employed <dbl>, Nongrad_full_time_year_round <dbl>,
## #   Nongrad_unemployed <dbl>, Nongrad_unemployment_rate <dbl>,
## #   Nongrad_median <dbl>, Nongrad_P25 <dbl>, Nongrad_P75 <dbl>,
## #   Grad_share <dbl>, Grad_premium <dbl>

Reading the ‘grad-students’ dataset

women_stem <- read_csv("https://raw.githubusercontent.com/uzmabb182/CUNY-SPS-Assignments/main/data_607/tidyverse_assignment/grad-students.csv", na="")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   Major = col_character(),
##   Major_category = col_character()
## )
## i Use `spec()` for the full column specifications.

head(women_stem)

## # A tibble: 6 x 22
##   Major_code Major     Major_category  Grad_total Grad_sample_size Grad_employed
##        <dbl> <chr>     <chr>                <dbl>            <dbl>         <dbl>
## 1       5601 CONSTRUC~ Industrial Art~       9173              200          7098
## 2       6004 COMMERCI~ Arts                 53864              882         40492
## 3       6211 HOSPITAL~ Business             24417              437         18368
## 4       2201 COSMETOL~ Industrial Art~       5411               72          3590
## 5       2001 COMMUNIC~ Computers & Ma~       9109              171          7512
## 6       3201 COURT RE~ Law & Public P~       1542               22          1008
## # ... with 16 more variables: Grad_full_time_year_round <dbl>,
## #   Grad_unemployed <dbl>, Grad_unemployment_rate <dbl>, Grad_median <dbl>,
## #   Grad_P25 <dbl>, Grad_P75 <dbl>, Nongrad_total <dbl>,
## #   Nongrad_employed <dbl>, Nongrad_full_time_year_round <dbl>,
## #   Nongrad_unemployed <dbl>, Nongrad_unemployment_rate <dbl>,
## #   Nongrad_median <dbl>, Nongrad_P25 <dbl>, Nongrad_P75 <dbl>,
## #   Grad_share <dbl>, Grad_premium <dbl>

Reading the ‘recent-grads’ dataset

recent_grads <- read_csv("https://raw.githubusercontent.com/uzmabb182/CUNY-SPS-Assignments/main/data_607/tidyverse_assignment/recent-grads.csv", na="")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   Major = col_character(),
##   Major_category = col_character()
## )
## i Use `spec()` for the full column specifications.

head(recent_grads)

## # A tibble: 6 x 21
##    Rank Major_code Major Total   Men Women Major_category ShareWomen Sample_size
##   <dbl>      <dbl> <chr> <dbl> <dbl> <dbl> <chr>               <dbl>       <dbl>
## 1     1       2419 PETR~  2339  2057   282 Engineering         0.121          36
## 2     2       2416 MINI~   756   679    77 Engineering         0.102           7
## 3     3       2415 META~   856   725   131 Engineering         0.153           3
## 4     4       2417 NAVA~  1258  1123   135 Engineering         0.107          16
## 5     5       2405 CHEM~ 32260 21239 11021 Engineering         0.342         289
## 6     6       2418 NUCL~  2573  2200   373 Engineering         0.145          17
## # ... with 12 more variables: Employed <dbl>, Full_time <dbl>, Part_time <dbl>,
## #   Full_time_year_round <dbl>, Unemployed <dbl>, Unemployment_rate <dbl>,
## #   Median <dbl>, P25th <dbl>, P75th <dbl>, College_jobs <dbl>,
## #   Non_college_jobs <dbl>, Low_wage_jobs <dbl>

Merging merge_df and recent_grads

merge_df <- merge_df %>% left_join(recent_grads, by = "Major_code")

merge_df

## # A tibble: 173 x 50
##    Major_code Major.x     Major_category.x  Total.x Employed.x Employed_full_ti~
##         <dbl> <chr>       <chr>               <dbl>      <dbl>             <dbl>
##  1       1100 GENERAL AG~ Agriculture & Na~  128148      90245             74078
##  2       1101 AGRICULTUR~ Agriculture & Na~   95326      76865             64240
##  3       1102 AGRICULTUR~ Agriculture & Na~   33955      26321             22810
##  4       1103 ANIMAL SCI~ Agriculture & Na~  103549      81177             64937
##  5       1104 FOOD SCIEN~ Agriculture & Na~   24280      17281             12722
##  6       1105 PLANT SCIE~ Agriculture & Na~   79409      63043             51077
##  7       1106 SOIL SCIEN~ Agriculture & Na~    6586       4926              4042
##  8       1199 MISCELLANE~ Agriculture & Na~    8549       6392              5074
##  9       1301 ENVIRONMEN~ Biology & Life S~  106106      87602             65238
## 10       1302 FORESTRY    Agriculture & Na~   69447      48228             39613
## # ... with 163 more rows, and 44 more variables: Unemployed.x <dbl>,
## #   Unemployment_rate.x <dbl>, Median.x <dbl>, P25th.x <dbl>, P75th.x <dbl>,
## #   Grad_total <dbl>, Grad_sample_size <dbl>, Grad_employed <dbl>,
## #   Grad_full_time_year_round <dbl>, Grad_unemployed <dbl>,
## #   Grad_unemployment_rate <dbl>, Grad_median <dbl>, Grad_P25 <dbl>,
## #   Grad_P75 <dbl>, Nongrad_total <dbl>, Nongrad_employed <dbl>,
## #   Nongrad_full_time_year_round <dbl>, Nongrad_unemployed <dbl>,
## #   Nongrad_unemployment_rate <dbl>, Nongrad_median <dbl>, Nongrad_P25 <dbl>,
## #   Nongrad_P75 <dbl>, Grad_share <dbl>, Grad_premium <dbl>, Rank <dbl>,
## #   Major.y <chr>, Total.y <dbl>, Men <dbl>, Women <dbl>,
## #   Major_category.y <chr>, ShareWomen <dbl>, Sample_size <dbl>,
## #   Employed.y <dbl>, Full_time <dbl>, Part_time <dbl>,
## #   Full_time_year_round <dbl>, Unemployed.y <dbl>, Unemployment_rate.y <dbl>,
## #   Median.y <dbl>, P25th.y <dbl>, P75th.y <dbl>, College_jobs <dbl>,
## #   Non_college_jobs <dbl>, Low_wage_jobs <dbl>

Remove multiple Column in R using dplyr (by name and index) – TidyVerse

merge_df <- select(merge_df, -c(Major_category.y, Major.y))

merge_df

## # A tibble: 173 x 48
##    Major_code Major.x     Major_category.x  Total.x Employed.x Employed_full_ti~
##         <dbl> <chr>       <chr>               <dbl>      <dbl>             <dbl>
##  1       1100 GENERAL AG~ Agriculture & Na~  128148      90245             74078
##  2       1101 AGRICULTUR~ Agriculture & Na~   95326      76865             64240
##  3       1102 AGRICULTUR~ Agriculture & Na~   33955      26321             22810
##  4       1103 ANIMAL SCI~ Agriculture & Na~  103549      81177             64937
##  5       1104 FOOD SCIEN~ Agriculture & Na~   24280      17281             12722
##  6       1105 PLANT SCIE~ Agriculture & Na~   79409      63043             51077
##  7       1106 SOIL SCIEN~ Agriculture & Na~    6586       4926              4042
##  8       1199 MISCELLANE~ Agriculture & Na~    8549       6392              5074
##  9       1301 ENVIRONMEN~ Biology & Life S~  106106      87602             65238
## 10       1302 FORESTRY    Agriculture & Na~   69447      48228             39613
## # ... with 163 more rows, and 42 more variables: Unemployed.x <dbl>,
## #   Unemployment_rate.x <dbl>, Median.x <dbl>, P25th.x <dbl>, P75th.x <dbl>,
## #   Grad_total <dbl>, Grad_sample_size <dbl>, Grad_employed <dbl>,
## #   Grad_full_time_year_round <dbl>, Grad_unemployed <dbl>,
## #   Grad_unemployment_rate <dbl>, Grad_median <dbl>, Grad_P25 <dbl>,
## #   Grad_P75 <dbl>, Nongrad_total <dbl>, Nongrad_employed <dbl>,
## #   Nongrad_full_time_year_round <dbl>, Nongrad_unemployed <dbl>,
## #   Nongrad_unemployment_rate <dbl>, Nongrad_median <dbl>, Nongrad_P25 <dbl>,
## #   Nongrad_P75 <dbl>, Grad_share <dbl>, Grad_premium <dbl>, Rank <dbl>,
## #   Total.y <dbl>, Men <dbl>, Women <dbl>, ShareWomen <dbl>, Sample_size <dbl>,
## #   Employed.y <dbl>, Full_time <dbl>, Part_time <dbl>,
## #   Full_time_year_round <dbl>, Unemployed.y <dbl>, Unemployment_rate.y <dbl>,
## #   Median.y <dbl>, P25th.y <dbl>, P75th.y <dbl>, College_jobs <dbl>,
## #   Non_college_jobs <dbl>, Low_wage_jobs <dbl>

To remove a common suffix from column names in an R data frame

colnames(merge_df)<-gsub(".x","",colnames(merge_df))
merge_df

## # A tibble: 173 x 48
##    Major_code Major   Major_category  Total Employed Employed_full_t~ Unemployed
##         <dbl> <chr>   <chr>           <dbl>    <dbl>            <dbl>      <dbl>
##  1       1100 GENERA~ Agriculture &~ 128148    90245            74078       2423
##  2       1101 AGRICU~ Agriculture &~  95326    76865            64240       2266
##  3       1102 AGRICU~ Agriculture &~  33955    26321            22810        821
##  4       1103 ANIMAL~ Agriculture &~ 103549    81177            64937       3619
##  5       1104 FOOD S~ Agriculture &~  24280    17281            12722        894
##  6       1105 PLANT ~ Agriculture &~  79409    63043            51077       2070
##  7       1106 SOIL S~ Agriculture &~   6586     4926             4042        264
##  8       1199 MISCEL~ Agriculture &~   8549     6392             5074        261
##  9       1301 ENVIRO~ Biology & Lif~ 106106    87602            65238       4736
## 10       1302 FOREST~ Agriculture &~  69447    48228            39613       2144
## # ... with 163 more rows, and 41 more variables: Unemployment_rate <dbl>,
## #   Median <dbl>, P25th <dbl>, P75th <dbl>, Grad_total <dbl>,
## #   Grad_sample_size <dbl>, Grad_employed <dbl>,
## #   Grad_full_time_year_round <dbl>, Grad_unemployed <dbl>,
## #   Grad_unemployment_rate <dbl>, Grad_median <dbl>, Grad_P25 <dbl>,
## #   Grad_P75 <dbl>, Nongrad_total <dbl>, Nongrad_employed <dbl>,
## #   Nongrad_full_time_year_round <dbl>, Nongrad_unemployed <dbl>,
## #   Nongrad_unemployment_rate <dbl>, Nongrad_median <dbl>, Nongrad_P25 <dbl>,
## #   Nongrad_P75 <dbl>, Grad_share <dbl>, Grad_premium <dbl>, Rank <dbl>,
## #   Total.y <dbl>, Men <dbl>, Women <dbl>, ShareWomen <dbl>, Sample_size <dbl>,
## #   Employed.y <dbl>, Full_time <dbl>, Part_time <dbl>,
## #   Full_time_year_round <dbl>, Unemployed.y <dbl>, Unemployment_rate.y <dbl>,
## #   Median.y <dbl>, P25th.y <dbl>, P75th.y <dbl>, College_jobs <dbl>,
## #   Non_college_jobs <dbl>, Low_wage_jobs <dbl>

Quantiles

To demonstrate this new flexibility in a more useful situation, let’s take a look at quantile(). quantile() was hard to use previously because it returns multiple values. Now it’s straightforward

merge_df %>% 
  group_by(Major_category) %>% 
  summarise(x = quantile(Grad_unemployment_rate, c(0.25, 0.5, 0.75)), q = c(0.25, 0.5, 0.75))

## `summarise()` has grouped output by 'Major_category'. You can override using the `.groups` argument.

## # A tibble: 48 x 3
## # Groups:   Major_category [16]
##    Major_category                       x     q
##    <chr>                            <dbl> <dbl>
##  1 Agriculture & Natural Resources 0.0223  0.25
##  2 Agriculture & Natural Resources 0.0304  0.5 
##  3 Agriculture & Natural Resources 0.0344  0.75
##  4 Arts                            0.0407  0.25
##  5 Arts                            0.0541  0.5 
##  6 Arts                            0.0627  0.75
##  7 Biology & Life Science          0.0210  0.25
##  8 Biology & Life Science          0.0250  0.5 
##  9 Biology & Life Science          0.0329  0.75
## 10 Business                        0.0419  0.25
## # ... with 38 more rows

Summerizing the merge_df using ‘summarise’ function on ‘Grad_unemployed’ column

summary1 <- merge_df %>% 
  group_by(Major_category) %>% 
  summarise(Grad_unemployed = mean(Grad_unemployed, na.rm = TRUE)) %>% 
  arrange(desc(Grad_unemployed))

summary1

## # A tibble: 16 x 2
##    Major_category                      Grad_unemployed
##    <chr>                                         <dbl>
##  1 Business                                      7846.
##  2 Social Science                                6725.
##  3 Humanities & Liberal Arts                     5669.
##  4 Psychology & Social Work                      5492 
##  5 Communications & Journalism                   4433.
##  6 Education                                     4184.
##  7 Arts                                          3070.
##  8 Computers & Mathematics                       2642 
##  9 Physical Sciences                             2403 
## 10 Biology & Life Science                        2287.
## 11 Engineering                                   2244.
## 12 Health                                        2164.
## 13 Law & Public Policy                           2002.
## 14 Industrial Arts & Consumer Services           1283.
## 15 Agriculture & Natural Resources                500.
## 16 Interdisciplinary                              261

Visualizing the trend of Grad_unemployed vs Major_category

ggplot(data = summary1) +
  geom_count(mapping = aes(x = Grad_unemployed, y = Major_category,  color = Major_category) )

TidyVerse: Worked Example

Mubashira Qari

April 22, 2022

Objective: Create an Example:

Readr – Reading Files

The easiest way to get readr is to install the whole tidyverse:

Fetching the majors-list dataset from the relational dataset on GitHub

Using package ‘Deplyr’ – to Rename Columns

Reading ‘all-ages’ dataset

This ‘pipe operator’ replicate SQL query

Using the ‘ggplot’ for visualizing which category has most most employed

Reading the ‘grad-students’ dataset

Mutating joins:

Remove multiple Column in R using dplyr (by name and index) – TidyVerse

Change Name of Columns in R with dplyr rename()

Reading the ‘grad-students’ dataset

Reading the ‘recent-grads’ dataset

Merging merge_df and recent_grads

Remove multiple Column in R using dplyr (by name and index) – TidyVerse

To remove a common suffix from column names in an R data frame

Quantiles

Summerizing the merge_df using ‘summarise’ function on ‘Grad_unemployed’ column

Visualizing the trend of Grad_unemployed vs Major_category