TidyverseCreate

Source: Kagle data - Data Analysis Jobs, Based on NYC Jobs - October 2021

This dataset contains current job postings available on the City of New York’s official jobs site ( http://www.nyc.gov/html/careers/html/search/search.shtml ). Internal postings available to city employees and external postings available to the general public are included.

Load Packages

library("tidyverse")
#> Warning: package 'tidyverse' was built under R version 4.0.5
#> -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
#> v ggplot2 3.3.5     v purrr   0.3.4
#> v tibble  3.1.4     v dplyr   1.0.7
#> v tidyr   1.1.3     v stringr 1.4.0
#> v readr   2.0.1     v forcats 0.5.1
#> Warning: package 'ggplot2' was built under R version 4.0.5
#> Warning: package 'tibble' was built under R version 4.0.5
#> Warning: package 'tidyr' was built under R version 4.0.5
#> Warning: package 'readr' was built under R version 4.0.5
#> Warning: package 'dplyr' was built under R version 4.0.5
#> Warning: package 'forcats' was built under R version 4.0.5
#> -- Conflicts ------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
library("reactable")
#> Warning: package 'reactable' was built under R version 4.0.5
library("tidytext")
#> Warning: package 'tidytext' was built under R version 4.0.5

Data Set

The readr package in tidyverse library contains the function read_csv that will import data from a csv file. The csv file was downloaded from Kaggle.com dataset on Data Analysis Jobs in NYC for October 2021. The imported data was read in to RStudio as a dataframe, da.

da <- read_csv("https://github.com/candrewxs/Vignettes/blob/master/dadata/data_analysis_jobs.csv?raw=true")
#> Rows: 851 Columns: 23
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (18): Agency, Posting Type, Business Title, Civil Service Title, Title C...
#> dbl  (4): Job ID, # Of Positions, Salary Range From, Salary Range To
#> lgl  (1): Recruitment Contact
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Retrieve the names attribute of the da data set with the names() function.

names(da)
#>  [1] "Job ID"                    "Agency"                   
#>  [3] "Posting Type"              "# Of Positions"           
#>  [5] "Business Title"            "Civil Service Title"      
#>  [7] "Title Code No"             "Level"                    
#>  [9] "Salary Range From"         "Salary Range To"          
#> [11] "Salary Frequency"          "Work Location"            
#> [13] "Division/Work Unit"        "Job Description"          
#> [15] "Minimum Qual Requirements" "Preferred Skills"         
#> [17] "Additional Information"    "To Apply"                 
#> [19] "Hours/Shift"               "Work Location 1"          
#> [21] "Recruitment Contact"       "Residency Requirement"    
#> [23] "Post Until"

Group by Variables

Dataset Question: How many job postings are Internal and External? What are the Levels and how many are available?

Create a new data frame using existing data frame (da) by extracting columns: Level and Posting Type. The count() combines group_by and count rows in each group into a single function. Renamed the column name “n” to “Count” and created an interactive data table.

plot1 <- da %>%
   count(Level, `Posting Type`)  # dplyr package: group and count 

colnames(plot1)[3] <- "Count" 
# use function colnames() to rename column 
# and access individual column names with colnames(df)[index]

reactable(plot1) # interactive data table

Visualization

# discrete visualization with ggplot2
p1 <- ggplot(data = plot1) 
  p1 + geom_col(aes(Level, Count, fill = `Posting Type`))

Dataset Question: What are the Business Titles categories which NYC hires Data Analysis to perform? Which Business Title has the highest demand for Data Analysis?

Create a new data frame using existing data frame (da) by extracting column: Business Title. Tidy data frame column to upper case and build a contingency table of the counts at each combination of factor levels with the table() function. Renamed the column name “pl2” to “Business Title” and created an interactive data table showing the frequency.

plot2 <- as.data.frame(da[,5])

plot2$`Business Title` <- toupper(plot2$`Business Title`)

pl2 <- as.data.frame(table(plot2))

colnames(pl2)[1] <- "Business Title"

reactable(pl2)

summary(pl2) # calculates some statistics on the data
#>                                   Business Title      Freq       
#>  .NET DEVELOPER                          :  1    Min.   : 1.000  
#>  .NET DEVELOPER ANALYST                  :  1    1st Qu.: 2.000  
#>  .NET PROGRAMMER/ANALYST (TECHNICAL LEAD):  1    Median : 2.000  
#>  .NET/CRM DEVELOPER                      :  1    Mean   : 2.251  
#>  .NET/JAVASCRIPT DEVELOPER               :  1    3rd Qu.: 2.000  
#>  311 ASSISTANT BUSINESS SUPPORT ASSOCIATE:  1    Max.   :19.000  
#>  (Other)                                 :372

Filter Business Titles with a frequency greater or equal to 10.

pl_2 <- filter(pl2, Freq >= 10) 

pl_2
#>                          Business Title Freq
#> 1 ADMINISTRATIVE HOUSING SUPERINTENDENT   10
#> 2                      BUSINESS ANALYST   19
#> 3        COMPUTER SPECIALIST (SOFTWARE)   16
#> 4                          DATA ANALYST   11
#> 5                       PROJECT MANAGER   14

Visualization

p2 <- ggplot(pl_2) 
  p2 + geom_col(aes(`Business Title`, Freq)) +
    coord_flip()

Conclusion

Kagle data - Data Analysis Jobs, Based on NYC Jobs was loaded and analyzed using R base functions and packages from Tidyverse and Reactable. These are the Tidyverse packages that were utilized to load (readr) , perform data manipulation (dplyr) , create graphic representation of input data (ggplot2).

Links GitHub

TidyverseExtend

In the Extend chapter, we will take a look at the Preferred Skills field which is a free format english description of the requirements. We are going to do some lexicon analysis on this field.

Note there are some non ascii characters in it.


da[1,]$`Preferred Skills`
#> [1] "Ã¢Â\200Â¢\tExperience in the procurement of goods and services (professional and standardized preferred) Ã¢Â\200Â¢\tStrong knowledge of New York City Procurement Policy Board (PPB) Rules and local laws Ã¢Â\200Â¢\tStrong knowledge of City systems including Financial Management System (FMS); Procurement and Sourcing Solution Portal (Passport); and Office of Citywide Procurement (OCP) Ã¢Â\200Â¢\tExperience processing large procurements from start to completion  Ã¢Â\200Â¢\tKnowledge of the CityÃ¢Â\200Â\231s Minority/Women-owned Business Enterprise (M/WBE) Program Ã¢Â\200Â¢\tStrong analytical skills with commitment to detail and organization Ã¢Â\200Â¢\tProven ability to manage a large workload within tight time constraint Ã¢Â\200Â¢\tProven ability to work independently in a fast-paced environment Ã¢Â\200Â¢\tStrong Microsoft Excel and Word skills Ã¢Â\200Â¢\tExcellent writing, interpersonal, customer service and presentation skills"

stringr::str_replace_all()

To get rid of these we can use the str_replace_all() command to replace anything thats not between the hexadecimal codes that encompasses all ascii characters. Note str_replace_all() is the equivalent of the the base R gsub() function.



da$`Preferred Skills`<-str_replace_all(da$`Preferred Skills`,"[^\x20-\x7E]", "")

Note the non-ascci characters are removed.

da[1,]$`Preferred Skills`
#> [1] "Experience in the procurement of goods and services (professional and standardized preferred) Strong knowledge of New York City Procurement Policy Board (PPB) Rules and local laws Strong knowledge of City systems including Financial Management System (FMS); Procurement and Sourcing Solution Portal (Passport); and Office of Citywide Procurement (OCP) Experience processing large procurements from start to completion  Knowledge of the Citys Minority/Women-owned Business Enterprise (M/WBE) Program Strong analytical skills with commitment to detail and organization Proven ability to manage a large workload within tight time constraint Proven ability to work independently in a fast-paced environment Strong Microsoft Excel and Word skills Excellent writing, interpersonal, customer service and presentation skills"

tidyr::drop_na()

Use drop_na() to eliminate any null values.


pref_skills_df<-da[c("Job ID","Preferred Skills")]


names(pref_skills_df)<-c("jobid", "skills" )     # set col names

# drop the NA skills
pref_skills_df <- pref_skills_df %>%
    drop_na(skills)

tidytext::unnest_tokens()

Now we can use unnest_tokens() to seperate every word into a seperate row. The first parameter “word” is the token parameter, so you could specify token=word. Other tokens include sentence, line, and character.


pref_skills_df2 <- pref_skills_df %>%
  unnest_tokens(word, skills )

dplyr :: group_by() and summarize()

These are 2 very useful functions that often go together. Here we get the count of every word in the job requirements.


pref_skills_df3 <- pref_skills_df2 %>%
  group_by(word) %>%
  summarize(n = n())

dplyr :: arrange()

Use arrange() to sort by the count, descending, so we can see the most used words. Note arange() is the equivalent to the base R order()


 pref_skills_df3 %>%
      arrange(desc(n)) %>%
        head()
#> # A tibble: 6 x 2
#>   word           n
#>   <chr>      <int>
#> 1 and         7236
#> 2 to          2812
#> 3 of          2268
#> 4 with        2088
#> 5 experience  1984
#> 6 in          1944

dplyr::filter()

We are only interested in skills so lets create a corpus of skills and use the filter() function to isolate them.


  

skill_words<-c("management","communication","written","microsoft","excel","analysis","business","sql","interpersonal","software","research","powerpoint","database","quantitative","java","databases","statistical","leadership","communications","javascript","oracle","agile","visio","tableau","law","xml","manager","html5","spanish","economics","github","osha","qa")


pref_skills_df4<-pref_skills_df3 %>%
    filter(word %in% skill_words) %>%
    arrange(desc(n))

## dplyr::inner_join()

Now use inner_join() to merge the words back to the job ids. Note inner_join() is the equivalent of the base R merge() function.


pref_skills_df5= pref_skills_df2 %>% inner_join(pref_skills_df4,by="word")

reactable

Lastly, the reactable package is a pretty neat alternative to knitr and kable.

You can set up a filter, for example type “business” under word.

reactable(pref_skills_df5, filterable = TRUE, minRows = 10)

You can display in groups

reactable(pref_skills_df5, groupBy = "word")

See here for a great tutorial on reactable.

Kagle Data - Tidyverse

Coffy Andrews-Guo Tom Buonora

2021-11-06

TidyverseCreate

Load Packages

Data Set

Group by Variables

Dataset Question: How many job postings are Internal and External? What are the Levels and how many are available?

Visualization

Dataset Question: What are the Business Titles categories which NYC hires Data Analysis to perform? Which Business Title has the highest demand for Data Analysis?

Visualization

Conclusion

TidyverseExtend

stringr::str_replace_all()

tidyr::drop_na()

tidytext::unnest_tokens()

dplyr :: group_by() and summarize()

dplyr :: arrange()

dplyr::filter()

reactable