Attempt to determine the top skills for a data scientist based on job listings for data scientist. Different text mining techniques are used to identify skills from the description section of the job listings.
# Read in CSV file of the 10000 jobs listings for Data Scientist
jobs_df <- read.csv(file = 'data_scientist_united_states_job_postings_jobspikr.csv', stringsAsFactors = FALSE)
# For testing purposes, just the first 100 rows only
jobs_df <- jobs_df[1:100,]
# Output all the column names to confirm
names(jobs_df)
## [1] "crawl_timestamp" "url" "job_title"
## [4] "category" "company_name" "city"
## [7] "state" "country" "inferred_city"
## [10] "inferred_state" "inferred_country" "post_date"
## [13] "job_description" "job_type" "salary_offered"
## [16] "job_board" "geo" "cursor"
## [19] "contact_email" "contact_phone_number" "uniq_id"
## [22] "html_job_description"
# Just confirming the class is character for all these fields
#class(jobs_df$job_title)
#class(jobs_df$category)
#class(jobs_df$job_description)
#class(jobs_df$job_type)
#class(jobs_df$job_board)
Title is not that valueable as many of the titles are “Data Scientist” with some containing Sr. or I, or some derivative name, so will be ignoring for the analysis.
# Following the NASA case study: https://www.tidytextmining.com/nasa.html
# Set up separate tidy data frames for job_title, category, job_description, job_type while keeping the uniq_id for each so we can connect later
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
jobs_title <- tibble(id = jobs_df$uniq_id, title = jobs_df$job_title)
jobs_title %>% print(n=10)
## # A tibble: 100 x 2
## id title
## <chr> <chr>
## 1 3b6c6acfcba6135a31c83b… Enterprise Data Scientist I
## 2 741727428839ae7ada852e… Data Scientist
## 3 cdc9ef9a1de327ccdc19cc… Data Scientist
## 4 1c8541cd2c2c924f9391c7… Data Scientist, Aladdin Wealth Tech, Associate …
## 5 445652a560a54410608578… Senior Data Scientist
## 6 9571ec617ba209fd9a4f84… CIB – Fixed Income Research – Machine Learning …
## 7 0ec629c03f3e82651711f2… Data Scientist, Licensing Operations
## 8 972e897473d65f34b8e7f1… Sr. Data Scientist (Can work on Xoriant W2)
## 9 80d64b46bc7c89602f63da… Data Scientist, Aladdin Wealth Tech, Associate
## 10 b772c6ef8ee7631895ab9a… Data Scientist
## # … with 90 more rows
Job description is the most important field in the dataset. Contains the complete write-up posted for the job listing.
jobs_desc <- tibble(id = jobs_df$uniq_id,
desc = jobs_df$job_description)
jobs_desc %>%
select(desc) %>%
sample_n(5)
## # A tibble: 5 x 1
## desc
## <chr>
## 1 "Read what people are saying about working here. \n\nSysco Associate App…
## 2 "SUMMARY\n\nThe Data Scientist will use expertise in programming databas…
## 3 "Read what people are saying about working here. \n\nData Scientist\n\nY…
## 4 Data Scientist Tampa, FL $110-130K Job Description - A highly respected …
## 5 Data Scientist Work location: El Segundo, CA Duration: 10 months+ Synthe…
Category is interesting info. Used it for the NASA case study, not sure if the results are really telling of anything, but leaving in for now.
jobs_cat <- tibble(id = jobs_df$uniq_id,
category = jobs_df$category)
jobs_cat <- jobs_cat %>% filter(jobs_cat$category != "") %>% print(n=100)
## # A tibble: 57 x 2
## id category
## <chr> <chr>
## 1 3b6c6acfcba6135a31c83bd7ea493b18 Accounting/Finance
## 2 1c8541cd2c2c924f9391c7d3f526f64e Accounting/Finance
## 3 445652a560a5441060857853cf267470 biotech
## 4 9571ec617ba209fd9a4f842973a4e9c8 Accounting/Finance
## 5 0ec629c03f3e82651711f2626c23cadb Accounting/Finance
## 6 80d64b46bc7c89602f63daf06b9f1b4c Accounting/Finance
## 7 9cd3ed78e5cac9e516ea41173de2c25f Computer/Internet
## 8 53224da901548e7137bbb163d456ba6a Computer/Internet
## 9 010b5d671896d26eba50948c7c337f94 Computer/Internet
## 10 4a8b875d1acc4f560716561d699aa022 Computer/Internet
## 11 edcdf4dd38cca4bbccd5ba0b787d2c49 Computer/Internet
## 12 303c2bd509a4d2c2da619f68fe4b28f0 Arts/Entertainment/Publishing
## 13 2956d0023b0c4617c430adcedb1e38d9 Computer/Internet
## 14 e457f78180e72dd1ddee0efc042b0496 Computer/Internet
## 15 8d220fbda8e4ab36f10bba4ec8ff20ad military
## 16 b6e361550638f071a4985bf5f3d440ce business and financial operations
## 17 d33577ea9ae09c58d77e1fab2c012ba2 business and financial operations
## 18 09378ddde9b997b1acbf519c2b9ddf03 business and financial operations
## 19 eb3e1cdd65a86ef8b4cf28094f0c785b business and financial operations
## 20 1b148bd2e730d37c7afd2cbf0e7e7824 business and financial operations
## 21 6354c9293fd5867c41b9f0ca305cd163 business and financial operations
## 22 acfd50bf4d44eb476ec69b38348355be business and financial operations
## 23 674c331993a36bb28fd4cf302ce66e9d business and financial operations
## 24 87c7330ffd4a65d432eee4ca5955b971 Engineering/Architecture
## 25 2775ed23918410617dbc6cad9886e455 Engineering/Architecture
## 26 cbf0b6e8a77c62543a82b26fb06b34da Manufacturing/Mechanical
## 27 1f57eb4c560bbf2bbc780ec6274acaeb Engineering/Architecture
## 28 05116b62a369b96b2fa5d48c87e9310f Engineering/Architecture
## 29 6d5c369d0649aef334336592d5d502e0 Manufacturing/Mechanical
## 30 9c43f943b0210574243390fd774df9f3 life physical and social science
## 31 ea67f8f42a6fa593ce7acefddd8aef2e Computer/Internet
## 32 6393773aacaec4432bd091d3fda500b0 Computer/Internet
## 33 b0f7650425cbbbf4a8161af888c46359 Computer/Internet
## 34 7297f4499b50a131d8878f1bc99bfb0e Computer/Internet
## 35 d6c3c25cd9c3578291bceb0dfc3a5c68 Computer/Internet
## 36 e0f3ff447e5b10caf54a33e503d839e7 Arts/Entertainment/Publishing
## 37 de26a41473d841eb4f4e6c89a9c0c1fb Banking/Loans
## 38 33da39965c50a7044ab1da5586fc8454 agriculture and fishing
## 39 bfb0fdec92e9d40b47a26cec6b1fe457 business and financial operations
## 40 c8909fd5d13d781e4ad9d8708a877199 business and financial operations
## 41 78e276415437c529dd0a82c8d1b7df25 business and financial operations
## 42 f58d6adb2373f5349ef055e7855deafe business and financial operations
## 43 eab636ef77254c7aa93df150cfa55a19 business and financial operations
## 44 a7c13d21920e860ddaefbb3085eece14 business and financial operations
## 45 daab34b7684bdd0a27c3ccb991005d53 Education/Training
## 46 41083b020b17f0394c96ebc271095082 Engineering/Architecture
## 47 33a223e13f1902bc094a6be5e9aff6b6 Engineering/Architecture
## 48 6644ceb685aa5f434a907000d5cfbc8e Engineering/Architecture
## 49 c1bdfdd34b1b67bd435f83d87004fbe5 Engineering/Architecture
## 50 37c4c80252fb381734614164b6bf2c9b Engineering/Architecture
## 51 248276268f2f4779ae399f54962b9a84 Engineering/Architecture
## 52 30f9bad2109e8d12c0e42c90622fc2cb Engineering/Architecture
## 53 689e911a7b842c0942901f00924bd07c Manufacturing/Mechanical
## 54 8ec37358cbf6691fd30c21a5f1de1d8a science
## 55 e682d47f650976c7cfd182e10656003e Engineering/Architecture
## 56 d86cb7aad36562ab4a58c67fd37a4e95 Engineering/Architecture
## 57 99045ef17e641c9631cc2d5939e3fa0c Engineering/Architecture
Type info not valuable, as it’s just full-time, contract, etc.
jobs_type <- tibble(id = jobs_df$uniq_id,
category = jobs_df$job_type)
jobs_type %>% print(n=10)
## # A tibble: 100 x 2
## id category
## <chr> <chr>
## 1 3b6c6acfcba6135a31c83bd7ea493b18 Undefined
## 2 741727428839ae7ada852eebef29b0fe Undefined
## 3 cdc9ef9a1de327ccdc19cc0d07dbbb37 Full Time
## 4 1c8541cd2c2c924f9391c7d3f526f64e Undefined
## 5 445652a560a5441060857853cf267470 Full Time
## 6 9571ec617ba209fd9a4f842973a4e9c8 Undefined
## 7 0ec629c03f3e82651711f2626c23cadb Undefined
## 8 972e897473d65f34b8e7f1c1b4c74b1c Contract
## 9 80d64b46bc7c89602f63daf06b9f1b4c Undefined
## 10 b772c6ef8ee7631895ab9a59b5e8b2c1 Contract
## # … with 90 more rows
Attempted to make a list of keywords, which would essentially be the list of skills. Wanted to see if I created this list and then ran it against the job description if there would be a way to count or correlate the two. But since there is no ID as from the dataset, this list has proven not useful. Leaving in for now.
keywords <- c("python", "sql", "modeling", "statistics", "algorithms", "r", "visualization", "hadoop", "mining", "communication", "aws", "spark", "artificial intelligence", "machine learning", "sas", "cloud", "innovative", "driven", "optimization", "java", "databases", "leadership", "security", "tableau", "phd", "education", "degree", "hive", "ml", "scala", "ms", "economics", "neural", "verbal", "transformation", "culture", "tensorflow", "automation", "azure", "nlp", "architecture", "nosql", "scripting", "passionate", "agile", "bachelor\'s", "clustering", "pandas", "bs")
jobs_kw <- tibble(keyword = keywords)
jobs_kw
## # A tibble: 49 x 1
## keyword
## <chr>
## 1 python
## 2 sql
## 3 modeling
## 4 statistics
## 5 algorithms
## 6 r
## 7 visualization
## 8 hadoop
## 9 mining
## 10 communication
## # … with 39 more rows
From the job_description, tokenize all the words and remove “stop_words” which are common words in the English language to allow for focus on meaningful words of the job listing.
# Use tidytext’s unnest_tokens() for the description field so we can do the text analysis.
# unnest_tokens() will tokenize all the words in the description field and create a tidy dataframe of the word by identifer
library(tidytext)
jobs_desc <- jobs_desc %>%
unnest_tokens(word, desc) %>%
anti_join(stop_words)
## Joining, by = "word"
jobs_desc
## # A tibble: 23,310 x 2
## id word
## <chr> <chr>
## 1 3b6c6acfcba6135a31c83bd7ea493b18 read
## 2 3b6c6acfcba6135a31c83bd7ea493b18 people
## 3 3b6c6acfcba6135a31c83bd7ea493b18 farmers
## 4 3b6c6acfcba6135a31c83bd7ea493b18 join
## 5 3b6c6acfcba6135a31c83bd7ea493b18 team
## 6 3b6c6acfcba6135a31c83bd7ea493b18 diverse
## 7 3b6c6acfcba6135a31c83bd7ea493b18 professionals
## 8 3b6c6acfcba6135a31c83bd7ea493b18 farmers
## 9 3b6c6acfcba6135a31c83bd7ea493b18 acquire
## 10 3b6c6acfcba6135a31c83bd7ea493b18 skills
## # … with 23,300 more rows
Provide count in table form of the most common words in the job descriptions.
# Most common words in the description field
jobs_desc %>%
count(word, sort = TRUE) %>% print(n=10)
## # A tibble: 3,824 x 2
## word n
## <chr> <int>
## 1 data 971
## 2 experience 455
## 3 business 200
## 4 learning 198
## 5 skills 183
## 6 team 178
## 7 science 177
## 8 models 163
## 9 machine 155
## 10 analysis 152
## # … with 3,814 more rows
Added more words to be removed, certainly not exhaustive at this point, but leaving in so can me added to.
# added in extra stop words to remove the noise of words in the descriptions
extra_stopwords <- tibble(word = c(as.character(1:10),
"2", "job", "company", "e.g", "religion", "origin", "color", "gender", "2019", "1999"))
jobs_desc <- jobs_desc %>%
anti_join(extra_stopwords)
## Joining, by = "word"
jobs_desc %>%
count(word, sort = TRUE) %>% print(n=10)
## # A tibble: 3,806 x 2
## word n
## <chr> <int>
## 1 data 971
## 2 experience 455
## 3 business 200
## 4 learning 198
## 5 skills 183
## 6 team 178
## 7 science 177
## 8 models 163
## 9 machine 155
## 10 analysis 152
## # … with 3,796 more rows
Applied the stemming of words to in essence combine words of the same root, but I’m not sure if it’s valuable at this point. Leaving in for now.
library(SnowballC)
# Stemming the words, and let's see what top 10 come out as not too useful
# from https://abndistro.com/post/2019/02/10/tidy-text-mining-in-r/#stemming
jobs_desc %>%
mutate(word_stem = SnowballC::wordStem(word)) %>%
count(word_stem, sort = TRUE) %>% print(n=10)
## # A tibble: 2,676 x 2
## word_stem n
## <chr> <int>
## 1 data 971
## 2 experi 479
## 3 model 309
## 4 team 255
## 5 learn 233
## 6 develop 224
## 7 busi 212
## 8 skill 202
## 9 analyt 201
## 10 statist 192
## # … with 2,666 more rows
# Not very helpful as of yet, definitely stemming words, but then it's too generic
Just a list of skills based on the output of the count of common words in the job description after removing the stop_words
Skills: python, sql, modeling, statistics, algorithms, r, visualization, hadoop, mining, communication, aws, spark, artificial intelligence, machine learning, sas, cloud, innovative, driven, optimization, java, databases, leadership, security, tableau, phd, education, degree, hive, ml, scala, ms, economics, neural, verbal, transformation, culture, tensorflow, automation, azure, nlp, architecture, nosql, scripting, passionate, agile, bachelor’s, clustering, pandas, bs
Applying lowercase to all the words to ensure different cases aren’t problematic
# lowercase all the words just to make sure there's no redundancy
jobs_desc <- jobs_desc %>%
mutate(word = tolower(word))
Count the number of occurrences two words appear together in the description field. This does not mean the two words are next to each other, just that they appear in the same description field.
# Word co-ocurrences and correlations
# Count how many times each pair of words occurs together in the description field.
library(widyr)
desc_word_pairs <- jobs_desc %>%
pairwise_count(word, id, sort = TRUE, upper = FALSE)
desc_word_pairs
## # A tibble: 944,637 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 data experience 91
## 2 data python 81
## 3 experience python 80
## 4 skills data 74
## 5 data learning 73
## 6 team data 72
## 7 skills experience 72
## 8 experience learning 72
## 9 data analysis 71
## 10 data scientist 70
## # … with 944,627 more rows
Plot network of the co-occuring words.
# Plot networks of these co-occuring words
library(ggplot2)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
set.seed(1234)
desc_word_pairs %>%
filter(n >= 50) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
Output the correlation of the word pairs.
# Find correlation among the words in the description field
# This looks for those words that are more likely to occur together than with other words for a dataset.
desc_cors <- jobs_desc %>%
group_by(word) %>%
filter(n() >= 50) %>%
pairwise_cor(word, id, sort = TRUE, upper = FALSE)
desc_cors %>% print(n=10)
## # A tibble: 1,953 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 machine learning 0.905
## 2 science computer 0.570
## 3 advanced techniques 0.522
## 4 computer statistics 0.493
## 5 teams role 0.481
## 6 science statistics 0.481
## 7 develop responsibilities 0.477
## 8 preferred systems 0.472
## 9 analytics management 0.471
## 10 degree related 0.465
## # … with 1,943 more rows
# Skipping the rest of section 8.2
Calculating the term frequency times inverse document frequency.
# Calculating tf-idf for the description fields
# we can use tf-idf, the term frequency times inverse document frequency, to identify words that are especially important to a document within a collection of documents.
desc_tf_idf <- jobs_desc %>%
count(id, word, sort = TRUE) %>%
ungroup() %>%
bind_tf_idf(word, id, n)
desc_tf_idf %>% filter(n >= 10) %>%
arrange(-tf_idf)
## # A tibble: 73 x 6
## id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 0ec629c03f3e82651711f2626c23cadb licensing 14 0.0606 4.61 0.279
## 2 37c4c80252fb381734614164b6bf2c9b fedex 12 0.0381 4.61 0.175
## 3 6393773aacaec4432bd091d3fda500b0 dmi 11 0.0282 4.61 0.130
## 4 674c331993a36bb28fd4cf302ce66e9d clinical 17 0.0344 2.66 0.0915
## 5 a7c13d21920e860ddaefbb3085eece14 clinical 17 0.0344 2.66 0.0915
## 6 40ca3fab8f31010d1b5f35877508c0bf cloud 10 0.0325 2.12 0.0688
## 7 972e897473d65f34b8e7f1c1b4c74b1c cloud 10 0.0319 2.12 0.0677
## 8 674c331993a36bb28fd4cf302ce66e9d review 11 0.0223 2.66 0.0592
## 9 a7c13d21920e860ddaefbb3085eece14 review 11 0.0223 2.66 0.0592
## 10 85901569bb70a9ccb272208c19ef043a research 11 0.0447 1.17 0.0524
## # … with 63 more rows
# These are the most important words in the description fields as measured by tf-idf, meaning they are common but not too common.
Combining the data frame of the TF_IDF from the job descriptions with the categories. Joining by the unique ID as key.
By the categories, this will identify the most important words from the job descriptions.
#8.3.2
# Try it with the category, not sure if how much value that will be, but want to try it
desc_tf_idf <- full_join(desc_tf_idf, jobs_cat, by = "id")
desc_tf_idf %>%
filter(!near(tf, 1)) %>%
filter(category %in% c("Accounting/Finance", "biotech",
"Computer/Internet", "Arts/Entertainment/Publishing",
"military", "business and financial operations",
"Engineering/Architecture", "Manufacturing/Mechanical",
"life physical and social science", "Banking/Loans",
"agriculture and fishing ", "Education/Training",
"science")) %>%
arrange(desc(tf_idf)) %>%
group_by(category) %>%
distinct(word, category, .keep_all = TRUE) %>%
top_n(15, tf_idf) %>%
ungroup() %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
ggplot(aes(word, tf_idf, fill = category)) +
geom_col(show.legend = FALSE) +
facet_wrap(~category, ncol = 3, scales = "free") +
coord_flip() +
labs(title = "Highest tf-idf words in DS job listing description fields",
caption = "From jobpickr dataset",
x = NULL, y = "tf-idf")
For topic modeling, create a document term matrix.
Initially, the word count by document ID.
# 8.4 Topic Modeling
# Casting to a document-term matrix
word_counts <- jobs_desc %>%
count(id, word, sort = TRUE) %>%
ungroup()
word_counts %>% print(n=10)
## # A tibble: 16,114 x 3
## id word n
## <chr> <chr> <int>
## 1 674c331993a36bb28fd4cf302ce66e9d data 41
## 2 a7c13d21920e860ddaefbb3085eece14 data 41
## 3 625160c64fec3702a8c0c7ae97e77825 data 27
## 4 c10862d80f4bcf1b0b72bfb7f861776f data 27
## 5 99045ef17e641c9631cc2d5939e3fa0c data 25
## 6 de26a41473d841eb4f4e6c89a9c0c1fb data 25
## 7 eda91b88eb3096ed98bc1a5f6b5568df data 25
## 8 40ca3fab8f31010d1b5f35877508c0bf data 24
## 9 cbf0b6e8a77c62543a82b26fb06b34da data 23
## 10 972e897473d65f34b8e7f1c1b4c74b1c data 22
## # … with 1.61e+04 more rows
Now construct the document-term matrix. High level of sparsity. From the case study: “Each non-zero entry corresponds to a certain word appearing in a certain document.”Each non-zero entry corresponds to a certain word appearing in a certain document."
# Construct DTM
desc_dtm <- word_counts %>%
cast_dtm(id, word, n)
desc_dtm
## <<DocumentTermMatrix (documents: 100, terms: 3806)>>
## Non-/sparse entries: 16114/364486
## Sparsity : 96%
## Maximal term length: 23
## Weighting : term frequency (tf)
From wikipedia: In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.
# 8.4.2 Topic modeling
library(topicmodels)
# be aware that running this model is time intensive
# Define there to be 16 topics. I have entered 13 categories, so I figured, 16 is 2^4, and close to 13, so why not.
desc_lda <- LDA(desc_dtm, k = 16, control = list(seed = 1234))
desc_lda
## A LDA_VEM topic model with 16 topics.
Tidy the resulting topics.
# Interpreting the data model
tidy_lda <- tidy(desc_lda)
tidy_lda
## # A tibble: 60,896 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 data 0.0303
## 2 2 data 0.0286
## 3 3 data 0.0189
## 4 4 data 0.0349
## 5 5 data 0.0363
## 6 6 data 0.0571
## 7 7 data 0.0382
## 8 8 data 0.0660
## 9 9 data 0.0620
## 10 10 data 0.0260
## # … with 60,886 more rows
Identify the top 10 terms for each topic.
# Top 10 Terms
top_terms <- tidy_lda %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms
## # A tibble: 166 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 data 0.0303
## 2 1 experience 0.0107
## 3 1 team 0.0101
## 4 1 business 0.00976
## 5 1 learning 0.00946
## 6 1 models 0.00931
## 7 1 skills 0.00908
## 8 1 science 0.00866
## 9 1 quantitative 0.00811
## 10 1 engineering 0.00737
## # … with 156 more rows
Plot the top 10 terms for each topic. Interesting to see how the words break out, even though the topics are anonymous, only identified by number.
From case study: The topic modeling process has identified groupings of terms that we can understand as human readers of these description fields.
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_x_reordered() +
labs(title = "Top 10 terms in each LDA topic",
x = NULL, y = expression(beta)) +
facet_wrap(~ topic, ncol = 4, scales = "free")
Now calculate gamma, which will define the probability that each document belongs in each topic
Below graph isn’t useful to me.
# LDA gamma
lda_gamma <- tidy(desc_lda, matrix = "gamma")
lda_gamma
## # A tibble: 1,600 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 674c331993a36bb28fd4cf302ce66e9d 1 0.0000279
## 2 a7c13d21920e860ddaefbb3085eece14 1 0.0000279
## 3 625160c64fec3702a8c0c7ae97e77825 1 0.0000496
## 4 c10862d80f4bcf1b0b72bfb7f861776f 1 0.0000327
## 5 99045ef17e641c9631cc2d5939e3fa0c 1 0.0000422
## 6 de26a41473d841eb4f4e6c89a9c0c1fb 1 0.0000289
## 7 eda91b88eb3096ed98bc1a5f6b5568df 1 0.0000226
## 8 40ca3fab8f31010d1b5f35877508c0bf 1 0.0000448
## 9 cbf0b6e8a77c62543a82b26fb06b34da 1 0.0000327
## 10 972e897473d65f34b8e7f1c1b4c74b1c 1 0.0000441
## # … with 1,590 more rows
# Distribution of probabilities for all topics
ggplot(lda_gamma, aes(gamma)) +
geom_histogram() +
scale_y_log10() +
labs(title = "Distribution of probabilities for all topics",
y = "Number of documents", x = expression(gamma))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 4 rows containing missing values (geom_bar).
Below graph isn’t useful to me.
# Distribution of probability for each topic
ggplot(lda_gamma, aes(gamma, fill = as.factor(topic))) +
geom_histogram(show.legend = FALSE) +
facet_wrap(~ topic, ncol = 4) +
scale_y_log10() +
labs(title = "Distribution of probability for each topic",
y = "Number of documents", x = expression(gamma))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 410 rows containing missing values (geom_bar).
discover which categories are associated with which topic.
# 8.4.4
lda_gamma <- full_join(lda_gamma, jobs_cat, by = c("document" = "id"))
lda_gamma
## # A tibble: 1,600 x 4
## document topic gamma category
## <chr> <int> <dbl> <chr>
## 1 674c331993a36bb28fd4cf302ce… 1 2.79e-5 business and financial oper…
## 2 a7c13d21920e860ddaefbb3085e… 1 2.79e-5 business and financial oper…
## 3 625160c64fec3702a8c0c7ae97e… 1 4.96e-5 <NA>
## 4 c10862d80f4bcf1b0b72bfb7f86… 1 3.27e-5 <NA>
## 5 99045ef17e641c9631cc2d5939e… 1 4.22e-5 Engineering/Architecture
## 6 de26a41473d841eb4f4e6c89a9c… 1 2.89e-5 Banking/Loans
## 7 eda91b88eb3096ed98bc1a5f6b5… 1 2.26e-5 <NA>
## 8 40ca3fab8f31010d1b5f3587750… 1 4.48e-5 <NA>
## 9 cbf0b6e8a77c62543a82b26fb06… 1 3.27e-5 Manufacturing/Mechanical
## 10 972e897473d65f34b8e7f1c1b4c… 1 4.41e-5 <NA>
## # … with 1,590 more rows
top_cats <- lda_gamma %>%
filter(gamma > 0.5) %>%
count(topic, category, sort = TRUE)
top_cats
## # A tibble: 58 x 3
## topic category n
## <int> <chr> <int>
## 1 16 <NA> 7
## 2 15 <NA> 6
## 3 3 <NA> 5
## 4 1 <NA> 4
## 5 15 business and financial operations 4
## 6 6 <NA> 3
## 7 11 <NA> 3
## 8 14 <NA> 3
## 9 2 Engineering/Architecture 2
## 10 4 Accounting/Finance 2
## # … with 48 more rows
Little hard to decpiher, but I think connecting this plot to the previous plot of the topics would then connect which words (skills) are meaningful by category.
# One more graph from 8.4.4
top_cats %>%
group_by(topic) %>%
top_n(5, n) %>%
ungroup %>%
mutate(category = reorder_within(category, n, topic)) %>%
ggplot(aes(category, n, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
labs(title = "Top categories for each LDA topic",
x = NULL, y = "Number of documents") +
coord_flip() +
scale_x_reordered() +
facet_wrap(~ topic, ncol = 4, scales = "free")
Create the dataframe of the first 100 rows
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.3
## ✔ tidyr 1.0.2 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tibble::as_data_frame() masks igraph::as_data_frame(), dplyr::as_data_frame()
## ✖ purrr::compose() masks igraph::compose()
## ✖ tidyr::crossing() masks igraph::crossing()
## ✖ dplyr::filter() masks stats::filter()
## ✖ igraph::groups() masks dplyr::groups()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::simplify() masks igraph::simplify()
library(quanteda)
## Package version: 2.0.1
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:igraph':
##
## as.igraph
## The following object is masked from 'package:utils':
##
## View
# See below for how this works
# https://www.r-bloggers.com/advancing-text-mining-with-r-and-quanteda/
jobs_df <- read.csv(file = 'data_scientist_united_states_job_postings_jobspikr.csv', stringsAsFactors = FALSE)
# For just testing purposes, just the first 100 rows only
jobs_df <- jobs_df[1:100,]
jobs_df <- jobs_df[,c('uniq_id', 'job_description')]
class(jobs_df)
## [1] "data.frame"
Create a corpus based on the unique ID and the job description
# Generate a corpus
my_corpus <- corpus(jobs_df, docid_field = "uniq_id", text_field = "job_description")
#my_corpus
mycorpus_stats <- summary(my_corpus)
head(mycorpus_stats)
## Text Types Tokens Sentences
## 1 3b6c6acfcba6135a31c83bd7ea493b18 263 476 11
## 2 741727428839ae7ada852eebef29b0fe 104 176 8
## 3 cdc9ef9a1de327ccdc19cc0d07dbbb37 65 83 1
## 4 1c8541cd2c2c924f9391c7d3f526f64e 312 592 14
## 5 445652a560a5441060857853cf267470 276 465 13
## 6 9571ec617ba209fd9a4f842973a4e9c8 331 686 23
Preprocess the text. Remove numbers, remove punctuation, remove symbols, remove URLs, split hyphens. Because the blog entry included it, I kept the part about cleaning for OCR.
# Preprocess the text
# Create tokens
token <-
tokens(
my_corpus,
remove_numbers = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
split_hyphens = TRUE
)
# Don't think this is needed by why not
# Clean tokens created by OCR
token_ungd <- tokens_select(
token,
c("[\\d-]", "[[:punct:]]", "^.{1,2}$"),
selection = "remove",
valuetype = "regex",
verbose = TRUE
)
## removed 299 features
Create a Data Frequency Matrix, this time using Quanteda
Also, filter words that appear less than 7.5% and more than 90%. This rather conservative approach is possible because we have a sufficiently large corpus.
# Data frequency matrix
my_dfm <- dfm(token_ungd,
tolower = TRUE,
stem = TRUE,
remove = stopwords("english")
)
my_dfm_trim <-
dfm_trim(
my_dfm,
min_docfreq = 0.075,
# min 7.5%
max_docfreq = 0.90,
# max 90%
docfreq_type = "prop"
)
head(dfm_sort(my_dfm_trim, decreasing = TRUE, margin = "both"),
n = 10,
nf = 10)
## Document-feature matrix of: 10 documents, 10 features (5.0% sparse).
## features
## docs work model team learn develop busi use
## eda91b88eb3096ed98bc1a5f6b5568df 5 9 1 16 5 1 9
## ea67f8f42a6fa593ce7acefddd8aef2e 17 4 20 5 1 2 4
## 674c331993a36bb28fd4cf302ce66e9d 4 0 4 2 2 2 5
## a7c13d21920e860ddaefbb3085eece14 4 0 4 2 2 2 5
## 8d220fbda8e4ab36f10bba4ec8ff20ad 9 14 10 5 10 0 2
## de26a41473d841eb4f4e6c89a9c0c1fb 8 6 9 4 5 7 2
## features
## docs skill analyt statist
## eda91b88eb3096ed98bc1a5f6b5568df 3 6 9
## ea67f8f42a6fa593ce7acefddd8aef2e 6 2 0
## 674c331993a36bb28fd4cf302ce66e9d 6 2 1
## a7c13d21920e860ddaefbb3085eece14 6 2 1
## 8d220fbda8e4ab36f10bba4ec8ff20ad 4 1 2
## de26a41473d841eb4f4e6c89a9c0c1fb 6 4 3
## [ reached max_ndoc ... 4 more documents ]
# https://quanteda.io/articles/pkgdown/examples/plotting.html
#corpus_subset(my_corpus) %>%
# dfm(remove = stopwords("english"), remove_punct = TRUE) %>%
# dfm_trim(min_termfreq = 1, verbose = FALSE) %>%
# textplot_wordcloud(comparison = TRUE)
textplot_wordcloud(my_dfm, min_count = 10,
color = c('red', 'pink', 'green', 'purple', 'orange', 'blue'))
kwic(my_corpus, pattern = "data") %>%
textplot_xray()
library("ggplot2")
theme_set(theme_bw())
g <- textplot_xray(
kwic(my_corpus, pattern = "python"),
kwic(my_corpus, pattern = "r")
)
g + aes(color = keyword) +
scale_color_manual(values = c("blue", "red")) +
theme(legend.position = "none")
# Frequency plots
features_dfm <- textstat_frequency(my_dfm, n = 50)
# Sort by reverse frequency order
features_dfm$feature <- with(features_dfm, reorder(feature, -frequency))
ggplot(features_dfm, aes(x = feature, y = frequency)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
From: https://www.tidytextmining.com/
library(tidytext)
library(dplyr)
library(ggplot2)
library(tidyr)
jobs_df <- read.csv(file = 'data_scientist_united_states_job_postings_jobspikr.csv', stringsAsFactors = FALSE)
# For just testing purposes, just the first 100 rows only
jobs_df <- jobs_df[1:100,]
jobs_df <- jobs_df[,c('job_description', 'uniq_id')]
jobs_words <- jobs_df %>%
unnest_tokens(word, job_description) %>%
anti_join(stop_words) %>%
count(uniq_id, word, sort = TRUE)
## Joining, by = "word"
total_words <- jobs_words %>%
group_by(uniq_id) %>%
summarize(total = sum(n))
jobs_words <- left_join(jobs_words, total_words)
## Joining, by = "uniq_id"
jobs_words
## # A tibble: 16,453 x 4
## uniq_id word n total
## <chr> <chr> <int> <int>
## 1 674c331993a36bb28fd4cf302ce66e9d data 41 505
## 2 a7c13d21920e860ddaefbb3085eece14 data 41 505
## 3 625160c64fec3702a8c0c7ae97e77825 data 27 282
## 4 c10862d80f4bcf1b0b72bfb7f861776f data 27 431
## 5 99045ef17e641c9631cc2d5939e3fa0c data 25 328
## 6 de26a41473d841eb4f4e6c89a9c0c1fb data 25 483
## 7 eda91b88eb3096ed98bc1a5f6b5568df data 25 613
## 8 40ca3fab8f31010d1b5f35877508c0bf data 24 312
## 9 cbf0b6e8a77c62543a82b26fb06b34da data 23 431
## 10 972e897473d65f34b8e7f1c1b4c74b1c data 22 317
## # … with 16,443 more rows
Don’t see much value in this one. Probably just remove.
ggplot(jobs_words %>% filter(n > 20), aes(n / total, fill = uniq_id)) +
geom_histogram(show.legend = FALSE) +
xlim(0.0, 0.4) +
facet_wrap(~uniq_id, ncol = 2, scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 22 rows containing missing values (geom_bar).
Don’t see any value in this one. Probably just remove.
jobs_words <- jobs_words %>%
bind_tf_idf(word, uniq_id, n)
jobs_words
## # A tibble: 16,453 x 7
## uniq_id word n total tf idf tf_idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 674c331993a36bb28fd4cf302ce66e9d data 41 505 0.0812 0.0305 0.00247
## 2 a7c13d21920e860ddaefbb3085eece14 data 41 505 0.0812 0.0305 0.00247
## 3 625160c64fec3702a8c0c7ae97e77825 data 27 282 0.0957 0.0305 0.00292
## 4 c10862d80f4bcf1b0b72bfb7f861776f data 27 431 0.0626 0.0305 0.00191
## 5 99045ef17e641c9631cc2d5939e3fa0c data 25 328 0.0762 0.0305 0.00232
## 6 de26a41473d841eb4f4e6c89a9c0c1fb data 25 483 0.0518 0.0305 0.00158
## 7 eda91b88eb3096ed98bc1a5f6b5568df data 25 613 0.0408 0.0305 0.00124
## 8 40ca3fab8f31010d1b5f35877508c0bf data 24 312 0.0769 0.0305 0.00234
## 9 cbf0b6e8a77c62543a82b26fb06b34da data 23 431 0.0534 0.0305 0.00163
## 10 972e897473d65f34b8e7f1c1b4c74b1c data 22 317 0.0694 0.0305 0.00211
## # … with 16,443 more rows
jobs_words %>%
select(-total) %>%
arrange(desc(tf_idf))
## # A tibble: 16,453 x 6
## uniq_id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 0ec629c03f3e82651711f2626c23cadb licensing 14 0.0601 4.61 0.277
## 2 87c7330ffd4a65d432eee4ca5955b971 food 6 0.0561 3.51 0.197
## 3 96b331bc3cb323f9fc8c43a5029687a6 tekvalley.com 2 0.0426 4.61 0.196
## 4 87c7330ffd4a65d432eee4ca5955b971 heinz 4 0.0374 4.61 0.172
## 5 87c7330ffd4a65d432eee4ca5955b971 kraft 4 0.0374 4.61 0.172
## 6 37c4c80252fb381734614164b6bf2c9b fedex 12 0.0373 4.61 0.172
## 7 e682d47f650976c7cfd182e10656003e crypto 6 0.0357 4.61 0.164
## 8 77d0b5f4638619632040c228a561f031 person 4 0.0548 2.66 0.146
## 9 3b6c6acfcba6135a31c83bd7ea493b18 farmers 9 0.0363 3.91 0.142
## 10 bfb0fdec92e9d40b47a26cec6b1fe457 validation 2 0.0444 3.00 0.133
## # … with 16,443 more rows
Don’t see value in below graph. Remove
jobs_words %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(uniq_id) %>%
top_n(15) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = uniq_id)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~uniq_id, ncol = 2, scales = "free") +
coord_flip()
## Selecting by tf_idf
Use bigrams of 2 to find the true word pairs that occur the most frequently
jobs_bigrams <- jobs_df %>%
unnest_tokens(bigram, job_description, token = "ngrams", n = 2)
jobs_bigrams %>%
count(bigram, sort = TRUE)
## # A tibble: 20,584 x 2
## bigram n
## <chr> <int>
## 1 machine learning 153
## 2 experience with 143
## 3 data scientist 126
## 4 ability to 116
## 5 of the 112
## 6 data science 109
## 7 experience in 105
## 8 in a 81
## 9 in the 78
## 10 years of 78
## # … with 20,574 more rows
bigrams_separated <- jobs_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
# This result is valuable
bigram_counts %>% print(n = 500)
## # A tibble: 8,637 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 machine learning 153
## 2 data scientist 126
## 3 data science 109
## 4 computer science 48
## 5 data sets 46
## 6 data analysis 42
## 7 data mining 37
## 8 communication skills 33
## 9 job description 26
## 10 data analytics 25
## 11 data scientists 25
## 12 equal opportunity 25
## 13 neural networks 22
## 14 clinical data 21
## 15 data review 20
## 16 learning algorithms 20
## 17 data sources 19
## 18 data visualization 18
## 19 related field 18
## 20 deep learning 17
## 21 learning techniques 17
## 22 real world 17
## 23 sexual orientation 17
## 24 advanced analytics 16
## 25 national origin 16
## 26 opportunity employer 16
## 27 statistical analysis 16
## 28 veteran status 16
## 29 artificial intelligence 15
## 30 gender identity 15
## 31 quantitative field 15
## 32 software development 15
## 33 team player 15
## 34 orientation gender 14
## 35 unstructured data 14
## 36 data management 13
## 37 functional teams 13
## 38 operations research 13
## 39 scikit learn 13
## 40 senior data 13
## 41 solving skills 13
## 42 statistical modeling 13
## 43 complex data 12
## 44 decision trees 12
## 45 ideal candidate 12
## 46 predictive modeling 12
## 47 programming languages 12
## 48 qualified applicants 12
## 49 random forest 12
## 50 statistical methods 12
## 51 time series 12
## 52 affirmative action 11
## 53 analytical skills 11
## 54 cross functional 11
## 55 learning models 11
## 56 action employer 10
## 57 analyze data 10
## 58 bachelor’s degree 10
## 59 business intelligence 10
## 60 color religion 10
## 61 cutting edge 10
## 62 minimum qualifications 10
## 63 mining techniques 10
## 64 predictive analytics 10
## 65 statistical models 10
## 66 technical skills 10
## 67 business partners 9
## 68 data environment 9
## 69 development experience 9
## 70 engineering team 9
## 71 experience experience 9
## 72 feature engineering 9
## 73 master's degree 9
## 74 model development 9
## 75 product development 9
## 76 project management 9
## 77 race color 9
## 78 relevant field 9
## 79 scientist location 9
## 80 wide range 9
## 81 cloud resiliency 8
## 82 data engineering 8
## 83 data gathering 8
## 84 data models 8
## 85 data processing 8
## 86 data sciences 8
## 87 excellent written 8
## 88 experience developing 8
## 89 fast paced 8
## 90 highly collaborative 8
## 91 interpersonal skills 8
## 92 job summary 8
## 93 manipulating data 8
## 94 mathematics computer 8
## 95 mathematics statistics 8
## 96 operational data 8
## 97 preferred qualifications 8
## 98 public cloud 8
## 99 python sql 8
## 100 science mathematics 8
## 101 strong experience 8
## 102 successful candidate 8
## 103 technical solutions 8
## 104 advanced degree 7
## 105 advanced statistical 7
## 106 analysis modeling 7
## 107 applied mathematics 7
## 108 clustering decision 7
## 109 data driven 7
## 110 ensemble methods 7
## 111 global leader 7
## 112 java scala 7
## 113 natural language 7
## 114 northrop grumman 7
## 115 opportunity affirmative 7
## 116 programming skills 7
## 117 proven ability 7
## 118 python experience 7
## 119 python java 7
## 120 real time 7
## 121 required skills 7
## 122 san francisco 7
## 123 statistics mathematics 7
## 124 strong communication 7
## 125 strong knowledge 7
## 126 strongly preferred 7
## 127 unsupervised learning 7
## 128 version control 7
## 129 web services 7
## 130 anomaly detection 6
## 131 bachelor's degree 6
## 132 broad range 6
## 133 business decisions 6
## 134 computer languages 6
## 135 contract clinical 6
## 136 creating data 6
## 137 custom data 6
## 138 customer experience 6
## 139 data analyst 6
## 140 data modeling 6
## 141 data visualizations 6
## 142 engineer data 6
## 143 experience creating 6
## 144 hive pig 6
## 145 implement models 6
## 146 implementing models 6
## 147 job title 6
## 148 language processing 6
## 149 learning artificial 6
## 150 learning data 6
## 151 learning methods 6
## 152 logistic regression 6
## 153 marital status 6
## 154 math statistics 6
## 155 north america 6
## 156 preferred experience 6
## 157 presentation skills 6
## 158 quantitative discipline 6
## 159 relational databases 6
## 160 religion sex 6
## 161 science engineering 6
## 162 science statistics 6
## 163 scripting languages 6
## 164 sex sexual 6
## 165 spark scala 6
## 166 statistical computer 6
## 167 statistical techniques 6
## 168 strong data 6
## 169 strong understanding 6
## 170 title data 6
## 171 wholesale credit 6
## 172 3 5 5
## 173 5 7 5
## 174 actionable insights 5
## 175 advanced techniques 5
## 176 analysis methods 5
## 177 analytics team 5
## 178 application development 5
## 179 bachelors degree 5
## 180 build predictive 5
## 181 building models 5
## 182 business outcomes 5
## 183 ca duration 5
## 184 collaborative team 5
## 185 complex business 5
## 186 data ability 5
## 187 data architectures 5
## 188 data collection 5
## 189 data experience 5
## 190 data extraction 5
## 191 data infrastructure 5
## 192 data structures 5
## 193 decision tree 5
## 194 demonstrated experience 5
## 195 distributed data 5
## 196 distributions statistical 5
## 197 engineering math 5
## 198 engineering mathematics 5
## 199 environment ability 5
## 200 existing models 5
## 201 experience manipulating 5
## 202 extensive experience 5
## 203 field e.g 5
## 204 financial technology 5
## 205 gathering techniques 5
## 206 genetic information 5
## 207 gradient boosting 5
## 208 health outcomes 5
## 209 highly motivated 5
## 210 holding company 5
## 211 hypothesis testing 5
## 212 identify opportunities 5
## 213 improve business 5
## 214 informed decisions 5
## 215 intelligence systems 5
## 216 languages python 5
## 217 linux environment 5
## 218 mining data 5
## 219 oral communication 5
## 220 paced environment 5
## 221 physics engineering 5
## 222 position summary 5
## 223 primary focus 5
## 224 programming experience 5
## 225 protected veteran 5
## 226 python 2 5
## 227 querying databases 5
## 228 random forests 5
## 229 reporting tools 5
## 230 required experience 5
## 231 required qualifications 5
## 232 reusable code 5
## 233 scale data 5
## 234 science computer 5
## 235 science machine 5
## 236 sr data 5
## 237 statistical machine 5
## 238 statistics regression 5
## 239 status disability 5
## 240 subject matter 5
## 241 technical expertise 5
## 242 technology teams 5
## 243 text mining 5
## 244 top tier 5
## 245 trees neural 5
## 246 verbal communication 5
## 247 visualization tools 5
## 248 written communication 5
## 249 12 months 4
## 250 6 months 4
## 251 accredited college 4
## 252 ad hoc 4
## 253 additional information 4
## 254 advanced machine 4
## 255 amazon web 4
## 256 analyses e.g 4
## 257 analysis skills 4
## 258 analytical models 4
## 259 analytics platforms 4
## 260 apache spark 4
## 261 application process 4
## 262 apply statistical 4
## 263 artificial neural 4
## 264 banking subsidiaries 4
## 265 basic qualifications 4
## 266 biostatistics applied 4
## 267 bs ms 4
## 268 build data 4
## 269 building statistical 4
## 270 business analysts 4
## 271 business challenges 4
## 272 business leaders 4
## 273 characteristic protected 4
## 274 click stream 4
## 275 clinical trial 4
## 276 cloud data 4
## 277 color national 4
## 278 computer vision 4
## 279 contract employees 4
## 280 conviction records 4
## 281 creating algorithms 4
## 282 credit data 4
## 283 cross validation 4
## 284 customer behavior 4
## 285 d3 js 4
## 286 data engineer 4
## 287 data engineers 4
## 288 data insights 4
## 289 data pipelines 4
## 290 data technologies 4
## 291 data tools 4
## 292 databases data 4
## 293 deep expertise 4
## 294 deep knowledge 4
## 295 deep understanding 4
## 296 degree preferred 4
## 297 degree required 4
## 298 depth knowledge 4
## 299 develop custom 4
## 300 draw insights 4
## 301 drive business 4
## 302 drive cloud 4
## 303 e.g python 4
## 304 e.g tableau 4
## 305 effective manner 4
## 306 electronic data 4
## 307 employer minorities 4
## 308 employment qualified 4
## 309 epidemiology statistics 4
## 310 excellent communication 4
## 311 existing data 4
## 312 experience preferred 4
## 313 experience visualizing 4
## 314 expert level 4
## 315 exploratory data 4
## 316 external data 4
## 317 fair chance 4
## 318 fast growing 4
## 319 feature selection 4
## 320 field preferred 4
## 321 fixed income 4
## 322 force productivity 4
## 323 fortune 500 4
## 324 google analytics 4
## 325 hadoop spark 4
## 326 hadoop stack 4
## 327 hadoop streaming 4
## 328 hyper parameters 4
## 329 information management 4
## 330 information technology 4
## 331 intellectual curiosity 4
## 332 java perl 4
## 333 job posting 4
## 334 key business 4
## 335 key findings 4
## 336 kraft heinz 4
## 337 l3 ads 4
## 338 leadership teams 4
## 339 life cycle 4
## 340 live experiences 4
## 341 manipulate data 4
## 342 map reduce 4
## 343 master’s degree 4
## 344 mathematical statistical 4
## 345 mathematics operations 4
## 346 mclean va 4
## 347 methods gradient 4
## 348 military service 4
## 349 million live 4
## 350 model monitoring 4
## 351 modeling clustering 4
## 352 modeling machine 4
## 353 models experience 4
## 354 monitor outcomes 4
## 355 months contract 4
## 356 multiple sources 4
## 357 nosql databases 4
## 358 origin sex 4
## 359 paid time 4
## 360 pandas numpy 4
## 361 physical demands 4
## 362 pig hadoop 4
## 363 practical experience 4
## 364 predictive models 4
## 365 prior experience 4
## 366 product managers 4
## 367 production systems 4
## 368 programming language 4
## 369 protected veterans 4
## 370 qualifications 2 4
## 371 qualifications bachelor’s 4
## 372 qualifications data 4
## 373 qualifications master's 4
## 374 quantitative analysis 4
## 375 race sex 4
## 376 receive consideration 4
## 377 reduction techniques 4
## 378 reference code 4
## 379 regression simulation 4
## 380 related technical 4
## 381 required masters 4
## 382 risk management 4
## 383 role requires 4
## 384 sales force 4
## 385 scenario analysis 4
## 386 science math 4
## 387 science team 4
## 388 science technology 4
## 389 science toolkits 4
## 390 series analysis 4
## 391 sets experience 4
## 392 sex age 4
## 393 simulation scenario 4
## 394 skills strong 4
## 395 software applications 4
## 396 software engineer 4
## 397 software engineers 4
## 398 software tools 4
## 399 solve business 4
## 400 solve complex 4
## 401 sql hive 4
## 402 sql nosql 4
## 403 stack hive 4
## 404 statistics biostatistics 4
## 405 statistics data 4
## 406 statistics economics 4
## 407 status sexual 4
## 408 stream data 4
## 409 strong analytical 4
## 410 strong business 4
## 411 strong technical 4
## 412 success starts 4
## 413 summary location 4
## 414 support vector 4
## 415 taboola newsroom 4
## 416 team environment 4
## 417 techniques including 4
## 418 technological solutions 4
## 419 translate business 4
## 420 trees random 4
## 421 trust safety 4
## 422 walmart international 4
## 423 wide variety 4
## 424 world’s largest 4
## 425 1 2 3
## 426 813 437 3
## 427 accommodation options 3
## 428 advanced analytical 3
## 429 advanced data 3
## 430 advanced skill 3
## 431 age color 3
## 432 algorithm design 3
## 433 algorithm development 3
## 434 america europe 3
## 435 analysis data 3
## 436 analysis document 3
## 437 analytic models 3
## 438 analytics efforts 3
## 439 analytics techniques 3
## 440 analyze model 3
## 441 analyzing data 3
## 442 application links 3
## 443 applications knowledge 3
## 444 applications strong 3
## 445 applied statistics 3
## 446 background bs 3
## 447 based environment 3
## 448 based insights 3
## 449 based models 3
## 450 bash scripting 3
## 451 bi solutions 3
## 452 boosting trees 3
## 453 build decision 3
## 454 business insights 3
## 455 business metrics 3
## 456 business objects 3
## 457 business questions 3
## 458 business results 3
## 459 business strategies 3
## 460 business teams 3
## 461 call 888 3
## 462 changing landscape 3
## 463 citizenship status 3
## 464 classical statistical 3
## 465 clean performant 3
## 466 cleaning preparing 3
## 467 client facing 3
## 468 clients including 3
## 469 coding knowledge 3
## 470 collaborative environment 3
## 471 common data 3
## 472 communicate results 3
## 473 company engages 3
## 474 competitive advantage 3
## 475 complex analytical 3
## 476 computer engineering 3
## 477 computing tools 3
## 478 concepts regression 3
## 479 conduct cross 3
## 480 conduct validation 3
## 481 continuously seek 3
## 482 corridor duration 3
## 483 credit models 3
## 484 dashboards reports 3
## 485 data applications 3
## 486 data based 3
## 487 data computing 3
## 488 data excellent 3
## 489 data flow 3
## 490 data i.e 3
## 491 data platform 3
## 492 data platforms 3
## 493 data preparation 3
## 494 data products 3
## 495 data quality 3
## 496 data related 3
## 497 data set 3
## 498 data transformation 3
## 499 debugging data 3
## 500 deeply understand 3
## # … with 8,137 more rows
Bigrams with 3 words occurring together to see if there is anything meaningful from that. This appears to be identifying boilerplate text that occurs for many listings but isn’t specific to th
jobs_df %>%
unnest_tokens(trigram, job_description, token = "ngrams", n = 3) %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
count(word1, word2, word3, sort = TRUE)
## # A tibble: 6,394 x 4
## word1 word2 word3 n
## <chr> <chr> <chr> <int>
## 1 machine learning algorithms 18
## 2 equal opportunity employer 16
## 3 machine learning techniques 16
## 4 sexual orientation gender 14
## 5 orientation gender identity 12
## 6 affirmative action employer 10
## 7 data mining techniques 10
## 8 senior data scientist 10
## 9 data scientist location 9
## 10 machine learning models 9
## # … with 6,384 more rows
Graph the bigrams
# Visualizing bigrams
library(igraph)
# filter for only relatively common combinations
bigram_graph <- bigram_counts %>%
filter(n > 10) %>%
graph_from_data_frame()
bigram_graph
## IGRAPH 8fcaf22 DN-- 79 55 --
## + attr: name (v/c), n (e/n)
## + edges from 8fcaf22 (vertex names):
## [1] machine ->learning data ->scientist
## [3] data ->science computer ->science
## [5] data ->sets data ->analysis
## [7] data ->mining communication->skills
## [9] job ->description data ->analytics
## [11] data ->scientists equal ->opportunity
## [13] neural ->networks clinical ->data
## [15] data ->review learning ->algorithms
## + ... omitted several edges
Good graph here. Keep
library(ggraph)
set.seed(2020)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
Good graph here. Keep
set.seed(2021)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
Just a generic plot of the most common words that occur at least 25 times
jobs_words <- jobs_words %>%
anti_join(stop_words)
## Joining, by = "word"
jobs_words %>%
count(word, sort = TRUE)
## # A tibble: 3,824 x 2
## word n
## <chr> <int>
## 1 data 97
## 2 experience 94
## 3 python 82
## 4 learning 74
## 5 skills 74
## 6 team 72
## 7 analysis 71
## 8 machine 70
## 9 scientist 70
## 10 science 69
## # … with 3,814 more rows
jobs_words %>%
count(word, sort = TRUE) %>%
filter(n > 25) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
=== Just notes below
Education R programming python coding hadoop platform sql database/coding apache spark machine learning and AI data visualization unstructured data intellectual curiosity business acumen communication skills teamwork
communication business acumen data-drive problem solving data visualization programming R python tableau hadoop sql spark statistics mathematics