library("dplyr", warn.conflicts = FALSE)
unzip("job descriptors.csv.zip", exdir = "~/tmp/")
df_raw <- data.table::fread("~/tmp/job descriptors.csv", data.table = FALSE)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
|--------------------------------------------------|
|==================================================|
Group by worker ID (mem_id):
df_grouped <- df_raw %>%
group_by(mem_id) %>%
summarize(n_bids = n(), n_unique_job_title = length(unique(job_title)),
job_title = head(job_title, 1))
`summarise()` ungrouping output (override with `.groups` argument)
# sort by descending number of bids
df_grouped <- arrange(df_grouped, desc(n_bids))
# verify that job_title is the same for each mem_id
stopifnot(all(select(df_grouped, n_unique_job_title) == 1))
# remove some intermediate variables
df_grouped <- select(df_grouped, -n_unique_job_title)
So we can see here that every unique mem_id (worker) had only one job title. (Good for us!)
Most workers had only a single bid. The median number was 3.
summary(df_grouped$n_bids)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.00 3.00 22.37 11.00 14273.00
We can see the skew:
library("ggplot2")
ggplot(df_grouped, aes(x = n_bids)) +
scale_x_continuous(breaks=c(1, 2, 5, 10, 25, 50, 100, 500, 1000, 10000),
trans = "log") +
geom_histogram(aes(y = ..density..),
binwidth = .5, colour = "black", fill = "white") +
geom_density(alpha = .2, fill = "#FF6655", colour = "blue")
So, four lunatics have apparently submitted more than 10,000 bids each:
# top 10 list
head(df_grouped, 10)
Also, six bidders had no description:
filter(df_grouped, is.na(job_title))
library("quanteda", warn.conflicts = FALSE)
Package version: 2.1.0
Parallel computing: 2 of 12 threads used.
See https://quanteda.io for tutorials and examples.
job_corpus <- filter(df_grouped, !is.na(job_title)) %>%
corpus(text_field = "job_title", docid_field = "mem_id")
job_tokens <- tokens(job_corpus, remove_punct = TRUE, remove_symbols = TRUE,
padding = TRUE) %>%
tokens_tolower()
# top terms
job_tokens %>%
tokens_remove(c("", stopwords("en"))) %>%
dfm() %>%
textstat_frequency(n = 30)
We can refine this by finding some multi-word expressions:
job_mwe <- job_tokens %>%
tokens_remove(stopwords("en"), padding = TRUE) %>%
tokens_wordstem(language = "en") %>%
textstat_collocations()
head(job_mwe, 30)
Now let’s turn the top 100 multi-word bigrams into single tokens:
job_tokens_mwe <- job_tokens %>%
tokens_remove(stopwords("en"), padding = TRUE) %>%
tokens_wordstem(language = "en") %>%
tokens_compound(head(job_mwe, 100), concatenator = " ") %>%
tokens_compound(phrase("word press"), concatenator = " ") %>%
tokens_remove("")
job_dfmat <- dfm(job_tokens_mwe)
textstat_frequency(job_dfmat, n = 30)
Some of these occur very infrequently:
textstat_frequency(job_dfmat) %>%
ggplot(aes(x = frequency)) +
scale_x_continuous(breaks = 10^(0:6),
trans = "log") +
geom_histogram(aes(y = ..density..),
binwidth = .5, colour = "black", fill = "white") +
geom_density(alpha = .2, fill = "#FF6655", colour = "blue")
job_dfmat_trimmed <- dfm_trim(job_dfmat, min_termfreq = 10,
min_docfreq = .01, docfreq_type = "prop",
verbose = TRUE)
Removing features occurring:
- fewer than 10 times: 18,870
- in fewer than 2159.74 documents: 21,612
Total features removed: 21,612 (99.7%).
print(job_dfmat_trimmed, 0, 0)
Document-feature matrix of: 215,974 documents, 55 features (97.2% sparse) and 1 docvar.
library("ClusterR")
Loading required package: gtools
Optimal_Clusters_KMeans(as.matrix(job_dfmat_trimmed), 10, criterion = "AIC")
[1] 342019.5 322244.6 313601.9 300477.2 293531.6 275992.6 281111.3 270818.3 270928.3 268875.4
attr(,"class")
[1] "k-means clustering"
From this plot, it looks like k = 6 is the optimal number of clusters, so let’s fit that.
job_kmeans <- KMeans_arma(as.matrix(job_dfmat_trimmed), clusters = 6)
job_corpus$cluster <- job_dfmat$cluster <-
predict_KMeans(as.matrix(job_dfmat_trimmed), job_kmeans)
Now we can look at the top terms in each cluster:
textstat_frequency(job_dfmat, n = 10, groups = "cluster") %>%
as.data.frame() %>%
select(-rank)
We could say this:
cluster_label <- c("financial", "web development", "web designer", "graphic designer", "developer", "admin/support")
tsf <- textstat_frequency(job_dfmat, n = 10, groups = "cluster")
tab <- data.frame(cluster = 1:6, cluster_label,
"Top terms" = sapply(split(tsf$feature, f = tsf$group), paste, collapse = ", "))
# knitr::kable(tab, format = "markdown")
tab
I used the stm package for this, which implements a faster and more modern version of LDA than the older packages. If you want to cite what it is, just call it “LDA”. (See https://www.structuraltopicmodel.com/ for details, citations, etc.)
Here I trimmed the (very) sparse dfm a bit less agressively. The min_docfreq = .001 means that a term had to occur in 1/1000 of the documents to be retained, or in this case, at least 216 documents (since there are 215,974 documents).
job_dfmat_trimmed2 <- dfm_trim(job_dfmat, min_termfreq = 10,
min_docfreq = .001, docfreq_type = "prop",
verbose = TRUE)
Removing features occurring:
- fewer than 10 times: 18,870
- in fewer than 215.974 documents: 21,201
Total features removed: 21,201 (97.8%).
# remove the empty documents
job_dfmat_trimmed2 <- dfm_subset(job_dfmat_trimmed2, ntoken(job_dfmat_trimmed2) > 0)
print(job_dfmat_trimmed2, 0, 0)
Document-feature matrix of: 209,294 documents, 466 features (99.4% sparse) and 2 docvars.
Now we can fit a topic model.
library("stm")
stm v1.3.5 successfully loaded. See ?stm for help.
Papers, resources, and other materials at structuraltopicmodel.com
tmod <- stm(job_dfmat_trimmed2, K = 10, emtol = 1e-03)
Beginning Spectral Initialization
Calculating the gram matrix...
Finding anchor words...
..........
Recovering initialization...
....
Initialization complete.
....................................................................................................
Completed E-Step (164 seconds).
Completed M-Step.
Completing Iteration 1 (approx. per word bound = -5.806)
....................................................................................................
Completed E-Step (99 seconds).
Completed M-Step.
Completing Iteration 2 (approx. per word bound = -5.678, relative change = 2.215e-02)
....................................................................................................
Completed E-Step (141 seconds).
Completed M-Step.
Completing Iteration 3 (approx. per word bound = -5.573, relative change = 1.839e-02)
....................................................................................................
Completed E-Step (130 seconds).
Completed M-Step.
Completing Iteration 4 (approx. per word bound = -5.498, relative change = 1.350e-02)
....................................................................................................
Completed E-Step (130 seconds).
Completed M-Step.
Completing Iteration 5 (approx. per word bound = -5.447, relative change = 9.316e-03)
Topic 1: writer, editor, photograph, softwar, data entri
Topic 2: web design, develop, translat, wordpress, research
Topic 3: develop, virtual assist, data entri, support, write
Topic 4: expert, manag, seo, php, data
Topic 5: illustr, profession, artist, account, translat
Topic 6: graphic design, copywrit, administr, blogger, produc
Topic 7: market, engin, anim, assist, pr
Topic 8: web develop, specialist, consult, senior, graphic
Topic 9: design, creativ, 3d, fashion, student
Topic 10: web, freelanc, consult, sale, softwar develop
....................................................................................................
Completed E-Step (98 seconds).
Completed M-Step.
Completing Iteration 6 (approx. per word bound = -5.409, relative change = 6.986e-03)
....................................................................................................
Completed E-Step (105 seconds).
Completed M-Step.
Completing Iteration 7 (approx. per word bound = -5.382, relative change = 4.927e-03)
....................................................................................................
Completed E-Step (95 seconds).
Completed M-Step.
Completing Iteration 8 (approx. per word bound = -5.360, relative change = 4.010e-03)
..............................
labelTopics(tmod)
Topic 1 Top Words:
Highest Prob: writer, editor, photograph, proofread, journalist, softwar, content writer
FREX: writer, photograph, editor, videograph, softwar, journalist, market consult
Lift: quantiti, charter account, cameraman, writer, data entri oper, videograph, photograph
Score: writer, quantiti, editor, photograph, proofread, journalist, softwar
Topic 2 Top Words:
Highest Prob: develop, web design, programm, websit, wordpress, director, logo design
FREX: web design, websit, project manag, interpret, programm, websit design, applic develop
Lift: interpret, en, busi analyst, self employ, project manag, law, websit
Score: interpret, develop, web design, programm, websit, project manag, websit design
Topic 3 Top Words:
Highest Prob: data entri, virtual assist, research, support, write, admin, excel
FREX: virtual assist, softwar engin, write, support, data entri, custom servic, copi
Lift: civil, electron, softwar engin, technic support, administr assist, structur, copi
Score: civil, virtual assist, data entri, research, support, write, softwar engin
Topic 4 Top Words:
Highest Prob: expert, manag, seo, php, data, analyst, architect
FREX: expert, mysql, architect, sem, experi, ppc, smm
Lift: surveyor, sem, codeignit, link build, smm, smo, cakephp
Score: surveyor, expert, seo, manag, php, mysql, data
Topic 5 Top Words:
Highest Prob: translat, illustr, profession, artist, account, english, digit
FREX: artist, english, french, spanish, account, bookkeep, teacher
Lift: concept artist, cartoonist, italian, french, spanish, chines, safeti
Score: concept artist, translat, artist, illustr, account, english, profession
Topic 6 Top Words:
Highest Prob: graphic design, copywrit, administr, blogger, produc, virtual, project
FREX: graphic design, copywrit, maker, administr, oper, founder, train
Lift: cutter, maker, founder, graphic design, copywrit, commerci, system administr
Score: graphic design, cutter, copywrit, administr, blogger, produc, maker
Topic 7 Top Words:
Highest Prob: market, engin, anim, assist, execut, pr, brand
FREX: market, assist, execut, anim, engin, pa, transcrib
Lift: technologist, assist, graduat, market, visualis, execut, mechan engin
Score: technologist, market, engin, assist, anim, execut, pr
Topic 8 Top Words:
Highest Prob: web develop, specialist, php, senior, wordpress, freelanc writer, product
FREX: senior, android, io, net, web develop, mobil, ui
Lift: front end, io, front-end, android, net, full, senior
Score: front end, web develop, net, php, io, specialist, android
Topic 9 Top Words:
Highest Prob: design, creativ, graphic, 3d, fashion, student, visual
FREX: fashion, 3d, cad, student, design, owner, logo
Lift: textil, industri, banner, 3d visualis, creativ director, cad, studio
Score: design, textil, 3d, creativ, fashion, cad, graphic
Topic 10 Top Words:
Highest Prob: freelanc, web, consult, sale, servic, social media, softwar develop
FREX: softwar develop, hr, web, solut, freelanc, technolog, social media
Lift: sap, hr, independ, internet, softwar develop, devlop, technolog
Score: sap, freelanc, web, consult, softwar develop, sale, social media
plot(tmod, n = 5)
data.table::fwrite(df_raw2, "~/tmp/job descriptors LDA.csv")
Written 29.1% of 4832204 rows in 2 secs using 1 thread. maxBuffUsed=28%. ETA 4 secs.
Written 43.4% of 4832204 rows in 3 secs using 1 thread. maxBuffUsed=28%. ETA 3 secs.
Written 58.2% of 4832204 rows in 4 secs using 1 thread. maxBuffUsed=28%. ETA 2 secs.
Written 73.4% of 4832204 rows in 5 secs using 1 thread. maxBuffUsed=30%. ETA 1 secs.
Written 88.8% of 4832204 rows in 6 secs using 1 thread. maxBuffUsed=30%. ETA 0 secs.
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
proj_id(at least in the same way)
Why? Because mem_id and job_title are a 1:1 match, but proj_id to job_title is a 1:many match. job_title is an attribute of the bidder (worker), but proj_id seems to be an ID for the task for which workers are bidding. (Which suggests that job_title has a somewhat misleading name: it should be mem_description.)
Here’s the distribution of job_description across proj_id:
df_grouped_projid <- df_raw %>%
group_by(proj_id) %>%
summarize(n_bids = n(), n_unique_job_title = length(unique(job_title)),
job_title = head(job_title, 1))
`summarise()` ungrouping output (override with `.groups` argument)
summary(df_grouped_projid$n_unique_job_title)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 2.000 7.105 9.000 365.000
So while the median “project” received two bidders, the mean project received 7.1 bids and this went as as 365! So the text for each project is different.
However if you want, I could combine all of the text for job descriptions for a proj_id and fit that. There are 644,132 distinct projects. This will not cluster the projects into areas based on their descriptions, but it will cluster them based on the worker self-descriptions who placed bids for each project.
Should I do this?
I fit the k-means clusters as a robustness check, but the LDA fit is better, so I would use that. However I chose k as 10, somewhat arbitrarily, and it would be better to test this using the likelihood methods that you (Stephan) mentioned had been tested in a previous JM article (but that was not attached to the email you sent).
So if you are ok with my preliminary steps, I will test the k value for the LDA, and then fit the topics again with the optimal k.
job_title fields could not be predictedThere were 76,404 of these, and they are “missing” (NA) in the output .csv file.
table(df_raw2$max_topic, useNA = "ifany")
1 2 3 4 5 6 7 8 9 10 <NA>
304882 910070 301355 580938 621102 284381 176788 732288 664459 179537 76404
Why not? Because they were junk. Here’s a glimpse of 50 of them: