I did the following:
job descriptors.csv.zip, which contains 4,832,204 rows, including the text field job_title.mem_id descriptions, so that each document was a single freelancer job_title was always a single value for a unique mem_id – meaning no freelancer apparently changed their job_title for different jobs. This resulted in 215,980 unique job titles, although some of these could not be analyzed because they contained only “.” for instance.results/LDA_k12_job_title.csv.job_title but aggregated by project_id.
job_title text after grouping by project_id. This resulted in 644,132 unique documents.proj_id as `results/LDA_k12_proj_id.csv.See Fit_LDA.Rmd" which has all of the code required to load the filejob descriptors.csv.zip`, process the text, fit the topic models, and output the results.
Because these files are so large, I outputted them to a temporary folder and then zipped them before copying to the Dropbox shared folder in results/.
library("stm")
stm v1.3.5 successfully loaded. See ?stm for help.
Papers, resources, and other materials at structuraltopicmodel.com
load("fitted/k_search.rda")
plot(k_search)
Details on the K = 12 topics:
load("fitted/tmod12.rda")
plot(tmod12)
summary(tmod12)
A topic model with 12 topics, 209294 documents and a 466 word dictionary.
Topic 1 Top Words:
Highest Prob: expert, seo, sale, social media, admin, excel, work
FREX: seo, social media, work, retouch, digit market, compani, internet market
Lift: ad, advertis, adword, analyt, email market, googl adword, link build
Score: expert, seo, sale, social media, quantiti, admin, ppc
Topic 2 Top Words:
Highest Prob: writer, editor, copywrit, freelanc writer, content, journalist, experienc
FREX: writer, editor, copywrit, freelanc writer, content, journalist, experienc
Lift: cameraman, en, film, filmmak, law, writer, editor
Score: writer, editor, interpret, copywrit, freelanc writer, journalist, content
Topic 3 Top Words:
Highest Prob: consult, manag, engin, analyst, softwar, experi, databas
FREX: consult, manag, engin, analyst, softwar, databas, system
Lift: consult, internet, scienc, secur, system, analysi, analyst
Score: consult, manag, engin, analyst, civil, softwar, system
Topic 4 Top Words:
Highest Prob: translat, research, proofread, english, content writer, blogger, teacher
FREX: translat, research, proofread, english, content writer, teacher, articl writer
Lift: research, french, onlin, academ, arab, articl, articl writer
Score: translat, english, proofread, research, blogger, french, content writer
Topic 5 Top Words:
Highest Prob: design, artist, graphic, anim, 3d, video editor, print
FREX: design, artist, graphic, video editor, print, voic, game
Lift: actor, voic, 2d, 2d anim, 3d artist, 3d model, artist
Score: design, artist, graphic, 3d, anim, concept artist, video editor
Topic 6 Top Words:
Highest Prob: graphic design, develop, web design, websit, websit design, mobil, seo specialist
FREX: graphic design, web design, websit, seo specialist, founder, websit design, head
Lift: cutter, founder, graphic design, head, seo specialist, web design, websit
Score: develop, graphic design, web design, cutter, websit, websit design, mobil
Topic 7 Top Words:
Highest Prob: illustr, programm, assist, architect, execut, student, brand
FREX: illustr, programm, assist, architect, execut, student, brand
Lift: creativ director, illustr, master, technologist, architect, art, brand
Score: illustr, programm, architect, assist, architectur, visual, student
Topic 8 Top Words:
Highest Prob: php, wordpress, creativ, logo design, softwar develop, html, joomla
FREX: php, wordpress, softwar develop, html, joomla, magento, android
Lift: php, 5, access, android, app develop, applic develop, brochur
Score: php, wordpress, html, joomla, front end, css, mysql
Topic 9 Top Words:
Highest Prob: web develop, freelanc, senior, fashion, communic, ui, owner
FREX: freelanc, senior, owner, ux, sound, photographi, photo editor
Lift: art director, freelanc, industri, music compos, music produc, owner, photographi
Score: web develop, freelanc, textil, senior, fashion, ui, ux
Topic 10 Top Words:
Highest Prob: profession, account, busi, servic, director, support, project manag
FREX: profession, account, busi, servic, director, project manag, bookkeep
Lift: advisor, busi analyst, coach, coordin, corpor, event, event manag
Score: account, profession, busi, director, support, sap, servic
Topic 11 Top Words:
Highest Prob: web, market, photograph, specialist, video, audio, maker
FREX: web, market, photograph, specialist, video, video edit, audio
Lift: photograph, web, creation, generalist, make, market, mechan engin
Score: web, market, photograph, specialist, generalist, video, audio
Topic 12 Top Words:
Highest Prob: data entri, virtual assist, administr, write, data, offic, custom servic
FREX: virtual assist, administr, write, data, offic, custom servic, pa
Lift: offic, admin assist, admin support, administr assist, appoint, articl write, call
Score: virtual assist, data entri, administr, write, custom servic, clerk, offic
mem_id (by freelancer)Load LDA_k12_job_title.csv. This file contains one set of topic proportions for each unique mem_id (a total of )
unzip("results/LDA_k12_job_title.zip", exdir = "~/tmp/")
LDA_k12_job_title <- data.table::fread("~/tmp/LDA_k12_job_title.csv", data.table = FALSE)
tibble::glimpse(LDA_k12_job_title, width = 90)
Rows: 209,294
Columns: 14
$ mem_id <int> 128342, 503693, 177138, 411986, 219778, 14128, 86906, 539230, 58975, …
$ theta_1 <dbl> 0.07308893, 0.10721996, 0.08948607, 0.20415807, 0.02732215, 0.1124007…
$ theta_2 <dbl> 0.075702142, 0.019463663, 0.080716024, 0.023394417, 0.003830533, 0.08…
$ theta_3 <dbl> 0.09336628, 0.03983987, 0.10311823, 0.04846664, 0.01433578, 0.0757593…
$ theta_4 <dbl> 0.061339687, 0.017868666, 0.072610576, 0.026829924, 0.003139403, 0.10…
$ theta_5 <dbl> 0.09302686, 0.04843737, 0.06836757, 0.03183601, 0.02363463, 0.0434636…
$ theta_6 <dbl> 0.09764756, 0.22147844, 0.07880496, 0.12536559, 0.13920830, 0.0597120…
$ theta_7 <dbl> 0.05943682, 0.03162716, 0.05138677, 0.02492144, 0.01204788, 0.0397462…
$ theta_8 <dbl> 0.06818606, 0.34183952, 0.05665602, 0.31244171, 0.71002458, 0.0478790…
$ theta_9 <dbl> 0.09897834, 0.05867481, 0.06594379, 0.04536958, 0.03635920, 0.0480426…
$ theta_10 <dbl> 0.15528114, 0.04166124, 0.18338070, 0.05595774, 0.01058258, 0.1266794…
$ theta_11 <dbl> 0.06931444, 0.04835895, 0.07024998, 0.05327263, 0.01529206, 0.0633435…
$ theta_12 <dbl> 0.054631748, 0.023530343, 0.079279321, 0.047986233, 0.004222914, 0.19…
$ max_topic <int> 10, 8, 10, 8, 8, 12, 8, 8, 6, 3, 12, 8, 8, 5, 8, 10, 8, 1, 8, 8, 8, 6…
To merge this with the original data, just do a “left join” operation:
unzip("job descriptors.zip", exdir = "~/tmp/")
error 1 in extracting from zip fileWarning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
all_data <- data.table::fread("~/tmp/job descriptors.csv", data.table = FALSE)
all_data_LDA_job_title <- dplyr::left_join(all_data, LDA_k12_job_title, by = "mem_id")
dplyr::glimpse(all_data_LDA_job_title, width = 90)
Rows: 4,832,204
Columns: 17
$ bid_id <int> 404205, 416055, 699962, 823066, 824483, 1146456, 200210, 193618, 1812…
$ proj_id <int> 43215, 44698, 73522, 85889, 85967, 120260, 20982, 20168, 18487, 19720…
$ mem_id <int> 114272, 114272, 114272, 140195, 140195, 140195, 60468, 60468, 60468, …
$ job_title <chr> "Web/Graphic Design Adobe Photoshop cs3 ; cs4 ; cs5 ; Illustrator ; C…
$ theta_1 <dbl> 0.06715131, 0.06715131, 0.06715131, 0.04421541, 0.04421541, 0.0442154…
$ theta_2 <dbl> 0.04058958, 0.04058958, 0.04058958, 0.05566576, 0.05566576, 0.0556657…
$ theta_3 <dbl> 0.06431540, 0.06431540, 0.06431540, 0.05289841, 0.05289841, 0.0528984…
$ theta_4 <dbl> 0.03015566, 0.03015566, 0.03015566, 0.03772991, 0.03772991, 0.0377299…
$ theta_5 <dbl> 0.10882433, 0.10882433, 0.10882433, 0.22895669, 0.22895669, 0.2289566…
$ theta_6 <dbl> 0.16089062, 0.16089062, 0.16089062, 0.11142330, 0.11142330, 0.1114233…
$ theta_7 <dbl> 0.08201423, 0.08201423, 0.08201423, 0.13881545, 0.13881545, 0.1388154…
$ theta_8 <dbl> 0.14438100, 0.14438100, 0.14438100, 0.06390998, 0.06390998, 0.0639099…
$ theta_9 <dbl> 0.09925113, 0.09925113, 0.09925113, 0.09508420, 0.09508420, 0.0950842…
$ theta_10 <dbl> 0.09503692, 0.09503692, 0.09503692, 0.07527572, 0.07527572, 0.0752757…
$ theta_11 <dbl> 0.07636458, 0.07636458, 0.07636458, 0.06785218, 0.06785218, 0.0678521…
$ theta_12 <dbl> 0.03102524, 0.03102524, 0.03102524, 0.02817298, 0.02817298, 0.0281729…
$ max_topic <int> 6, 6, 6, 5, 5, 5, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,…
proj_id (by project)Load LDA_k12_proj_id.csv. This file contains one set of topic proportions for each unique mem_id (a total of )
unzip("results/LDA_k12_proj_id.zip", exdir = "~/tmp/")
LDA_k12_proj_id <- data.table::fread("~/tmp/LDA_k12_proj_id.csv", data.table = FALSE)
tibble::glimpse(LDA_k12_proj_id, width = 90)
Rows: 639,678
Columns: 14
$ proj_id <int> 1004, 1402, 1419, 1888, 2973, 3103, 3250, 3275, 5837, 6141, 6254, 649…
$ theta_1 <dbl> 0.02484302, 0.07454527, 0.10842039, 0.13170684, 0.15179779, 0.1022415…
$ theta_2 <dbl> 0.008886185, 0.066314688, 0.045267903, 0.042876092, 0.057985343, 0.05…
$ theta_3 <dbl> 0.03158616, 0.11964086, 0.19163514, 0.08205396, 0.12051238, 0.1104882…
$ theta_4 <dbl> 0.004995802, 0.058052048, 0.046520735, 0.044954254, 0.041141920, 0.10…
$ theta_5 <dbl> 0.11799575, 0.04937282, 0.02479040, 0.02938004, 0.03525608, 0.0224015…
$ theta_6 <dbl> 0.28167562, 0.05024897, 0.03793470, 0.03823611, 0.05463732, 0.0390909…
$ theta_7 <dbl> 0.04418074, 0.04041016, 0.02578611, 0.02733854, 0.02713814, 0.0223500…
$ theta_8 <dbl> 0.34311672, 0.03252080, 0.03162171, 0.02895976, 0.05523331, 0.0260378…
$ theta_9 <dbl> 0.07605664, 0.04853417, 0.03221491, 0.03755607, 0.04535646, 0.0414873…
$ theta_10 <dbl> 0.02739549, 0.31848332, 0.27610439, 0.32073269, 0.22182093, 0.1652494…
$ theta_11 <dbl> 0.03424946, 0.05759115, 0.04787108, 0.06026831, 0.10964360, 0.0563533…
$ theta_12 <dbl> 0.005018427, 0.084285755, 0.131832544, 0.155937332, 0.079476741, 0.25…
$ max_topic <int> 8, 10, 10, 10, 10, 12, 10, 12, 12, 10, 10, 10, 6, 8, 10, 10, 12, 8, 2…
To merge this with the original data, just do a “left join” operation:
all_data_LDA_proj_id <- dplyr::left_join(all_data, LDA_k12_proj_id, by = "proj_id")
dplyr::glimpse(all_data_LDA_proj_id, width = 90)
Rows: 4,832,204
Columns: 17
$ bid_id <int> 404205, 416055, 699962, 823066, 824483, 1146456, 200210, 193618, 1812…
$ proj_id <int> 43215, 44698, 73522, 85889, 85967, 120260, 20982, 20168, 18487, 19720…
$ mem_id <int> 114272, 114272, 114272, 140195, 140195, 140195, 60468, 60468, 60468, …
$ job_title <chr> "Web/Graphic Design Adobe Photoshop cs3 ; cs4 ; cs5 ; Illustrator ; C…
$ theta_1 <dbl> 0.031542269, 0.048840017, 0.091749654, 0.008246565, 0.029839193, 0.05…
$ theta_2 <dbl> 0.032127402, 0.007837322, 0.013990019, 0.046447252, 0.019700680, 0.01…
$ theta_3 <dbl> 0.023764586, 0.034962227, 0.032299515, 0.009772934, 0.028723194, 0.01…
$ theta_4 <dbl> 0.016029310, 0.005322921, 0.007722979, 0.017385866, 0.010632236, 0.00…
$ theta_5 <dbl> 0.245843668, 0.071529731, 0.067098941, 0.375921441, 0.363578252, 0.28…
$ theta_6 <dbl> 0.228765226, 0.245321525, 0.154724394, 0.079028017, 0.167235956, 0.11…
$ theta_7 <dbl> 0.078976426, 0.051136990, 0.029294372, 0.310453583, 0.140258542, 0.26…
$ theta_8 <dbl> 0.159050136, 0.374920646, 0.419285425, 0.018590715, 0.052087925, 0.04…
$ theta_9 <dbl> 0.056934308, 0.055500799, 0.089933186, 0.077227010, 0.069596126, 0.09…
$ theta_10 <dbl> 0.04420721, 0.05470092, 0.04604291, 0.02630976, 0.06065594, 0.0401919…
$ theta_11 <dbl> 0.07606562, 0.04019165, 0.03523336, 0.02612268, 0.04691210, 0.0437800…
$ theta_12 <dbl> 0.006693847, 0.009735248, 0.012625244, 0.004494180, 0.010779856, 0.01…
$ max_topic <int> 5, 8, 8, 5, 5, 5, 12, 12, 12, 12, 12, 2, 12, 2, 12, 12, 12, 12, 12, 1…