job_posting_recommendation

Objective, Steps

This project is about scraping data from glassdoor and using it to build a personal job recommendation. A different script is set up to run the following part I, II and V, and the script is scheduled to run daily/weekly using Windows Task Scheduler (alternatively, “taskscheduleR” package).

Part I of the script is about web scraping. First, I need to identify the url (such as search key terms, location, salary, date range, etc) as a pointer for web scraping. Each of these url(s) would include multiple jobs and associated job id posted. Second, I need to extract info for each of these jobs, including salary, review, company name, url, job id, etc. Subsequently, I extract company names for all these jobs. I need to query these company info (such as headquarter, company size, revenue, etc) by searching on google. Next, I will clean (mostly by removing incomplete data set and performing some stringr operations) the unstructural data set(s) and put everything together. I will have two data sets/frames, i.e. jobDf, companyDf. I will need to merge them together by jobListingId. Any job that has incomplete data (such as missing salary or company info) is removed. The exercise of this project is to rely on a complete data set to make classification/recommendation.

Part II, I perform a simple change-data-capture (CDC). Above steps describe a simple ETL (extract, transform & load) process. Whenever the script is run on scheduler (say, daily), a txt file for jobDf and a txt file for companyDf are saved locally. Each of these txt files’ names is associated with a timestamp. Plus, an aggregated data frame (saved as a single object as a .RDS file) is always updated. That means, each of these txt files (for job or company) is appended whenever the script is run. I don’t lose any data - I keep each txt file when the script is run, I also append them to the aggregated table, but I do update, delete and insert based on jobListingId and latest company info. Not just simply appending/adding data to the existing aggregated table. In other words, I do change-data-capture (CDC) for two simple reasons: 1) I avoid seeing the same jobs inserted multiple times in the latest (aggregated) data frame, and 2) I update job posting and company info based on latest data (same job can be posted multiple times, as well as company info can vary across times). When I am ready to analyze the data, I merge the aggregated jobDf and companyDf together. I will have a single, complete data frame (I name it as jpr - job-posting-recommendation) to work with and one row would represent a single job posted.

Part III, I start to anaylze data here. First, I manually review the data that I collect. In this case, I have collected 200 jobs. Each of these is reviewed and flagged (1/0) - whether I would be interested in it or not. Subsequently, I load the data (jobListingId, flag) back and merge it with the jpr data frame to form a new data frame, i.e. newDf. I clean the job description and turn it into a document term matrix (dtm). I split the newDf into train and test sets. I combine the dtm with other quantitative variables that I collect for each job. These variables include estimated salary, company review (from 0 to 5), company size, and revenue. I need to drop “industry” because I do not have a big enough data set, i.e. some industries included in the train set cannot be found in the test set, thus a classification model built using the train set would fail to apply on a test set. At the end, I build a logistic regression model using these quantitative variables along with the job descriptions turned into a dtm (I set a criterion - only words appear 10 times or more in the corpus of job descriptions would be included in the final classification model). The results indicate sensitivity is 60%, specificity is 70%, precision is 28.6% and total accuracy is 68.3%. Various algorithm is also tried, including support-vector-machine (SVM) and Naive Bayes, but both perform very poorly and not included for display here. A larger data set is recommended.

Part IV, I visualize the job descriptions using word cloud. I create a single word cloud featuring jobs that I am interested in, and also I create a comparison cloud to feature and compare jobs that I am interested in versus jobs that I am not. The visualization is satisfying, especially the comparison cloud can tease out keywords from the two separate corpuses.

Part V, finally, I demonstrate how to send out an email notification in R. In theory, I can classify jobs that I am potentially interested in (by applying the model to the jobs that I scrap daily or weekly), and then send a notification with attachment(s) via email to myself. This step is omitted and not shown here, because I have not collected large enough data set to provide/develop any useful model for job recommendation. Example of seting up an email with attachment is shown here (without username or password displayed). In addition, I need to create a simple task.bat file to instruct Windows where to find and run my script when I set up the Task Scheduler.

PART I : WEB SCRAPING - Step 1 : get links - Step 2 : get job info for each link - Step 3 : jobDf - Step 4 : companyDf

PART II : CHANGE-DATA-CAPTURE (CDC) - Step 5.1 : update (delete), insert - Step 5.2 : merge

PART III : ANALYSIS BEGINS! - Step 6 : reivew and flag jobs - Step 7 : document term matrix - Step 8 : split into train, test - Step 9 : classification - logistic regression

PART IV : WORD CLOUD - Step 10.1 : code - Step 10.2 : overall word cloud - Step 10.3 : comparison cloud

PART V : SET UP - Step 11 : email notification - Step 12 : Windows Task Scheduler

Step 0 : load packages

knitr::opts_chunk$set(echo = TRUE)

packages <- c("tidyverse", "rvest", "RCurl", "plyr", "wrapr", "Hmisc", "sqldf", "kableExtra", "tidytext", "tm", "e1071", "pROC", "wordcloud")
invisible( lapply(packages, function(x) library(x, character.only = T, verbose = F, quietly = T)) )

PART I : WEB SCRAPING

Step 1 : get links

# identify the url(s) with the necessary parameters, such as job title, salary, distance, etc.
# data analyst, New York, 100k USD, past 7 days
da <- "https://www.glassdoor.com/Job/jobs.htm?sc.keyword=data%20analyst&locT=C&locId=1132348&locKeyword=New%20York,%20NY&jobType=fulltime&fromAge=7&minSalary=100000&includeNoSalaryJobs=true&radius=-1&cityId=-1&minRating=0.0&industryId=-1&companyId=-1&applicationType=0&employerSizes=0&remoteWorkType=0"

# business intelligence, New York, 100k USD, past 7 days
bi <- "https://www.glassdoor.com/Job/jobs.htm?sc.keyword=business%20intelligence&locT=C&locId=1132348&locKeyword=New%20York,%20NY&jobType=fulltime&fromAge=7&minSalary=100000&includeNoSalaryJobs=true&radius=-1&cityId=-1&minRating=0.0&industryId=-1&companyId=-1&applicationType=0&employerSizes=0&remoteWorkType=0"

# get the job links and jobListingId from the above urls
da_links <- read_html(da) %>%
        # html_nodes(".compactStars , .small , .empLoc div , .jobLink") %>%
        html_nodes(".jobLink") %>%
        html_attr("href") %>%
        na.omit %>%
        as.data.frame %>%
        dplyr::rename("links" = ".") %>%
        dplyr::mutate( links = paste0("https://www.glassdoor.com", as.character(links)),
                jobListingId = stringr::str_extract_all(links, pattern = "jobListingId=[0-9]+") %.>% 
                        gsub(pattern = "jobListingId=", replacement = "", x = .) ) %>%
        arrange(., jobListingId, desc(links)) %>%
        # many of these contain the same identical job posting but with different urls, we need to take them out
        group_by(jobListingId) %>%
        dplyr::mutate( id = row_number() ) %>%
        ungroup %>%
        dplyr::filter(id == 1) %>%
        select(links, jobListingId)

bi_links <- read_html(bi) %>%
        # html_nodes(".compactStars , .small , .empLoc div , .jobLink") %>%
        html_nodes(".jobLink") %>%
        html_attr("href") %>%
        na.omit %>%
        as.data.frame %>%
        dplyr::rename("links" = ".") %>%
        dplyr::mutate( links = paste0("https://www.glassdoor.com", as.character(links)),
                       jobListingId = stringr::str_extract_all(links, pattern = "jobListingId=[0-9]+") %.>% 
                               gsub(pattern = "jobListingId=", replacement = "", x = .) ) %>%
        arrange(., jobListingId, desc(links)) %>%
        # many of these contain the same identical job posting but with different urls, we need to take them out
        group_by(jobListingId) %>%
        dplyr::mutate( id = row_number() ) %>%
        ungroup %>%
        dplyr::filter(id == 1) %>%
        select(links, jobListingId)

# append the df
links <- rbind(da_links, bi_links) %>% distinct  # a data frame with unique job listing id and associated url

Step 2 : get job info for each link

# create extract function 
extract_job_info <- function(link, jobListingId) {
        read_html(link) %>% 
        html_nodes(".nowrap, .ratingNum , .salEst , .desc , #HeroHeaderModule .strong") %>%
        html_text %>%
        as.data.frame %>%
        dplyr::rename("job" = ".") %>%
        t %>%
        as.data.frame %>% 
        dplyr::mutate(jobListingId = jobListingId)        
}

# must use purrr::possibly - fail-safe approach
extract_job_info <- purrr::possibly(extract_job_info, otherwise = NA_real_)

# create an empty list
jobList <- vector(mode = "list", length = nrow(links))

# scrape the job info and write each into the list before complying into a data.frame 
system.time( lapply(1:length(jobList), function(x){
        jobList[[x]] <<- extract_job_info(links$links[x], links$jobListingId[x])
        } ) %.>% invisible(.) )

# remove NA result first
na <- lapply(1:length(jobList), function(x) is.na(jobList[x])) %>% unlist %>% which
stay <- c(1:length(jobList))[1:length(jobList) %nin% na]

jobList <- jobList[stay]

# there's missing data in different job postings
# some have 4 or 5 columns, while most have 8 (complete set)
jobListSummary <- data.frame(row = 1:length(jobList), length = NA)

lapply(1:length(jobList), function(x) {
        jobListSummary$length[x] <<- length(jobList[[x]])
}) %.>% invisible(.)

uniqueLength <- unique(jobListSummary$length)
lengthUniqueLength <- length(uniqueLength)
uniqueList <- vector(mode = "list", length = lengthUniqueLength)

for(i in 1:lengthUniqueLength){
        uniqueList[[i]] <- sqldf::sqldf(
                sprintf('select row from jobListSummary where length = %d', uniqueLength[i]) 
        )
}

uniqueList <- lapply(1:length(uniqueList), function(x){
        unlist(uniqueList[[x]]) %>% as.vector
})

# store each set of data into a data.frame by the length of column
jobListDf <- vector(mode = "list", length = lengthUniqueLength)

for(i in 1:lengthUniqueLength){
        jobListDf[[i]] <- jobList[c(uniqueList[[i]])]
}

# jobListDf is now a list of lists (by length of column) in which each list contains a data.frame for each job
# str(jobListDf); summary(jobListDf)
# let's reduce it to a number of list by length of column and make each list a data.frame (combining all jobs inside that list)
putTogether <- function(x){
        plyr::ldply(x, data.frame)
}

jobListDf <- lapply(jobListDf, putTogether)

Step 3 : jobDf

### CAUTION : there are multiple lists - some have 9 columns, whereas some have only 5 or 6 ###
# create and save (locally) a data frame for jobs
# let's just look at the complete set (9 columns)
# because the other are missing critical information that we need, e.g. review, salary

extractListNumber <- lapply(1:length(jobListDf), function(x){
        length(jobListDf[[x]]) == 9}) %>% 
        unlist %.>%
        which(. == T)

jobDf <- jobListDf[[extractListNumber]] %>%
        dplyr::mutate( title = as.character(V1),
                       company = as.character(V2) %>% 
                               stringr::str_trim(.),
                       last_updated = stringr::str_extract_all(V3, pattern = "Today|[0-9]+") %>% 
                               unlist %.>% 
                               ifelse(. == "Today", 0, .) %>%
                               as.numeric %.>%
                               lapply(1:length(.), function(x){
                                       Sys.Date() -.[x]
                               }) %>% 
                               unlist %.>%
                               as.Date(., origin = "1970-01-01"), 
                       glassdoor_est = as.character(V4),
                       review = stringr::str_extract_all(V6, pattern = "[0-9]\\.[0-9]") %>% 
                               as.numeric,
                       salary_est = gsub(pattern = ",", replacement = "", V7) %>%
                               stringr::str_extract_all(., pattern = "[0-9]+") %>%
                               unlist %>%
                               as.numeric,
                       salary_est = salary_est / 1000,
                       jobListingId = as.character(jobListingId),
                       description = as.character(V8),
                       run_time = Sys.time() ) %>%
        tidyr::separate(., glassdoor_est, into = c("salary_min_est", "salary_max_est"), sep = "-") %>%
        mutate( salary_min_est = stringr::str_extract_all(salary_min_est, pattern = "[0-9]+") %>%
                        unlist %>%
                        as.numeric,
                salary_max_est = stringr::str_extract_all(salary_max_est, pattern = "[0-9]+") %>%
                        unlist %>%
                        as.numeric ) %>%
        # inner join it back with "links" to get url pairing with jobListingId
        dplyr::inner_join(., links, by = "jobListingId") %>%
        dplyr::rename(url = links) %>%
        # pull together a list of following columns
        select(jobListingId, title, company, last_updated, 
               salary_min_est, salary_max_est, salary_est, 
               review, url, run_time, description) %>%
        arrange(company, title, last_updated)

# setwd to data/job
currentwd <- getwd()
setwd("../"); setwd("data/job")

# save as txt
run_time <- jobDf$run_time %>% unique %.>% gsub(pattern = ":", replacement = "", x = .)
filename <- paste0("jobDf_", run_time, ".txt", sep = "")
write.table(jobDf, file = filename, sep = "\t", row.names = F, append = F)

# reset wd
setwd(currentwd)

Step 4 : companyDf

# create and save (locally) a data frame for companies
# write a function to search via google for the url pointing to the glassdoor overview database for a company
# wish I can just get an API to query against glassdoor/Indeed database
overview <- function(company){
        entity = company
        search = "glassdoor company overview"
        url = URLencode(paste0("https://www.google.com/search?q=", paste0(entity, search, sep = " ")))
        
        # return top pages result
        cite <- read_html(url) %>% 
                html_nodes("cite") %>%  # change any node you like, e.g. cite
                html_text() %>%
                as.data.frame(., stringsAsFactors = F) %>%
                dplyr::mutate(row = row_number()) %>%
                dplyr::rename(url = ".")
        
        # sql hack - return only the "lower(url) like '%glassdoor%overview/working-at%'"
        overview_cite_select <- sqldf("select url 
                                from (
                                select url, min(row) as flag 
                                from ( select url, row 
                                        from cite 
                                        where lower(url) like '%glassdoor%overview/working-at%' ) x 
                                group by url
                                ) y") %>% as.character
        
        # scrape the company overview
        overview <- read_html(overview_cite_select) %>%
                html_nodes(".value , .website") %>%
                html_text %>%
                # dplyr::rename("overview" = ".") %>%
                t %>%
                as.data.frame %>%
                dplyr::mutate(V0 = company) %>%
                select(., starts_with("V"))
        
        return(overview)
}

# must use purrr::possibly - fail-safe approach
overview <- purrr::possibly(overview, otherwise = NA_real_)

# start scraping from the list of companies from jobDf
companies <- vector(mode = "list", length = length(unique(jobDf$company)))
names(companies) <- jobDf$company %>% unique 
companies <- lapply(1:length(companies), function(x) companies[[x]] <- names(companies)[x])
# system.time( companies <- lapply(1:length(companies), function(x) overview(companies[[x]])) )
system.time( companies <- purrr::map(1:length(companies), function(x) overview(companies[[x]])) )  # a little faster

# remove NA result first
remove <- lapply(1:length(companies), function(x) is.na(companies[x])) %>% unlist %>% which
keep <- c(1:length(companies))[1:length(companies) %nin% remove]

# extract only V0:V7 columns
extract <- function(x){
        x <- dplyr::select(x, V0, V1, V2, V3, V4, V5, V6, V7)
        return(x)
}

# must use purrr::safely - fail-safe approach
extract_safely <- purrr::safely(extract, otherwise = NA_real_)

# extract, combine into a single df, and then rename columns
companyDf <- purrr::map(companies[keep], extract_safely) %.>%
        purrr::transpose(.)

companyDf <- companyDf$result %>%
        plyr::ldply(., data.frame)

names(companyDf) <- c("company", "website", "headquarters", "size",
                      "founded", "type", "industry", "revenue")

# change Factor to Chr, use str_trim to remove white space
companyDf <- companyDf[, 1:8]  # have to subset first; dplyr::select does not allow NA column
companyDf <- companyDf[complete.cases(companyDf), ] %>%
        dplyr::mutate( website = as.character(website) %>% stringr::str_trim(.),
                       headquarters = as.character(headquarters) %>% stringr::str_trim(.),
                       size = as.character(size) %>% stringr::str_trim(.),
                       founded = stringr::str_extract_all(founded, "[0-9]+") %>% as.numeric,
                       type = as.character(type) %.>%
                               gsub(".*[Pp]rivate.*", "private", .) %.>%
                               gsub(".*[Pp]ublic.*", "public", .) %>%
                               stringr::str_trim(.),
                       industry = as.character(industry) %>% stringr::str_trim(.),
                       revenue = as.character(revenue) %>% stringr::str_trim(.),
                       run_time = Sys.time() )

# complete.cases() again, b/c of some misplacing columns
# some jobs have 10 columns and there's some mismatch when we manually pull V1:V7 columns
# for example, "founded" value would be misplaced in the "type" column
# an imperfect and temporary solution is to do complete.cases() one more time to rid the NA
companyDf <- companyDf[complete.cases(companyDf), ]

# setwd to data/company
setwd("../"); setwd("data/company")

# save as txt
run_time2 <- companyDf$run_time %>% unique %.>% gsub(pattern = ":", replacement = "", x = .)
filename2 <- paste0("companyDf_", run_time2, ".txt", sep = "")
write.table(companyDf, file = filename2, sep = "\t", row.names = F, append = F)

# reset wd
setwd(currentwd)

PART II : CHANGE-DATA-CAPTURE (CDC)

Step 5.1 : update (delete), insert

setwd("../"); setwd("data")

# load the _agg tables
jobDf_agg <- readRDS(file = "jobDf_agg.RDS")
companyDf_agg <- readRDS(file = "companyDf_agg.RDS")

#######################
# delete, insert, save
# job
jobListingId_agg <- jobDf_agg$jobListingId   
jobListingId_now <- jobDf$jobListingId

del <- jobListingId_agg[jobListingId_agg %in% jobListingId_now]

jobDf_agg <- jobDf_agg %>%
        dplyr::filter(jobListingId %nin% del) %.>%  # delete
        dplyr::bind_rows(., jobDf)  # append

saveRDS(jobDf_agg, file = "jobDf_agg.RDS")  # save

# company
company_agg <- companyDf_agg$company   
company_now <- companyDf$company

del2 <- company_agg[company_agg %in% company_now]

companyDf_agg <- companyDf_agg %>%
        dplyr::filter(company %nin% del2) %.>%  # delete
        dplyr::bind_rows(., companyDf)  # append

saveRDS(companyDf_agg, file = "companyDf_agg.RDS")  # save
#######################

# reset wd
setwd(currentwd)

Step 5.2 : merge

### CAUTION : there is always missing data for some companies overview ###
# let's merge (instead of left join) jobDf_agg and companyDf_agg together

jpr <- merge(jobDf_agg, companyDf_agg, by = "company") %>%
        # filter job titles that are not relevant
        dplyr::filter(title %in% title[!grepl("(architect|software|engineer|director)", title, ignore.case = T)]) %>%
        # filter by job posting last_updated within past 14 days
        # dplyr::filter(last_updated > Sys.Date() -14) %>%
        # filter out duplicated jobListingId - this happens b/c same job with same Id can have different urls
        group_by(jobListingId) %>%
        dplyr::mutate(row = row_number()) %>%
        ungroup %>%
        dplyr::filter(row == 1) %>%
        select(jobListingId, title, company, last_updated, salary_min_est, salary_max_est, salary_est, review, url,
               website, headquarters, size, founded, type, industry, revenue, description) %>%
        arrange(company, title, last_updated) 

# jpr %>% arrange(., desc(last_updated)) %>% View

# output into txt, for manually review
# write.table(jpr, file = "output.txt", row.names = F)

PART III : ANALYSIS BEGINS!

Step 6 : reivew and flag jobs

# manully go through the short list (200 jobs) - flag out those that I am interested in
# merge it back and create a new data frame (newDf) for modeling
# I saved and pushed it to github account - that is a csv file that contains 200 jobs
newDf <- read.table("https://raw.githubusercontent.com/myvioletrose/job_posting_recommendation/master/data/newDf.csv", header = T, stringsAsFactors = T)
newDf <- newDf %>% dplyr::mutate(description = as.character(description))
str(newDf)

## 'data.frame':    200 obs. of  8 variables:
##  $ jobListingId: num  1.52e+09 2.20e+09 2.49e+09 2.50e+09 2.60e+09 ...
##  $ flag        : int  0 0 0 1 0 1 1 0 0 1 ...
##  $ salary_est  : num  141 142 112 89 131 ...
##  $ review      : num  4.6 4.7 3.6 4.8 2.6 3.3 3.4 4.1 4.1 4.4 ...
##  $ size        : Factor w/ 7 levels "1 to 50 employees",..: 4 4 5 4 7 3 3 3 3 2 ...
##  $ industry    : Factor w/ 37 levels "Accounting","Advertising & Marketing",..: 7 22 22 22 32 8 23 17 17 6 ...
##  $ revenue     : Factor w/ 12 levels "$1 to $2 billion (USD) per year",..: 10 12 5 12 10 4 12 12 12 3 ...
##  $ description : chr  "WHO WE LOOK FOR An SEI Consultant is a master communicator and active listener who understands how to navigate "| __truncated__ " SeatGeek operates a unique business model in a complicated, opaque market. Many of the hardest problems we fac"| __truncated__ " Ref#: R0007641  Additional Posting Location(s): Austin, Stamford Our mission.  As the world’s number 1 job sit"| __truncated__ "Data-driven decision making has always been important at SeatGeek. Our Analytics team helps drive strategic dec"| __truncated__ ...

head(newDf) %>% kable %>% 
        kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)

jobListingId	flag	salary_est	review	size	industry	revenue	description
1518289779	0	141.000	4.6	201 to 500 employees	Consulting	$50 to $100 million (USD) per year	WHO WE LOOK FOR An SEI Consultant is a master communicator and active listener who understands how to navigate an audience. Self-aware, almost to a fault, SEI consultants keenly understand how to adjust their approach based on the situation. Following a logical, fact-based approach, our consultants possess the superior ability to see correlations others may not, ask the right questions and drive solutions. As super-connectors, our consultants connect not only people, but data, trends and experiences. Mature, humble, and genuine, SEI Consultants frequently go above and beyond for both their clients and their colleagues. SEI Consultants are ethical and trustworthy individuals who do what they say. SEI Consultants have an insatiable curiosity and love to learn. These individuals are commonly tech savvy and early adopters. Their passion for learning is infectious and excites others. As every project is different, an SEI Consultant must be adaptable and comfortable with unexpected situations. An SEI Consultant must be at ease with ambiguity because although a client knows that a problem exists – they need SEI to figure it out and drive a solution. SEI Consultants define ambition differently. SEI Consultants are authentic, low-maintenance individuals who like to hang out with colleagues outside of work. Whether it be cooking, traveling, hiking, or volunteering, SEI Consultants enjoy working with genuine, thoughtful folks who want to steer clear of the traditional grind and share the joy of day-to-day life and activities with colleagues, friends and family. WHAT WE DO Our consultants work with clients at all levels of the organization, from the C-suite to the shop floor, helping them to deliver on their most strategic initiatives. We’re known for making realistic, data-driven decisions that deliver value in tangible ways to our clients. Our clients ask for us on projects that require a superior combination of technical and business capabilities, people and management skills, and a collaborative mindset. We excel in understanding complex programs and strategic initiatives and breaking them into actionable pieces. We work across a variety of industries and business functions and provide depth and breadth of experience across a core set of service areas: Agile Transformation and DeliveryAnalytics and Business IntelligenceBusiness Process & TransformationOrganizational Change ManagementInformation Management & Data GovernanceProgram & Project Leadership & ExecutionStrategic PlanningVendor & Technology SelectionUser Experience (UX) A career at SEI extends well beyond providing great service and thought leadership to our clients. Everyone takes an active role in building and managing our business, in an environment that runs counter to traditional consulting firms. Our consultants have a “seat at the table” and contribute to growing our business in ways that align to their interests such as growing business development opportunities, conducting interviews to support our hiring process, managing internal initiatives that build our brand or organizing trainings to share what you know with your colleagues. There is no telling what an SEI Consultant will be asked to do on a day-to-day basis – we do what it takes to get the job done. QUALIFICATIONSRequiredDemonstrated business acumenProven track record of delivering resultsExperience working with and/or leading a teamAbility to work independentlyAbility to work across industries, roles, functions & technologiesPositive can-do attitudeAuthorization for permanent employment in the United States (this position is not eligible for immigration sponsorship)PreferredBachelor’s degreeMinimum 8+ years professional experienceConsulting experienceExperience across our service offeringsProfessional certifications (e.g. PMP , Scrum, Agile)
2204974404	0	141.757	4.7	201 to 500 employees	Internet	Unknown / Non-Applicable	SeatGeek operates a unique business model in a complicated, opaque market. Many of the hardest problems we face have never been tackled at scale and do not have clear questions, let alone answers. Moving forward requires critical thinking, rapid prototyping, and intellectual dexterity. Our team members have varied backgrounds including an expert on natural language processing, a neuroscientist, a former math teacher, and a mathematician who previously specialized in traffic flow optimization. We share common views on experimental rigor, pragmatism, and software quality. We want someone to join us who shares our excitement at providing data services to our colleagues and customers, someone proficient with at least one general-purpose programming language and who knows its scientific stack. We want someone who can take a messy dataset and make it clean and who can take a clean dataset and make it sing. Last but not least, we want someone who’s committed not only to bettering themselves but to bettering their team, someone who values and invests in knowledge share and open communication.What you’ll doAs a member of the SeatGeek data science team you will take the complex issues facing the business and make them simple. We aim to find meaning in the data we have, go out and get the data we don’t. We leverage technology whenever possible, and we aim to build systems that anticipate the needs of tomorrow as well as solving the problems of today. Here are some things you might work on: Design and implement statistical tests for new KPIs in our A/B testing frameworkMine user interaction data for operational efficiencies and product improvements that could save our customers time and boost our bottom lineDeliver a talk to engineers on your favorite Scala features or a class on epistemology to the rest of the businessEstimate how adding a hundred thousand tickets to our inventory or changing how we allocate marketing resources will impact revenueUse machine learning to identify when a user has a problem before they contact usBuild a model to predict the probability a particular ticket listing will sell at a given priceWhat we’re looking forYou have a passion for problem solving, experience working on open-ended projects, and a proven ability to come up with creative, elegant solutions to complex issues. Experience with specific tools is less important than aptitude and drive, but at a minimum we would expect: 3-5 years of academic or professional experience in a quantitative roleExperience translating business problems into data problems and solving themComfort turning ideas into code (bonus points for experience with Python or Scala)Commitment to creating and sharing reproducible analysisA passion for learning and teaching others Bonus points for candidates who have experience with or desire to learn any of the following: Streaming data (Reactive Extensions, Spark Streaming, Akka-streams, Kafka, RabbitMQ)AWS infrastructure (we use Redshift, S3, EMR, Kinesis, Lambda, and RDS)Building distributed software (especially on top of Spark, Hadoop, etc.) in a production environmentUtilizing applied statistics or machine learning on large, complex, noisy datasetsScaling and turning statistical models into production-ready applications such as recommender systems The Tools We UseWe do research and development work in a custom environment optimized for repeatability and collaboration. You absolutely do not need experience with all of these, but we thought you might be curious. Languages: Python for web services and product devlopment, R for analysis and prototypingDatastores: MySQL, Redshift, Elasticsearch, RedisMonitoring: Graphite/StatsDVersion control: GitPerks:A laid-back, fun workplace designed to facilitate collaboration and company wide events$120/mo to spend on live events ticketsA superb benefits package that supports health/dental/visionA focus on transparency. We have regular team lunches and Q&A panels where employees can chat openly with teams across SeatGeek, our co-founders, and external guests from the industryAnnual subscriptions to Citibike, Spotify, and meditation services Come Join Us!SeatGeek is committed to providing equal employment opportunities to all employees and applicants for employment regardless of race, color, religion, creed, age, national origin or ancestry, ethnicity, sex, sexual orientation, gender identity or expression, disability, military or veteran status, or any other category protected by federal, state, or local law. As an equal opportunities employer, we recognize that diversity is a positive attribute and we welcome the differences and benefits that a diverse culture brings.
2494229336	0	112.000	3.6	5001 to 10000 employees	Internet	$2 to $5 billion (USD) per year	Ref#: R0007641 Additional Posting Location(s): Austin, Stamford Our mission. As the world’s number 1 job site, our mission is to help people get jobs. We need talented, passionate people working together to make this happen. We are looking to grow our teams with people who share our energy and enthusiasm for creating the best experience for job seekers. The team. We are builders, we are integrators. Business Intelligence creates and optimizes solutions for a rapidly growing business on a global scale. We work with distributed infrastructure, petabytes of data, and billions of transactions with no limitations on your creativity. With tech hubs in Seattle, San Francisco, Austin, New York, Dublin, Tokyo and Hyderabad, we are improving people’s lives all around the world, one job at a time. Your job.As the Business Intelligence Manager for Finance Strategy and Analysis at Indeed, you will have a direct impact on the business by partnering with Finance on their reporting and analytic initiatives to support all business functions. You know how Finance teams work and how to empower them with data and solutions. You understand that the best managers serve their teams, removing roadblocks and giving individual contributors autonomy and ownership. You have delivered challenging technical solutions. You have led business intelligence teams and earned the respect of talented analysts. You are equally happy talking algorithms and data structures as you are brainstorming about topics such as business opportunities and career development. You want to be in the mix technically while providing leadership to your teams. As our organization grows, you will help ensure we keep the talent bar high, and grow the skills and capabilities of our team. Responsibilities Provide technical and team leadership and management for Indeed’s Finance business intelligence team supporting the Finance Strategy and Analysis organizationLead business intelligence platform integration and maintenance, defining and building reports and dashboardsHelp boost company performance through revenue or efficiency gains, as a result of the BI Team’s effortsHave detailed knowledge of Indeed’s data, and processes - including forecastingWork closely with internal stakeholders to implement solutions that meet their needs, while making sure the priorities for your team are properly aligned to business needsCollaborate with BI leaders across Indeed to drive business performance locally and globallyEmpower members of your team to grow their skillset and advance their career goals About you. About you.Requirements 4+ years experience leading BI teamsExperience leading cross functional teams to deliver business value through dataExperience analyzing high volumes of data for multiple purposes in a dynamic environmentTrack record in driving a data driven culture across business unitsPrevious or current Python programming experienceStrong numeracy and algebraic skillsAbility to simultaneously manage large projects and manage rapid delivery of requestsSolid understanding of data structures and algorithmsStrong ability to coach analysts, helping them improve their skills and grow their careersExperience with business intelligence tools, e.g. TableauFinance Strategy, analysis and forecasting experience a distinct advantage
2498217759	1	89.000	4.8	201 to 500 employees	Internet	Unknown / Non-Applicable	Data-driven decision making has always been important at SeatGeek. Our Analytics team helps drive strategic decision-making by providing meaningful business insights from data. We’re looking for a Data Analyst to join our growing team who can ensure the most relevant information is surfaced to each team to allow for well informed decisions. If you’re a well-rounded, business-savvy, technically capable analyst with SQL chops and a knack for figuring out complex systems fast, we’d love to talk with you.Who you are You love digging through mounds of data, but your understanding of what is relevant and your ability to communicate results in a concise way separates you from the pack. You look forward to open discussions around the “gray” area of business decisions, but aim to narrow that area with data driven insights as much as possible. You take pride in proactively providing data, analysis and findings in a collaborative manner. You also have many of the following: 2 to 4 years of relevant experience analyzing large and complex data sets, working with relational databases and writing SQLa degree in Mathematics, Economics, Statistics or another quantitative fieldadvanced working knowledge of SQL, perhaps with a side of Python or Ra keen sense of design, both in data architecture and dashboard presentation.knowledge of the transactional marketplace business model or consumer facing e-commerce products (bonus for mobile-first!)What you’ll dohelp drive strategic decision-making through detailed analysis, insights and reporting on key performance indicators (KPIs) and financial resultsinvestigate positive and negative trends in a wide array of business processescollaborate with departments across the company to understand needs while effectively prioritizing incoming requestswork with engineers and other source data experts to clean up messy data setscontinually improve visibility into benchmarks across the companycreate and maintain world class dashboards and data models in our BI platform, Looker Job Perks A culture that places the product first. We are a technology company at heart, and are proud of the idea that great technology drives great user experienceA laid-back, fun workplace designed to facilitate collaboration and company wide events$120/mo to spend on tickets to live eventsA superb benefits package, including full health/dental/visionA focus on transparency. We have regular team lunches and Q&A panels where employees can chat openly with teams across SeatGeek, our co-founders, and external guests from the industryHackathons: scheduled times when everyone drops what they’re doing and builds cool stuff in small groups Come Join Us!SeatGeek is committed to providing equal employment opportunities to all employees and applicants for employment regardless of race, color, religion, creed, age, national origin or ancestry, ethnicity, sex, sexual orientation, gender identity or expression, disability, military or veteran status, or any other category protected by federal, state, or local law. As an equal opportunities employer, we recognize that diversity is a positive attribute and we welcome the differences and benefits that a diverse culture brings.
2597660817	0	131.000	2.6	51 to 200 employees	Publishing	$50 to $100 million (USD) per year	SheKnows Media is a top womens lifestyle digital media company that averages 60 million unique visitors per month (comScore) and reaches more than 300 million social media fans and followers. The company operates a family of leading media properties that include SheKnows.com, BlogHer.com, STYLECASTER.com and HelloFlo.com. The Job: This position will be a key resource in the companys growing Business Intelligence & Analytics team. This role will own and lead the building, updating, and maintaining of key weekly/monthly reports and dashboards supporting the CEO, CFO, and various executive stakeholders. The role will work closely with Business Intelligence team members and other internal stakeholders to collect, validate, and visualize data from various sources. The candidate is expected to be highly proficient in Excel, able to quickly cut and dice data, and comfortable learning how new data sets relate to business concepts and objectives. As a Reporting Analyst, your role requires great attention to detail and great comfort level working with large and complex quantitative datasets. The expectation is to produce accurate and meaningful metrics to deliver high quality and actionable insights to the business. The Work:Develop, reconcile, and distribute weekly, monthly, and ad hoc reports ensuring complete accuracy.Data extraction and transposition from web-based systems (e.g. web analytics, DFP, ad server, order management, CRM, Looker, Redshift, and various other internal and external tracking and reporting UI systems) to Microsoft Excel.Visualize and report on data in accessible and meaningful ways to internal business units.Take initiative to lead prototyping and testing of automated reports from SQL data warehouse.Stay on top of new features/developments for additional data gathering opportunities.Data mining and analysis. Drawing actionable and meaning business insights from large datasets.The Fine Print:BS/BA in a quantitative or analytical field3-5 years working with data in a similar business intelligence, reporting, or analytical role.Outstanding Excel skills required.Basic knowledge of relational database concepts, SQL, and financial statements is preferred.Prior experience with some of the following tools preferred: Tableau, Looker (or other dashboarding tools), Google or Adobe Analytics, SalesForce (or other CRM platform), DFP/OAS (or other ad servers), DSM or Operative, comScore/Nielsen/Quantcast, NetSuite or other accounting softwareAble to work in a fast-paced environment and manage multiple priorities and deadlines.Strong structured problem solving, attention to detail and organizational skills.Tech savvy and able to research and recommend tools and technologies to solve problems.Ability to quickly learn new tools, concepts and reporting processes is extremely important.About Us With a mission of women inspiring women, SheKnows Media is revolutionizing the publishing industry by forging a new kind of model that seamlessly integrates users, editors and content creators onto a single platform designed to empower all women to discover, share and create. Whether its parenting or pop culture, fashion or food, DIY or décor, our award-winning editorial team, Experts, bloggers and social media influencers produce authentic and on-trend content every day. We dig deep to learn what makes our audience tick, revealing unexpected insights on women and digital media. Our robust, end-to-end suite of premium branded content and influencer marketing solutions generate nearly 1 billion ad impressions per month (sources: DFP), allowing brands to distribute authentic content and integrated advertising at scale. SheKnows Media offers a competitive compensation and benefits package with an exciting, fresh and engaging work environment. Join us, and love where you work!
2625550835	1	126.000	3.3	1001 to 5000 employees	Consumer Product Rental	$100 to $500 million (USD) per year	About Rent the Runway: Recently named #9 on CNBC’s Disruptor 50 list for 2018, Rent the Runway is transforming the $2.4 trillion global fashion industry by introducing clothing rental as a utility for women. We have pioneered the closet in the cloud and believe that every person globally will soon have a subscription to fashion. Since our launch in late 2009, the company has raised more than $210 million from top-tier investors and built one of the most beloved brands on earth. We are proud to be both a profitable and high-growth business, with a loyal 9 million members who believe that rental is the future. Our 1200+ employees have a revolutionary spirit that permeates our culture. We’ve built proprietary technology, a one-of-a-kind reverse logistics operation, stores of the future, a viral brand, relationships with hundreds of fashion brands - and we are obsessed with continuing to game change our customer experience. We are also trying to revolutionize entrepreneurship itself - proving that diverse teams produce outsized impact. The Rent the Runway Foundation, which our two co-founders launched together in 2015, helps thousands of female entrepreneurs build and scale their own businesses with the mission of increasing the number of high growth women-led companies. About the Position: Rent the Runway has a complex data ecosystem that spans web and infrastructure logs, transactions and recurring subscription behavior, and tracking each inventory unit and throughput of our home grown reverse logistics rental operations. Data systems power critical business decisions - from inventory assortment to operational efficacy, from marketing actions to personalized website features and customer experience enhancements - and we want to scale the data science function across all these vectors. In this role, you would:Mine the data ecosystem & find fruitful signalsDevelop and deploy performant algorithms powering applications Examples of applications include consumer-facing recommendation engines, inventory allocation systems, and internal apps for inventory buying and customer experienceDo bleeding-edge work including optimizing how products fit customers (unsolved as of yet), using AI techniques for inventory buying, and real-time streaming/model computations.You should have:Strong academic background in Computer Science, Mathematics, Physics or a related quantitative/engineering fieldExperience developing software or data products, particularly in Python, our go-to language. We also use packages like R for modeling efforts. Java is main language for our back-end systems.Passion for data and its ability to drive serious business impactAbility to work independently and take responsibility in a function that is growing in strength but is also core to the company’s growth

# create a description data frame for text mining
description <- newDf %>% dplyr::select(jobListingId, flag, description)
table(description$flag)  # I am interested in 60 out of 200 jobs scrapped from glassdoor

## 
##   0   1 
## 140  60

### the goal is to combine quantitative variables (like estimated salary, company revenue) with text description of a job to predict my potential interest in a job when it's getting posted!

Step 7 : document term matrix

### cleanup steps ###

# first put the corpus in tm format
descriptionClean <- Corpus(VectorSource(description$description))

# clean up
descriptionClean <- tm_map(descriptionClean, content_transformer(tolower))
descriptionClean <- tm_map(descriptionClean, removeWords, stopwords())
descriptionClean <- tm_map(descriptionClean, stripWhitespace)
descriptionClean <- tm_map(descriptionClean, removePunctuation)

# convert it into a dtm (row per document, column per word)
dtm <- DocumentTermMatrix(descriptionClean)
# inspect(dtm)

# set frequency filter, i.e. only include words that appear f or more times in the whole corpus
f = 10
features <- findFreqTerms(dtm, f)

Step 8 : split into train, test

# set index : split by 70% vs 30%
set.seed(1234)
index <- sample(1:dim(description)[1], .7 * dim(description)[1])

# step 1 : split original corpus into train and test sets, each set contains the "flag" (dependent variable)
train_step_1 <- description[index, ]
test_step_1 <- description[-index, ]

# step 2 : dummify the "term" (or word) columns
train_step_2 <- descriptionClean[index] %>% 
        DocumentTermMatrix(., list(global = c(2, Inf), dictionary = features)) %>%
        apply(MARGIN = 2, function(x) x <- ifelse(x >0, 1, 0)) %>%
        as.data.frame

test_step_2 <- descriptionClean[-index] %>% 
        DocumentTermMatrix(., list(global = c(2, Inf), dictionary = features)) %>%
        apply(MARGIN = 2, function(x) x <- ifelse(x >0, 1, 0)) %>%
        as.data.frame

# step 3 : put step 1 and 2 together
train <- cbind(flag = factor(train_step_1$flag), 
               jobListingId = train_step_1$jobListingId,
               train_step_2) %>% as.data.frame

test <- cbind(flag = factor(test_step_1$flag), 
              jobListingId = test_step_1$jobListingId,
              test_step_2) %>% as.data.frame

# FINAL step : merge back with newDf
# minus "industry" because we don't collect enough data
# the train data set is missing some industries in the test data set
newDf_train_subset <- newDf[index, ] %>% select(-c(description, flag, industry))
newDf_test_subset <- newDf[-index, ] %>% select(-c(description, flag, industry))

train <- merge(train, newDf_train_subset, by = "jobListingId") %>% dplyr::select(-jobListingId)
test <- merge(test, newDf_test_subset, by = "jobListingId") %>% dplyr::select(-jobListingId)

Step 9 : classification - logistic regression

# build a model
fit_lr <- glm(flag ~., train, family = "binomial")  # summary(fit_lr)

# fit a prediction
fit_lr_pred <- predict(fit_lr, newdata = test[, -1], type = "response")

# classification outcome
ftable(test$flag, fit_lr_pred > 0.5) -> table_lr
table_lr

##    FALSE TRUE
##              
## 0     35   15
## 1      4    6

table_lr %>% prop.table(., margin = 1)*100 -> accuracy_lr
round(accuracy_lr, 1)

##    FALSE TRUE
##              
## 0     70   30
## 1     40   60

tp <- 6
fp <- 15
tn <- 35
fn <- 4

sensitivity <- tp / (tp + fn)  # equivalent to recall
specificity <- tn / (tn + fp)
precision <- tp / (tp + fp) 
total_accuracy <- (tp + tn) / sum(tp, fp, tn, fn)

metric <- c("sensitivity", "specificity", "precision", "total_accuracy")
purrr::map2(c(sensitivity, specificity, precision, total_accuracy), c(1:4), function(x, y) { paste0(metric[y], " is ", round(x * 100, 1), "%") }) %>% unlist

## [1] "sensitivity is 60%"      "specificity is 70%"     
## [3] "precision is 28.6%"      "total_accuracy is 68.3%"

# ROC curve
fit_lr_pred_roc <- roc(flag ~ fit_lr_pred, data = test)
plot(fit_lr_pred_roc, main = "ROC curve of logistic regression model")

PART IV : word cloud

Step 10.1 : code

### let's visualize solely jobs that I am interested in by using word cloud

####### overall word cloud #######

# clean text first
clean.text = function(x)
{
        # tolower
        x = tolower(x)
        # remove rt
        x = gsub("rt", "", x)
        # remove at
        x = gsub("@\\w+", "", x)
        # remove punctuation
        x = gsub("[[:punct:]]", "", x)
        # remove numbers
        x = gsub("[[:digit:]]", "", x)
        # remove links http
        x = gsub("http\\w+", "", x)
        # remove tabs
        x = gsub("[ |\t]{2,}", "", x)
        # remove blank spaces at the beginning
        x = gsub("^ ", "", x)
        # remove blank spaces at the end
        x = gsub(" $", "", x)
        return(x)
}

overall <- tm::Corpus(VectorSource(description$description[description$flag == 1])) %>%
        clean.text 

set.seed(8321)
wordcloud(overall, 
          min.freq = 30, 
          colors = brewer.pal(8, "RdBu"))


####### comparison cloud #######

# clean, transform into tdm first
interested <- description %>% filter(flag == 1) %>% select(description) %>%
        clean.text %>%
        paste(., collapse = " ")

not_interested <- description %>% filter(flag == 0) %>% select(description) %>%
        clean.text %>%
        paste(., collapse = " ")

all <- c(interested, not_interested) %>%
        removeWords(., c(stopwords("english"))) %>%
        VectorSource %>%
        Corpus

tdm <- TermDocumentMatrix(all) %>% as.matrix
colnames(tdm) <- c("Interested", "Not Interested")

# comparison cloud #
set.seed(8321)
comparison.cloud(tdm,
                 title.size = 1,
                 random.order = FALSE,
                 # colors = c("#00B2FF", "red", "#FF0099", "#6600CC"),
                 # colors = c("#00B2FF", "#6600CC"),
                 colors = c("green", "red"),
                 max.words = 200)

Step 10.2 : overall word cloud

Jobs that I am interested in

Step 10.3 : comparison cloud

The comparison cloud is pretty accurate. I am more interested in insights, marketing, analytics, python, experienced in visualization (such as Looker and Tableau are both displayed on top), whereas I am less interested in finance, investment, banking or management (or manager role), and it’s correct that I don’t have any experience in microstrategy, software implementation, design or integration.

PART V : SET UP

Step 11 : email notification

library(mailR)
sender <- "myvioletrose@gmail.com" 
recipients <- c("myvioletrose@gmail.com") 
# username <- "xyz"
# password <- "xyz"

email <- mailR::send.mail(from = sender,
                   to = recipients,
                   subject = "Jobs that I am interested in",
                   body = "FYI",
                   smtp = list(host.name = "smtp.gmail.com", 
                               port = 465, 
                               user.name = username,
                               passwd = password,
                               ssl = T),
                   authenticate = T,
                   send = T,
                   # attach any file (txt or png)
                   attach.files = c(grep(pattern = "jobs_i_like", dir(), value = T)))

Step 12 : Windows Task Scheduler

# create a .bat file (e.g. task.bat) that includes two lines of code
# it is used as a pointer to tell Windows where to find and run your script
@echo off
"C:\Program Files\R\R-3.4.0\bin\R.exe" CMD BATCH "C:\Users\traveler\Desktop\job_posting_recommendation\career_web_scrap.R"