Text Analysis of Simplyhired, ai-job.net, and Monster

After all of the Data has been scraped from the Job websites and turned into .csv’s, next we have to read in the .csv file and merge them together and then prepare them for text analysis.

#Load packages

require(dplyr)
require(rvest)
require(stringr)
require(tm)
require(SnowballC)
require(tidytext)
require(stringr)
require(textdata)
require(tidyverse)
require(ggplot2)
require(wordcloud)
require(widyr)
require(igraph)
require(ggraph)
require(kableExtra)

Since all the comlumn names were different the column names were changed so that they all can be the same when we merge the three databases different databases.

Read in .csv of SimplyHired

From Leticia

ltcancel<-read.csv("https://raw.githubusercontent.com/ltcancel/Project3/master/SimplyHiredJobs.csv", stringsAsFactors = FALSE)
colnames(ltcancel)<-c("Position", "Company","Location","Salary","URL","Job_Description")
str(ltcancel)
## 'data.frame':    209 obs. of  6 variables:
##  $ Position       : chr  "Data Scientist, Marketplace – all levels" "Data Engineer" "Senior Data Scientist" "Ecological Wetland Scientist" ...
##  $ Company        : chr  "Spotify" "Noom Inc." "HVH Precision Analytics" "PS&S" ...
##  $ Location       : chr  "New York, NY" "New York, NY" "New York, NY" "Mineola, NY" ...
##  $ Salary         : chr  "Estimated: $110,000 - $160,000 a year" "Estimated: $73,000 - $96,000 a year" "Estimated: $110,000 - $140,000 a yearSimply Apply" "Estimated: $54,000 - $66,000 a year" ...
##  $ URL            : chr  "https://www.simplyhired.com/job/-q6yR-atece9p8LQvm2yP8xIX3VcYfRC9wsdPgSS0nWHIG3f2EZOxA?q=data+scientist" "https://www.simplyhired.com/job/pg680lk5W0WVpIIE7QQXRgEim6bP-NKuilb64EfQF80SDIp_X1ufSA?q=data+scientist" "https://www.simplyhired.com/job/YPL5f6DfJcFxTqZiKpIW3UWlZ0bqR5UKLLVHPAMbe3OnTxpIEdlCpg?q=data+scientist" "https://www.simplyhired.com/job/uip2DCa3k_ke2JyWB08F8Gm7MRFCNdXkFqw4nCdD5nq6tJiCPp7oZA?q=data+scientist" ...
##  $ Job_Description: chr  "Marketplace is the home for Spotify’s music industry products, such as Spotify for Artists, Spotify Label Analy"| __truncated__ "At Noom, we use scientifically proven methods to help our users create healthier lifestyles, and manage importa"| __truncated__ "Job Description: Data science professional to design, implement and deployadvanced machine learning / artificia"| __truncated__ "Overview\nPS&S is an award-winning “one-stop shop” of architecture and engineering excellence. The depth and br"| __truncated__ ...

Read in .csv of ai-job.net

From Salma

selshahawy<-read.csv("https://raw.githubusercontent.com/salma71/MSDS_2019/master/Fall2019/aquisition_management_607/project_3/jobs_detailsInfo.csv", stringsAsFactors = FALSE) 
colnames(selshahawy)<-c("Position", "Company","Location","URL","Job_Description")  
str(selshahawy)
## 'data.frame':    60 obs. of  5 variables:
##  $ Position       : chr  "Data EngineerMongoDB" "Machine Learning EngineerMedium" "Senior Data EngineerManulife" "Research ScientistAllen Institute for Artificial Intelligence (AI2)" ...
##  $ Company        : chr  "MongoDB" "Medium" "Manulife" "Allen Institute for Artificial Intelligence (AI2)" ...
##  $ Location       : chr  "New York City" "San Francisco" "Toronto, ON CA" "Seattle, WA" ...
##  $ URL            : chr  "https://ai-jobs.net/job/data-engineer-40/" "https://ai-jobs.net/job/machine-learning-engineer-64/" "https://ai-jobs.net/job/senior-data-engineer-8/" "https://ai-jobs.net/job/research-scientist-6/" ...
##  $ Job_Description: chr  "MongoDB is growing rapidly and seeking a Data Engineer to be a key contributor to the overall internal data pla"| __truncated__ "At Medium, words matter. We are building the best place for reading and writing on the internetâ\200”a place wh"| __truncated__ "Are you looking for unlimited opportunities to develop and succeed? With work that challenges and makes a diffe"| __truncated__ "The Allen Institute for Artificial Intelligence (AI2) is a non-profit research institute in Seattle founded by "| __truncated__ ...

Read in .csv of Monster

From Suwarman

ssufian<-read.csv("https://raw.githubusercontent.com/salma71/DataScience_skills/master/monsterjobs.csv", stringsAsFactors = FALSE) 
colnames(ssufian)<-c("Position", "Company","Location","Salary","URL","Job_Description")  
str(ssufian)
## 'data.frame':    26 obs. of  6 variables:
##  $ Position       : chr  "Principal Data Scientist (Facilities Analytics)" "Data Scientist" "Lead Data Scientist" "Data Scientist" ...
##  $ Company        : chr  "Northrop Grumman" "Eaton Corporation" "CenturyLink" "CBTS a Cincinnati Bell Company" ...
##  $ Location       : chr  "Redondo Beach, CA" "Eden Prairie, MN" "BROOMFIELD, CO" "Cincinnati, OH" ...
##  $ Salary         : logi  NA NA NA NA NA NA ...
##  $ URL            : chr  "https://job-openings.monster.com/principal-data-scientist-facilities-analytics-redondo-beach-ca-us-northrop-gru"| __truncated__ "https://job-openings.monster.com/data-scientist-eden-prairie-mn-us-eaton-corporation/2577f928-ed07-402e-bfd1-3e8a72ecb61e" "https://job-openings.monster.com/lead-data-scientist-broomfield-co-us-centurylink/8a9ed3aa-2fe3-48b4-86a7-4d3b7da8cd8a" "https://job-openings.monster.com/data-scientist-cincinnati-oh-us-cbts-a-cincinnati-bell-company/206896014" ...
##  $ Job_Description: chr  "At Northrop Grumman we develop cutting-edge technology that preserves freedom and advances human discovery. Our"| __truncated__ "Eaton’s Hydraulics division is currently seeking a DataScientist to join our team. This position is based at ou"| __truncated__ "CenturyLink (NYSE: CTL) is a global communications and IT services company focused on connecting its customers "| __truncated__ "CBTS is seeking a Senior Data Scientist to join our team. The initial focus of the role will be data classifica"| __truncated__ ...

Merge all the .csv’s into one.

Since we only can merge two at at time, we merge the first two .csvs into one. The column names are position, Company, Location, URL, and Job Description.

All the column names and rows are not identical so we have to set all = TRUE to make sure that they all merge no matter the number of columns or the number of rows.

twocsv<-merge(ltcancel,selshahawy,all= TRUE)
str(twocsv)
## 'data.frame':    269 obs. of  6 variables:
##  $ Position       : chr  "2020 Data Science Intern" "2020 Machine Learning Internship â\200“ Amazon SearchAmazon.com" "Administrative NP Coordinator - Stroke Program, Bellevue Hospital" "Administrative NP Coordinator - Stroke Program, Bellevue Hospital" ...
##  $ Company        : chr  "Guardian Life Insurance Company" "Amazon.com" "NYU Langone Health" "NYU Langone Medical Center" ...
##  $ Location       : chr  "New York, NY" "Berlin, Germany" "New York, NY" "New York, NY" ...
##  $ URL            : chr  "https://www.simplyhired.com/job/W0oTT9LF3Cfy9OZtnD5FBVyj1P52XNQnt_6JxZW83BvgYoo2zz2uJQ?q=data+scientist" "https://ai-jobs.net/job/2020-machine-learning-internship-amazon-search/" "https://www.simplyhired.com/job/vM0UOOBVm4K93DvY7BZORwTyO-0LDcZSRcQqg55e1Fvxf3gnaooqQw?q=data+scientist" "https://www.simplyhired.com/job/8EoewrddDdBqY2rnCl00C898-taFoO084O_BzpMdVtde014pojcrTQ?q=data+scientist" ...
##  $ Job_Description: chr  "2020 Data Science Intern - (19001927)\nDescription\n\nInternship Overview\nOur Internship Program is a paid 10-"| __truncated__ "We are looking for PhD students to join Amazon Search in Berlin for a 3-6 month internship in 2020.Hundreds of "| __truncated__ "NYU School of Medicine is one of the nation's top-ranked medical schools. For 175 years, NYU School of Medicine"| __truncated__ "NYU School of Medicine is one of the nation's top-ranked medical schools. For 175 years, NYU School of Medicine"| __truncated__ ...
##  $ Salary         : chr  "5d" NA "Estimated: $44,000 - $57,000 a year8d" "Estimated: $40,000 - $57,000 a year8d" ...

209 and 60 objects have been merged together to equal 269 ob listings.

Here we merge the third .csv and all of the .csv’s are merged together.

allcsv<-merge(twocsv,ssufian, all=TRUE)
str(allcsv)
## 'data.frame':    295 obs. of  6 variables:
##  $ Position       : chr  "2020 Citizen Data Scientist Internship" "2020 Data Science Intern" "2020 Machine Learning Internship â\200“ Amazon SearchAmazon.com" "Administrative NP Coordinator - Stroke Program, Bellevue Hospital" ...
##  $ Company        : chr  "FCA" "Guardian Life Insurance Company" "Amazon.com" "NYU Langone Health" ...
##  $ Location       : chr  "Detroit, MI" "New York, NY" "Berlin, Germany" "New York, NY" ...
##  $ URL            : chr  "https://job-openings.monster.com/2020-citizen-data-scientist-internship-detroit-mi-us-fca/212893769" "https://www.simplyhired.com/job/W0oTT9LF3Cfy9OZtnD5FBVyj1P52XNQnt_6JxZW83BvgYoo2zz2uJQ?q=data+scientist" "https://ai-jobs.net/job/2020-machine-learning-internship-amazon-search/" "https://www.simplyhired.com/job/vM0UOOBVm4K93DvY7BZORwTyO-0LDcZSRcQqg55e1Fvxf3gnaooqQw?q=data+scientist" ...
##  $ Job_Description: chr  "FCA US LLC College Intern Program offers a unique opportunity for highly motivated, innovative, and inspired in"| __truncated__ "2020 Data Science Intern - (19001927)\nDescription\n\nInternship Overview\nOur Internship Program is a paid 10-"| __truncated__ "We are looking for PhD students to join Amazon Search in Berlin for a 3-6 month internship in 2020.Hundreds of "| __truncated__ "NYU School of Medicine is one of the nation's top-ranked medical schools. For 175 years, NYU School of Medicine"| __truncated__ ...
##  $ Salary         : chr  NA "5d" NA "Estimated: $44,000 - $57,000 a year8d" ...

From 269 objects now we have 295, which gives us a total number of 295 job listings that we get to do text mining with.

Remove columns to make it easier to run.

allcsv2 <-allcsv[c(1,2,5)]

Prepare the csv for text analysis.

Using Corpus with the tm package seemed to be the best way to prepare the data for data analysis. Using the Corpus method information and with the help of this it cleaned up the text and unnecessary words that we dont want to have in the analysis. After this step you move onto the step which is the bag of words.

This step begins to create character vectors using corpus

descriptionofjobs = Corpus(VectorSource(allcsv2$Job_Description)) 
descriptionofjobs = tm_map(descriptionofjobs, content_transformer(tolower))##changes to lower letters
descriptionofjobs= tm_map(descriptionofjobs, content_transformer(gsub), pattern="\\W",replace=" ")
removeURL = function(x) gsub("http^\\s\\s*", "", x)%>%
descriptionofjobs = tm_map(descriptionofjobs, content_transformer(removeURL))
descriptionofjobs=tm_map(descriptionofjobs,removeNumbers) ##Remove numbers
descriptionofjobs=tm_map(descriptionofjobs,removePunctuation)##Punctuation
descriptionofjobs = tm_map(descriptionofjobs, removeWords, stopwords(kind = "english"))##Stopwords
extraStopwords = c(setdiff(stopwords('english'), c("used", "will")), "time", "can", "sex", "role", "new","can", "job", "etc", "one", "looking", "well","use","best","also", "high", "real", "please", "key", "able", "must", "like", "full", "include", "good", "non", "need","plus","day","year", "com", "want", "age","using","sexual", "help","apply", "race", "orientation", "will", "work", "new")
descriptionofjobs = tm_map(descriptionofjobs, removeWords, extraStopwords) ##more stop words or unwanted words
descriptionofjobs = tm_map (descriptionofjobs, stripWhitespace)

Creating the Bag of Words

This is an important step, after cleaning the text we then have to store the words into the document term matrix. With the document term matrix we are able to see how many occorances does the term have in the matrix. After we do that we then sparse the words so that we can get the words that have the most occurences in the matrix. I used a sparsity of .45.

allwords2<-DocumentTermMatrix(descriptionofjobs)##Creating the DocumentTermMatrix
sparsewords = removeSparseTerms(allwords2,.45) ##removing the sparse terms. 

Begin the analysis

Now finally we can begin the analysis using the guidance of Text Mining with R of the data that was scraped from all the job boards and see what skills have the most occurences in the job boards without mutating the skills into the table. But before we can do that we have to make the DocumentTermMatrix into a tidy table.

Which is done here.

tidywords<-tidy(sparsewords)
tidywords
## # A tibble: 2,870 x 3
##    document term       count
##    <chr>    <chr>      <dbl>
##  1 1        business       9
##  2 1        data          13
##  3 1        experience     4
##  4 1        knowledge      2
##  5 1        learning       3
##  6 1        machine        2
##  7 1        science        1
##  8 1        skills         2
##  9 2        ability        1
## 10 2        business       6
## # ... with 2,860 more rows

This shows the number of times the term appers in the text.

totalwords<-tidywords%>%
  count(term, sort= TRUE)
kable(totalwords) %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = F) %>%
  column_spec(1, width = "15em", background = "lightgreen")
term n
data 287
experience 279
team 243
science 218
years 216
python 208
skills 193
business 189
learning 185
degree 175
knowledge 171
machine 171
ability 169
including 166

As you can see data, experience, team, science, experience, years, python, skills, business, learning, degree, knowledge, machine, ability, are the words that occur the most in the text. which is a mixture of hard and soft skills. From the top skills that we got you can see machine learning, python.

tidywords %>%
  count(term, sort = TRUE) %>%
  filter(n > 180) %>%
  ggplot(aes(term, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

The top 3 most used terms are data, experience, and team.

## This is needed in order to be able to calculate the frequency of the term. 
summaryofwords<-tidywords %>% 
  group_by(term) %>%
  summarize(total = sum(n()))%>%
  arrange(desc(total))
totalwords<-left_join(totalwords, summaryofwords)

This calculates the Frequency of each term which is inspired by Zipf’s Law. Zipf’s law states that the frequency is inversely proportional to the rank.

tfrequency <-summaryofwords %>%
  group_by(term)%>%
  mutate(rank= n(), 'frequencyofterm' = n()/total)%>%
  arrange(total)
kable(tfrequency) %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = F) %>%
  column_spec(1, width = "15em", background = "cyan")
term total rank frequencyofterm
including 166 1 0.0060241
ability 169 1 0.0059172
knowledge 171 1 0.0058480
machine 171 1 0.0058480
degree 175 1 0.0057143
learning 185 1 0.0054054
business 189 1 0.0052910
skills 193 1 0.0051813
python 208 1 0.0048077
years 216 1 0.0046296
science 218 1 0.0045872
team 243 1 0.0041152
experience 279 1 0.0035842
data 287 1 0.0034843

This shows the frequencies for the word in its descending order.

Groupings of the words

It is interesting to see the number of times that words have been grouped together in the text per document. This shows the number of times that the word has been grouped together.

group_word_pairs<-tidywords%>%
  group_by(term) %>%
  arrange(count)
kable(group_word_pairs [1:20, 1:3] ) %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = F) %>%
  column_spec(2, bold = T, border_right = F)%>%
  column_spec(1, width = "3em", background = "yellow")%>%
  column_spec(2, width = "8em", background = "cyan")
document term count
1 science 1
2 ability 1
2 degree 1
2 including 1
2 knowledge 1
2 python 1
2 years 1
3 business 1
3 degree 1
3 python 1
4 business 1
4 including 1
4 knowledge 1
5 business 1
5 including 1
5 knowledge 1
6 including 1
6 python 1
6 years 1
7 data 1
#set.seed(1234)
group_word_pairs %>%
  filter(count >= 10) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = count, edge_width = count), edge_colour = "cyan3") +
  geom_node_point(size = 2) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.3, "lines")) +
  theme_void()

Correlation of the words

It would even be better to see what words occur togehter and there correlation to each other. Here we get to see the correlations of the words.

tidywords_cors <- tidywords %>% 
  group_by(term) %>%
  filter(n() >= 180) %>%
  pairwise_cor(term, count, sort = TRUE, upper = FALSE)
kable(tidywords_cors) %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = F) %>%
  column_spec(2, bold = T, border_right = F)%>%
  column_spec(1, width = "10em", background = "cyan")%>%
  column_spec(2, width = "10em", background = "yellow")
item1 item2 correlation
learning science 0.9304842
learning team 0.9304842
science team 0.8523810
experience learning 0.8172253
skills years 0.8154762
science skills 0.7826238
science years 0.7826238
team years 0.7826238
experience science 0.7604152
business team 0.7604152
experience team 0.7604152
learning skills 0.7282191
learning years 0.7282191
skills python 0.7126966
python years 0.7126966
business learning 0.6817494
business science 0.6217513
skills team 0.6175807
business experience 0.6092437
experience skills 0.5951190
business years 0.5951190
experience years 0.5951190
science python 0.5577734
python team 0.5577734
learning python 0.5189993
business skills 0.4400880
business python 0.4241393
experience python 0.4241393
business data NaN
data experience NaN
data learning NaN
data science NaN
data skills NaN
data python NaN
data team NaN
data years NaN
set.seed(1234)
tidywords_cors %>%
  filter(correlation > .3) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "lightblue2") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.3, "lines")) +
  theme_void()

Here you can visually see the co-occurences.

WordCloud

In order to make the wordcloud we use the terms that we ended up with after we cleaned the terms. Within the function it changes the word to a DocumentTermMatrix and it sparses the words. Here we can the terms that have been used the most in the job boards.

library(wordcloud)
dtm = DocumentTermMatrix(descriptionofjobs)
dtm = removeSparseTerms(dtm, 0.65)
dataset = as.matrix(dtm)
v = sort(colSums(dataset),decreasing=TRUE)
myNames = names(v)
d = data.frame(word=myNames,freq=v)
wordcloud(d$word, colors=c(1:4),random.color=TRUE, d$freq, min.freq=100)

Here you can see that data, experience, business, leraning, machine, years, analytics, solutions, developement, support, scientest, science, computer. teams, skills, technology, and information are one of the many words that are in the wordcloud that stand out the most. But of course DATA is going to be one of the most important terms in a job board for data scientist. In the wordcloud you can see all of the top skills the we found in the other analysis - Research, Machine Learning, Statistics, and Communication, but not all the top skills are seen.

Conclusion

Doing this project gives one an idea what Companies are looking for when it comes to Data Science.With the top three most used terms being data, experince and team, it gives one an idea of the companies want and expect from a Data Scientest.One of course they expect the applicant to know Data.They also want the person to have experience with working with data. They also want people that are able to work in a team while working with data. Some of the words that had the highest frquencies were ability, knowledge, machine, degree, learning, business, skills, and python. With these words one can tell that Companies want someone that has the ability to work in a business enviorment, that has the knowledge and the education, and the skills needed to do machine learning and python. When you look at the correlation of the words you can see how years is in the middle and connects to point in the correlation. With that one can tell that companies care about the years of experience a person has with working in Data. So what are the top skills that companies are looking for is it soft or hard skills? with this it seems to be a mixture of everything, companies want people soft skills that are able to work in a team. They also want people with hard skills that are able to work with python, and sql. To be a data scientest you have to both have hards skills, and soft skills.

Another Way to merge all of the data tables using the Reduce Function

csv123<-df_list<- list(ltcancel,selshahawy,ssufian)
Reduce(function(d1,d2) merge(d1, d2, by = "Position",all.x=TRUE, all.y = FALSE),df_list)
summary(csv123)
##      Length Class      Mode
## [1,] 6      data.frame list
## [2,] 5      data.frame list
## [3,] 6      data.frame list

All of the tables are merged as a list

References

  1. Text Mining With R https://www.tidytextmining.com/

2.Corpus https://www.rdocumentation.org/packages/tm/versions/0.7-6/topics/Corpus

3.KableExtra https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html#overview

4.R Reduce Applys https://blog.zhaw.ch/datascience/r-reduce-applys-lesser-known-brother/

  1. Understanding and writing your text analysis. https://towardsdatascience.com/understanding-and-writing-your-first-text-mining-script-with-r-c74a7efbe30f

RPubs Link http://rpubs.com/Luz917/project3data607Cruz