Project 3 607

Text Analysis of Simplyhired, ai-job.net, and Monster

After all of the Data has been scraped from the Job websites and turned into .csv’s, next we have to read in the .csv file and merge them together and then prepare them for text analysis.

#Load packages

require(dplyr)
require(rvest)
require(stringr)
require(tm)
require(SnowballC)
require(tidytext)
require(stringr)
require(textdata)
require(tidyverse)
require(ggplot2)
require(wordcloud)
require(widyr)
require(igraph)
require(ggraph)
require(kableExtra)

Since all the comlumn names were different the column names were changed so that they all can be the same when we merge the three databases different databases.

Read in .csv of SimplyHired

From Leticia

ltcancel<-read.csv("https://raw.githubusercontent.com/ltcancel/Project3/master/SimplyHiredJobs.csv", stringsAsFactors = FALSE)
colnames(ltcancel)<-c("Position", "Company","Location","Salary","URL","Job_Description")
str(ltcancel)

## 'data.frame':    209 obs. of  6 variables:
##  $ Position       : chr  "Data Scientist, Marketplace – all levels" "Data Engineer" "Senior Data Scientist" "Ecological Wetland Scientist" ...
##  $ Company        : chr  "Spotify" "Noom Inc." "HVH Precision Analytics" "PS&S" ...
##  $ Location       : chr  "New York, NY" "New York, NY" "New York, NY" "Mineola, NY" ...
##  $ Salary         : chr  "Estimated: $110,000 - $160,000 a year" "Estimated: $73,000 - $96,000 a year" "Estimated: $110,000 - $140,000 a yearSimply Apply" "Estimated: $54,000 - $66,000 a year" ...
##  $ URL            : chr  "https://www.simplyhired.com/job/-q6yR-atece9p8LQvm2yP8xIX3VcYfRC9wsdPgSS0nWHIG3f2EZOxA?q=data+scientist" "https://www.simplyhired.com/job/pg680lk5W0WVpIIE7QQXRgEim6bP-NKuilb64EfQF80SDIp_X1ufSA?q=data+scientist" "https://www.simplyhired.com/job/YPL5f6DfJcFxTqZiKpIW3UWlZ0bqR5UKLLVHPAMbe3OnTxpIEdlCpg?q=data+scientist" "https://www.simplyhired.com/job/uip2DCa3k_ke2JyWB08F8Gm7MRFCNdXkFqw4nCdD5nq6tJiCPp7oZA?q=data+scientist" ...
##  $ Job_Description: chr  "Marketplace is the home for Spotify’s music industry products, such as Spotify for Artists, Spotify Label Analy"| __truncated__ "At Noom, we use scientifically proven methods to help our users create healthier lifestyles, and manage importa"| __truncated__ "Job Description: Data science professional to design, implement and deployadvanced machine learning / artificia"| __truncated__ "Overview\nPS&S is an award-winning “one-stop shop” of architecture and engineering excellence. The depth and br"| __truncated__ ...

Read in .csv of ai-job.net

From Salma

selshahawy<-read.csv("https://raw.githubusercontent.com/salma71/MSDS_2019/master/Fall2019/aquisition_management_607/project_3/jobs_detailsInfo.csv", stringsAsFactors = FALSE) 
colnames(selshahawy)<-c("Position", "Company","Location","URL","Job_Description")  
str(selshahawy)

## 'data.frame':    60 obs. of  5 variables:
##  $ Position       : chr  "Data EngineerMongoDB" "Machine Learning EngineerMedium" "Senior Data EngineerManulife" "Research ScientistAllen Institute for Artificial Intelligence (AI2)" ...
##  $ Company        : chr  "MongoDB" "Medium" "Manulife" "Allen Institute for Artificial Intelligence (AI2)" ...
##  $ Location       : chr  "New York City" "San Francisco" "Toronto, ON CA" "Seattle, WA" ...
##  $ URL            : chr  "https://ai-jobs.net/job/data-engineer-40/" "https://ai-jobs.net/job/machine-learning-engineer-64/" "https://ai-jobs.net/job/senior-data-engineer-8/" "https://ai-jobs.net/job/research-scientist-6/" ...
##  $ Job_Description: chr  "MongoDB is growing rapidly and seeking a Data Engineer to be a key contributor to the overall internal data pla"| __truncated__ "At Medium, words matter. We are building the best place for reading and writing on the internetâ\200”a place wh"| __truncated__ "Are you looking for unlimited opportunities to develop and succeed? With work that challenges and makes a diffe"| __truncated__ "The Allen Institute for Artificial Intelligence (AI2) is a non-profit research institute in Seattle founded by "| __truncated__ ...

Read in .csv of Monster

From Suwarman

ssufian<-read.csv("https://raw.githubusercontent.com/salma71/DataScience_skills/master/monsterjobs.csv", stringsAsFactors = FALSE) 
colnames(ssufian)<-c("Position", "Company","Location","Salary","URL","Job_Description")  
str(ssufian)

## 'data.frame':    26 obs. of  6 variables:
##  $ Position       : chr  "Principal Data Scientist (Facilities Analytics)" "Data Scientist" "Lead Data Scientist" "Data Scientist" ...
##  $ Company        : chr  "Northrop Grumman" "Eaton Corporation" "CenturyLink" "CBTS a Cincinnati Bell Company" ...
##  $ Location       : chr  "Redondo Beach, CA" "Eden Prairie, MN" "BROOMFIELD, CO" "Cincinnati, OH" ...
##  $ Salary         : logi  NA NA NA NA NA NA ...
##  $ URL            : chr  "https://job-openings.monster.com/principal-data-scientist-facilities-analytics-redondo-beach-ca-us-northrop-gru"| __truncated__ "https://job-openings.monster.com/data-scientist-eden-prairie-mn-us-eaton-corporation/2577f928-ed07-402e-bfd1-3e8a72ecb61e" "https://job-openings.monster.com/lead-data-scientist-broomfield-co-us-centurylink/8a9ed3aa-2fe3-48b4-86a7-4d3b7da8cd8a" "https://job-openings.monster.com/data-scientist-cincinnati-oh-us-cbts-a-cincinnati-bell-company/206896014" ...
##  $ Job_Description: chr  "At Northrop Grumman we develop cutting-edge technology that preserves freedom and advances human discovery. Our"| __truncated__ "Eaton’s Hydraulics division is currently seeking a DataScientist to join our team. This position is based at ou"| __truncated__ "CenturyLink (NYSE: CTL) is a global communications and IT services company focused on connecting its customers "| __truncated__ "CBTS is seeking a Senior Data Scientist to join our team. The initial focus of the role will be data classifica"| __truncated__ ...

Merge all the .csv’s into one.

Since we only can merge two at at time, we merge the first two .csvs into one. The column names are position, Company, Location, URL, and Job Description.

All the column names and rows are not identical so we have to set all = TRUE to make sure that they all merge no matter the number of columns or the number of rows.

twocsv<-merge(ltcancel,selshahawy,all= TRUE)
str(twocsv)

## 'data.frame':    269 obs. of  6 variables:
##  $ Position       : chr  "2020 Data Science Intern" "2020 Machine Learning Internship â\200“ Amazon SearchAmazon.com" "Administrative NP Coordinator - Stroke Program, Bellevue Hospital" "Administrative NP Coordinator - Stroke Program, Bellevue Hospital" ...
##  $ Company        : chr  "Guardian Life Insurance Company" "Amazon.com" "NYU Langone Health" "NYU Langone Medical Center" ...
##  $ Location       : chr  "New York, NY" "Berlin, Germany" "New York, NY" "New York, NY" ...
##  $ URL            : chr  "https://www.simplyhired.com/job/W0oTT9LF3Cfy9OZtnD5FBVyj1P52XNQnt_6JxZW83BvgYoo2zz2uJQ?q=data+scientist" "https://ai-jobs.net/job/2020-machine-learning-internship-amazon-search/" "https://www.simplyhired.com/job/vM0UOOBVm4K93DvY7BZORwTyO-0LDcZSRcQqg55e1Fvxf3gnaooqQw?q=data+scientist" "https://www.simplyhired.com/job/8EoewrddDdBqY2rnCl00C898-taFoO084O_BzpMdVtde014pojcrTQ?q=data+scientist" ...
##  $ Job_Description: chr  "2020 Data Science Intern - (19001927)\nDescription\n\nInternship Overview\nOur Internship Program is a paid 10-"| __truncated__ "We are looking for PhD students to join Amazon Search in Berlin for a 3-6 month internship in 2020.Hundreds of "| __truncated__ "NYU School of Medicine is one of the nation's top-ranked medical schools. For 175 years, NYU School of Medicine"| __truncated__ "NYU School of Medicine is one of the nation's top-ranked medical schools. For 175 years, NYU School of Medicine"| __truncated__ ...
##  $ Salary         : chr  "5d" NA "Estimated: $44,000 - $57,000 a year8d" "Estimated: $40,000 - $57,000 a year8d" ...

209 and 60 objects have been merged together to equal 269 ob listings.

Here we merge the third .csv and all of the .csv’s are merged together.

allcsv<-merge(twocsv,ssufian, all=TRUE)
str(allcsv)

## 'data.frame':    295 obs. of  6 variables:
##  $ Position       : chr  "2020 Citizen Data Scientist Internship" "2020 Data Science Intern" "2020 Machine Learning Internship â\200“ Amazon SearchAmazon.com" "Administrative NP Coordinator - Stroke Program, Bellevue Hospital" ...
##  $ Company        : chr  "FCA" "Guardian Life Insurance Company" "Amazon.com" "NYU Langone Health" ...
##  $ Location       : chr  "Detroit, MI" "New York, NY" "Berlin, Germany" "New York, NY" ...
##  $ URL            : chr  "https://job-openings.monster.com/2020-citizen-data-scientist-internship-detroit-mi-us-fca/212893769" "https://www.simplyhired.com/job/W0oTT9LF3Cfy9OZtnD5FBVyj1P52XNQnt_6JxZW83BvgYoo2zz2uJQ?q=data+scientist" "https://ai-jobs.net/job/2020-machine-learning-internship-amazon-search/" "https://www.simplyhired.com/job/vM0UOOBVm4K93DvY7BZORwTyO-0LDcZSRcQqg55e1Fvxf3gnaooqQw?q=data+scientist" ...
##  $ Job_Description: chr  "FCA US LLC College Intern Program offers a unique opportunity for highly motivated, innovative, and inspired in"| __truncated__ "2020 Data Science Intern - (19001927)\nDescription\n\nInternship Overview\nOur Internship Program is a paid 10-"| __truncated__ "We are looking for PhD students to join Amazon Search in Berlin for a 3-6 month internship in 2020.Hundreds of "| __truncated__ "NYU School of Medicine is one of the nation's top-ranked medical schools. For 175 years, NYU School of Medicine"| __truncated__ ...
##  $ Salary         : chr  NA "5d" NA "Estimated: $44,000 - $57,000 a year8d" ...

From 269 objects now we have 295, which gives us a total number of 295 job listings that we get to do text mining with.

Remove columns to make it easier to run.

allcsv2 <-allcsv[c(1,2,5)]

Prepare the csv for text analysis.

Using Corpus with the tm package seemed to be the best way to prepare the data for data analysis. Using the Corpus method information and with the help of this it cleaned up the text and unnecessary words that we dont want to have in the analysis. After this step you move onto the step which is the bag of words.

This step begins to create character vectors using corpus

descriptionofjobs = Corpus(VectorSource(allcsv2$Job_Description)) 
descriptionofjobs = tm_map(descriptionofjobs, content_transformer(tolower))##changes to lower letters
descriptionofjobs= tm_map(descriptionofjobs, content_transformer(gsub), pattern="\\W",replace=" ")
removeURL = function(x) gsub("http^\\s\\s*", "", x)%>%
descriptionofjobs = tm_map(descriptionofjobs, content_transformer(removeURL))
descriptionofjobs=tm_map(descriptionofjobs,removeNumbers) ##Remove numbers
descriptionofjobs=tm_map(descriptionofjobs,removePunctuation)##Punctuation

descriptionofjobs = tm_map(descriptionofjobs, removeWords, stopwords(kind = "english"))##Stopwords

extraStopwords = c(setdiff(stopwords('english'), c("used", "will")), "time", "can", "sex", "role", "new","can", "job", "etc", "one", "looking", "well","use","best","also", "high", "real", "please", "key", "able", "must", "like", "full", "include", "good", "non", "need","plus","day","year", "com", "want", "age","using","sexual", "help","apply", "race", "orientation", "will", "work", "new")
descriptionofjobs = tm_map(descriptionofjobs, removeWords, extraStopwords) ##more stop words or unwanted words

descriptionofjobs = tm_map (descriptionofjobs, stripWhitespace)

Creating the Bag of Words

This is an important step, after cleaning the text we then have to store the words into the document term matrix. With the document term matrix we are able to see how many occorances does the term have in the matrix. After we do that we then sparse the words so that we can get the words that have the most occurences in the matrix. I used a sparsity of .45.

allwords2<-DocumentTermMatrix(descriptionofjobs)##Creating the DocumentTermMatrix

sparsewords = removeSparseTerms(allwords2,.45) ##removing the sparse terms.

Begin the analysis

Now finally we can begin the analysis using the guidance of Text Mining with R of the data that was scraped from all the job boards and see what skills have the most occurences in the job boards without mutating the skills into the table. But before we can do that we have to make the DocumentTermMatrix into a tidy table.

Which is done here.

tidywords<-tidy(sparsewords)
tidywords

## # A tibble: 2,870 x 3
##    document term       count
##    <chr>    <chr>      <dbl>
##  1 1        business       9
##  2 1        data          13
##  3 1        experience     4
##  4 1        knowledge      2
##  5 1        learning       3
##  6 1        machine        2
##  7 1        science        1
##  8 1        skills         2
##  9 2        ability        1
## 10 2        business       6
## # ... with 2,860 more rows

This shows the number of times the term appers in the text.

totalwords<-tidywords%>%
  count(term, sort= TRUE)
kable(totalwords) %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = F) %>%
  column_spec(1, width = "15em", background = "lightgreen")

term	n
data	287
experience	279
team	243
science	218
years	216
python	208
skills	193
business	189
learning	185
degree	175
knowledge	171
machine	171
ability	169
including	166

As you can see data, experience, team, science, experience, years, python, skills, business, learning, degree, knowledge, machine, ability, are the words that occur the most in the text. which is a mixture of hard and soft skills. From the top skills that we got you can see machine learning, python.

tidywords %>%
  count(term, sort = TRUE) %>%
  filter(n > 180) %>%
  ggplot(aes(term, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

The top 3 most used terms are data, experience, and team.

## This is needed in order to be able to calculate the frequency of the term. 
summaryofwords<-tidywords %>% 
  group_by(term) %>%
  summarize(total = sum(n()))%>%
  arrange(desc(total))
totalwords<-left_join(totalwords, summaryofwords)

This calculates the Frequency of each term which is inspired by Zipf’s Law. Zipf’s law states that the frequency is inversely proportional to the rank.

tfrequency <-summaryofwords %>%
  group_by(term)%>%
  mutate(rank= n(), 'frequencyofterm' = n()/total)%>%
  arrange(total)
kable(tfrequency) %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = F) %>%
  column_spec(1, width = "15em", background = "cyan")

term	total	rank	frequencyofterm
including	166	1	0.0060241
ability	169	1	0.0059172
knowledge	171	1	0.0058480
machine	171	1	0.0058480
degree	175	1	0.0057143
learning	185	1	0.0054054
business	189	1	0.0052910
skills	193	1	0.0051813
python	208	1	0.0048077
years	216	1	0.0046296
science	218	1	0.0045872
team	243	1	0.0041152
experience	279	1	0.0035842
data	287	1	0.0034843

This shows the frequencies for the word in its descending order.

Groupings of the words

It is interesting to see the number of times that words have been grouped together in the text per document. This shows the number of times that the word has been grouped together.

group_word_pairs<-tidywords%>%
  group_by(term) %>%
  arrange(count)
kable(group_word_pairs [1:20, 1:3] ) %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = F) %>%
  column_spec(2, bold = T, border_right = F)%>%
  column_spec(1, width = "3em", background = "yellow")%>%
  column_spec(2, width = "8em", background = "cyan")

document	term	count
1	science	1
2	ability	1
2	degree	1
2	including	1
2	knowledge	1
2	python	1
2	years	1
3	business	1
3	degree	1
3	python	1
4	business	1
4	including	1
4	knowledge	1
5	business	1
5	including	1
5	knowledge	1
6	including	1
6	python	1
6	years	1
7	data	1

#set.seed(1234)
group_word_pairs %>%
  filter(count >= 10) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = count, edge_width = count), edge_colour = "cyan3") +
  geom_node_point(size = 2) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.3, "lines")) +
  theme_void()

Correlation of the words

It would even be better to see what words occur togehter and there correlation to each other. Here we get to see the correlations of the words.

tidywords_cors <- tidywords %>% 
  group_by(term) %>%
  filter(n() >= 180) %>%
  pairwise_cor(term, count, sort = TRUE, upper = FALSE)
kable(tidywords_cors) %>%
  kable_styling(full_width = F) %>%
  column_spec(1, bold = T, border_right = F) %>%
  column_spec(2, bold = T, border_right = F)%>%
  column_spec(1, width = "10em", background = "cyan")%>%
  column_spec(2, width = "10em", background = "yellow")

item1	item2	correlation
learning	science	0.9304842
learning	team	0.9304842
science	team	0.8523810
experience	learning	0.8172253
skills	years	0.8154762
science	skills	0.7826238
science	years	0.7826238
team	years	0.7826238
experience	science	0.7604152
business	team	0.7604152
experience	team	0.7604152
learning	skills	0.7282191
learning	years	0.7282191
skills	python	0.7126966
python	years	0.7126966
business	learning	0.6817494
business	science	0.6217513
skills	team	0.6175807
business	experience	0.6092437
experience	skills	0.5951190
business	years	0.5951190
experience	years	0.5951190
science	python	0.5577734
python	team	0.5577734
learning	python	0.5189993
business	skills	0.4400880
business	python	0.4241393
experience	python	0.4241393
business	data	NaN
data	experience	NaN
data	learning	NaN
data	science	NaN
data	skills	NaN
data	python	NaN
data	team	NaN
data	years	NaN

set.seed(1234)
tidywords_cors %>%
  filter(correlation > .3) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "lightblue2") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.3, "lines")) +
  theme_void()

Here you can visually see the co-occurences.

WordCloud

In order to make the wordcloud we use the terms that we ended up with after we cleaned the terms. Within the function it changes the word to a DocumentTermMatrix and it sparses the words. Here we can the terms that have been used the most in the job boards.

library(wordcloud)
dtm = DocumentTermMatrix(descriptionofjobs)
dtm = removeSparseTerms(dtm, 0.65)
dataset = as.matrix(dtm)
v = sort(colSums(dataset),decreasing=TRUE)
myNames = names(v)
d = data.frame(word=myNames,freq=v)
wordcloud(d$word, colors=c(1:4),random.color=TRUE, d$freq, min.freq=100)

Here you can see that data, experience, business, leraning, machine, years, analytics, solutions, developement, support, scientest, science, computer. teams, skills, technology, and information are one of the many words that are in the wordcloud that stand out the most. But of course DATA is going to be one of the most important terms in a job board for data scientist. In the wordcloud you can see all of the top skills the we found in the other analysis - Research, Machine Learning, Statistics, and Communication, but not all the top skills are seen.

Conclusion

Doing this project gives one an idea what Companies are looking for when it comes to Data Science.With the top three most used terms being data, experince and team, it gives one an idea of the companies want and expect from a Data Scientest.One of course they expect the applicant to know Data.They also want the person to have experience with working with data. They also want people that are able to work in a team while working with data. Some of the words that had the highest frquencies were ability, knowledge, machine, degree, learning, business, skills, and python. With these words one can tell that Companies want someone that has the ability to work in a business enviorment, that has the knowledge and the education, and the skills needed to do machine learning and python. When you look at the correlation of the words you can see how years is in the middle and connects to point in the correlation. With that one can tell that companies care about the years of experience a person has with working in Data. So what are the top skills that companies are looking for is it soft or hard skills? with this it seems to be a mixture of everything, companies want people soft skills that are able to work in a team. They also want people with hard skills that are able to work with python, and sql. To be a data scientest you have to both have hards skills, and soft skills.

Another Way to merge all of the data tables using the Reduce Function

csv123<-df_list<- list(ltcancel,selshahawy,ssufian)
Reduce(function(d1,d2) merge(d1, d2, by = "Position",all.x=TRUE, all.y = FALSE),df_list)

summary(csv123)

##      Length Class      Mode
## [1,] 6      data.frame list
## [2,] 5      data.frame list
## [3,] 6      data.frame list

All of the tables are merged as a list

References

Text Mining With R https://www.tidytextmining.com/

2.Corpus https://www.rdocumentation.org/packages/tm/versions/0.7-6/topics/Corpus

3.KableExtra https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html#overview

4.R Reduce Applys https://blog.zhaw.ch/datascience/r-reduce-applys-lesser-known-brother/

Understanding and writing your text analysis. https://towardsdatascience.com/understanding-and-writing-your-first-text-mining-script-with-r-c74a7efbe30f

RPubs Link http://rpubs.com/Luz917/project3data607Cruz