Project 3

NLP Analyzing Job Description

# first, let's read in the data extracted from the larger scraped dataset and clean up the job descriptions
jobs_desc_file = "https://raw.githubusercontent.com/evanmclaughlin/ECM607/master/Project1_Description-Data.csv"
jobs_desc = readLines( jobs_desc_file , warn = FALSE)
jobs_desc <- data.frame(do.call('rbind', strsplit(as.character(jobs_desc),'"',fixed=TRUE)))

## Warning in rbind("row,desc", c("1,", " Vitality, Data Scientist,London,
## £Competitive + Benefits + Bonus We’re the UK insurer and investment provider
## that rewards people for positive lifestyle choices. With 1.25m+ UK members and
## more than 25m globally, we’re out to make the world a healthier, happier place.
## That applies as much to our people as it does to our members. So, as well as a
## highly competitive pay package, you’ll enjoy: complimentary breakfasts; regular
## onsite physical and mental wellness workshops; on-site health checks; annual flu
## jabs, and access to our full range of partners and rewards. It’s what we call
## offering shared value, because a healthy, happy team is good for us, good for
## our members, and good for you. As ourData Scientist, you’ll get the benefits
## our members enjoy, including: - Our award-winning private Vitality Health
## insurance + wellness incentive programme - Access to The Vitality Programme –
## Apple Watch, Waitrose and Partners, Garmin, Amazon Prime, Champneys Spa days,
## Rakuten TV and half-price gym memberships to name a few! - Personal health
## fund + Life Assurance - Stakeholder Pension Plan with employer contribution
## - 25 days annual leave + Bank holidays + option to buy and sell 5 more -
## Flexible benefits package - Internal incentives, competitions, and awards – a
## chance to win football and sports tickets or even be in with a chance to have
## a holiday of a lifetime - A genuine opportunity to grow and establish a long-
## term career As our Data Scientist, you will work on innovative AI applications
## across Vitality’s business, including health and wellness management, marketing,
## sales, retention, customer service and engagement, using data mining and machine
## learning techniques. You will enjoy working with rich datasets, cutting edge
## technology, advanced machine learning techniques, see your models used in real
## business applications and help shape new projects in an innovative company
## that helps people live healthier lives. Responsibilities as ourData Scientist
## will include: - Developing and implementing advanced predictive models and
## optimisation algorithms - Producing analytical work that is both customer
## and business-focused - Discovering trends, patterns and stories told by the
## data and presenting them to stakeholders - Leveraging new open data sources
## and extracting further value from existing company data - Producing creative
## data visualisations and intuitive graphics to present complex analytics -
## Leveraging state of the art data mining and machine learning algorithms to
## drive business value - Communicating analysis, findings and recommendations
## to various stakeholders and senior executives What we are looking for in our
## ideal Data Scientist: - An undergraduate degree in a numerical subject - Strong
## knowledge of Microsoft Office tools - Experience accessing and analysing data
## using language/tools/databases such as Python, R, SQL, etc - Experience using
## Gradient Boosting Machines, Random Forest, Neural Network or similar algorithms
## - Experience working as a data science or quantitative/statistical analyst
## - Practical experience of building and implementing machine learning models
## to solve business problems Closing Date: Friday 26th March 2021 Working for
## Vitality, you'll experience an exciting mix of creativity and innovation, within
## a framework of challenging objectives and a passion for delivering the best.
## Our people are chosen for their skills, knowledge, enthusiasm, and attitude but
## above all, their belief that anything can be achieved. If you feel you have the
## skills and experience to become our Data Scientist, thenplease click ‘apply’
## today. ": number of columns of result is not a multiple of vector length (arg
## 56)

jobs_desc <- jobs_desc[-1,-1]
jobs_df <- jobs_desc$X2

# It's easier to manipulate this data how we want to by converting it to a tibble
jobs_tbl <- tibble(txt = jobs_df)
#jobs_tbl

#next, let's tokenize the text of the description and execute a word count to get an idea of the most prevalent words. We'll also run the result against a stop words list to exclude words that don't add any value to our analysis such as "the", "and", "that", etc.
token <- jobs_tbl %>%
  unnest_tokens(word, 1) %>%
  anti_join(stop_words)

## Joining, by = "word"

token_count <- token %>%
  count(word) %>%
  arrange(desc(n))

head(token_count)

## # A tibble: 6 x 2
##   word           n
##   <chr>      <int>
## 1 data        3665
## 2 experience  1297
## 3 team         881
## 4 scientist    819
## 5 science      766
## 6 role         723

# Looking at the output above, it will definitely be more useful to take a look at the most common word pairs, given many of these words are more descriptive in combination with others
token_pairs <- jobs_tbl %>%
  unnest_tokens(pairs, 1, token = "ngrams", n = 2)
token_pairs %>%
  count(pairs) %>%
  arrange(desc(n))

## # A tibble: 50,845 x 2
##    pairs                n
##    <chr>            <int>
##  1 data scientist     747
##  2 you will           728
##  3 data science       619
##  4 will be            576
##  5 machine learning   565
##  6 of the             521
##  7 in the             391
##  8 in a               366
##  9 as a               319
## 10 for a              310
## # ... with 50,835 more rows

# Now, let's run the pairs against the stop_word database by separating the pairs and eliminating cases where either word appears in the stop_word list
pairs_separated <- token_pairs %>%
  separate(pairs, c("word1", "word2"), sep = " ")

pairs_df <- pairs_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

pairs_count <- pairs_df %>% 
  count(word1, word2, sort = TRUE)

head(pairs_count)

## # A tibble: 6 x 3
##   word1   word2          n
##   <chr>   <chr>      <int>
## 1 data    scientist    747
## 2 data    science      619
## 3 machine learning     565
## 4 data    scientists   259
## 5 senior  data         214
## 6 data    engineer     125

# Before uniting these columns, let's quickly go through the prominent words and eliminate more terms that don't add much value by augmenting the stop_words list and running another. 
# Some such words are job titles, recruiter names, job locations, salary information, contract lengths, etc. 
# We can add to this list if we happen to see any additional words that aren't helpful to our analysis. 

new_stop <- data.frame(word = c("apply", "london", "remote","remotely", "interview", "salary", "contract", "candidate", "scientist", "scientists", "team", "analyst", "engineer", "engineers", "manager", "managers", "senior", "employment", "experienced", "consultant", "junior", "month", "level", "masters", "rosie", "months", "experience", "level", "orientation", "opportunity", "principal", "benefits", "nick", "days", "day", "role", "francesca", "goldman", "luke", "anna", "date", "charlotte", "driven"), lexicon = "custom")
my_stopwords <- rbind(new_stop, stop_words)

pairs_df <- pairs_separated %>%
  filter(!word1 %in% my_stopwords$word) %>%
  filter(!word2 %in% my_stopwords$word)

# Let's now reunite the columns into a single pairs for analysis.

pairs_united <- pairs_df %>%
  unite(term, word1, word2, sep = " ")
df_terms <- pairs_united$term
terms_tbl <- tibble(txt = df_terms)

united_count <- pairs_united %>% 
  count(term, sort = TRUE)

head(united_count)

## # A tibble: 6 x 2
##   term                     n
##   <chr>                <int>
## 1 data science           619
## 2 machine learning       565
## 3 computer science        92
## 4 data analytics          80
## 5 communication skills    63
## 6 data engineering        63

# To facilitate visualization, we can narrow down to the most relevant job skills that employers are looking for by setting a floor on the number of instances and condense our data frame.

library(plotly)

## Warning: package 'plotly' was built under R version 4.0.4

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

a <- 30
data <- united_count
Results<-dplyr::filter(data, data[,2]>a)

colnames(Results)<-c("term", "frequency")

ggplot2::ggplot(Results, aes(x=term, y=frequency, fill=term)) + geom_bar(width = 0.75,  stat = "identity", colour = "black", size = 1) + coord_polar(theta = "x") + xlab("") + ylab("") + ggtitle("Term Frequency (min: 30)") + theme(legend.position = "none") + labs(x = NULL, y = NULL)

plotly::ggplotly(ggplot2::ggplot(Results, aes(x=term, y=frequency, fill=term)) + geom_bar(width = 0.75, stat = "identity", colour = "black", size = 1) + 
xlab("") + ylab("") + ggtitle("Word Frequency (min: 30)") + theme(legend.position = "none") + labs(x = NULL, y = NULL) + theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1), axis.text.x = element_text(angle = 90)) + theme(panel.background = element_rect(fill = "honeydew1"), plot.background = element_rect(fill = "antiquewhite")))%>% config(displaylogo = F) %>% config(showLink = F)

# Data Science and Machine Learning are obviously the overwhelming results, relative to other skills pairs. Data Science is sort of a catch-all term that we should strive to ignore moving forward. 

# Keeping these two terms in the visualization makes it difficult to analyze the remaining results. So let's insert a maximum records constraint in the above graphics to try to add some nuance to our analysis.

a <- 30
b <- 100
data2 <- united_count
Results2<-dplyr::filter(data2, data2[,2]>a, data2[,2]<b )

colnames(Results2)<-c("term", "frequency")

ggplot2::ggplot(Results2, aes(x=term, y=frequency, fill=term)) + geom_bar(width = 0.75,  stat = "identity", colour = "black", size = 1) + coord_polar(theta = "x") + xlab("") + ylab("") + ggtitle("Term Frequency (min: 30, max: 100)") + theme(legend.position = "none") + labs(x = NULL, y = NULL)

plotly::ggplotly(ggplot2::ggplot(Results2, aes(x=term, y=frequency, fill=term)) + geom_bar(width = 0.75, stat = "identity", colour = "black", size = 1) + xlab("") + ylab("") + ggtitle("Word Frequency (min: 30, max: 100)") + theme(legend.position = "none") + labs(x = NULL, y = NULL) + theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1), axis.text.x = element_text(angle = 90)) + theme(panel.background = element_rect(fill = "honeydew1"), plot.background = element_rect(fill = "antiquewhite")))%>% config(displaylogo = F) %>% config(showLink = F)

#This provides quite a bit more differentiation between the remaining terms in our visualizations, but the most sought-after skill remains machine learning throughout the job descriptions.

# Let's just visualize the remaining (ex. Data Science and Machine Learning) terms once more.
library(wordcloud2)

## Warning: package 'wordcloud2' was built under R version 4.0.4

c <- 10
d <- 600
Results3<-dplyr::filter(data2, data2[,2]>c, data2[,2]<d)

wordcloud2(Results3, color = "random-light", backgroundColor = "grey", size = 1.75)

Project 3

Evan McLaughlin

3/25/2021

Overview

NLP Analyzing Job Description