Below, we undertook some NLP processes to distill the job descriptions down to a few key skills. After reading in and cleaning the column, we convert the column to tokens, ran the tokens against our stop words list, and found the most prevalent words in the job descriptions. Single words don’t provide much value in terms of analytical insight, so, after enhancing our stop word list, we next determined the most common word pairings, which proved to have much more analytical value. Data Science and Machine Learning were the overwhelming leaders in terms of in-demand skills. Considering Data Science more represents a group of skills as opposed to a single skill, we can safely conclude that in London Data Scientist job postings, Machine Learning represents the most sought-after skill.
Nevertheless, it’s useful to learn about other in-demand skills, so we filtered out Data Science and Machine Learning from our dataset in order to better visualize other popular skills. “Computer Science,” “Data Analytics,” and the ever-important “Communication Skills” topped the list of sought-after characteristics in London-area Data Scientist job descriptions. We’ve used some helpful graphics to help illustrate our findings below.
# first, let's read in the data extracted from the larger scraped dataset and clean up the job descriptions
jobs_desc_file = "https://raw.githubusercontent.com/evanmclaughlin/ECM607/master/Project1_Description-Data.csv"
jobs_desc = readLines( jobs_desc_file , warn = FALSE)
jobs_desc <- data.frame(do.call('rbind', strsplit(as.character(jobs_desc),'"',fixed=TRUE)))
## Warning in rbind("row,desc", c("1,", " Vitality, Data Scientist,London,
## £Competitive + Benefits + Bonus We’re the UK insurer and investment provider
## that rewards people for positive lifestyle choices. With 1.25m+ UK members and
## more than 25m globally, we’re out to make the world a healthier, happier place.
## That applies as much to our people as it does to our members. So, as well as a
## highly competitive pay package, you’ll enjoy: complimentary breakfasts; regular
## onsite physical and mental wellness workshops; on-site health checks; annual flu
## jabs, and access to our full range of partners and rewards. It’s what we call
## offering shared value, because a healthy, happy team is good for us, good for
## our members, and good for you. As ourData Scientist, you’ll get the benefits
## our members enjoy, including: - Our award-winning private Vitality Health
## insurance + wellness incentive programme - Access to The Vitality Programme –
## Apple Watch, Waitrose and Partners, Garmin, Amazon Prime, Champneys Spa days,
## Rakuten TV and half-price gym memberships to name a few! - Personal health
## fund + Life Assurance - Stakeholder Pension Plan with employer contribution
## - 25 days annual leave + Bank holidays + option to buy and sell 5 more -
## Flexible benefits package - Internal incentives, competitions, and awards – a
## chance to win football and sports tickets or even be in with a chance to have
## a holiday of a lifetime - A genuine opportunity to grow and establish a long-
## term career As our Data Scientist, you will work on innovative AI applications
## across Vitality’s business, including health and wellness management, marketing,
## sales, retention, customer service and engagement, using data mining and machine
## learning techniques. You will enjoy working with rich datasets, cutting edge
## technology, advanced machine learning techniques, see your models used in real
## business applications and help shape new projects in an innovative company
## that helps people live healthier lives. Responsibilities as ourData Scientist
## will include: - Developing and implementing advanced predictive models and
## optimisation algorithms - Producing analytical work that is both customer
## and business-focused - Discovering trends, patterns and stories told by the
## data and presenting them to stakeholders - Leveraging new open data sources
## and extracting further value from existing company data - Producing creative
## data visualisations and intuitive graphics to present complex analytics -
## Leveraging state of the art data mining and machine learning algorithms to
## drive business value - Communicating analysis, findings and recommendations
## to various stakeholders and senior executives What we are looking for in our
## ideal Data Scientist: - An undergraduate degree in a numerical subject - Strong
## knowledge of Microsoft Office tools - Experience accessing and analysing data
## using language/tools/databases such as Python, R, SQL, etc - Experience using
## Gradient Boosting Machines, Random Forest, Neural Network or similar algorithms
## - Experience working as a data science or quantitative/statistical analyst
## - Practical experience of building and implementing machine learning models
## to solve business problems Closing Date: Friday 26th March 2021 Working for
## Vitality, you'll experience an exciting mix of creativity and innovation, within
## a framework of challenging objectives and a passion for delivering the best.
## Our people are chosen for their skills, knowledge, enthusiasm, and attitude but
## above all, their belief that anything can be achieved. If you feel you have the
## skills and experience to become our Data Scientist, thenplease click ‘apply’
## today. ": number of columns of result is not a multiple of vector length (arg
## 56)
jobs_desc <- jobs_desc[-1,-1]
jobs_df <- jobs_desc$X2
# It's easier to manipulate this data how we want to by converting it to a tibble
jobs_tbl <- tibble(txt = jobs_df)
#jobs_tbl
#next, let's tokenize the text of the description and execute a word count to get an idea of the most prevalent words. We'll also run the result against a stop words list to exclude words that don't add any value to our analysis such as "the", "and", "that", etc.
token <- jobs_tbl %>%
unnest_tokens(word, 1) %>%
anti_join(stop_words)
## Joining, by = "word"
token_count <- token %>%
count(word) %>%
arrange(desc(n))
head(token_count)
## # A tibble: 6 x 2
## word n
## <chr> <int>
## 1 data 3665
## 2 experience 1297
## 3 team 881
## 4 scientist 819
## 5 science 766
## 6 role 723
# Looking at the output above, it will definitely be more useful to take a look at the most common word pairs, given many of these words are more descriptive in combination with others
token_pairs <- jobs_tbl %>%
unnest_tokens(pairs, 1, token = "ngrams", n = 2)
token_pairs %>%
count(pairs) %>%
arrange(desc(n))
## # A tibble: 50,845 x 2
## pairs n
## <chr> <int>
## 1 data scientist 747
## 2 you will 728
## 3 data science 619
## 4 will be 576
## 5 machine learning 565
## 6 of the 521
## 7 in the 391
## 8 in a 366
## 9 as a 319
## 10 for a 310
## # ... with 50,835 more rows
# Now, let's run the pairs against the stop_word database by separating the pairs and eliminating cases where either word appears in the stop_word list
pairs_separated <- token_pairs %>%
separate(pairs, c("word1", "word2"), sep = " ")
pairs_df <- pairs_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
pairs_count <- pairs_df %>%
count(word1, word2, sort = TRUE)
head(pairs_count)
## # A tibble: 6 x 3
## word1 word2 n
## <chr> <chr> <int>
## 1 data scientist 747
## 2 data science 619
## 3 machine learning 565
## 4 data scientists 259
## 5 senior data 214
## 6 data engineer 125
# Before uniting these columns, let's quickly go through the prominent words and eliminate more terms that don't add much value by augmenting the stop_words list and running another.
# Some such words are job titles, recruiter names, job locations, salary information, contract lengths, etc.
# We can add to this list if we happen to see any additional words that aren't helpful to our analysis.
new_stop <- data.frame(word = c("apply", "london", "remote","remotely", "interview", "salary", "contract", "candidate", "scientist", "scientists", "team", "analyst", "engineer", "engineers", "manager", "managers", "senior", "employment", "experienced", "consultant", "junior", "month", "level", "masters", "rosie", "months", "experience", "level", "orientation", "opportunity", "principal", "benefits", "nick", "days", "day", "role", "francesca", "goldman", "luke", "anna", "date", "charlotte", "driven"), lexicon = "custom")
my_stopwords <- rbind(new_stop, stop_words)
pairs_df <- pairs_separated %>%
filter(!word1 %in% my_stopwords$word) %>%
filter(!word2 %in% my_stopwords$word)
# Let's now reunite the columns into a single pairs for analysis.
pairs_united <- pairs_df %>%
unite(term, word1, word2, sep = " ")
df_terms <- pairs_united$term
terms_tbl <- tibble(txt = df_terms)
united_count <- pairs_united %>%
count(term, sort = TRUE)
head(united_count)
## # A tibble: 6 x 2
## term n
## <chr> <int>
## 1 data science 619
## 2 machine learning 565
## 3 computer science 92
## 4 data analytics 80
## 5 communication skills 63
## 6 data engineering 63
# To facilitate visualization, we can narrow down to the most relevant job skills that employers are looking for by setting a floor on the number of instances and condense our data frame.
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
a <- 30
data <- united_count
Results<-dplyr::filter(data, data[,2]>a)
colnames(Results)<-c("term", "frequency")
ggplot2::ggplot(Results, aes(x=term, y=frequency, fill=term)) + geom_bar(width = 0.75, stat = "identity", colour = "black", size = 1) + coord_polar(theta = "x") + xlab("") + ylab("") + ggtitle("Term Frequency (min: 30)") + theme(legend.position = "none") + labs(x = NULL, y = NULL)
plotly::ggplotly(ggplot2::ggplot(Results, aes(x=term, y=frequency, fill=term)) + geom_bar(width = 0.75, stat = "identity", colour = "black", size = 1) +
xlab("") + ylab("") + ggtitle("Word Frequency (min: 30)") + theme(legend.position = "none") + labs(x = NULL, y = NULL) + theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1), axis.text.x = element_text(angle = 90)) + theme(panel.background = element_rect(fill = "honeydew1"), plot.background = element_rect(fill = "antiquewhite")))%>% config(displaylogo = F) %>% config(showLink = F)
# Data Science and Machine Learning are obviously the overwhelming results, relative to other skills pairs. Data Science is sort of a catch-all term that we should strive to ignore moving forward.
# Keeping these two terms in the visualization makes it difficult to analyze the remaining results. So let's insert a maximum records constraint in the above graphics to try to add some nuance to our analysis.
a <- 30
b <- 100
data2 <- united_count
Results2<-dplyr::filter(data2, data2[,2]>a, data2[,2]<b )
colnames(Results2)<-c("term", "frequency")
ggplot2::ggplot(Results2, aes(x=term, y=frequency, fill=term)) + geom_bar(width = 0.75, stat = "identity", colour = "black", size = 1) + coord_polar(theta = "x") + xlab("") + ylab("") + ggtitle("Term Frequency (min: 30, max: 100)") + theme(legend.position = "none") + labs(x = NULL, y = NULL)
plotly::ggplotly(ggplot2::ggplot(Results2, aes(x=term, y=frequency, fill=term)) + geom_bar(width = 0.75, stat = "identity", colour = "black", size = 1) + xlab("") + ylab("") + ggtitle("Word Frequency (min: 30, max: 100)") + theme(legend.position = "none") + labs(x = NULL, y = NULL) + theme(plot.subtitle = element_text(vjust = 1), plot.caption = element_text(vjust = 1), axis.text.x = element_text(angle = 90)) + theme(panel.background = element_rect(fill = "honeydew1"), plot.background = element_rect(fill = "antiquewhite")))%>% config(displaylogo = F) %>% config(showLink = F)
#This provides quite a bit more differentiation between the remaining terms in our visualizations, but the most sought-after skill remains machine learning throughout the job descriptions.
# Let's just visualize the remaining (ex. Data Science and Machine Learning) terms once more.
library(wordcloud2)
## Warning: package 'wordcloud2' was built under R version 4.0.4
c <- 10
d <- 600
Results3<-dplyr::filter(data2, data2[,2]>c, data2[,2]<d)
wordcloud2(Results3, color = "random-light", backgroundColor = "grey", size = 1.75)