Introduction

The purpose of this study is to answer the question, “Which are the most valued data science skills?” I have had experiences applying for jobs, but was relatively new to the data science vernacular. For example, I am student with a background in chemical engineering. This project is my endeavor in how I answered this question.

Approach

I had discussions about which sets of data to use. The internet is filled with a plethora of data, but few were of applicable use. LinkedIn was not readily accessible. I worked from our experiences with applying for jobs and during some brainstorming over Zoom, I came up with searching government sites for available data, Kaggle, and the possibility of web scraping Monster or Indeed. True, I did have my own presuppositions about needed skills for a data scientist, but we wanted the data to speak for itself and hopefully challenge our presuppositions.

Jackson Pollock the Data

My thinking was that if I obtain data science job listings, I could mine the text for words and phrases. In other words, if I splash the data in front of us, find some frequencies, perhaps a pattern would arise.


After googling and searching, I settled on a data set that will be reviewed in the next section. However, little did I realize how much cleaning and hours would be committed to having data that was accessible and usable.

Load Libraries

Continuing on, the following are the R libraries we will be using.

library(RMySQL)
library(tm)
library(SnowballC)
library(wordcloud)
library(RColorBrewer)
library(stringr)
library(ggplot2)
library(dplyr)
library(tidytext)
library(tidyverse)
library(knitr)
library(summarytools)
library(ggraph)
library(igraph)
library(quanteda)
library(tidyr)
library(topicmodels)

RMySQL is used for working with a MySQL database. tm and RWeka were used for text mining. Packages SnowballC, wordcloud, and RColorBrewer were used for creating the word cloud in R. stringr, dplyr, tidytext, and tidyverse were used for text manipulations. ggplot2, ggraph, igraph were used for graphs and knitr was used for making some tables. topicmodels and quanteda were used to derive text analysis from the data. summarytools provides a handy frequency function to find tally the occurrences in a column.

Importing

The data is from the web site https://data.world/jobspikr/10000-data-scientist-job-postings-from-the-usa. I obtained the csv. The data was scraped using JobsPikr. The csv consists of 10,000 records and had the following as text fields: crawl_timestamp, url, job_title, category, company_name, city, state, country, inferred_city, inferred_state, inferred_country, post_date, job_description, job_type, salary_offered, job_board, geo, cursor, contact_email, contact_phone_number, uniq_id, and html_job_desc.

This is the original csv with field names.

CSV Fieldnames

CSV Fieldnames

Inital import of the csv file.

# Using read_csv as it allows for on the fly decompression of zip csv files
jobs_df <- read_csv("https://github.com/logicalschema/DATA607/raw/master/project3/data/data_scientist_united_states_job_postings_jobspikr.csv.gz")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   post_date = col_date(format = ""),
##   salary_offered = col_logical(),
##   cursor = col_double(),
##   contact_email = col_logical(),
##   html_job_description = col_logical()
## )
## See spec(...) for full column specifications.
# Look that the first group of imported rows
head(jobs_df)
jobs_desc <- tibble(id = jobs_df$uniq_id, 
                        desc = jobs_df$job_description)
jobs_desc %>% 
  select(desc) %>% 
  sample_n(5)
jobs_cat <- tibble(id = jobs_df$uniq_id, 
                        category = jobs_df$category)
jobs_cat <- jobs_cat %>% filter(jobs_cat$category != "") %>% print(n=10)
## # A tibble: 2,136 x 2
##    id                               category          
##    <chr>                            <chr>             
##  1 3b6c6acfcba6135a31c83bd7ea493b18 Accounting/Finance
##  2 1c8541cd2c2c924f9391c7d3f526f64e Accounting/Finance
##  3 445652a560a5441060857853cf267470 biotech           
##  4 9571ec617ba209fd9a4f842973a4e9c8 Accounting/Finance
##  5 0ec629c03f3e82651711f2626c23cadb Accounting/Finance
##  6 80d64b46bc7c89602f63daf06b9f1b4c Accounting/Finance
##  7 9cd3ed78e5cac9e516ea41173de2c25f Computer/Internet 
##  8 53224da901548e7137bbb163d456ba6a Computer/Internet 
##  9 010b5d671896d26eba50948c7c337f94 Computer/Internet 
## 10 4a8b875d1acc4f560716561d699aa022 Computer/Internet 
## # ... with 2,126 more rows

job_description is the most important field in the dataset for the data mining exercise. The field contains the complete write-up posted for the job listing. The jobjob_categoryprovides some context of the job listing and will be used to capture important words per category. <br> Initial problems and solutions: 1. The csv file is 48.5 MB and the character set had UTF-8 characters. The file was zipped up to reduce download time. When the file is imported into R, we will tidy the data. 2. HTML code was left in thejob_descriptionfield. Evidence of the HTML scrape of the data source. Invalid characters will be removed. 3. Initial load of the csv into R would take time depending upon the hardware of the group member's system. With the zip file and cutting up the file into smaller pieces, this would reduce the need for additional RAM. 4. How would we convert to a normalized database? The csv was converted to a MySQL script file. The header column names were encapsulated with"` marks. Workbench was used to create a new table on the MySQL database. Subsequently, through SQL SELECT, we normalized the data.

In addition, when the data was imported in a database, I ran Text Analytics on the `job_description` column to find key phrases. This information was used to create a new column called `keyphrases` in the normalized database.
  1. Additinally, my laptop had 8 GB RAM as opposed to 16 GB.

    How did the data look in Azure? When the data was imported into Microsoft’s cloud service, I ran some initial tests on the data to look for patterns. Using Microsoft’s Text Analytics1 tools the following is a word cloud2 of the keyphrases that were discovered by Microsoft’s AI: Word Cloud Courtesy of Azure The key phrases were stored in the database. Key phrases were extrapolated from the job descriptions for each of the jobs listed. Because the use of Microsoft’s service was limited, we found the need for persistent storage of the results. Sung had used up his $200 credit for one of his subscriptions within a week of testing. In working with the Azure Text Analytics API, there is an option to send data via JSON to have it analyzed.

    Excursus

    This is a brief description of some of the steps taken to help alleviate the problems of importing the information for use in this project.

    Infrastructure

    Azure, MySQL, and Cognitive Text Analytics I wanted to experiment with cloud computing, so I used my SPS email to create a trial account for Azure. I created a MySQL database instance to host the databases used. Each trial account is given a $200 credit.
Default timeout settings for the database server had to be lengthened to enable longer processing.
This is a link to the mysqldump of the database.
Diagram of Database E-R Digram ### Tools Microsoft’s Cognitive Text Analytics, Power BI, Slack, GitHub, and MySQL Workbench
In addition to the database instance, I created, I created an instance for Azure’s Cognitive Text Analytics to experiment to see what information Microsoft’s AI can mine from our data. The tools were used to facilitate handling the data. Power BI is a Microsoft product that is used to visualize data. It was employed to work with Microsoft’s Cognitive Text Analytics to extrapolate keyphrases from the job_descriptions of the data and to create a simple word cloud to compare with the later analysis. Slack and GitHub were used, and code was edited via GitHub. MySQL Workbench was used to connect to the MySQL database. ### Data Management As an option to curtail the amount of time needed to process the 10,000 row csv, I worked with a small subset of the file, tested code, and then would work with the larger data set. I had a few memory errors. Files being worked on from Github were compressed to enable quicker transport across the network.
This is an example of a memory error on my laptop when trying to text mine on the job description data: Memory Error
To avert this problem, I modified the code to use the Corpus object instead of VCorpus for collections of text documents. A VCorpus is described as a volatile corpus that is “fully kept in memory”.3 In addition, wherever possible, snippets of the data objects were viewed and knitted in R Markdown. So instead of viewing 10,000 rows, I did my best to display 10-20 rows of a data set.

Tidying

Since I have imported our data into a database, I still need to clean up the job descriptions. I hope that I will be able to tidy the data such that we would be able to determine frequencies and see what picture the data paints for us. I will now begin to query the database for the job descriptions and the key phrases that Azure extrapolated for us. I will store job descriptions in the variable descriptions and the key phrases in keyphrases.

Briefly looking at the frequencies found in the key phrases, some skills started to appear such as Python (count: 6471), Computer Science (4441), and SQL (2672). However, some problems came us as the boiler plate for many job descriptions about diversity and Equal opportunity policies were appearing as evident by the frequency of such phrases as Religion (2746), Color (2675), and National Origin (2649).


Text Mining

After I have imported and tidied our data, I will perform tests to mine it.


Process Input Data

Following the NASA case study example presented at https://www.tidytextmining.com/nasa.html for Term Frequency and Topic Modeling.

From the job_description, tokenize all the words and remove “stop_words” which are common words in the English language to allow for focus on meaningful words of the job listing.

# Use tidytext’s unnest_tokens() for the description field so we can do the text analysis.
# unnest_tokens() will tokenize all the words in the description field and create a tidy dataframe of the word by identifer
jobs_desc <- jobs_desc %>% 
  unnest_tokens(word, desc) %>% 
  anti_join(stop_words)
## Joining, by = "word"
jobs_desc

Provide count in table form of the most common words in the job descriptions.

# Most common words in the description field
jobs_desc %>%
  count(word, sort = TRUE) %>% print(n=10)
## # A tibble: 18,517 x 2
##    word           n
##    <chr>      <int>
##  1 data       31593
##  2 experience 14064
##  3 business    7720
##  4 learning    6339
##  5 science     6032
##  6 analytics   5561
##  7 analysis    4913
##  8 skills      4910
##  9 machine     4833
## 10 team        4821
## # ... with 18,507 more rows

Applying lowercase to all the words to ensure different cases of the same word aren’t considered different.

# lowercase all the words just to make sure there's no redundancy
jobs_desc <- jobs_desc %>% 
  mutate(word = tolower(word))

Term Frequency

The term frequency times inverse document frequency (TF-IDF) is used to identify words that are especially important to a document within a collection of documents. The results are the most important words in the description fields as measured by TF-IDF, meaning the words are common but not too common.

  1. Calculate the TF-IDF
# Calculating tf-idf for the description fields
desc_tf_idf <- jobs_desc %>% 
  count(id, word, sort = TRUE) %>%
  ungroup() %>%
  bind_tf_idf(word, id, n)
desc_tf_idf %>% filter(n >= 10) %>%
  arrange(-tf_idf)
  1. Combine the data frame of the TF_IDF of the job descriptions with the job categories.

The join is performed on the unique ID as key. Joining with the categories will identify the most important words from the job descriptions per job category.

# Join with the category
desc_tf_idf <- full_join(desc_tf_idf, jobs_cat, by = "id")
desc_tf_idf %>% 
  filter(!near(tf, 1)) %>%
  filter(category %in% jobs_cat$category) %>%
  arrange(desc(tf_idf)) %>%
  group_by(category) %>%
  distinct(word, category, .keep_all = TRUE) %>%
  top_n(8, tf_idf) %>% 
  ungroup() %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>%
  ggplot(aes(word, tf_idf, fill = category)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~category, ncol = 3, scales = "free") +
  coord_flip() +
  labs(title = "Highest tf-idf words in job listing description fields",
       caption = "From jobpickr dataset",
       x = NULL, y = "tf-idf")

The resulting plot did not prove useful for identifying skills across all the job listings for data scientist. The plot does indicate which words are more common across that specific job category. The results demonstrate that job listings by category are likely posted by the same company or same recruiter and thus the same boilerplate description is often used across many job listings.

** Topic Modeling **

In order to peform topic modeling, a document term matrix is required.

  1. Calculate the word count by document ID. Each job description is considered a unique document by the job listing’s unique ID.
# 8.4 Topic Modeling
# Casting to a document-term matrix
word_counts <- jobs_desc %>%
  count(id, word, sort = TRUE) %>%
  ungroup()
word_counts %>% print(n=10)
## # A tibble: 483,690 x 3
##    id                               word      n
##    <chr>                            <chr> <int>
##  1 f563735821e0d15153c487edc700db30 data     88
##  2 f9c3f51cbdf699189e3175c66310934b data     72
##  3 25992aeea0ec81a362eb8d5c8a945bac data     66
##  4 04f4b7504d7cb71baf8a22c5a6dafb4e data     64
##  5 25f126b32f26022e8e1d3fbfe85b1802 data     64
##  6 d869251bbe6e76b971246ce06642a44d data     62
##  7 626eb8b274e2369aabdcc153e4e1f32c data     60
##  8 3d8f0398cc54891ea8882148c5ccbc94 data     58
##  9 049784c976f14a08fd6ad657112ee142 data     56
## 10 97e787bb231a3947d4535d7caccc8a2b data     54
## # ... with 483,680 more rows
  1. Construct the document-term matrix.

The resulting document-term matrix indicates a high level of sparsity. The non-zero entries do correspond to a certain word appearing in a particular document.

# Construct DTM
desc_dtm <- word_counts %>%
  cast_dtm(id, word, n)
desc_dtm
## <<DocumentTermMatrix (documents: 2497, terms: 18517)>>
## Non-/sparse entries: 483690/45753259
## Sparsity           : 99%
## Maximal term length: 72
## Weighting          : term frequency (tf)
  1. Calculate the LDA

According to Wikipedia, “In natural language processing, the latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.”

# Rrunning this model is time intensive
# Define there to be 16 topics.
desc_lda <- LDA(desc_dtm, k = 16, control = list(seed = 1234))
desc_lda
## A LDA_VEM topic model with 16 topics.
  1. Tidy the resulting LDA topics.
# Interpreting the data model
tidy_lda <- tidy(desc_lda)
tidy_lda
  1. Identify the top 10 terms for each topic.
# Top 10 Terms
top_terms <- tidy_lda %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms
  1. Plot the top 10 terms for each topic.

Even though the topics are anonymous, only identified by number, the groupings of words show some similarities and differences, but do not necessarily provide much value at this point.

The topic modeling process has identified groupings of terms that we can understand as human readers of these description fields.

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Top 10 terms in each LDA topic",
       x = NULL, y = expression(beta)) +
  facet_wrap(~ topic, ncol = 4, scales = "free")

  1. Calculate gamma

Gamma will define the probability that each document belongs in each topic.

# LDA gamma
lda_gamma <- tidy(desc_lda, matrix = "gamma")
lda_gamma
  1. Identify the categories associated with each topic
lda_gamma <- full_join(lda_gamma, jobs_cat, by = c("document" = "id"))
lda_gamma
top_cats <- lda_gamma %>% 
  filter(gamma > 0.5) %>% 
  count(topic, category, sort = TRUE)
top_cats <- top_cats %>% filter(!is.na(category))

Topic 9 identifes ‘business and financial operations’ as the top category, and the only topic to include the term ‘aws’. Topic 4, most identified with category ‘Arts/Entertainment/Publishing’ contains the terms ‘experience’ and ‘content’ which align with the category broadly.

# One more graph from 8.4.4
top_cats %>%
  group_by(topic) %>%
  top_n(5, n) %>%
  ungroup %>%
  mutate(category = reorder_within(category, n, topic)) %>%
  ggplot(aes(category, n, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  labs(title = "Top categories for each LDA topic",
       x = NULL, y = "Number of documents") +
  coord_flip() +
  scale_x_reordered() +
  facet_wrap(~ topic, ncol = 2, scales = "free")


Lexical Dispersion

Following the example at https://www.r-bloggers.com/advancing-text-mining-with-r-and-quanteda/ and https://quanteda.io/articles/pkgdown/examples/plotting.html

  1. Create a corpus based on the unique ID and the job description.
# Generate a corpus
uniq_jobs_df <- jobs_df %>% distinct(uniq_id, .keep_all = TRUE)
my_corpus <- corpus(uniq_jobs_df, docid_field = "uniq_id", text_field = "job_description")
mycorpus_stats <- summary(my_corpus)
  1. Preprocess the text.

Remove numbers, remove punctuation, remove symbols, remove URLs, split hyphens. Clean for OCR.

# Preprocess the text
# Create tokens
token <-
  tokens(
    my_corpus,
    remove_numbers  = TRUE,
    remove_punct    = TRUE,
    remove_symbols  = TRUE,
    remove_url      = TRUE,
    split_hyphens   = TRUE
  )
# Clean tokens created by OCR
token_ungd <- tokens_select(
  token,
  c("[\\d-]", "[[:punct:]]", "^.{1,2}$"),
  selection = "remove",
  valuetype = "regex",
  verbose = TRUE
)
## removed 2,124 features
  1. Create a Data Frequency Matrix

Using the Quanteda library, create the data frequency matrix and filter words that appear less than 7.5% and more than 90%.

# Data frequency matrix
my_dfm <- dfm(token_ungd,
              tolower = TRUE,
              stem = TRUE,
              remove = stopwords("english")
              )
my_dfm_trim <-
  dfm_trim(
    my_dfm,
    min_docfreq = 0.075,
    # min 7.5%
    max_docfreq = 0.90,
    # max 90%
    docfreq_type = "prop"
  )
head(dfm_sort(my_dfm_trim, decreasing = TRUE, margin = "both"),
     n = 10,
     nf = 10)
## Document-feature matrix of: 10 documents, 10 features (4.0% sparse) and 20 docvars.
##                                   features
## docs                               analyt busi develop model learn team scienc
##   9824c201fbaf4d211ef817cdac428156     14    8       2    10     2   12     10
##   f563735821e0d15153c487edc700db30     22   27      18     6     8    8     22
##   0f4fc982e72ca26aaad1f66637019652     10    6      10     6     6   12     12
##   9f2a8aee7e7a102720cfb3dbc39c6c24      2    0      28     4    18   18     10
##   6737d173f0bdf3d4d8fd3543518e40c7      2    0      30     4    18   18     10
##   3fcf1d237cf41524a247f9b940393b95     40   22      18    18     4   18      0
##                                   features
## docs                               statist use scientist
##   9824c201fbaf4d211ef817cdac428156      14   8        10
##   f563735821e0d15153c487edc700db30      12  24         6
##   0f4fc982e72ca26aaad1f66637019652      10  10         4
##   9f2a8aee7e7a102720cfb3dbc39c6c24      10   4        12
##   6737d173f0bdf3d4d8fd3543518e40c7      10   4         8
##   3fcf1d237cf41524a247f9b940393b95      12   6         6
## [ reached max_ndoc ... 4 more documents ]
  1. Plot lexical dispersion

Plot shows the occurrences of the term ‘python’ and ‘r’ in across all documents for the state of Oregon (OR). The state was chosen to give a natural subset of all documents initially included.

The lexical dispersion appears to indicate the use of the terms ‘python’ and ‘r’ often occur in conjunction in the documents (job descriptions) which would indicate the listings are listing the two programming languages in or near the same sentence.

my_corpus_sub <- corpus_subset(my_corpus, state == "OR")
theme_set(theme_bw())
g <- textplot_xray(
     kwic(my_corpus_sub, pattern = "python"),
     kwic(my_corpus_sub, pattern = "r")
)
g + aes(color = keyword) + 
    scale_color_manual(values = c("blue", "red")) +
    theme(legend.position = "none")


Bigrams

From the tutorial at https://www.tidytextmining.com/. The bigrams identify the word pairs that occur the most frequently

  1. Identify bigrams of n=2.

This exercise finds bigrams of two words. The function does allow for bigrams of greater than two.

jobs_bigrams <- jobs_df %>%
  unnest_tokens(bigram, job_description, token = "ngrams", n = 2)
jobs_bigrams %>%
  count(bigram, sort = TRUE)
bigrams_separated <- jobs_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
# new bigram counts:
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)
# This result is valuable
bigram_counts %>% print(n = 20)
## # A tibble: 110,235 x 3
##    word1         word2           n
##    <chr>         <chr>       <int>
##  1 machine       learning     4721
##  2 data          scientist    4271
##  3 data          science      3998
##  4 computer      science      1415
##  5 data          sets         1272
##  6 data          analysis     1208
##  7 data          mining       1150
##  8 equal         opportunity  1129
##  9 national      origin        926
## 10 data          scientists    875
## 11 veteran       status        858
## 12 communication skills        851
## 13 sexual        orientation   842
## 14 data          analytics     838
## 15 opportunity   employer      829
## 16 data          sources       817
## 17 gender        identity      752
## 18 race          color         737
## 19 color         religion      730
## 20 related       field         713
## # ... with 110,215 more rows
  1. Filter the bigrams

Include bigrams that occurred at least 1250 times in order to filter out visual noise.

# filter for only relatively common combinations
bigram_graph <- bigram_counts %>%
  filter(n > 1250) %>%
  graph_from_data_frame()
bigram_graph
## IGRAPH ebe5791 DN-- 7 5 -- 
## + attr: name (v/c), n (e/n)
## + edges from ebe5791 (vertex names):
## [1] machine ->learning  data    ->scientist data    ->science  
## [4] computer->science   data    ->sets
  1. Visualize the bigrams in Network plot
set.seed(2020)
ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

  1. Visualize the bigrams with Directional plot
set.seed(2021)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()


Frequency Plot

Simple bar plot of the most common words across all the job listings’ descriptions.

jobs_words <- jobs_df %>%
  unnest_tokens(word, job_description) %>%
  anti_join(stop_words) %>%
  count(uniq_id, word, sort = TRUE)
## Joining, by = "word"
total_words <- jobs_words %>% 
  group_by(uniq_id) %>% 
  summarize(total = sum(n))
jobs_words <- left_join(jobs_words, total_words)
## Joining, by = "uniq_id"
jobs_words <- jobs_words %>%
    anti_join(stop_words)
## Joining, by = "word"
jobs_words %>%
  count(word, sort = TRUE)
jobs_words %>%
  count(word, sort = TRUE) %>%
  filter(n > 2500) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()


Analysis

After mining the job descriptions of our data set, the team met again over Zoom to discuss the results of our Pollock method for looking at the data.

Observations:

  1. The Equal Employment Opportunity (EEO) information for jobs was impacting the frequencies.

    An example of an EEO is below:

    CUNY encourages people with disabilities, minorities, veterans and women to apply. At CUNY, Italian Americans are also included among our protected groups. Applicants and employees will not be discriminated against on the basis of any legally protected category, including sexual orientation or gender identity. EEO/AA/Vet/Disability Employer.4

    Such words as religion, nation of origin, and diversity would appear in our frequencies. We would have to exclude these. We would not have been able to make this decision if we did not regroup.

  2. Loss of context

    A word or phrase can be taken out of context. For example, the word experience sounds good by itself, but when it is counted it could have been within a sentence that asked for 3-5 years of experience or no experience needed. In addition, we did not examine the role or title of the job description. A senior data scientist would require a particular gamut of skills different from an entry level data scientist. One job title might require a PhD as opposed to a Bachelors degree.

  3. Hard Skills

    After going through the analysis, these were the top 5 hard skills determined by looking at the data together. Note: hard skills are “teachable abilities or skill sets that are easy to quantify.”5

    • Programming Skills: Python, R, SQL
    • Platforms: Tableau, Apache Spark, Hadoop, AWS
    • Computer Science
    • Statistics, Mathematics, Algorithms
    • Experience and Education
  4. Soft Skills

    The following were the top 5 soft skills we determined by looking at the data together. Note: soft skills are “subjective skills that are much harder to quantify.”6

    • Communication Skills: presentation skills
    • Interpersonal Skills: i.e. EQ Emotional Intelligence
    • Team Player
    • Analytical
    • Motivated, driven
  5. Debugging?

    From the number of days spent debugging and troubleshooting, I noticed the abscence of the skill of debugging. Why was troubleshooting seemingly absent from the job descriptions?


Conclusions

The above skills are highly subjective. From what the data and analysis provided, did not provide any earth-breaking “Aha” moments. I did make a note that the data obtained is for job postings and represented only a slice of the skills needed to be a data scientist. My analysis is not a holistic one.

Every company will look for the best ideal candidates and the job posting is one step in any recruitment process. For every job, candidates have to interview and sometimes have to go through a barrage of interpersonal relationships with managers and headhunters.

For future steps, it would be necessary to include headhunters and other recruiters to provide data. In addition, a breakdown of the skills by job title might produce additional insight. For the Azure text analytics, one can remove the EEO policies from the job descriptions and employ RJson to encapsulate the job descriptions as JSON objects to be analyzed.