Introduction
Data Collection
Data Cleaning
Data Storage
Analysis
- Analysis: By Count
- Analysis: By Salary
  - Exploratory Analysis
  - Skills by Salary
Conclusion

Load libraries:

library(dplyr)
library(ggplot2)
library(RSQLite)
library(stringr)
library(tidyr)

Introduction

The purpose of this document is to answer the question, “Which are the most valued data science skills?” We collected a sample $n = 136$ of data scientist job postings from Indeed.com via web scraping.

After data collection and cleaning, we employed two strategies to value each skill: using salary information from the job listings, and using raw counts. Postings that included salary information were a minority of postings.

Data Collection

We scraped about 500 data scientist listings from Indeed.com, spread between five large American cities: New York City, San Francisco, Chicago, Washington, D.C., and Seattle.

Indeed.com was chosen because of its simple and clear query string method. The URL of a page of listings consisted of a base URL, a variable for city, and a variable for page number. E.g., the second page of listings for New York City data scientists was https://www.indeed.com/jobs?q=data+scientist&l=New+York%2C+NY&start=10. It was thus easy to create a list of ten pages for each city, and then scrape the URLs of the individual job postings on each results page.

We now had a list of individual job postings. Although they were not as structured as we would like, we were able to put some structure to it. Our final data frame had columns: url, job title, company name, Indeed users’ score of the comapny, job description, how long the posting had been up, metadata (mostly salary information), the date of scraping, city, and page number of results.

The actual scrape algorithm is pretty simple:

For each city in the list of cities:
1. Get some page of results, seq(0, 100, 10) = (0, 10, 20, ..., 100).
2. For each page in pages:
  1. Assemble a URL of job results, e.g., https://www.indeed.com/jobs?q=data+scientist&l=Seattle%2C+WA&start=100
  2. Read the HTML of the resulting page: read_html(results_url)
  3. Extract the approximately 15 links for job postings contained in the HTML: extract_listing_urls(results_page_html)
  4. Use sapply to extract_listing_data() from each of these links
  5. Transforming these results into a data.frame, save it to a TSV file

The functions extract_listing_urls() and extract_listing_data() do the bulk of the work. They are present in scrape.R. To highlight one interesting technique, extract_listing_data() is actually two function. extract_listing_data_()—note the underscore at the end—is the main scrape function, which is wrapped in extract_listing_data(), which uses R’s tryCatch() function to handle HTML errors elegantly (404s, etc.).

One challenge in the cleaning process was how to store the long string of text in the description column. We employed a nuber of techniques to make this easier, however, we probably should’ve store data in JSON format. First, we used tab delimited format, second we replaced all whitespace characters in the description with simple spaces, and third, we used stringR’s str_squish() function.

Data Cleaning

The data was cleaned using the procedure below:

The data was extracted from a GitHub directory and imported into R as a single data frame.
The data had some NAs, which were removed to ensure an accurate count of soft and hard skills.
Duplicate entries were removed so the final counts of skills were not skewed.
Columns were created for each of the soft and hard skills identified from resume genius.com. When a soft or hard skill appeared within a job posting a “TRUE” value was generated.
The data was transformed from wide to long. A condition column was created so each soft and hard skill would be a row in the said column. The skills column had Boolean values that need to be converted to a numeric value. True was modified to 1 and False to 0.

Data Storage

A SQL script was developed to transfer the indeed data frame to tables in MYSQL Workbench.

Analysis

Load the clean data from the SQLite table:

driver <- SQLite()
con <- dbConnect(driver, dbname='../data/clean/indeed.db')
df <- dbGetQuery( con,'select * from indeed_clean;' )
dbDisconnect(con)
colnames(df) <- as.character(df[1,])
df <- df[-1,]
df$`NA` <- NULL

Analysis: By Count

#Convert data from wide to long
datalong <- gather(df, condition, skills,softskill_communication:hardskill_shell, factor_key = TRUE)

# Convert True to 1 and False to 0
datalong$skills[datalong$skills == "TRUE"] <- 1
datalong$skills[datalong$skills == "FALSE"] <- 0

#Convert NA to 0

datalong$skills[is.na(datalong$skills)] <- 0

#Count of soft and hard skills for job posting advertised on Indeed.com
count_df <- datalong %>% group_by(condition) %>%
  summarise(count = sum(as.numeric(skills))) %>%
  arrange(desc(count))

## Warning: package 'bindrcpp' was built under R version 3.4.4

head(count_df,10)

## # A tibble: 10 x 2
##    condition                  count
##    <fct>                      <dbl>
##  1 hardskill_python             107
##  2 hardskill_sql                 80
##  3 hardskill_Analytics           63
##  4 softskill_communication       62
##  5 softskill_problem_solving     41
##  6 softskill_time_management     38
##  7 hardskill_data_engineering    38
##  8 hardskill_spark               36
##  9 hardskill_r                   34
## 10 hardskill_bd                  34

tail(count_df,10)

## # A tibble: 10 x 2
##    condition                                            count
##    <fct>                                                <dbl>
##  1 hardskill_pig                                            4
##  2 softskill_teamwork                                       3
##  3 softskill_adaptability                                   2
##  4 softskill_work_ethic                                     2
##  5 hardskill_User_Interface_Design                          1
##  6 hardskill_HTML                                           0
##  7 hardskill_digital_communication                          0
##  8 hardskill_database_management                            0
##  9 hardskill_Web_Architecture_and_Development_Framework     0
## 10 hardskill_risk_assessment                                0

#PLot Data
ggplot(count_df, aes(x=condition, y=count)) +
  geom_bar(stat='identity') +
  coord_flip()

#Count of soft and hard skills for job posting in Indeed
df.1 <- count_df

soft_skills <-   df.1[1:10,] %>%
  mutate(total = sum(count))%>%
  mutate(job.percentage = count/total)
soft_skills

## # A tibble: 10 x 4
##    condition                  count total job.percentage
##    <fct>                      <dbl> <dbl>          <dbl>
##  1 hardskill_python             107   533         0.201 
##  2 hardskill_sql                 80   533         0.150 
##  3 hardskill_Analytics           63   533         0.118 
##  4 softskill_communication       62   533         0.116 
##  5 softskill_problem_solving     41   533         0.0769
##  6 softskill_time_management     38   533         0.0713
##  7 hardskill_data_engineering    38   533         0.0713
##  8 hardskill_spark               36   533         0.0675
##  9 hardskill_r                   34   533         0.0638
## 10 hardskill_bd                  34   533         0.0638

hard_skills <- df.1[11:34,]  %>%
  mutate(total = sum(count))%>%
  mutate(job.percentage = count/total)
hard_skills

## # A tibble: 24 x 4
##    condition                     count total job.percentage
##    <fct>                         <dbl> <dbl>          <dbl>
##  1 hardskill_hadoop                 31   197         0.157 
##  2 hardskill_java                   22   197         0.112 
##  3 softskill_leadership             18   197         0.0914
##  4 hardskill_data_mining            18   197         0.0914
##  5 softskill_attention_to_detail    17   197         0.0863
##  6 hardskill_aws                    16   197         0.0812
##  7 hardskill_hive                   16   197         0.0812
##  8 softskill_interpersonal          10   197         0.0508
##  9 hardskill_pivot_table             7   197         0.0355
## 10 hardskill_iOS_App_Development     7   197         0.0355
## # ... with 14 more rows

According to resume genius.com, soft skills are the character traits or interpersonal skills that affect your ability to work and in interact with others. Ten (10) soft skills were selected from the following website - https://resumegenius.com . A search was a conducted through Data Scientist job postings that were advertised on Indeed.com. The soft skills: communication, problem solving, time management, leadership, and attention to detail are found more frequently on job postings advertised on Indeed.com. Based on the number of occurrences of the foregoing soft skills, it can be assumed that these soft skills are most sought after by employers of data scientist positions.

According to resume genius.com, hard skills are the specific knowledge and abilities that are learned through education and training. Similarly to the soft skills, twenty three (23) hard skills were selected from Indeed.com . The hard skills python, spark, R, hadoop, SQL, Big Data, Java are found more frequently on job postings advertised on Indeed.com. These hard skills it can also be assumed that these hard skills are most sought after by employers of data scientist positions.

Note that the python, spark, R, hadoop, SQL are programming languages. Python was present more frequently on job postings for Data Scientist positions. This analysis would encourage Data scientists to highlight Python experience on their resumes or learn Python if they haven’t done so already.

Analysis: By Salary

Another way we might evaluate the relative worth of data science skills is by comparing the salaries of these job offers. Unfortunately, most of these listings do not provide salaries, so we’ll have to use a sample much smaller than the original.

First, we’ll need to standardize the metadata column that seems to exist only for conveying salaries. To begin, we’ll subset the dataset to only contain data with salary information.

Most of the observations are blank, represented by character(0) or an empty string ''. Those empty observations will be filtered out, as well as metadata that does not refer to salary (operationalized by contains the dollar sign $) as well as those offering only an hourly wage:

del <- df$metadata[1]
salary <- df %>% filter(metadata != del, 
                        metadata != '',
                        str_detect(metadata, '\\$') == TRUE,
                        str_detect(metadata, 'hour') == FALSE)

This leaves a total of 22 observations with salary information. It is presented in a variety of formats that will need to be cleaned:

head(salary$metadata)

## [1] "$60,000 - $95,000 a year"                                       
## [2] "c(\\$76,000 - $112,000 a year (Indeed Est.) \\, \\Internship\\)"
## [3] "$92,000 - $136,000 a year (Indeed Est.) "                       
## [4] "$120,000 - $177,000 a year (Indeed Est.) "                      
## [5] "c(\\$90,000 - $130,000 a year\\, \\Contract\\)"                 
## [6] "c(\\$80,000 - $110,000 a year\\, \\Contract\\)"

We’ll use a function to extract the important data and discard the rest. Note that this function will convert ranges into their averages, e.g., ‘$100,000 - $110,000’ will be converted to ‘$105,000’.

First, determine if the input string is a valid input. The function is designed to handle common textual representations of salaries, which most often include a dollar sign. If the string does not include a dollar sign, the function assumes the input is invalid and returns NA. Otherwise, the function determines if the string contains a salary range or a single salary, via the presence of -. From there, it strips out unnecessary characters, and returns a numeric data type for salary.

extract_salary <- function(s) {
    has_dollar_sign <- str_detect(s, '\\$')
    if (has_dollar_sign == FALSE) {
        return(NA)
    } else {
        
        contains_range <- str_detect(s, ' - ')
        
        if (contains_range == TRUE) {
            text_range <- unlist(str_extract_all(s, '\\$[0-9,]+'))
            numeric_range <- as.numeric(str_remove_all(text_range, ',|\\$'))
            return(mean(c(numeric_range[1], numeric_range[2])))
        } else {
            as_text <- str_extract_all(s, '\\$[0-9,]+')
            as_numeric <- as.numeric(str_remove_all(as_text, ',|\\$'))
            return(as_numeric)
        }
    }
}

Run a few tests:

extract_salary('This is not a salary')

## [1] NA

extract_salary('$150,000 (depending on experience)')

## [1] 150000

extract_salary('$100,000 - $135,000')

## [1] 117500

Apply to our data frame with sapply:

salary$salary <- as.numeric(lapply(salary$metadata, extract_salary))
head(salary$salary)

## [1]  77500  94000 114000 148500 110000  95000

Exploratory Analysis

With our salary data now available, we can begin examining the distribution of salary and its variation across various dimensions.

ggplot(salary, aes(salary)) + 
    geom_histogram(binwidth=10000)

The salary distribution has a clear central tendancy, though the distribution is pretty wide. We believe the data suggests that a larger sample size would raise several observations further to the right of the above distribution.

The 50th percentile data scientist salary is around $115,000. The minimum paying job is a Data Scientist position for Biz2Credit Inc. in New York City, while the max paying job is for a Senior Data Scientist position for Octane Lending, also in New York City.

quantile(salary$salary, c(0, .1, .25, .5, .75, .9, .95, .99))

##     0%    10%    25%    50%    75%    90%    95%    99% 
##  57500  79150  96250 115000 134375 145200 148350 157585

Skills by Salary

With clean salary and skills data, we are now equipped to examine the relative value each skill in an exact dollar amount. First, we’ll look at soft skills, then hard skills, then we’ll compare them in aggregate.

Rearrange skills data to be amenable to analysis:

skills_long <- salary %>%
    select(url, salary, city, softskill_communication:hardskill_shell) %>%
    gather(skill_name, skill, softskill_communication:hardskill_shell, factor_key=TRUE) %>%
    filter(skill == TRUE) %>%
    select(url, salary, city, skill_name)

Soft Skills

Further rearrange for soft-skills:

soft_skills <- skills_long %>% 
    filter(str_detect(skill_name, 'softskill') == TRUE) %>%
    mutate(skill_name = str_remove(skill_name, 'softskill_')) %>%
    arrange(desc(salary))

Graphically represent this data with a box plot:

ggplot(soft_skills, aes(skill_name, salary)) +
    geom_boxplot() +
    theme(axis.text.x=element_text(angle=45, hjust=1))

The graph gives a sense of the distribution of salaries corresponding to each skill. Below gives the mean value. We see time management appears to be most highly prized, with a mean salary of $120,500.

soft_skills %>% 
    group_by(skill_name) %>% 
    summarize(avg_salary=mean(salary)) %>%
    arrange(desc(avg_salary))

## # A tibble: 6 x 2
##   skill_name          avg_salary
##   <chr>                    <dbl>
## 1 time_management        120500 
## 2 communication          118278.
## 3 problem_solving        110700 
## 4 attention_to_detail    109500 
## 5 interpersonal          106000 
## 6 leadership              90000

Hard Skills

Duplicating this same analysis for hard skills:

hard_skills <- skills_long %>% 
    filter(str_detect(skill_name, 'hardskill') == TRUE) %>%
    mutate(skill_name = str_remove(skill_name, 'hardskill_')) %>%
    arrange(desc(salary))

ggplot(hard_skills, aes(skill_name, salary)) +
    geom_boxplot() +
    theme(axis.text.x=element_text(angle=45, hjust=1))

The table gives the mean salary of each hard skill:

hard_skills %>% 
    group_by(skill_name) %>% 
    summarize(avg_salary=mean(salary)) %>%
    arrange(desc(avg_salary))

## # A tibble: 18 x 2
##    skill_name          avg_salary
##    <chr>                    <dbl>
##  1 tensorflow             132250 
##  2 sql                    122281.
##  3 data_mining            122000 
##  4 hive                   121300 
##  5 python                 120417.
##  6 c                      120000 
##  7 hadoop                 119875 
##  8 Analytics              119292.
##  9 bd                     116917.
## 10 spark                  115062.
## 11 aws                    112000 
## 12 iOS_App_Development    110833.
## 13 data_engineering       110429.
## 14 pig                    110000 
## 15 java                   105375 
## 16 r                      101700 
## 17 shell                   93750 
## 18 pivot_table             57500

Although the larger number of hard skills make the graph harder to read, there are perhaps three identifiable groups of higher-, medium-, and lesser-valued skills.

Hard Skills v. Soft Skills

The following section is a small experiment to compare the value of soft and hard skills to employers. We will collect all soft skills and hard skills together, and examine their respective mean salaries.

hard_soft <- skills_long %>%
    mutate(skill_type = ifelse(str_detect(skills_long$skill_name, 'softskill_') == TRUE, 'soft', 'hard')) %>%
    arrange(desc(salary))

ggplot(hard_soft, aes(skill_type, salary)) +
    geom_boxplot() +
    theme(axis.text.x=element_text(angle=45, hjust=1))

Although the mean salary for hard skills is slightly higher than soft skills, the distributions are effectively the same (at least for this low $n = 22$), both in terms of central tendency and variance.

hard_soft %>% 
    group_by(skill_type) %>% 
    summarize(avg_salary=mean(salary)) %>%
    arrange(desc(avg_salary))

## # A tibble: 2 x 2
##   skill_type avg_salary
##   <chr>           <dbl>
## 1 hard          116423.
## 2 soft          112328.

According to this data, we have to conclude that soft and hard skills are valued about equally by employers.

Conclusion

Among the top five most highly valued skills include Tensforflow, SQL, data mining, Hive, and Python. Python knowledge was especially emphasized in the count analysis. Much of this made sense to us. Tensorflow and deep learning are probably the greatest trend in data science today. Python and SQL are mainstays in data science, and Hive is probably the easiest way to access big data (in a SQL-like language) available today. We were unsurprised that ‘pivot table’ is not valued so highly—data scientists are not Excel jockeys. More surprising is that the average salary for R is less than $100,000. Though perhaps not—as Python catches up and (arguably) surpasses R as the langauge of choice for datas science, R would be valued less. Perhaps a flood of new R-trained data science and statistics graduates are depressing its price in the market!

The soft skills were more surprising. Leadership—a ubiquitous and often obnoxious buzzword of the corporate world—does not appear to be so highly valued for data scientists, with a mean salary of $90,000, and a ‘lower distribution’ than all the other soft skills. Data science projects often lack a discrete beginning and end, so time management was the most prized soft skill. Communication was also highly valued, which is a sensible desire for technical skilled workers who oftne have to provide non-technical management with recommendations and analyses.

Hard skills and soft skills in general appear to be equally prized by employers.

Caution is warranted, however. These results are based on a small sample size. Further work could validate them by collecting much more job listings with more geographical diversity. Although we included geography in this data set, we did not explore this variation.

DATA 607—Data Science Job Skills

Joel G. & Ben H.

October 21, 2018