Introduction

As current and future data scientists, we are keenly interested in the job market for data science. And we can, in fact, use data science to analyze said job market. For this project we procure and store three different datasets of job listings using a variety of methods. We tidy the data and then analyze and visualize it to get a better understanding. In particular, we examine matters such as salaries, locations, and coveted skills for data scientists.

City Jobs Dataset

We start by importing the data for NYC jobs from Github and cleaning up the column names. Then we remove duplicate listings.

raw_nyc_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_607/master/project_3/nyc-jobs.csv')
nyc_jobs_df <- raw_nyc_df
names(nyc_jobs_df) <- names(nyc_jobs_df) %>% tolower() %>% gsub("\\.", "_", .)
names(nyc_jobs_df)[names(nyc_jobs_df)=="x__of_positions"] <- "n_of_positions"
nyc_jobs_df <- nyc_jobs_df %>% select(-posting_type) %>% unique()
nyc_jobs_df %>% tail() %>% datatable()


Filter for data-specific jobs

We filter for data science jobs by using a regular expression to search the business_title column for the case insensitive terms “data” or “analytics.” Then we create another table for non-data jobs.

data_jobs_df <- nyc_jobs_df %>% filter(grepl("data|analytics", business_title, ignore.case = T))
## Warning: package 'bindrcpp' was built under R version 3.4.4
other_jobs_df <- nyc_jobs_df %>% filter(!grepl("data|analytics", business_title, ignore.case = T))
data_jobs_df %>% head() %>% datatable()


Analyze quantity and salary

We can see from the above table that a lot of key values are missing, including Job Description and Preferred Skills. This significantly limits our analysis capabilities to only a few areas.


Of all the jobs working for New York City, how many are data jobs?

### Total number of jobs in the dataset:
nrow(nyc_jobs_df)
## [1] 2205
### Number of data jobs:
nrow(data_jobs_df)
## [1] 71
### Data jobs, as a percentage of total:
nrow(data_jobs_df)/nrow(nyc_jobs_df)
## [1] 0.03219955

In this dataset, only about 3.2 percent of jobs are data jobs.


How well do data jobs pay relative to non-data jobs?

This data set provided two salaries per job posting. The first was for the minimum, while the second was for the maximum. At first attempt to see if there was a difference between data jobs and non-data jobs, we created box plots to see the difference in distribution.

### Data jobs
summary(data_jobs_df$salary_range_to)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      15   65339   84301   78740   99000  161497
### Non-data jobs
summary(other_jobs_df$salary_range_to)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     10.36  56703.25  75595.50  78639.47 101673.00 230000.00
par(mfrow=c(1,2))
boxplot(data_jobs_df$salary_range_to, xlab="Data Jobs", ylab="Salary in Dollars", ylim=c(0, 200000))
boxplot(other_jobs_df$salary_range_to, xlab="Non-Data Jobs", ylab="Salary in Dollars", ylim=c(0, 200000))

par(mfrow=c(1,1))

The distribution is wider for non-data jobs, but the median salary is higher for data jobs. It is important to remember that these are all government jobs, which overall may pay less than private sector jobs.

After analyzing this data further, we can see that minimum salary for both data sets is less than $20. These must be hourly rates, and this will unfairly skew the distribution. Therefore, we removed any salary that was less than $20,000.

salary.thresh <- 20000
data_jobs_df_new <- subset(data_jobs_df, salary_range_to >= salary.thresh & salary_range_from >= salary.thresh)
other_jobs_df_new <- subset(other_jobs_df, salary_range_to >= salary.thresh & salary_range_from >= salary.thresh)

Now that we have only included annual salaries, we can conduct a statistical analysis on this data. We can run a two-sample \(t\)-test to see if there is any statistical difference between the means. In particular, we wanted to see if data jobs paid better than non-data jobs. The null and alternative hypothesis can be seen below:

\(H_{0}: \mu_{A}=\mu_{B}\)

\(H_{a}: \mu_{A}>\mu_{B}\)

where \(A\) in this case represents the data group, and \(B\) represents the non-data group.

The t-statstic can be calculated using the Welch t-statistic method since the variances do not appear to be equal. The equation is as follows:

\(t = \frac{\bar{x}_{A}-\bar{x}_{B}}{\sqrt{\frac{s_{A}^{2}}{n_{A}}+\frac{s_{B}^{2}}{n_{B}}}}\)

The degrees of freedom, \(df\), are calculated using the following equation:

\(df = \frac{\left(\frac{s_{A}^{2}}{n_{A}}+\frac{s_{B}^{2}}{n_{B}}\right)}{\frac{s_{A}^{4}}{n_{A}^{2}(n_{B}-1)}+\frac{s_{B}^{4}}{n_{B}^{2}(n_{A}-1)}}\)

Below is the \(t\)-test for the higher range of salries. The \(t\)-test shows that we failed to reject the null hypothesis at 95% confidence level, since the \(p\)-value is greater than the significance level.

t.test(data_jobs_df_new$salary_range_to, other_jobs_df_new$salary_range_to, alternative="greater", var.equal=FALSE)

    Welch Two Sample t-test

data:  data_jobs_df_new$salary_range_to and other_jobs_df_new$salary_range_to
t = -0.077608, df = 71.146, p-value = 0.5308
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -5588.974       Inf
sample estimates:
mean of x mean of y 
 88735.88  88984.57 

However, if we conduct the same \(t\)-test on the lower salary range, we can see that the \(p\)-value is less than 0.05, as shown below. Therefore, there is sufficient evidence to reject the null hypothesis in favor of the alternative, and we can say that the lower end of the starting salary for data jobs is higher than non-data jobs. The lower range may hold more water since a lot of job postings will try to attract you with the range, but lean towards the lower salary number.

t.test(data_jobs_df$salary_range_from, other_jobs_df$salary_range_from, alternative="greater", var.equal=FALSE)

    Welch Two Sample t-test

data:  data_jobs_df$salary_range_from and other_jobs_df$salary_range_from
t = 1.7365, df = 75.181, p-value = 0.04328
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 227.0869      Inf
sample estimates:
mean of x mean of y 
 59624.94  54081.72 

Below, we can see the histograms for the lower and higher range salaries of both the data and non-data jobs. These graphs show that the lower salary-range for the data jobs tend to have a greater percentage of their jobs in the the 55-65K range compared to the non-data jobs. However, for the higher end of salary, the difference in distributions is harder to tell. This is consistent with our hypothesis tests.

numbreaks <- 10
up.lim <- 200000
par(mfrow=c(2,2))
hist(data_jobs_df_new$salary_range_from,breaks=numbreaks,xlim=c(0,up.lim),main="Lower Limit Data Jobs",xlab="Salary ($)")
hist(other_jobs_df_new$salary_range_from,breaks=numbreaks,xlim=c(0,up.lim),main="Lower Limit Other Jobs",xlab="Salary ($)")
hist(data_jobs_df_new$salary_range_to,breaks=numbreaks,xlim=c(0,up.lim),main="Upper Limit Data Jobs",xlab="Salary ($)")
hist(other_jobs_df_new$salary_range_to,breaks=numbreaks,xlim=c(0,up.lim),main="Upper Limit Other Jobs",xlab="Salary ($)")


Technology Jobs Dataset

Next we import the data for technology jobs within New York City that were posted to dice.com. We separate one of the columns and rename a few others.

raw_dice_df <- read.csv('https://raw.githubusercontent.com/mehtablocker/cuny_607/master/project_3/dice_com_nyc_jobs.csv', stringsAsFactors = F)
dice_jobs_df <- as_tibble(raw_dice_df) %>% 
  separate(employmenttype_jobstatus, into=c("employment_type", "job_status"), sep = ", ", fill="right", extra = "drop")
dice_jobs_df <- dice_jobs_df %>% 
  rename(advertiser_url = advertiserurl, 
         job_description = jobdescription,
         job_id = jobid, 
         job_location = joblocation_address, 
         job_title = jobtitle, 
         post_date = postdate)
dice_jobs_df %>% head() %>% datatable()


Filter for data science jobs

Since this dataset is comprised of only technology jobs, finding specifically data science jobs may require a bit more nuance. For example, if we try to filter for the words “data” or “analytics” as before, we catch a lot of software developer jobs that are not exactly the same subspace as data science.

ds_dice_df <- dice_jobs_df %>% filter(grepl("data|analytics", job_title, ignore.case = T))
ds_dice_df %>% select(job_title, company, employment_type, skills) %>% head() %>% datatable()

We can refine our search by excluding words like “engineer” and “architect” to get a more relevant result.

ds_dice_df <- ds_dice_df %>% 
  filter(!grepl("architect|architecture|engineer|developer|development|administrator|administration", job_title, ignore.case = T))
ds_dice_df %>% select(job_title, company, employment_type, skills) %>% head() %>% datatable()


Search for keywords

We can text mine the job_description and skills columns to find specific keywords.


How many job postings mention the R programming language?

r_dice_df <- ds_dice_df %>% 
  filter(grepl(" R | R,", job_description, ignore.case=T) | grepl(" R | R,", skills, ignore.case=T))
nrow(r_dice_df)
## [1] 6
r_dice_df %>% head() %>% datatable()

Of our 59 filtered job listings, six explicitly mention R.


How many job postings mention Python?

python_dice_df <- ds_dice_df %>% 
  filter(grepl(" python | python,", job_description, ignore.case=T) | grepl(" python | python,", skills, ignore.case=T))
nrow(python_dice_df)
## [1] 8
python_dice_df %>% head() %>% datatable()

Of our 59 filtered job listings, eight explicitly mention Python.


Indeed Jobs Scraping

Web Scraping Indeed Job Search Pages

To obtain the latest available information about data science job postings, 50 pages of the latest Indeed job postings are scraped using various html nodes in the rvest package. The resulting dataset contains the following fields: job title, company, location, job summary, and link. To identify the correct html nodes that return these fields, the chrome extention SelectorGadget was used in conjunction with inspect element. Regular expressions are also used for each field to remove blank spaces, new lines, and unnecessary information.

listings <- data.frame(title=character(),
                 company=character(), 
                 location=character(), 
                 summary=character(), 
                 link=character(), 
                 description = character(),
                 stringsAsFactors=FALSE) 
for (i in seq(0, 990, 10)){
  url_ds <- paste0('https://www.indeed.com/jobs?q=data+scientist&l=all&start=',i)
  var <- read_html(url_ds)
  
  #job title
  title <-  var %>% 
    html_nodes('#resultsCol .jobtitle') %>%
    html_text() %>%
    str_extract("(\\w+.+)+") 
  
  #company
  company <- var %>% 
    html_nodes('#resultsCol .company') %>%
    html_text() %>%
    str_extract("(\\w+).+") 
  
  #location
  location <- var %>%
    html_nodes('#resultsCol .location') %>%
    html_text() %>%
    str_extract("(\\w+.)+,.[A-Z]{2}")   
  #summary
  summary <- var %>%
    html_nodes('#resultsCol .summary') %>%
    html_text() %>%
    str_extract(".+")
  
  #link
  link <- var %>%
    html_nodes('#resultsCol .jobtitle .turnstileLink, #resultsCol a.jobtitle') %>%
    html_attr('href') 
  link <- paste0("https://www.indeed.com",link)
    
  listings <- rbind(listings, as.data.frame(cbind(title,
                                                  company,
                                                  location,
                                                  summary,
                                                  link)))
}
## Warning in cbind(title, company, location, summary, link): number of rows
## of result is not a multiple of vector length (arg 2)

## Warning in cbind(title, company, location, summary, link): number of rows
## of result is not a multiple of vector length (arg 2)

We have successfully scraped job title, company, location, job summary, and job link from 100 pages of Indeed job postings.

datatable(listings)

Creating a Unique ID for Each Job Listing

We create a unique ID for each listing based on title, location, company, summary, and description. This unique ID will be used to remove duplicates (since many sponsored posts show up multiple times in Indeed’s postings).

#create a unique id for each job posting attribute combination
listings$uniqueid <- mapply(function(x, y, z) digest(paste0(x,y,z)), listings$title, listings$location, listings$company)
#remove duplicate unique ids
listings %<>%
  distinct(uniqueid, .keep_all = TRUE)
#remove duplicate links
listings %<>%
  distinct(link, .keep_all = TRUE)
datatable(listings)

Scraping Individual Listings for Full Descriptions

The summaries provided in the main job listing pages are somewhat limited. Links to each individual job posting were also extracted in the previous step, so it is possible to iterate theough each link and scrape the full job description. To do this, duplicates are removed and rvest is used again to extract full descriptions.

#obtain full description for all job postings
for (i in (1:length(listings$link))){
  desciption <- tryCatch(
     html_text(html_node(read_html(as.character(listings$link[i])),'.jobsearch-JobComponent-description')),
     error=function(e){NA}
  )
  if (is.null(desciption)){
    desc <- NA
  }
  listings$description[i] <- desciption
}
datatable(listings)
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html

Store Job Listings in AWS SQL Server

Now that full job descriptions are available, regular expressions are used for some minor cleanup. Finally, we store the information from the web scraping in an SQL server. Rather than deleting the SQL table everytime this code chunk is run, we append the new listings to the existing SQL server. This allows any additional job listings to be captured without deleting older job listings. To prevent capturing duplicates in the SQL database, we use a temporary SQL table to identify which listings are not in the database already.

config <- config::get()

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, user=config$user, password=config$password, dbname=config$dbname, host= config$host)

#Query to create new SQL table using uniqueid as the primary key
sqltable <- "CREATE TABLE listings (
 title TEXT,
 company TEXT,
 location TEXT,
 summary TEXT,
 link TEXT,
 description TEXT,
 uniqueid TEXT PRIMARY KEY
)"

#Creates new SQL table "listings" using the query above if it does not exist yet 
if (!dbExistsTable(con, "listings")){
  dbExecute(con, sqltable)
}

#Instead of dropping the table and creating a new one, we append records. Setting uniqueID as the primary key prevents us from appending duplicate job postings to the SQL database. This requires us to append a subset of job listings that are not duplicates. To do this, we create a temporary table to indentify and exclude duplicates - then import new listings into SQL.
dbWriteTable(con, "dup_listings", listings, row.names = FALSE)
dupcheck <- "
SELECT
  uniqueid,
  MAX(title) as title,
  MAX(company) as company,
  MAX(location) as location,
  MAX(summary) as summary,
  MAX(link) as link,
  MAX(description) as description
FROM dup_listings
WHERE uniqueid NOT IN (SELECT uniqueid FROM listings)
GROUP BY uniqueid;
"
dupcheck_results <- dbGetQuery(con, dupcheck)

if (nrow(dupcheck_results) > 0) {
  dbWriteTable(con, "listings", dupcheck_results, append = TRUE, row.names = FALSE)
}
dbRemoveTable(con, "dup_listings")
dbDisconnect(con)

Import Job Listings from SQL Server

This step imports all the unique data science job listings that we have saved in a SQL table.

config <- config::get()

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, user=config$user, password=config$password, dbname=config$dbname, host= config$host)
#Query to get all job listings from SQL database.
sql <- " 
    SELECT uniqueid, title, company, location, summary, link, description
    FROM listings
"
#Show query results
results <- dbGetQuery(con, sql)
dbDisconnect(con)
## [1] TRUE
datatable(results)

Using Keywords to Identify Valuable Skills

Word Cloud

To see which keywords show up most frequenctly in job descriptions, we create a word cloud. From the word cloud, we see that some important skills for a data science are business, machine learning, teamwork, statistics, analytics, research, and modeling. Languages highlighted in the word cloud include Python, SQL, and Java. This word cloud only provides a high-level summary of skills - for a more in-depth analysis, we will search for specific keywords. Note - R is filtered out of this word cloud. The word cloud function is from sthda.com.

res <- rquery.wordcloud(paste(results$description), type="text", 
        lang="english", excludeWords = c("data", "experience","will","work"),
        textStemming = FALSE,  colorPalette="Paired",
        max.words=500)

Keyword Search for Languages

Below we search for occurance rates of specific languages. The majority of job postings mention R, Python, and/or SQL. Java, C++, and Scala are less commonly listed, but also seem to be useful skills for many job listings.

#R
r_listings_df <- results %>% 
  filter(grepl(" R | R,", description, ignore.case=T) | grepl(" R | R,", summary, ignore.case=T))

#Python
python_listings_df <- results %>% 
  filter(grepl(" python| python,", description, ignore.case=T) | grepl(" python| python,", summary, ignore.case=T))

#Java
java_listings_df <- results %>% 
  filter(grepl(" java| java,", description, ignore.case=T) | grepl(" java| java,", summary, ignore.case=T))

#SQL
SQL_listings_df <- results %>% 
  filter(grepl("SQL| SQL,", description, ignore.case=T) | grepl("SQL| SQL,", summary, ignore.case=T))

#Scala
scala_listings_df <- results %>% 
  filter(grepl(" Scala| Scala,", description, ignore.case=T) | grepl(" Scala| Scala,", summary, ignore.case=T))

#C++
Cplusplus_listings_df <- results %>% 
  filter(grepl("C\\+\\+| C\\+\\+,", description, ignore.case=T) | grepl("C\\+\\+| C\\+\\+,", summary, ignore.case=T))


language_pct <- tibble(
  language = c("R", "Python", "Java", "SQL", "Scala","C++"), 
  mention_pct = c(round(nrow(r_listings_df)/nrow(results),digits=2), 
                  round(nrow(python_listings_df)/nrow(results),digits=2), 
                  round(nrow(java_listings_df)/nrow(results),digits=2), 
                  round(nrow(SQL_listings_df)/nrow(results),digits=2), 
                  round(nrow(scala_listings_df)/nrow(results),digits=2),
                  round(nrow(Cplusplus_listings_df)/nrow(results),digits=2)))
language_pct %>% ggplot(aes(x=language, y=mention_pct)) + 
  geom_bar(stat="identity", position=position_dodge(), fill="steelblue") + 
  labs(x = "Language", y = "Percent Mentioned")

Keyword Search for Other Tools

We also looked into the percent of job postings that mentioned tools such as Tableau, Spark, Hadoop, and SAS. These results show the prevalence of Big Data - given how often Spark and Hadoop are mentioned.

#Tableau
tableau_listings_df <- results %>% 
  filter(grepl("Tableau| Tableau,", description, ignore.case=T) | grepl("Tableau| Tableau,", summary, ignore.case=T))

#Spark
spark_listings_df <- results %>% 
  filter(grepl("spark| spark,", description, ignore.case=T) | grepl("spark| spark,", summary, ignore.case=T))

#Hadoop
hadoop_listings_df <- results %>% 
  filter(grepl("hadoop| hadoop,", description, ignore.case=T) | grepl("hadoop| hadoop,", summary, ignore.case=T))

#SAS
SAS_listings_df <- results %>% 
  filter(grepl("sas| sas,", description, ignore.case=T) | grepl("sas| sas,", summary, ignore.case=T))

#AWS
AWS_listings_df <- results %>% 
  filter(grepl(" AWS| Amazon Web Services", description, ignore.case=T) | grepl(" AWS| Amazon Web Services", summary, ignore.case=T))

#Azure
AWS_listings_df <- results %>% 
  filter(grepl(" AWS| Amazon Web Services", description, ignore.case=T) | grepl(" AWS| Amazon Web Services", summary, ignore.case=T))

#Excel
Excel_listings_df <- results %>% 
  filter(grepl(" excel | excel,", description, ignore.case=T) | grepl(" excel | excel,", summary, ignore.case=T))

tools_pct <- tibble(
  tool = c("Tableau", "Spark", "Hadoop", "SAS","Amazon Web Services","Excel"), 
  
  mention_pct = c(round(nrow(tableau_listings_df)/nrow(results),digits=2), 
                  round(nrow(spark_listings_df)/nrow(results),digits=2), 
                  round(nrow(hadoop_listings_df)/nrow(results),digits=2), 
                  round(nrow(SAS_listings_df)/nrow(results),digits=2),
                  round(nrow(AWS_listings_df)/nrow(results),digits=2),
                  round(nrow(Excel_listings_df)/nrow(results),digits=2)))
tools_pct %>% ggplot(aes(x=tool, y=mention_pct)) + 
  geom_bar(stat="identity", position=position_dodge(), fill="maroon") + 
  labs(x = "Tool", y = "Percent Mentioned")

Keyword Search for Technical Topics

As we saw in the word cloud above, Machine Learning is a common term mentioned in Indeed Data Science job listings. Which other technical topics are mentioned in job listings? Below we see that modeling, statistics, and visualization are key topics for data scientists. Big data and cloud technologies also have their place in the field of data science.

#Big Data
bigdata_listings_df <- results %>% 
  filter(grepl("big data| big data,", description, ignore.case=T) | grepl("big data| big data,", summary, ignore.case=T))

#Cloud
cloud_listings_df <- results %>% 
  filter(grepl("cloud| cloud,", description, ignore.case=T) | grepl("cloud| cloud,", summary, ignore.case=T))

#Machine Learning
ml_listings_df <- results %>% 
  filter(grepl("machine learning", description, ignore.case=T) | grepl("machine learning", summary, ignore.case=T))

#Regression Analysis
regr_listings_df <- results %>% 
  filter(grepl("regression| regression", description, ignore.case=T) | grepl("regression| regression", summary, ignore.case=T))

#Artificial Intelligence
ai_listings_df <- results %>% 
  filter(grepl(" AI |Artificial Intelligence", description, ignore.case=T) | grepl(" AI |Artificial Intelligence", summary, ignore.case=T))

#Statistics
stat_listings_df <- results %>% 
  filter(grepl("statistic", description, ignore.case=T) | grepl("statistic", summary, ignore.case=T))

#Modeling
model_listings_df <- results %>% 
  filter(grepl("model", description, ignore.case=T) | grepl("model", summary, ignore.case=T))

#Visualization
viz_listings_df <- results %>% 
  filter(grepl("visual", description, ignore.case=T) | grepl("visual", summary, ignore.case=T))

tools_pct <- tibble(
  tool = c("Big Data", "Cloud", "Machine Learning", "Regression","AI","Statistics","Modeling","Visualization"), 
  mention_pct = c(round(nrow(bigdata_listings_df)/nrow(results),digits=2), 
                  round(nrow(cloud_listings_df)/nrow(results),digits=2), 
                  round(nrow(ml_listings_df)/nrow(results),digits=2), 
                  round(nrow(regr_listings_df)/nrow(results),digits=2),
                  round(nrow(ai_listings_df)/nrow(results),digits=2),
                  round(nrow(stat_listings_df)/nrow(results),digits=2),
                  round(nrow(model_listings_df)/nrow(results),digits=2),
                  round(nrow(viz_listings_df)/nrow(results),digits=2)))
tools_pct %>% ggplot(aes(x=tool, y=mention_pct)) + 
  geom_bar(stat="identity", position=position_dodge(), fill="purple") + 
  labs(x = "Technical Topic", y = "Percent Mentioned")

Keyword Search for Soft Skills

Which soft skills are important for data scienctists? We searched the occurance of multiple terms, such as communication, leadership, and problem solving. The key word search indicates that communication, leadership, and presentation skills are necessary for many job postings. Creativity and problem solving are also mentioned in a significant portion of job postings. On the other hand, teamwork seems to be a less valuable skill for data scientists!

#Communication
communication_listings_df <- results %>% 
  filter(grepl("communication | communicate", description, ignore.case=T) | grepl("communication | communicate", summary, ignore.case=T))

#Leadership
leadership_listings_df <- results %>% 
  filter(grepl("leadership | leader", description, ignore.case=T) | grepl("leadership | leader", summary, ignore.case=T))

#Problem Solving
problem_listings_df <- results %>% 
  filter(grepl("problem solv", description, ignore.case=T) | grepl("problem solv", summary, ignore.case=T))

#Presentation
presentation_listings_df <- results %>% 
  filter(grepl("present | presentation | public speak", description, ignore.case=T) | grepl("present | presentation | public speak", summary, ignore.case=T))

#Teamwork
teamwork_listings_df <- results %>% 
  filter(grepl("teamwork", description, ignore.case=T) | grepl("teamwork", summary, ignore.case=T))

#Adaptability
adapt_listings_df <- results %>% 
  filter(grepl("adapt", description, ignore.case=T) | grepl("adapt", summary, ignore.case=T))

#Creativity
creative_listings_df <- results %>% 
  filter(grepl("creative | creativity", description, ignore.case=T) | grepl("creative | creativity", summary, ignore.case=T))

tools_pct <- tibble(
  tool = c("Communication", "Leadership", "Problem Solving", "Presentation","Teamwork","Adaptability","Creativity"), 
  mention_pct = c(round(nrow(communication_listings_df)/nrow(results),digits=2), 
                  round(nrow(leadership_listings_df)/nrow(results),digits=2), 
                  round(nrow(problem_listings_df)/nrow(results),digits=2), 
                  round(nrow(presentation_listings_df)/nrow(results),digits=2),
                  round(nrow(teamwork_listings_df)/nrow(results),digits=2),
                  round(nrow(adapt_listings_df)/nrow(results),digits=2),
                  round(nrow(creative_listings_df)/nrow(results),digits=2)))
tools_pct %>% ggplot(aes(x=tool, y=mention_pct)) + 
  geom_bar(stat="identity", position=position_dodge(), fill="green") + 
  labs(x = "Soft Skill", y = "Percent Mentioned")

Number of Jobs by Location

We thought it might be interesting to see the frequency of jobs by location. We wanted to see the frequency by both city and state, so we created two new data frames to look at this:

listings.city <- listings[,1:3]
listings.state <- listings.city %>% separate(location, c("City","State"), sep = ",")

Now, we can create frequency tables for both:

locations.city <- as.data.frame(table(listings.city$location))
locations.state <- as.data.frame(table(listings.state$State))
head(locations.city[order(locations.city$Freq,decreasing = TRUE),],10)
                     Var1 Freq
9            New York, NY  212
12      San Francisco, CA   82
30             Boston, MA   30
15         Washington, DC   25
23 Santa Clara Valley, CA   16
40        Los Angeles, CA   14
17          Palo Alto, CA   13
35          Sunnyvale, CA   13
18            Seattle, WA   12
16         Menlo Park, CA   11
head(locations.state[order(locations.state$Freq,decreasing = TRUE),],10)
   Var1 Freq
25   NY  241
4    CA  215
16   MA   54
35   VA   34
33   TX   33
37   WA   30
7    DC   25
29   PA   22
21   NC   17
9    FL   16

We see states with the highest amount of data science jobs are New York and California.

Looking at data science jobs by city, it is evident that New York/Manhattan and San Francisco are a hub for data scientists.

Conclusion

Our examination of the data science job market involved us gathering and cleaning three datasets of job listings. After storing, importing, tidying, and analyzing the data, we learned that:

  1. Salaries tend to be higher for data jobs than non-data jobs.
  2. The New York, California, and DC areas are popular bases for data science opportunities.
  3. Coveted data science skills include Python, R, SQL, Spark and Hadoop, among others.
  4. High-ranking data science tops include Machine Learning, Statistics, and Modeling.
  5. Important soft skills for data scientists are communication and leadership abilities.

Future analysis of interest might delve into various trends of popularity. Overall, the current landscape appears bright for data scientists.