DATA 607 - Final Project

Background

In this course, we’ve analyzed the most attractive skills among Data Scientists for prospective employers. I also delved into LinkedIn’s recommendation engine and tried to reverse engineer how it determined which jobs to show me based on skills I claimed to have and the jobs I had applied to previously.

These projects prompted me to seek out opportunities for Data Scientists among FinTechs, Tech firms, and Gaming firms that are, purportedly, targeting expansion in New York more and more. However, it’s worth targeting the highest growth, most dynamic firms, but what would the criteria for these characteristics be? To find these firms, I’ll scrape Gaming, Technology, and Financial Technology firms in the US that are achieving a compound annual growth rate (CAGR) of at least 5% and run these results against which companies on Indeed are posting the most NY-based Data Scientist roles. By doing so, I’ll hope to answer the following questions:

Are the highest-growth firms hiring up Data Scientists in the New York area?
Are these jobs found on Indeed?

To do this, I’ll take the following steps:

Acquire CAGR company information and determine highest-growth companies.
Scrape a list of companies with highest number of NY-based Data Scientist roles on Indeed.
Organize the data, derive insights, and generate a few helpful visualizations for analysis and conclusions.

Data

I was fortunate to only need two data sources for my analysis:

Maxine Kelly, Financial Times, April 13 2021. FT ranking: The Americas’ fastest-growing companies 2021, https://www.ft.com/americas-fastest-growing-companies-2021

Indeed.com, May 15 search for full-time Data Scientist roles in the New York City area shared in the last thirty days. I also pull in San Francisco data for comparison purposes.

library(rvest)

## Warning: package 'rvest' was built under R version 4.0.5

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.5

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.0.4

## Warning: package 'tibble' was built under R version 4.0.5

## Warning: package 'tidyr' was built under R version 4.0.5

## Warning: package 'readr' was built under R version 4.0.5

## Warning: package 'dplyr' was built under R version 4.0.5

## Warning: package 'forcats' was built under R version 4.0.5

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter()         masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()

library(rebus)

## Warning: package 'rebus' was built under R version 4.0.5

## 
## Attaching package: 'rebus'

## The following object is masked from 'package:stringr':
## 
##     regex

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(xml2)

# read in csv data from the FT
ft_data <- read_csv("https://raw.githubusercontent.com/evanmclaughlin/ECM607/master/FT_data.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Rank = col_double(),
##   Name = col_character(),
##   Country = col_character(),
##   category = col_character(),
##   absolute_growth_rate_pct_inc = col_number(),
##   CAGR_pct_inc = col_double(),
##   `2018_rev` = col_number(),
##   `2015_rev` = col_number(),
##   employees_2018 = col_double(),
##   founded = col_double()
## )

# filter for US data, high CAGR, and data from Tech, Fintech, and Gaming

US_based <- filter(ft_data, ft_data$Country == "US")
high_cagr <- filter(US_based, US_based$CAGR_pct_inc > 5.0)

# I'm establishing a flexible category data frame in case I want to expand or contract the search later

sectors <- data.frame(word = c("Games industry", "Technology", "Fintech"))

corp_df <- high_cagr %>%
  filter(category %in% sectors$word)

corp_df

## # A tibble: 111 x 10
##     Rank Name      Country category   absolute_growth_r~ CAGR_pct_inc `2018_rev`
##    <dbl> <chr>     <chr>   <chr>                   <dbl>        <dbl>      <dbl>
##  1     1 Niantic*  US      Games ind~            180307.        1117.     790.  
##  2     2 UiPath    US      Technology             37464.         622.     155   
##  3     6 Perpay    US      Fintech                18480.         471.      22.9 
##  4     8 Applied ~ US      Fintech                11990.         394.      37.4 
##  5    10 Urgent.ly US      Technology             11612.         389.      30.2 
##  6    11 Vlocity   US      Technology             10245.         369.      96.5 
##  7    15 Printify* US      Technology              7639.         326.      10.1 
##  8    17 Fund Tha~ US      Fintech                 6018.         294        6.41
##  9    19 Beeswax.~ US      Technology              5262          277.      12.3 
## 10    24 Asset Pa~ US      Technology              3692.         236        5.09
## # ... with 101 more rows, and 3 more variables: 2015_rev <dbl>,
## #   employees_2018 <dbl>, founded <dbl>

This is a perfectly manageable dataframe now and helps us zero in on what we’re really looking for in our analysis.

Data Scrape

Next, we’re going to pull Data Scientist opening data from Indeed and convert it to a readable format, using a read_html function. I’ll repeat this step three times since Indeed caps the number of results at 50 per page. Merging these three datasets should give me a good population from which to draw some insights.

#link to New York Data Scientist job query

url1 <- "https://www.indeed.com/jobs?as_and=Data+Scientist&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=25&l=New+York%2C+NY&fromage=any&limit=50&sort=date&psf=advsrch&from=advancedsearch"

# read html
page1 <- xml2::read_html(url1)

#get the company name from the search
company1 <- page1 %>% 
    rvest::html_nodes("span")  %>% 
    rvest::html_nodes(xpath = '//*[@class="company"]')  %>% 
    rvest::html_text() %>%
    stringi::stri_trim_both()
tib1 <- tibble(company1, .name_repair = ~ c("company"))

url2 <- "https://www.indeed.com/jobs?q=Data+Scientist&l=New+York%2C+NY&sort=date&limit=50&radius=25&start=50"

# read html
page2 <- xml2::read_html(url2)

#get the company name from the search
company2 <- page2 %>% 
    rvest::html_nodes("span")  %>% 
    rvest::html_nodes(xpath = '//*[@class="company"]')  %>% 
    rvest::html_text() %>%
    stringi::stri_trim_both()
tib2 <- tibble(company2, .name_repair = ~ c("company"))

url3 <- "https://www.indeed.com/jobs?q=Data+Scientist&l=New+York%2C+NY&sort=date&limit=50&radius=25&start=100"

# read html
page3 <- xml2::read_html(url3)

#get the company name from the search
company3 <- page3 %>% 
    rvest::html_nodes("span")  %>% 
    rvest::html_nodes(xpath = '//*[@class="company"]')  %>% 
    rvest::html_text() %>%
    stringi::stri_trim_both()
tib3 <- tibble(company3, .name_repair = ~ c("company"))


# adding additional Indeed searches because results were not very helpful

url4 <- "https://www.indeed.com/jobs?q=Data+Scientist&l=New+York%2C+NY&sort=date&limit=50&radius=25&start=150"

# read html
page4 <- xml2::read_html(url4)

#get the company name from the search
company4 <- page4 %>% 
    rvest::html_nodes("span")  %>% 
    rvest::html_nodes(xpath = '//*[@class="company"]')  %>% 
    rvest::html_text() %>%
    stringi::stri_trim_both()
tib4 <- tibble(company4, .name_repair = ~ c("company"))

url5 <- "https://www.indeed.com/jobs?q=Data+Scientist&l=New+York%2C+NY&sort=date&limit=50&radius=25&start=200"

# read html
page5 <- xml2::read_html(url5)

#get the company name from the search
company5 <- page5 %>% 
    rvest::html_nodes("span")  %>% 
    rvest::html_nodes(xpath = '//*[@class="company"]')  %>% 
    rvest::html_text() %>%
    stringi::stri_trim_both()
tib5 <- tibble(company5, .name_repair = ~ c("company"))


fullcomp <- rbind(tib1, tib2, tib3, tib4, tib5)
fullcomp

## # A tibble: 150 x 1
##    company                
##    <chr>                  
##  1 Barclays               
##  2 Barclays               
##  3 CVS Health             
##  4 DISH                   
##  5 Patch Biosciences      
##  6 Amazon.com Services LLC
##  7 Slack                  
##  8 H&M Group              
##  9 New York University    
## 10 Uber                   
## # ... with 140 more rows

This gives us a good list of companies that have (sometimes multiple) Data Scientist openings in New York City. Now we can run this list against the dataset with high CAGR companies and see what matches.

ds_cagr <- fullcomp %>%
  filter(fullcomp$company %in% corp_df$Name)

ds_cagr

## # A tibble: 4 x 1
##   company  
##   <chr>    
## 1 Yext     
## 2 BairesDev
## 3 BairesDev
## 4 BairesDev

A first attempt at this, using only three pages of scraped Indeed data, yielded only one company, Yext. I went back and added an additional 100 Indeed posts. This hardly improved things. There are only two high-growth companies, Yext and BairesDev, that are hiring for Data Scientists in the New York area. Could it be that New York just isn’t the technology incubator that we wish it was. Is San Francisco still king? It’s worth another look, using roughly the same methodology to see if there are any noticeable differences.

#link to New York Data Scientist job query

sfurl1 <- "https://www.indeed.com/jobs?as_and=Data+Scientist&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=50&l=san+francisco%2C+ca&fromage=any&limit=50&sort=date&psf=advsrch&from=advancedsearch"

# read html
sfpage1 <- xml2::read_html(sfurl1)

#get the company name from the search
sfcompany1 <- sfpage1 %>% 
    rvest::html_nodes("span")  %>% 
    rvest::html_nodes(xpath = '//*[@class="company"]')  %>% 
    rvest::html_text() %>%
    stringi::stri_trim_both()
sftib1 <- tibble(sfcompany1, .name_repair = ~ c("company"))

sfurl2 <- "https://www.indeed.com/jobs?q=Data+Scientist&l=san+francisco%2C+ca&radius=50&sort=date&limit=50&start=50"

# read html
sfpage2 <- xml2::read_html(sfurl2)

#get the company name from the search
sfcompany2 <- sfpage2 %>% 
    rvest::html_nodes("span")  %>% 
    rvest::html_nodes(xpath = '//*[@class="company"]')  %>% 
    rvest::html_text() %>%
    stringi::stri_trim_both()
sftib2 <- tibble(sfcompany2, .name_repair = ~ c("company"))

sfurl3 <- "https://www.indeed.com/jobs?q=Data+Scientist&l=san+francisco%2C+ca&radius=50&sort=date&limit=50&start=100"

# read html
sfpage3 <- xml2::read_html(sfurl3)

#get the company name from the search
sfcompany3 <- sfpage3 %>% 
    rvest::html_nodes("span")  %>% 
    rvest::html_nodes(xpath = '//*[@class="company"]')  %>% 
    rvest::html_text() %>%
    stringi::stri_trim_both()
sftib3 <- tibble(sfcompany3, .name_repair = ~ c("company"))


# adding additional Indeed searches because results were not very helpful

sfurl4 <- "https://www.indeed.com/jobs?q=Data+Scientist&l=san+francisco%2C+ca&radius=50&sort=date&limit=50&start=150"

# read html
sfpage4 <- xml2::read_html(sfurl4)

#get the company name from the search
sfcompany4 <- sfpage4 %>% 
    rvest::html_nodes("span")  %>% 
    rvest::html_nodes(xpath = '//*[@class="company"]')  %>% 
    rvest::html_text() %>%
    stringi::stri_trim_both()
sftib4 <- tibble(sfcompany4, .name_repair = ~ c("company"))

sfurl5 <- "https://www.indeed.com/jobs?q=Data+Scientist&l=san+francisco%2C+ca&radius=50&sort=date&limit=50&start=200"

# read html
sfpage5 <- xml2::read_html(sfurl5)

#get the company name from the search
sfcompany5 <- sfpage5 %>% 
    rvest::html_nodes("span")  %>% 
    rvest::html_nodes(xpath = '//*[@class="company"]')  %>% 
    rvest::html_text() %>%
    stringi::stri_trim_both()
sftib5 <- tibble(sfcompany5, .name_repair = ~ c("company"))


sffullcomp <- rbind(sftib1, sftib2, sftib3, sftib4, sftib5)
sffullcomp

## # A tibble: 150 x 1
##    company                  
##    <chr>                    
##  1 Walmart                  
##  2 Ericsson                 
##  3 Google                   
##  4 JPMorgan Chase Bank, N.A.
##  5 Intel                    
##  6 Nuro                     
##  7 PayPal                   
##  8 Tenstorrent              
##  9 VIZIO, Inc.              
## 10 Fiserv, Inc.             
## # ... with 140 more rows

Now let’s run this against the high-CAGR dataset to see just who the CAGR king is.

sfds_cagr <- sffullcomp %>%
  filter(sffullcomp$company %in% corp_df$Name)

sfds_cagr

## # A tibble: 0 x 1
## # ... with 1 variable: company <chr>

This yielded even FEWER results. So what’s going on? Well, it could be that the really high-growth companies are startups, and most of these startups are small, meaning they likely don’t post jobs on large platforms like Indeed. So Indeed might not be the best avenue for finding a job if you’re a Data Scientist looking for a job at a high-growth company.

mean(corp_df$employees_2018)

## [1] 493.7207

median(corp_df$employees_2018)

## [1] 81

The average employee headcount within these organizations is just under 500, and the median is 81, a good example of when to use the median for analysis.

Another possibility is that Data Scientists aren’t really in demand within early-stage startups that don’t actually have much, er, data. It may be that these tech firms are more concerned with hiring developers and financial engineers to actually develop core product functionality. But how young are these companies?

mean(corp_df$founded)

## [1] 2008.847

median(corp_df$founded)

## [1] 2010

The numbers above don’t even indicate a particular skew toward younger companies among those with an elevated CAGR. but there does seem to be some relationship between a company’s immaturity and its CAGR. This makes some sense, intuitively, as older, ostensibly larger companies, don’t generate enormous growth as easily due to their size.

b <- 1990
high_cagr_min <-dplyr::filter(high_cagr, high_cagr$founded>b)

cagr_found_lm <- lm(CAGR_pct_inc ~ founded, 
                data = high_cagr_min)

library(visreg)

## Warning: package 'visreg' was built under R version 4.0.5

visreg(cagr_found_lm, "founded", gg = TRUE)

The good news for aspiring Data Scientists, though, is that there are many, many openings in general, including multiple openings within individual firms.

First, let’s look at which firms are most actively seeking Data Scientists in New York.

library(RColorBrewer)

## Warning: package 'RColorBrewer' was built under R version 4.0.3

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:rebus':
## 
##     alpha

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

nycomp_count <- fullcomp %>%
  count(company) %>%
  arrange(desc(n))

a <- 1
nycomp_count_min <-dplyr::filter(nycomp_count, nycomp_count[,2]>a)

ggplot(nycomp_count_min) + geom_bar(aes(reorder(company,n) , y = n, fill=company) , stat = "identity", position = "dodge", width = .5) + coord_flip() +  theme(legend.position = "none") +  labs(title = "NY: Number of Data Scientist Openings, Min. 2", x = "", y = "", fill = "Source" + scale_x_continuous(breaks = c(2,3,4,5,6,7)))

Let’s run the same analysis on SF.

sfcomp_count <- sffullcomp %>%
  count(company) %>%
  arrange(desc(n))

sfcomp_count_min <-dplyr::filter(sfcomp_count, sfcomp_count[,2]>a)

ggplot(sfcomp_count_min) + geom_bar(aes(reorder(company,n) , y = n, fill=company) , stat = "identity", position = "dodge", width = .5) + coord_flip() +  theme(legend.position = "none") +  labs(title = "SF: Number of Data Scientist Openings, Min. 2", x = "", y = "", fill = "Source" + scale_x_continuous(breaks = c(2,3,4,5,6,7)))

There’s actually a bit of overlap here. It could be interesting to see if some of these are the same roles with flexible work locations, post-COVID. In any case, it appears that there are more Data Scientist jobs posted to Indeed in New York than in SF in general.

Conclusion

Let’s go back to the questions we intended to answer:

Are the highest-growth firms hiring up Data Scientists in the New York area?
Are these jobs found on Indeed?

While there are certainly lots of jobs for Data Scientists in the New York area, there don’t appear to be many high-CAGR firms looking to hire Data Scientists, at least not on Indeed. It may be that Indeed isn’t the best job search engine for aspiring Data Scientists to use in their searches. It could also be that some of the high growth, younger companies don’t have their HR functions built out sufficiently to post their jobs anywhere but their own sites. In any case, there are still plenty of opportunities for Data Scientists in today’s job market.

```

DATA 607 - Final Project

Evan McLaughlin

5/17/2021

Background

Data

Data Scrape

Conclusion