Project 3

Fulltime Job Listings in LA

The first 2 data sets we decided to work with are filltime job listing scraped from job sites. We decided to use LA and NY.

The job description column is what we are interested in. After reading the data into a dataframe we clean it up by removing any html and new line characters.

jobDescriptions <- read.csv('https://raw.githubusercontent.com/jglendrange/DATA607/main/fulltimeLA.csv')
jobDescriptions$description <- gsub("<[^>]+>", "", jobDescriptions$description)
jobDescriptions$description <- gsub("\n", " ", jobDescriptions$description)
jobDescriptions$description <- gsub("&amp;", "", jobDescriptions$description)
glimpse(jobDescriptions)

## Rows: 254
## Columns: 5
## $ position    <chr> "Institutional Review Board Analyst I - Office of Researc…
## $ company     <chr> "USC", "The Boston Consulting Group", "USC", "SHIELDS for…
## $ description <chr> "Please Note: This position is located on our Health Scie…
## $ numreview   <chr> "545 reviews", "198 reviews", "545 reviews", "48 reviews"…
## $ location    <chr> "Los Angeles, CA", "Los Angeles, CA", "Los Angeles, CA", …

Then we use the unnest function to count the frequency of words in all the job descriptions.

words <- jobDescriptions %>%
  unnest_tokens(word, description) %>%
  count(word, sort=TRUE) 
  

head(words)

##   word    n
## 1  and 7874
## 2   to 3772
## 3  the 3353
## 4   of 2927
## 5   in 2405
## 6    a 2170

After removing the noise from the dataset I am left with the highest frequency words that appear in these descriptions. I attempted to categorize them into 1 of 4 buckets. The skills are either hard or soft, and if they don’t fall into either, for example like project management, I bucket them into “experience”. Lastly, is the category “tool”, which is the exact name of the tool or programming language called out in the description.

After plotting the categories against each other there are almost as many soft skills are there are hard skills.

words_c <- read.csv('https://raw.githubusercontent.com/jglendrange/DATA607/main/mostPopularSkillsRevised.csv')

words_c %>%
  count(class) %>%
  ggplot(aes(x = reorder(class, -n) , y = n)) +
  geom_bar(stat="identity") +
  labs(
    x = "", y = "",
    title = "Which skill type is the most common?"
  ) +
  coord_flip()

Here I attempt to gauge how much companies in LA value each skill set by taking a value score: total frequency / # of phrases. We see that even though there are almost as many soft skill phrases the hard skill phrases pop up much more. I can use this to assume that companies value experience and hard skills in data science positions over soft skills.

words_c %>%
  group_by(class) %>%
  summarise(value = sum(count) / n()) %>%
  ggplot(aes(x = reorder(class, -value) , y = value)) +
  geom_bar(stat="identity") +
  labs(
    x = "", y = "",
    title = "Which skill type do companies value the most?"
  ) +
  coord_flip()

words_c %>%
  filter(class == 'Tool') %>%
  ggplot(aes(x = reorder(word, count) , y = count)) +
  geom_bar(stat="identity") +
  labs(
    x = "", y = "",
    title = "What tools are the most important?"
  ) +
  coord_flip()

Teamwork is by far the most frequent mention. It makes sense, being able to work as a team encapsulates most of the soft skills below it: Communication, collaboration, organization, leadership, etc.

words_c %>%
  filter(class == 'Soft Skill') %>%
  ggplot(aes(x = reorder(word, count) , y = count)) +
  geom_bar(stat="identity") +
  labs(
    x = "", y = "",
    title = "What Soft Skills are the most important?"
  ) +
  coord_flip()

The hardskills that are the most common are high level. Data, analysis, research, and science. This makes me think most companies want to hire someone who has some experience in the science field and understand the scientific method to some degree.

words_c %>%
  filter(class == 'Hard Skill') %>%
  ggplot(aes(x = reorder(word, count) , y = count)) +
  geom_bar(stat="identity") +
  labs(
    x = "", y = "",
    title = "What Soft Skills are the most important?"
  ) +
  coord_flip()

### Fulltime Job Listings in NY

Looking at fulltime positions in NY

ndf <- read.csv('https://raw.githubusercontent.com/jglendrange/DATA607/main/fulltimeNY.csv')

ds <- c('Staff Data Scientist', 'Machine Learning Data Scientist', 'Junior Data Scientist', 'Data Scientist,Analytics', 'Data Scientist (NYC)', 'Clinical Data Scientist')
sds <- c('Sr. Data Scientist', 'Sr Data Scientist', 'Senior Data Scientist', 'Senior Associate, Data Scientist')
lds <- c('Lead Data Scientist')
pds <- c('Principal Data Scientist')

analyst <- ndf %>% filter(across(position, ~grepl('analyst', ignore.case=T, .))) %>% pull(position)
machine_learning <- ndf %>% filter(across(position, ~grepl('machine learning', ignore.case=T, .))) %>% pull(position)
manager <- ndf %>% filter(across(position, ~grepl('manager', ignore.case=T, .))) %>% pull(position)

ndf <- ndf %>% mutate(emp=case_when(group=ndf$position %in% analyst ~ 'Analyst', ndf$position %in% machine_learning ~ 'Machine_learning', ndf$position %in% manager ~ 'Manager'))
ndf$emp <- ndf$emp %>% replace_na('Data Scientist')
glimpse(ndf)

## Rows: 986
## Columns: 6
## $ position    <chr> "Data Scientist—Research (Ref # EXEC/RD_DAT_NYC_6073)", "…
## $ company     <chr> "New York State Office of the Attorney General (OAG)", "C…
## $ description <chr> "<span id=\"job_summary\" class=\"summary\"><p><b>Executi…
## $ numreview   <chr> "", "", "217 reviews", "", "", "", "7 reviews", "705 revi…
## $ location    <chr> "New York, NY", "New York, NY", "New York, NY 10032", "Je…
## $ emp         <chr> "Data Scientist", "Data Scientist", "Analyst", "Data Scie…

Position and Company # Add box plots + stat_summary(fun=mean, color=‘blue’, size=1.5)

ndf %>% count(emp) %>% top_n(n=20) %>% ggplot(aes(x=emp, y=n)) + geom_col() + coord_flip()

## Selecting by n

Right now there is a lot of demand for Data Scientists vs Managers and Analyst; however there is a lot of overlap. One can assume a manager was once a data scientist that got promoted.

ndf %>% filter(emp == 'Data Scientist') %>% count(company) %>% top_n(n=10) %>% ggplot(aes(x=company, y=n)) + geom_col() + coord_flip()

## Selecting by n

NYU Langone Health is hiring more than most tech and financial companies combined.

Word cloud of description by position

ddf <- ndf %>% filter(emp == 'Data Scientist')

docs <- Corpus(VectorSource(ddf$description))

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")

# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)

dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)

set.seed(1234)
wordcloud(words = d$word, freq = d$freq, min.freq = 200,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

As one would expect data scientist need to have a very good understanding of data in every aspect. From analytics, statistics, management and everything in between. The word cloud also emphases the need for team work, machine learning and a lot of experience.

General Skills Data Set

This data set isn’t as raw as the previous data set. We wanted to compare our attempt at scrubbing job descriptions to some other more normalized data.

Just looking at the top skills we see a few that match up: analysis, machine learning, statistics, etc.

generalSkills <- read.csv('https://raw.githubusercontent.com/jglendrange/DATA607/main/ds_general_skills_revised.csv')

generalSkills <- generalSkills[1:15,]
print(generalSkills)

##                 Keyword LinkedIn Indeed SimplyHired Monster
## 1      machine learning    5,701  3,439       2,561   2,340
## 2              analysis    5,168  3,500       2,668   3,306
## 3            statistics    4,893  2,992       2,308   2,399
## 4      computer science    4,517  2,739       2,093   1,900
## 5         communication    3,404  2,344       1,791   2,053
## 6           mathematics    2,605  1,961       1,497   1,815
## 7         visualization    1,879  1,413       1,153   1,207
## 8          AI composite    1,568  1,125         811     687
## 9         deep learning    1,310    979         675     606
## 10        NLP composite    1,212    910         660     582
## 11 software development      732    627         481     784
## 12      neural networks      671    485         421     305
## 13     data engineering      514    339         276     200
## 14   project management      476    397         330     348
## 15 software engineering      413    295         250     512

skills_long <- generalSkills %>%
  pivot_longer(!Keyword, names_to='Job Site')

skills_long$value <- as.numeric(gsub(",", "", skills_long$value))
                                                      
skills_long %>%
  group_by(Keyword) %>%
  summarise(total = sum(value)) %>%
  ggplot(aes(x = reorder(Keyword, total) , y = total)) +
  geom_bar(stat="identity")  + coord_flip()

This view is comparing the distribution of skills across different job sites. Its a pretty equal distribution likely because the same employers are posting on all three sites.

skills_long$pct <- skills_long$value/ave(skills_long$value, skills_long$`Job Site`, FUN=sum)

ggplot(skills_long, aes(x=`Job Site`, y=pct, fill=Keyword))  +
  geom_bar(stat="identity")

Job Listings

jdf <- read.csv('https://raw.githubusercontent.com/jglendrange/DATA607/main/ds_job_listing_software.csv')
jdf <- jdf[1:37,]
jdf <- jdf %>% select(c(Keyword, LinkedIn, Indeed, SimplyHired, Monster)) %>% rename('skill'='Keyword') 
jdf[,2:5] <- lapply(jdf[,2:5],function(x){as.numeric(gsub(",", "", x))})
jdf <- jdf %>% mutate(total = rowSums(across(where(is.numeric))))
glimpse(jdf)

## Rows: 37
## Columns: 6
## $ skill       <chr> "Python", "R", "SQL", "Spark", "Hadoop", "Java", "SAS", "…
## $ LinkedIn    <dbl> 6347, 4553, 3879, 2169, 2142, 1944, 1713, 1216, 1182, 104…
## $ Indeed      <dbl> 3818, 3106, 2628, 1551, 1578, 1377, 1134, 1012, 830, 739,…
## $ SimplyHired <dbl> 2888, 2393, 2056, 1167, 1164, 1059, 910, 780, 637, 589, 5…
## $ Monster     <dbl> 2544, 2365, 1841, 1062, 1200, 1002, 978, 744, 619, 520, 4…
## $ total       <dbl> 15597, 12417, 10404, 5949, 6084, 5382, 4735, 3752, 3268, …

Create a stuff(i need a name for this column) feature based on the skills

## Map skills and then plot
frameworks <- c('TensorFlow', 'Scikit-learn', 'Pandas', 'Numpy', 'Keras', 'PyTorch', 'Caffe')
languages <- c('Python', 'R', 'Java', 'C', 'Scala', 'C++', 'Matlab', 'C# ', 'Javascript', 'SPSS', 'Perl')
database <- c('SQL', 'Hadoop', 'Spark', 'SAS', 'Tableau', 'Hive', 'AWS', 'NoSQL', 'Azure', 'Pig', 'Hbase', 'Cassandra', 'MySQL', 'MongoDB', 'D3')
other <- c('SPSS', 'Excel', 'Linux', 'Docker', 'Git')

jdf <- jdf %>% mutate(stuff=case_when(group=jdf$skill %in% frameworks ~ 'Framework', jdf$skill %in% languages ~ 'Languages', jdf$skill %in% database ~ 'Database', jdf$skill %in% other ~ 'Other'))
glimpse(jdf)

## Rows: 37
## Columns: 7
## $ skill       <chr> "Python", "R", "SQL", "Spark", "Hadoop", "Java", "SAS", "…
## $ LinkedIn    <dbl> 6347, 4553, 3879, 2169, 2142, 1944, 1713, 1216, 1182, 104…
## $ Indeed      <dbl> 3818, 3106, 2628, 1551, 1578, 1377, 1134, 1012, 830, 739,…
## $ SimplyHired <dbl> 2888, 2393, 2056, 1167, 1164, 1059, 910, 780, 637, 589, 5…
## $ Monster     <dbl> 2544, 2365, 1841, 1062, 1200, 1002, 978, 744, 619, 520, 4…
## $ total       <dbl> 15597, 12417, 10404, 5949, 6084, 5382, 4735, 3752, 3268, …
## $ stuff       <chr> "Languages", "Languages", "Database", "Database", "Databa…

How valuable are each stuff on different hiring websites?

jdf %>% select(!c('skill', 'total')) %>% pivot_longer(!c(stuff), names_to='Hiring', values_to='count') %>% ggplot(aes(x=stuff, y=count, fill=Hiring)) + geom_col()

Having experience in a specific language and knowledge of databases is more important than knowing a specific frame work. This makes sense since frame works get updated a lot as things change.

What languages are the most in demand?

jdf %>% select(!c('total' ,'stuff')) %>%  pivot_longer(!c(skill), names_to='Hiring', values_to='value') %>% filter(skill %in% languages) %>% ggplot(aes(x=skill, y=value, fill=Hiring, group=Hiring)) + geom_bar(stat='identity', position='dodge')

It’s very clear that knowledge of python and R are very important if one wants a successful carrier in data science.

jdf %>% select(!c('total' ,'stuff')) %>%  pivot_longer(!c(skill), names_to='Hiring', values_to='value') %>% filter(skill %in% languages) %>% ggplot(aes(x=skill, y=value)) + geom_col() + facet_wrap(~Hiring) + theme(axis.text.x = element_text(angle = 30))

Project 3

Kenan Sooklall, Jordan Glendrange

4/2/2021

Fulltime Job Listings in LA

General Skills Data Set

Job Listings