My Approach

I will utilize the Indeed.com website and perform Data Science search in the Boston Market,

  • Use the rvest library for the screen scraping tasks
  • Write functions to obtain links to job descriptions and to extract job descriptions
  • Use tidyverse to summarize data
  • Present results in table and word cloud formats

Extract Data From Indeed.com

Summarize Data

Leverage dplyr to summarize the job skills data. I determine if a particular skill is determined in each job description. If it is included, I assign a value of 1 and a 0 if not. Next I group by skill and calculate totals and percentage. Finally the tibble is sorted in descending order.

df <- as_tibble(indeedDesc) %>% 
  rename(JobDesc = value) %>%
  rowid_to_column("id") %>%
  mutate(n = n()) %>% 
  mutate(job = str_c("job",id,sep='')) %>% 
  mutate(JobDesc = str_replace_all(JobDesc,'\n',' ')) %>% 
  mutate(JobDesc = str_replace_all(JobDesc,'\\<U+00B7\\>','')) %>% 
  mutate(Python = if_else(str_detect(JobDesc,fixed('python',ignore_case = TRUE)),1,0)) %>%
  mutate(Excel = if_else(str_detect(JobDesc,fixed('Excel',ignore_case = TRUE)),1,0)) %>%
  mutate(Mongodb = if_else(str_detect(JobDesc,fixed('Mongo',ignore_case = TRUE)),1,0)) %>%
  mutate(R = if_else(str_detect(JobDesc,fixed('R,',ignore_case = TRUE)),1,0)) %>%
  mutate(CompSci = if_else(str_detect(JobDesc,fixed('Computer Science',ignore_case = TRUE)),1,0)) %>%
  mutate(Communication = if_else(str_detect(JobDesc,fixed('Communication Skills',ignore_case = TRUE)),1,0)) %>%
  mutate(SQL = if_else(str_detect(JobDesc,fixed('SQL',ignore_case = TRUE)),1,0)) %>% 
  mutate(AI = if_else(str_detect(JobDesc,fixed('Artificial',ignore_case = TRUE)),1,0)) %>%
  mutate(Predictive = if_else(str_detect(JobDesc,fixed('predictive',ignore_case = TRUE)),1,0)) %>%
  mutate(ML = if_else(str_detect(JobDesc,fixed('machine learning',ignore_case = TRUE)),1,0)) %>%
  mutate(Statistics = if_else(str_detect(JobDesc,fixed('Statistics',ignore_case = TRUE)),1,0)) %>%
  mutate(BigData = if_else(str_detect(JobDesc,fixed('Big Data',ignore_case = TRUE)),1,0)) %>%
  mutate(Neural = if_else(str_detect(JobDesc,fixed('Neural',ignore_case = TRUE)),1,0)) %>%
  mutate(Visualization = if_else(str_detect(JobDesc,fixed('visualization',ignore_case = TRUE)),1,0)) %>%
  mutate(Regression = if_else(str_detect(JobDesc,fixed('Regression',ignore_case = TRUE)),1,0)) %>%
  mutate(TextMining = if_else(str_detect(JobDesc,fixed('text minging',ignore_case = TRUE)),1,0)) %>%
  mutate(Matlab = if_else(str_detect(JobDesc,fixed('Matlab',ignore_case = TRUE)),1,0)) %>%
  mutate(SAS = if_else(str_detect(JobDesc,fixed('SAS',ignore_case = TRUE)),1,0)) %>%
  mutate(Cloud = if_else(str_detect(JobDesc,fixed('Cloud',ignore_case = TRUE)),1,0)) %>%
  gather('Python', 'R', 'SQL', 'AI', 'Predictive', 'ML', 'Statistics', 'BigData', 'Neural', 'Regression', 'TextMining', 'Matlab','SAS','Cloud', 'Visualization', 'Excel', 'Mongodb', 'CompSci','Communication',key=skill, value=value) %>% 
  mutate(Percent = round((value / n)* 100)) %>% 
  select(skill, value, Percent) %>% 
  group_by(skill) %>% 
  summarize(value = sum(value), Percent=sum(Percent)) %>%
  arrange(desc(value))

cloud_df <- df

Job Skills Sorted Table

The table below sets forth the relative demand for key data scientist skills. Value represents the number of job descriptions that contained the particular skill. Percent is calculated by dividing value by the number of job descriptions in the analysis.

skill value Percent
ML 50 50
CompSci 40 40
Communication 30 30
Predictive 30 30
SQL 30 30
Excel 20 20
Python 20 20
R 20 20
Statistics 20 20
BigData 10 10
Cloud 10 10
SAS 10 10
Visualization 10 10
AI 0 0
Matlab 0 0
Mongodb 0 0
Neural 0 0
Regression 0 0
TextMining 0 0