Overview

The purpose of Project 3 is to use data to answer the question, “Which are the most valued data science skills?”.

After initial brainstorming, we determined to use the following dataset to perform analysis: https://www.kaggle.com/elroyggj/indeed-dataset-data-scientistanalystengineer/kernels.

The dataset takes data related job postings and breaks them down by industry, company, skills, job titles, location and categorical job types (data scientist, data analyst, and data engineer). We determined this is an appropriate dataset since it is employers that generally determine what are valued skill sets in an occupation.

There are important but subtle differences between the three major job titles within the data science job family.

Data Analysts query and process data, provide reports, summarize and visualize data. They generally have a strong grasp of how to utilize existing tools and methods to solve problems, and help their respective company understand specific problems. Typical tasks performed by analysts are: cleaning and organizing raw data, use statistics to gain a big picture perspective on their data, find trends in data, and create visualizations for company staff to interpret the data.

Data Engineers are the professionals who prepare and create the infrastructure for “big data”. They transform the data into a useable format for analysis by data scientists. Data Engineers skill set lean toward the software development skill set. They build APIs for data consumption, integrate various datasets into existing data pipelines, monitor and test data pipelines to ensure optimal performance.

Data Scientists apply statistics, machine learning, and analytic approaches to solve problems. They deep dive into big data, unstructured data, and regular data to find patterns and future trends. Data Scientists are expected to have programming skills and an ability to design new algorithms. They uncover hidden trends by using supervised and unsupervised learning methods toward their machine learning models.Some of a data scientist’s task include: using statistical models to determine the validity of analyses, use machine learning to create better predictive algorithms, test and improve their machine learnig models and create data visualizations to summarize advanced analysis.

Data Import

To begin our project, we loaded our Indeed dataset into normalized tables in AWS to store our data. The scripts used to create those tables can be found in the project github repository.

#Create connection to database and load acquired data into data frame.
username = 'admin'
password = 'abcd1234'
dbname = 'ds_skills'
host = 'dsdb-1.cswpiojk9qjg.us-east-1.rds.amazonaws.com'
myDb = dbConnect(MySQL(), user=username, password=password, dbname=dbname, host=host)

#Import indeed dataset
url <- getURL("https://raw.githubusercontent.com/geeman1209/MSDATA2020/master/DATA607/Project_3/indeed_job_dataset.csv")

raw_data <- read.csv(text = url,stringsAsFactors = FALSE)

Cleaning the Data

Let’s take a look at our raw data:

glimpse(raw_data)

## Observations: 5,715
## Variables: 43
## $ X                                <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ Job_Title                        <chr> "Data Scientist", "Data Scientist", …
## $ Link                             <chr> "https://www.indeed.com/rc/clk?jk=6a…
## $ Queried_Salary                   <chr> "<80000", "<80000", "<80000", "<8000…
## $ Job_Type                         <chr> "data_scientist", "data_scientist", …
## $ Skill                            <chr> "['SAP', 'SQL']", "['Machine Learnin…
## $ No_of_Skills                     <int> 2, 5, 9, 1, 7, 6, 10, 3, 4, 6, 8, 8,…
## $ Company                          <chr> "Express Scripts", "Money Mart Finan…
## $ No_of_Reviews                    <dbl> 3301, NA, 62, 158, 495, 173, 30, NA,…
## $ No_of_Stars                      <dbl> 3.3, NA, 3.5, 4.3, 4.1, 4.3, 3.8, NA…
## $ Date_Since_Posted                <int> 1, 15, 1, 30, 30, 30, 5, 10, 1, 22, …
## $ Description                      <chr> "[<p><b>POSITION SUMMARY</b></p>, <p…
## $ Location                         <chr> "MO", "TX", "OR", "DC", "TX", "MD", …
## $ Company_Revenue                  <chr> "More than $10B (USD)", "", "", "", …
## $ Company_Employees                <chr> "10,000+", "", "", "", "Less than 10…
## $ Company_Industry                 <chr> "Health Care", "", "", "Government",…
## $ python                           <int> 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, …
## $ sql                              <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
## $ machine.learning                 <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, …
## $ r                                <int> 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ hadoop                           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ tableau                          <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ sas                              <int> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, …
## $ spark                            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ java                             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Others                           <int> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, …
## $ CA                               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ NY                               <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ VA                               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TX                               <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ MA                               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ IL                               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ WA                               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ MD                               <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ DC                               <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ NC                               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Other_states                     <int> 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, …
## $ Consulting.and.Business.Services <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Internet.and.Software            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Banks.and.Financial.Services     <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ Health.Care                      <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Insurance                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Other_industries                 <int> 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, …

We’re going to get rid of columns that are not necessary for our analysis. This includes all state data and information related to the company that posting the job.

df_1 <- raw_data
drops <- c("X","Link","Company","No_of_Reviews","No_of_Stars","Date_Since_Posted","Company_Revenue","Company_Employees", "Description", "Location")
df_1 <- select(df_1 ,-c(all_of(drops)))

#delete state related data in the columns 17-27
df_2 <- select(df_1, -c(17:27))

The raw data originally had 6 columns, one for each industry. If the job posting was within that industry, the column would be set to 1 if not, it was set to 0. We decided to combined these 6 columns into a single column ‘Industry’ with 6 different categories: Consulting&Business, Internet&Software, Banks&FiancialServices, HealthCare, Insurance and Other.

Additionally, we wanted to distinguish if the job titles had any variations of the words ‘junior’ or ‘senior’ in the title. Both of these additions will make our analysis easier.

#single column for industry
df_2['Industry'] <- ifelse(df_2$Consulting.and.Business.Services==1,'Consulting&Business',ifelse(df_2$Internet.and.Software,'Internet&Software',ifelse(df_2$Banks.and.Financial.Services==1,'Banks&FiancialServices',ifelse(df_2$Health.Care==1,'HealthCare',ifelse(df_2$Insurance==1,'Insurance',ifelse(df_2$Other_industries==1,'Other',NA))))))

#flag junior or senior job titles
df_2['Level'] <- ifelse(grepl("Jr|Junior",ignore.case=TRUE,df_2$Job_Title),'Junior',ifelse(grepl("Sr|Senior",ignore.case=TRUE,df_2$Job_Title),"Senior","Other"))

Data Analysis

Let’s look at the Most Frequently Seen Skills

leadingtech <- c("python","sql","machine.learning","r","hadoop","tableau","sas","spark","java","Others")

skill_ct <- raw_data %>% 
    select(all_of(leadingtech)) %>%
    gather(key,value) %>% 
    filter(value==1) %>%
    group_by(key) %>% 
    summarise(n = n()) %>%
    arrange(-n)

skill_ct

## # A tibble: 10 x 2
##    key                  n
##    <chr>            <int>
##  1 Others            5152
##  2 python            3325
##  3 sql               3104
##  4 machine.learning  2297
##  5 r                 2234
##  6 hadoop            1714
##  7 spark             1531
##  8 java              1480
##  9 tableau           1236
## 10 sas                941

What are the most popular skills for a data scientist? Do these top skils differ for data analysts and data engineers?

The data shows that data engineers are typically expected to know Java, Hadoop, Spark, SQL, and python. A lot of the “other” skills required of data engineers are cloud based technologies (Azure, AWS, Google Cloud) and linux. This is to be expected, as previously mentioned, data engineer’s skill set gravitates toward software development, a field where Java and SQL is prevelant for backend systems. Linux since it is open sourced and flexible is the operating system of choice, unless you’re in a Microsoft-based environment.

Data analysts are exptected to know Tableau to a greater degree than data scientists or engineers. Data analysts are also required to know SQL, to perform basic queries from historical data. Tableau is an interactive data visualization tool, also known as a business intelligence tool. SAS is another popular skill set for analysts, is statistical software used for analysis. These skills match well with the provided job description for analysts. Their skills are going to be geared for visualization, mining past data, and created business reports.

Data scientist, as the data shows, is a versatile role. However, Data scientists are expected know Machine Learning and R by significant margins above engineers and analysts. Data Scientists, to a lesser degree, are also expected to know Hadoop, Spark, and Java. Theses are the key skill sets for engineers. We suspect, a succesful data scientist may not be an expert in data engineering but should at least be familiar with the concepts associated with the profession. Of course, python is univerally required all positions.

skills <- df_2 %>% group_by(Job_Type) %>% summarise(sum(python),sum(sql),sum(machine.learning),sum(r),sum(hadoop),sum(tableau),sum(sas),sum(spark),sum(java),sum(Others))
colnames(skills) <- c('Job_Type','Python','SQL','Machine_Learning','R','Hadoop','Tableau','SAS','Spark','Java','Others')
skills <- pivot_longer(skills,cols=c(2:11),names_to = "Skill",values_to = "Count")

job_type_count <- df_2 %>% group_by(Job_Type) %>% tally()
a <- skills %>% inner_join(job_type_count,by="Job_Type")
skills['Total_Jobs']  <- a$n
skills['Perct_Total'] <- a$Count / a$n
  
ggplot(skills, aes(x=reorder(Skill,-Perct_Total),y=Perct_Total,fill=Job_Type)) +
  geom_bar(stat="identity", position = 'dodge') +
  xlab('Skill') +
  ylab('% of Total Postings that List this Skill') + 
  ggtitle('Top Skills on Job Listings - LinkedIn') +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(axis.text.x = element_text(angle = 90))

Do Skills Differ Among Junior vs. Senior Data Scientists?

There isn’t a difference in the top skills required from a junior or senior data scientist. Data scientists, senior and junior, are expected to know machine learning, python, and R. As you go down the list, there are differences in required skills. Senior data scientists are expected to know more data engineering (Hadoop, Spark, SQL) while junior level scientists, visualization tools like Tableau are in higher demand.

ds <- filter(df_2,Job_Type == 'data_scientist',Level %in% c("Senior","Junior"))
ds_skills <- ds %>% group_by(Level) %>% summarise(sum(python),sum(sql),sum(machine.learning),sum(r),sum(hadoop),sum(tableau),sum(sas),sum(spark),sum(java),sum(Others))
colnames(ds_skills) <- c('Level','Python','SQL','Machine_Learning','R','Hadoop','Tableau','SAS','Spark','Java','Others')
ds_skills_long <- pivot_longer(ds_skills,cols=c(2:11),names_to = "Skill",values_to = "Count")

ds_job_level<- df_2 %>% group_by(Level) %>% tally()
b <- ds_skills_long %>% inner_join(ds_job_level,by="Level")
ds_skills_long['Total_Jobs']  <- b$n
ds_skills_long['Perct_Total'] <- b$Count / b$n

senior <- filter(ds_skills_long,ds_skills_long$Level == "Senior")
s <- ggplot(senior, aes(x=reorder(Skill,Perct_Total),y=Perct_Total,color=Level,fill=Level)) +
  geom_bar(stat="identity", position = 'dodge',width=0.7) +
  coord_flip() +
  xlab('Skill') +
  ylab('') + 
  ggtitle('Senior Level') +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values=c("#E69F00")) +
  scale_fill_manual(values=c("#E69F00")) 

junior <- filter(ds_skills_long,ds_skills_long$Level == "Junior")
j <- ggplot(junior, aes(x=reorder(Skill,Perct_Total),y=Perct_Total,color=Level,fill=Level)) +
  geom_bar(stat="identity", position = 'dodge',width=0.7) +
  coord_flip() +
  xlab('Skill') +
  ylab('% Postings with Skill') + 
  ggtitle('Junior Level') +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values=c("#009E73")) +
  scale_fill_manual(values=c("#009E73")) 
grid.arrange(s,j,nrow=2,top = "Top Skills for Junior vs. Senior Data Scientists")

Which industries have the most job postings for data scientists?

Top industries are consulting/business, internet/software, banking/finance, healthcare and insurance.

posts_by_industry <- filter(df_2,Job_Type == 'data_scientist',is.na(df_2$Industry) == FALSE)
industry_count <- posts_by_industry %>% group_by(Industry) %>% tally()

colnames(industry_count) <- c('Industry','Count')

ggplot(industry_count, aes(x=reorder(Industry,Count),y=Count,fill = "#E69F00",color="black")) +
  geom_bar(stat="identity", position = 'dodge') +
  coord_flip() +
  xlab('Industry') +
  ylab('Count of Job Postings') + 
  ggtitle('Top Industries Among Data Job Listings - LinkedIn') +
  theme(plot.title = element_text(hjust = 0.5)) + 
  theme(legend.position = "none")

Let’s take the top skills to see which pay the highest salary for data scientists?

The two extremes of the spectrum require little machine learning, python, and R. This actually makes sense if the professionals making above $160K are in management, so they’re not actively coding as much as senior-mid tier professionals. The data shows machine learning, python, and R are the highest paying skills.

ds_salary <- select(filter(df_2,df_2$Job_Type=='data_scientist'), Queried_Salary,python,sql,machine.learning,r)
ds_salary_job_type <- pivot_longer(ds_salary,cols=c(2:5),names_to='Skill')
skill <- ds_salary_job_type %>% group_by(Queried_Salary,Skill) %>% summarise(sum(value))
colnames(skill) <- c('Salary_Range','Skill','Count')

#rearrange salary placements
skill$Salary_Range <- factor(skill$Salary_Range, levels=c("<80000","80000-99999","100000-119999","120000-139999","140000-159999",">160000"))
skill <- skill[order(skill$Salary_Range),]

ggplot(skill, aes(x=Salary_Range,y=Count,fill=Skill)) +
  geom_bar(stat="identity", position = 'dodge')+
  xlab('Salary Range') +
  ylab('Count of Postings') +
  theme(axis.text.x = element_text(angle = 90)) + 
  ggtitle("Highest Paying Skills") +
  theme(plot.title = element_text(hjust = 0.5))

Do certain industries require certain skills? We’re going to ignore OTHER since it’s overwhelmingly popular and we don’t know which industries it represents.

The required skills are similiar across industries most in-demand of data scientists. This is a boon to all data scientists as moving across industries should be easier than other professions. This opens the job market in a positive manner to all current and future data scientists.

#remove rows where industry is NA or OTHER
industry <- filter(df_2,Job_Type == 'data_scientist',df_2$Industry != 'Other',is.na(df_2$Industry) == FALSE)
industry_skills <- industry %>% group_by(Industry) %>% summarise(sum(python),sum(sql),sum(machine.learning),sum(r),sum(hadoop),sum(tableau),sum(sas),sum(spark),sum(java),sum(Others))

colnames(industry_skills) <- c('Industry','Python','SQL','Machine_Learning','R','Hadoop','Tableau','SAS','Spark','Java','Others')

industry_skills <- pivot_longer(industry_skills,cols=c(2:11),names_to = "Skill",values_to = "Count")

#Internet & Software
internet <- filter(industry_skills,industry_skills$Industry %in% c('Internet&Software','Consulting&Business'))
a <- ggplot(internet, aes(x=reorder(Skill,-Count),y=Count,fill=Industry)) +
  geom_bar(stat="identity", position = 'dodge')+
  xlab('') +
  ylab('Count of Job Posts') +
  theme(axis.text.x = element_text(angle = 90)) +
  expand_limits(y = c(0, 400)) +
  theme(legend.position = c(0.6, 0.8))

#Healthcare, Insurance & Banks
healthcare_insurance_banks <- filter(industry_skills,industry_skills$Industry %in% c('HealthCare','Insurance','Banks&FinancialServices'))
b <- ggplot(healthcare_insurance_banks, aes(x=reorder(Skill,-Count),y=Count,color=Industry,fill=Industry)) +
  geom_bar(stat="identity", position = 'dodge')+
  xlab('Skill') +
  ylab('Count of Job Posts') +
  theme(axis.text.x = element_text(angle = 90)) +
  expand_limits(y = c(0,160)) + 
  theme(legend.position = c(0.6725, 0.8)) + 
  scale_fill_manual(values = c("#009E73","#E69F00")) +
  scale_color_manual(values = c("#009E73","#E69F00")) 

grid.arrange(a,b,nrow=1,top = "Top Skills Across Industry for Data Scientist Job Postings",widths=c(1,1))

Additional Visualizations

Finally, let’s do a comparative study for all 3 Job Types: data_analyst, data_engineer and data_scientist and the skills needed for those jobs. This visualization shows that data_scientist has roughly all the same skills as data_analyst and data_engineer. Data_scientist clearly uses machine learning, python and r more than the other two jobs. This conclusion is supported in our earlier analysis. It would be interesting to know what ‘Others’ skill includes.

# gather() takes leadingtech as multiple columns and gathers them into key-value pairs to make “wide” data longer
# group_by_ groups the data by leadingtech
# summarise the results by leadingtech counts for all 3 job types data_analyst, data_engineer and data_scientist
# draw polygon for skillset and its count by Job_Type
# highlights the count points
# split up by Job_Type using reformulate to have splitted horizontally
# draw with polar coordinates
# set the theme

polar_split <- raw_data %>% 
  gather(key, value, all_of(leadingtech)) %>% 
  group_by_("Job_Type", "key")  %>% 
  summarise(count = sum(value)) %>% 
  ggplot() + 
  geom_polygon(aes_string(x="key", y="count", col="Job_Type", group="Job_Type"), size=0.5, fill=NA) + 
  geom_point(aes_string(x="key", y="count", col="Job_Type", group="Job_Type"), size=1) + 
  facet_grid(reformulate("Job_Type")) + 
  coord_polar() + 
  theme_minimal() +
  theme(legend.position = "top",
                 legend.title = element_blank(), 
                 legend.spacing.x = grid::unit(0.25, 'cm'),
                 legend.text = element_text(margin = margin(t = 10),vjust=2, size=15),
                 axis.title = element_blank(),
                 axis.text.x = element_text(size=10),
                 axis.text.y = element_blank())

polar_split

Conclusion:

Thoughts on the Data:

Based on the data analyzed, we found that the as future data scientists, we should focus on learning and mastering R, machine learning, and python. Data scientists should also be exposed and learn to a lesser degree, SQL (and other databases). Just like in other professions, the highest earners are (we assume) in management. There is a lot of similarility among the data analysts, data engineers, and data scientists but there are important distinctions. Data scientists are expected to have the most versatility,the data analyst role is heavy on visualization, and the data engineering profession skillset is a mix of software development and infrastructure architecture.

Final thoughts on the Project:

Since we used a curated dataset, our analysis was restricted to the quality of the raw dataset. One thing that was limiting, was that the salary information was listed as categorical data in ranges (<$80,000,$80,000-$99,999, etc.). These ranges are pretty large and it would’ve been nicer to have salary be a continuous variable, so analysis could be more detailed.

Other avenues for future exploration, is whether data scientist positions require a master’s degree or not. In the world of IT and even software development, the “self-taught” route is completely acceptable, would that hold true for data science? Also, is there data for recent graduates of the Master’s in Data Science program? Would the data from recent graduates match up with the project’s conclusions?

In regards to this project, teamwork and clear communication was imperative. We organically divided up the workload based on our strengths and provided constructive feedback to each other’s work. It was also a nice opportunity to collaborate with other classmates from different professional backgrounds.

References:

https://medium.com/optima-blog/using-polar-coordinates-for-better-visualization-1d337b6c9dec
https://www.kaggle.com/inigoml/data-exploration
https://community.rstudio.com/t/can-we-specify-the-order-for-factor-character-type-variables-when-using-arrange/3355/2

Project 3: Valuable Data Scientist Skills

Team Emerald Orcas - Gabe Abreu, Amit Kapoor & Devin Teran

3/17/2020