Overview

The purpose of Project 3 is to use data to answer the question, “Which are the most valued data science skills?”.

After initial brainstorming, we determined to use the following dataset to perform analysis: https://www.kaggle.com/elroyggj/indeed-dataset-data-scientistanalystengineer/kernels.

The dataset takes data related job postings and breaks them down by industry, company, skills, job titles, location and categorical job types (data scientist, data analyst, and data engineer). We determined this is an appropriate dataset since it is employers that generally determine what are valued skill sets in an occupation.


There are important but subtle differences between the three major job titles within the data science job family.

Data Analysts query and process data, provide reports, summarize and visualize data. They generally have a strong grasp of how to utilize existing tools and methods to solve problems, and help their respective company understand specific problems. Typical tasks performed by analysts are: cleaning and organizing raw data, use statistics to gain a big picture perspective on their data, find trends in data, and create visualizations for company staff to interpret the data.

Data Engineers are the professionals who prepare and create the infrastructure for “big data”. They transform the data into a useable format for analysis by data scientists. Data Engineers skill set lean toward the software development skill set. They build APIs for data consumption, integrate various datasets into existing data pipelines, monitor and test data pipelines to ensure optimal performance.

Data Scientists apply statistics, machine learning, and analytic approaches to solve problems. They deep dive into big data, unstructured data, and regular data to find patterns and future trends. Data Scientists are expected to have programming skills and an ability to design new algorithms. They uncover hidden trends by using supervised and unsupervised learning methods toward their machine learning models.Some of a data scientist’s task include: using statistical models to determine the validity of analyses, use machine learning to create better predictive algorithms, test and improve their machine learnig models and create data visualizations to summarize advanced analysis.

Data Import

To begin our project, we loaded our Indeed dataset into normalized tables in AWS to store our data. The scripts used to create those tables can be found in the project github repository.

Cleaning the Data

Let’s take a look at our raw data:

## Observations: 5,715
## Variables: 43
## $ X                                <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11…
## $ Job_Title                        <chr> "Data Scientist", "Data Scientist", …
## $ Link                             <chr> "https://www.indeed.com/rc/clk?jk=6a…
## $ Queried_Salary                   <chr> "<80000", "<80000", "<80000", "<8000…
## $ Job_Type                         <chr> "data_scientist", "data_scientist", …
## $ Skill                            <chr> "['SAP', 'SQL']", "['Machine Learnin…
## $ No_of_Skills                     <int> 2, 5, 9, 1, 7, 6, 10, 3, 4, 6, 8, 8,…
## $ Company                          <chr> "Express Scripts", "Money Mart Finan…
## $ No_of_Reviews                    <dbl> 3301, NA, 62, 158, 495, 173, 30, NA,…
## $ No_of_Stars                      <dbl> 3.3, NA, 3.5, 4.3, 4.1, 4.3, 3.8, NA…
## $ Date_Since_Posted                <int> 1, 15, 1, 30, 30, 30, 5, 10, 1, 22, …
## $ Description                      <chr> "[<p><b>POSITION SUMMARY</b></p>, <p…
## $ Location                         <chr> "MO", "TX", "OR", "DC", "TX", "MD", …
## $ Company_Revenue                  <chr> "More than $10B (USD)", "", "", "", …
## $ Company_Employees                <chr> "10,000+", "", "", "", "Less than 10…
## $ Company_Industry                 <chr> "Health Care", "", "", "Government",…
## $ python                           <int> 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, …
## $ sql                              <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
## $ machine.learning                 <int> 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, …
## $ r                                <int> 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, …
## $ hadoop                           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ tableau                          <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ sas                              <int> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, …
## $ spark                            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ java                             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Others                           <int> 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, …
## $ CA                               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ NY                               <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ VA                               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ TX                               <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ MA                               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ IL                               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ WA                               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ MD                               <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ DC                               <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ NC                               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Other_states                     <int> 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, …
## $ Consulting.and.Business.Services <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Internet.and.Software            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Banks.and.Financial.Services     <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ Health.Care                      <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Insurance                        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Other_industries                 <int> 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, …

We’re going to get rid of columns that are not necessary for our analysis. This includes all state data and information related to the company that posting the job.

The raw data originally had 6 columns, one for each industry. If the job posting was within that industry, the column would be set to 1 if not, it was set to 0. We decided to combined these 6 columns into a single column ‘Industry’ with 6 different categories: Consulting&Business, Internet&Software, Banks&FiancialServices, HealthCare, Insurance and Other.

Additionally, we wanted to distinguish if the job titles had any variations of the words ‘junior’ or ‘senior’ in the title. Both of these additions will make our analysis easier.

Data Analysis

Let’s look at the Most Frequently Seen Skills

## # A tibble: 10 x 2
##    key                  n
##    <chr>            <int>
##  1 Others            5152
##  2 python            3325
##  3 sql               3104
##  4 machine.learning  2297
##  5 r                 2234
##  6 hadoop            1714
##  7 spark             1531
##  8 java              1480
##  9 tableau           1236
## 10 sas                941

Do Skills Differ Among Junior vs. Senior Data Scientists?

There isn’t a difference in the top skills required from a junior or senior data scientist. Data scientists, senior and junior, are expected to know machine learning, python, and R. As you go down the list, there are differences in required skills. Senior data scientists are expected to know more data engineering (Hadoop, Spark, SQL) while junior level scientists, visualization tools like Tableau are in higher demand.

ds <- filter(df_2,Job_Type == 'data_scientist',Level %in% c("Senior","Junior"))
ds_skills <- ds %>% group_by(Level) %>% summarise(sum(python),sum(sql),sum(machine.learning),sum(r),sum(hadoop),sum(tableau),sum(sas),sum(spark),sum(java),sum(Others))
colnames(ds_skills) <- c('Level','Python','SQL','Machine_Learning','R','Hadoop','Tableau','SAS','Spark','Java','Others')
ds_skills_long <- pivot_longer(ds_skills,cols=c(2:11),names_to = "Skill",values_to = "Count")

ds_job_level<- df_2 %>% group_by(Level) %>% tally()
b <- ds_skills_long %>% inner_join(ds_job_level,by="Level")
ds_skills_long['Total_Jobs']  <- b$n
ds_skills_long['Perct_Total'] <- b$Count / b$n

senior <- filter(ds_skills_long,ds_skills_long$Level == "Senior")
s <- ggplot(senior, aes(x=reorder(Skill,Perct_Total),y=Perct_Total,color=Level,fill=Level)) +
  geom_bar(stat="identity", position = 'dodge',width=0.7) +
  coord_flip() +
  xlab('Skill') +
  ylab('') + 
  ggtitle('Senior Level') +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values=c("#E69F00")) +
  scale_fill_manual(values=c("#E69F00")) 

junior <- filter(ds_skills_long,ds_skills_long$Level == "Junior")
j <- ggplot(junior, aes(x=reorder(Skill,Perct_Total),y=Perct_Total,color=Level,fill=Level)) +
  geom_bar(stat="identity", position = 'dodge',width=0.7) +
  coord_flip() +
  xlab('Skill') +
  ylab('% Postings with Skill') + 
  ggtitle('Junior Level') +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_color_manual(values=c("#009E73")) +
  scale_fill_manual(values=c("#009E73")) 
grid.arrange(s,j,nrow=2,top = "Top Skills for Junior vs. Senior Data Scientists")

Additional Visualizations

Finally, let’s do a comparative study for all 3 Job Types: data_analyst, data_engineer and data_scientist and the skills needed for those jobs. This visualization shows that data_scientist has roughly all the same skills as data_analyst and data_engineer. Data_scientist clearly uses machine learning, python and r more than the other two jobs. This conclusion is supported in our earlier analysis. It would be interesting to know what ‘Others’ skill includes.

Conclusion:

Thoughts on the Data:

Based on the data analyzed, we found that the as future data scientists, we should focus on learning and mastering R, machine learning, and python. Data scientists should also be exposed and learn to a lesser degree, SQL (and other databases). Just like in other professions, the highest earners are (we assume) in management. There is a lot of similarility among the data analysts, data engineers, and data scientists but there are important distinctions. Data scientists are expected to have the most versatility,the data analyst role is heavy on visualization, and the data engineering profession skillset is a mix of software development and infrastructure architecture.

Final thoughts on the Project:

Since we used a curated dataset, our analysis was restricted to the quality of the raw dataset. One thing that was limiting, was that the salary information was listed as categorical data in ranges (<$80,000,$80,000-$99,999, etc.). These ranges are pretty large and it would’ve been nicer to have salary be a continuous variable, so analysis could be more detailed.

Other avenues for future exploration, is whether data scientist positions require a master’s degree or not. In the world of IT and even software development, the “self-taught” route is completely acceptable, would that hold true for data science? Also, is there data for recent graduates of the Master’s in Data Science program? Would the data from recent graduates match up with the project’s conclusions?

In regards to this project, teamwork and clear communication was imperative. We organically divided up the workload based on our strengths and provided constructive feedback to each other’s work. It was also a nice opportunity to collaborate with other classmates from different professional backgrounds.