Introduction

This analysis explores the question: “Which are the most valued data science skills?” As W. Edwards Deming said, “In God we trust, all others must bring data.” We will use data from two datasets to answer this question:

  1. Data Science Jobs dataset from GitHub
  2. Job Skills dataset from GitHub

The analysis includes data acquisition, tidying, transformation, exploratory data analysis, and storage in a relational database.

Data Acquisition

We begin by loading necessary libraries and downloading the datasets from GitHub.

# Load required libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(DBI)
library(RSQLite)
# Download datasets from GitHub
# Data set 1: Data Science Jobs
jobs_url <- "https://raw.githubusercontent.com/zahid607/Project-3/main/Data%20Science_Jobs%20.csv"
jobs <- read.csv(jobs_url, stringsAsFactors = FALSE)

# Data set 2: Job Skills
skills_url <- "https://raw.githubusercontent.com/tcgraham-data/data-607-project-3/main/job_skills.csv"  
job_skills <- read.csv(skills_url, stringsAsFactors = FALSE)

# Display the structure of both datasets
str(jobs)
## 'data.frame':    7505 obs. of  7 variables:
##  $ Job.Title      : chr  "Data Scientist" "Business Analyst / Data Scientist" "Data Scientist - AI/ML" "Data Scientist" ...
##  $ Company.Name   : chr  "Uplers" "CBRE" "Trent Limited" "Fortune 500 IT Services Company" ...
##  $ Location       : chr  "Hybrid - Bengaluru" "Hybrid - Bengaluru" "Hybrid - Bengaluru" "Hybrid - Bengaluru" ...
##  $ Experience     : chr  "0-4 Yrs" "0-4 Yrs" "0-4 Yrs" "0-4 Yrs" ...
##  $ Salary         : chr  "19-27.5 Lacs PA" "19-27.5 Lacs PA" "19-27.5 Lacs PA" "19-27.5 Lacs PA" ...
##  $ Job.Description: chr  "Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar..." "Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar..." "Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar..." "Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar..." ...
##  $ Skills         : chr  "Python, TensorFlow, Computer Vision, Flink, Recommender systems, Vision, Apache Flink, System" "Data Visualization, Tableau, Analytics, Business analysis, Data analysis, Data" "Artificial Intelligence, Machine Learning, Data Science, Azure Bot Framework, Azure Machine Learning, Predictiv"| __truncated__ "" ...
str(job_skills)
## 'data.frame':    12217 obs. of  2 variables:
##  $ job_link  : chr  "https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-at-jobs-for-humanity-3804053819" "https://www.linkedin.com/jobs/view/principal-software-engineer-ml-accelerators-at-aurora-3703455068" "https://www.linkedin.com/jobs/view/senior-etl-data-warehouse-specialist-at-adame-services-llc-3765023888" "https://www.linkedin.com/jobs/view/senior-data-warehouse-developer-architect-at-morph-enterprise-3794602483" ...
##  $ job_skills: chr  "Machine Learning, Programming, Python, Scala, Java, Data Engineering, Distributed Computing, Statistical Modeli"| __truncated__ "C++, Python, PyTorch, TensorFlow, MXNet, CUDA, OpenCL, OpenVX, Halide, SIMD programming models, MLspecific acce"| __truncated__ "ETL, Data Integration, Data Transformation, Data Warehousing, Business Intelligence, Data Modeling, Data Archit"| __truncated__ "Data Lakes, Data Bricks, Azure Data Factory Pipelines, Spark, Python, Business Intelligence, Data Warehouse, SQ"| __truncated__ ...
# Display the first few rows of each dataset
head(jobs)
##                           Job.Title                    Company.Name
## 1                    Data Scientist                          Uplers
## 2 Business Analyst / Data Scientist                            CBRE
## 3            Data Scientist - AI/ML                   Trent Limited
## 4                    Data Scientist Fortune 500 IT Services Company
## 5                    Data Scientist       Foreign IT Consulting MNC
## 6                                                                  
##             Location Experience          Salary
## 1 Hybrid - Bengaluru    0-4 Yrs 19-27.5 Lacs PA
## 2 Hybrid - Bengaluru    0-4 Yrs 19-27.5 Lacs PA
## 3 Hybrid - Bengaluru    0-4 Yrs 19-27.5 Lacs PA
## 4 Hybrid - Bengaluru    0-4 Yrs 19-27.5 Lacs PA
## 5 Hybrid - Bengaluru    0-4 Yrs 19-27.5 Lacs PA
## 6 Hybrid - Bengaluru    0-4 Yrs 19-27.5 Lacs PA
##                                                                              Job.Description
## 1 Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar...
## 2 Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar...
## 3 Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar...
## 4 Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar...
## 5 Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar...
## 6 Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar...
##                                                                                                                                                       Skills
## 1                                                              Python, TensorFlow, Computer Vision, Flink, Recommender systems, Vision, Apache Flink, System
## 2                                                                             Data Visualization, Tableau, Analytics, Business analysis, Data analysis, Data
## 3 Artificial Intelligence, Machine Learning, Data Science, Azure Bot Framework, Azure Machine Learning, Predictive Modeling, Logistic Regression, Regression
## 4                                                                                                                                                           
## 5                                                                                                                                                           
## 6
head(job_skills)
##                                                                                                      job_link
## 1         https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-at-jobs-for-humanity-3804053819
## 2         https://www.linkedin.com/jobs/view/principal-software-engineer-ml-accelerators-at-aurora-3703455068
## 3    https://www.linkedin.com/jobs/view/senior-etl-data-warehouse-specialist-at-adame-services-llc-3765023888
## 4 https://www.linkedin.com/jobs/view/senior-data-warehouse-developer-architect-at-morph-enterprise-3794602483
## 5                                    https://www.linkedin.com/jobs/view/lead-data-engineer-at-dice-3805948138
## 6                 https://www.linkedin.com/jobs/view/senior-data-engineer-at-university-of-chicago-3798206502
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            job_skills
## 1                                                                                                                                                                                                                                                                                                                                  Machine Learning, Programming, Python, Scala, Java, Data Engineering, Distributed Computing, Statistical Modeling, Optimization, Data Pipelines, Cloud Computing, DevOps, Software Development, Data Gathering, Data Preparation, Data Visualization, Machine Learning Frameworks, scikitlearn, PyTorch, Dask, Spark, TensorFlow, Distributed File Systems, Multi node Database Paradigms, Open Source ML Software, Responsible AI, Explainable AI
## 2                                                                                                                                                                                                                                                                                                                                                    C++, Python, PyTorch, TensorFlow, MXNet, CUDA, OpenCL, OpenVX, Halide, SIMD programming models, MLspecific accelerators, Linux/unix environments, Deep learning frameworks, Computer vision deep learning models, ML software and hardware technology, Inference on edge platforms, Cloud ML training pipelines, HPC experience, Performance troubleshooting, Profiling, Roofline model, Analytical skills, Communication skills
## 3                                                                                                                                                                                                                                                                                                                      ETL, Data Integration, Data Transformation, Data Warehousing, Business Intelligence, Data Modeling, Data Architecture, Data Quality, Data Validation, Data Cleansing, Performance Optimization, Performance Tuning, Troubleshooting, Documentation, Reporting, Data Analysis, Collaboration, Communication, SQL, Informatica, Talend, Apache NiFi, AWS Redshift, Azure SQL Data Warehouse, Financial/Banking, CloudBased Data Platforms, Regulatory Compliance
## 4 Data Lakes, Data Bricks, Azure Data Factory Pipelines, Spark, Python, Business Intelligence, Data Warehouse, SQL Server, Azure, ETL/ELT, SQL Server Integration Services, TSQL, Data Formatting, Data Capture, Data Search, Data Retrieval, Data Extraction, Data Classification, Information Filtering, Data Mining Architectures, Modeling Standards, Reporting, Data Analysis Methodologies, Data Engineering, Database File Systems Optimization, API's, Analytics as a Service, Relational Databases, Dimensional Databases, Entity Relationships, Data Warehousing, Facts, Dimensions, Star Schema Concepts, Star Schema Terminology, Project Management, Organizational Skills, Collaboration, Communication, Technical Presentaion Skills, 12+ Years of Relevant Experience
## 5                                                                                                                                                                                                                                                                                                                                                                     Java, Scala, Python, RDBMS, NoSQL, Redshift, Snowflake, Unit testing, Agile engineering, Big data technologies, Cloud computing (AWS Microsoft Azure Google Cloud), Distributed data/computing tools (MapReduce Hadoop Hive EMR Kafka Spark Gurobi MySQL), Realtime data and streaming applications, NoSQL implementation (Mongo Cassandra), Data warehousing (Redshift Snowflake), UNIX/Linux, Shell scripting
## 6                                                                                                                   Data Warehouse (DW), Extract/Transform/Load (ETL), Oracle, VPD, MuleSoft, Cloudera Apache Hadoop Ecosystem, Hadoop, Java, Python, R, AI/ML, Predictive Analytics, Business Objects, Tableau, OBIA/OBIEE, Oracle Cloud Financials, Power Designer, Erwin, UNIX, ODBC, JDBC, Perl DBI, Shell Scripting, PL/SQL, SQL Developer, SQL Plus, SQL Loader, TOAD, Data Profiling, SourceTarget Mapping, Transformations, Business Rules, OWB, ODI, Data Stage, Informatica, SQL, Unix Server, Windows Workstation, Windows Server, Microsoft Office, Excel, Analytic Skills, ProblemSolving, Communication Skills, Teamwork, Accountability, Attention to Detail, Accuracy

Data Tidying and Transformation

In this section, we clean and transform the data to prepare it for analysis. We create normalized tables for a relational database structure.

# Transform the job_skills dataset
# Split comma-separated skills into individual rows
job_skills_expanded <- job_skills %>%
  separate_rows(job_skills, sep = ",") %>%
  mutate(job_skills = trimws(job_skills))

# Create a master table for skills
skills_master <- job_skills_expanded %>%
  distinct(job_skills) %>%
  mutate(skill_id = row_number())

# Create a linking table between jobs and skills
job_skill_linking <- job_skills_expanded %>%
  left_join(skills_master, by = "job_skills") %>%
  select(job_link, skill_id)

# Display the first few rows of the transformed tables
head(job_skills_expanded)
## # A tibble: 6 × 2
##   job_link                                                            job_skills
##   <chr>                                                               <chr>     
## 1 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Machine L…
## 2 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Programmi…
## 3 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Python    
## 4 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Scala     
## 5 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Java      
## 6 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Data Engi…
head(skills_master)
## # A tibble: 6 × 2
##   job_skills       skill_id
##   <chr>               <int>
## 1 Machine Learning        1
## 2 Programming             2
## 3 Python                  3
## 4 Scala                   4
## 5 Java                    5
## 6 Data Engineering        6
head(job_skill_linking)
## # A tibble: 6 × 2
##   job_link                                                              skill_id
##   <chr>                                                                    <int>
## 1 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-…        1
## 2 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-…        2
## 3 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-…        3
## 4 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-…        4
## 5 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-…        5
## 6 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-…        6

Exploratory Data Analysis

Now we analyze the data to identify the most valued data science skills based on their frequency in job postings.

# Count skill frequency
skill_frequency <- job_skill_linking %>%
  group_by(skill_id) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  left_join(skills_master, by = "skill_id")

# Display top 20 skills by frequency
print("Top 20 most frequent skills:")
## [1] "Top 20 most frequent skills:"
head(skill_frequency, 20)
## # A tibble: 20 × 3
##    skill_id count job_skills        
##       <int> <int> <chr>             
##  1        3  4801 Python            
##  2       66  4606 SQL               
##  3       65  2498 Communication     
##  4       63  2181 Data Analysis     
##  5        1  1966 Machine Learning  
##  6      298  1740 AWS               
##  7      133  1685 Tableau           
##  8       16  1562 Data Visualization
##  9      129  1542 R                 
## 10        5  1414 Java              
## 11       21  1392 Spark             
## 12      416  1285 Data Science      
## 13        6  1262 Data Engineering  
## 14      163  1218 Teamwork          
## 15      104  1213 Project Management
## 16      334  1115 Data Analytics    
## 17      605  1093 Problem Solving   
## 18      128  1074 Hadoop            
## 19       64  1072 Collaboration     
## 20      203  1072 Data analysis
# Calculate basic statistics
skill_stats <- skill_frequency %>%
  summarise(
    mean_count = mean(count),
    median_count = median(count),
    sd_count = sd(count),
    min_count = min(count),
    max_count = max(count)
  )

print("Skill frequency statistics:")
## [1] "Skill frequency statistics:"
skill_stats
## # A tibble: 1 × 5
##   mean_count median_count sd_count min_count max_count
##        <dbl>        <dbl>    <dbl>     <int>     <int>
## 1       4.20            1     41.4         1      4801
# Identify potential outliers (skills that are mentioned significantly more often)
outlier_threshold <- skill_stats$mean_count + 2 * skill_stats$sd_count
outlier_skills <- skill_frequency %>%
  filter(count > outlier_threshold)

print("Skills that appear significantly more often (potential outliers):")
## [1] "Skills that appear significantly more often (potential outliers):"
outlier_skills
## # A tibble: 427 × 3
##    skill_id count job_skills        
##       <int> <int> <chr>             
##  1        3  4801 Python            
##  2       66  4606 SQL               
##  3       65  2498 Communication     
##  4       63  2181 Data Analysis     
##  5        1  1966 Machine Learning  
##  6      298  1740 AWS               
##  7      133  1685 Tableau           
##  8       16  1562 Data Visualization
##  9      129  1542 R                 
## 10        5  1414 Java              
## # ℹ 417 more rows
# Create a bar chart for the top skills with annotations  
top_10_skills <- skill_frequency %>%  
  head(10)  
ggplot(top_10_skills, aes(x = reorder(job_skills, count), y = count)) +  
  geom_bar(stat = "identity", fill = "steelblue") +  
  coord_flip() +  
  labs(title = "Top 10 Most Frequent Data Science Skills",  
       x = "Skill",  
       y = "Frequency") +  
  theme_minimal() 

# Log-Transformed Histogram for skill frequency distribution  
ggplot(skill_frequency, aes(x = log10(count + 1))) +  
  geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +  
  labs(title = "Distribution of Skill Frequencies (Log Scale)",  
       x = "Log10(Frequency + 1)",  
       y = "Count") +  
  theme_minimal() 

Relational Database Creation

We store our normalized tables in a SQLite database for better data management.

# Create or connect to a SQLite database file
db <- dbConnect(SQLite(), dbname = "data_science_jobs.db")

# Write normalized tables to the database
# Write the jobs table from the first dataset
dbWriteTable(db, "jobs", jobs, overwrite = TRUE)

# Write the skills master table
dbWriteTable(db, "skills_master", skills_master, overwrite = TRUE)

# Write the job-skill linking table
dbWriteTable(db, "job_skill_linking", job_skill_linking, overwrite = TRUE)

# List tables to confirm
dbListTables(db)
## [1] "job_skill_linking" "jobs"              "skills_master"
# Example query: Get the top 5 skills from the database
query <- "
SELECT sm.job_skills, COUNT(*) as frequency
FROM job_skill_linking jsl
JOIN skills_master sm ON jsl.skill_id = sm.skill_id
GROUP BY sm.job_skills
ORDER BY frequency DESC
LIMIT 5
"

result <- dbGetQuery(db, query)
print("Top 5 skills from database query:")
## [1] "Top 5 skills from database query:"
result
##         job_skills frequency
## 1           Python      4801
## 2              SQL      4606
## 3    Communication      2498
## 4    Data Analysis      2181
## 5 Machine Learning      1966
# Disconnect after writing
dbDisconnect(db)

Findings and Conclusion

Based on our analysis, the most valued data science skills (as measured by frequency in job postings) are:

  1. Python
  2. SQL
  3. Communication
  4. Data Analysis
  5. Machine Learning

These findings align with industry expectations, where technical skills like Python and SQL are fundamental, but soft skills like communication are also highly valued.

The analysis reveals a clear hierarchy of skills, with a few skills (like Python and SQL) appearing significantly more often than others. This suggests that these are core skills that almost every data science job requires.

Methodology Notes

  1. Data Collection: Data was collected from two GitHub repositories containing job postings and associated skills.
  2. Data Transformation: We normalized the data into three tables: jobs, skills_master, and job_skill_linking.
  3. Analysis Approach: We used frequency counts as a proxy for skill value, assuming that skills mentioned more frequently in job postings are more valued.
  4. Limitations: This analysis does not account for the context in which skills are mentioned or their relative importance within a job posting.

Future Work

Future analyses could: - Incorporate salary data to correlate skills with compensation - Analyze skill co-occurrence to identify skill clusters - Track skill trends over time to identify emerging skills - Segment analysis by job title, seniority level, or industry