This analysis explores the question: “Which are the most valued data science skills?” As W. Edwards Deming said, “In God we trust, all others must bring data.” We will use data from two datasets to answer this question:
The analysis includes data acquisition, tidying, transformation, exploratory data analysis, and storage in a relational database.
We begin by loading necessary libraries and downloading the datasets from GitHub.
# Load required libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(DBI)
library(RSQLite)
# Download datasets from GitHub
# Data set 1: Data Science Jobs
jobs_url <- "https://raw.githubusercontent.com/zahid607/Project-3/main/Data%20Science_Jobs%20.csv"
jobs <- read.csv(jobs_url, stringsAsFactors = FALSE)
# Data set 2: Job Skills
skills_url <- "https://raw.githubusercontent.com/tcgraham-data/data-607-project-3/main/job_skills.csv"
job_skills <- read.csv(skills_url, stringsAsFactors = FALSE)
# Display the structure of both datasets
str(jobs)
## 'data.frame': 7505 obs. of 7 variables:
## $ Job.Title : chr "Data Scientist" "Business Analyst / Data Scientist" "Data Scientist - AI/ML" "Data Scientist" ...
## $ Company.Name : chr "Uplers" "CBRE" "Trent Limited" "Fortune 500 IT Services Company" ...
## $ Location : chr "Hybrid - Bengaluru" "Hybrid - Bengaluru" "Hybrid - Bengaluru" "Hybrid - Bengaluru" ...
## $ Experience : chr "0-4 Yrs" "0-4 Yrs" "0-4 Yrs" "0-4 Yrs" ...
## $ Salary : chr "19-27.5 Lacs PA" "19-27.5 Lacs PA" "19-27.5 Lacs PA" "19-27.5 Lacs PA" ...
## $ Job.Description: chr "Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar..." "Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar..." "Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar..." "Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar..." ...
## $ Skills : chr "Python, TensorFlow, Computer Vision, Flink, Recommender systems, Vision, Apache Flink, System" "Data Visualization, Tableau, Analytics, Business analysis, Data analysis, Data" "Artificial Intelligence, Machine Learning, Data Science, Azure Bot Framework, Azure Machine Learning, Predictiv"| __truncated__ "" ...
str(job_skills)
## 'data.frame': 12217 obs. of 2 variables:
## $ job_link : chr "https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-at-jobs-for-humanity-3804053819" "https://www.linkedin.com/jobs/view/principal-software-engineer-ml-accelerators-at-aurora-3703455068" "https://www.linkedin.com/jobs/view/senior-etl-data-warehouse-specialist-at-adame-services-llc-3765023888" "https://www.linkedin.com/jobs/view/senior-data-warehouse-developer-architect-at-morph-enterprise-3794602483" ...
## $ job_skills: chr "Machine Learning, Programming, Python, Scala, Java, Data Engineering, Distributed Computing, Statistical Modeli"| __truncated__ "C++, Python, PyTorch, TensorFlow, MXNet, CUDA, OpenCL, OpenVX, Halide, SIMD programming models, MLspecific acce"| __truncated__ "ETL, Data Integration, Data Transformation, Data Warehousing, Business Intelligence, Data Modeling, Data Archit"| __truncated__ "Data Lakes, Data Bricks, Azure Data Factory Pipelines, Spark, Python, Business Intelligence, Data Warehouse, SQ"| __truncated__ ...
# Display the first few rows of each dataset
head(jobs)
## Job.Title Company.Name
## 1 Data Scientist Uplers
## 2 Business Analyst / Data Scientist CBRE
## 3 Data Scientist - AI/ML Trent Limited
## 4 Data Scientist Fortune 500 IT Services Company
## 5 Data Scientist Foreign IT Consulting MNC
## 6
## Location Experience Salary
## 1 Hybrid - Bengaluru 0-4 Yrs 19-27.5 Lacs PA
## 2 Hybrid - Bengaluru 0-4 Yrs 19-27.5 Lacs PA
## 3 Hybrid - Bengaluru 0-4 Yrs 19-27.5 Lacs PA
## 4 Hybrid - Bengaluru 0-4 Yrs 19-27.5 Lacs PA
## 5 Hybrid - Bengaluru 0-4 Yrs 19-27.5 Lacs PA
## 6 Hybrid - Bengaluru 0-4 Yrs 19-27.5 Lacs PA
## Job.Description
## 1 Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar...
## 2 Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar...
## 3 Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar...
## 4 Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar...
## 5 Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar...
## 6 Shift : 10:00AM to 7:00PM ISTBachelors / Masters degree in computer science (or similar...
## Skills
## 1 Python, TensorFlow, Computer Vision, Flink, Recommender systems, Vision, Apache Flink, System
## 2 Data Visualization, Tableau, Analytics, Business analysis, Data analysis, Data
## 3 Artificial Intelligence, Machine Learning, Data Science, Azure Bot Framework, Azure Machine Learning, Predictive Modeling, Logistic Regression, Regression
## 4
## 5
## 6
head(job_skills)
## job_link
## 1 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-at-jobs-for-humanity-3804053819
## 2 https://www.linkedin.com/jobs/view/principal-software-engineer-ml-accelerators-at-aurora-3703455068
## 3 https://www.linkedin.com/jobs/view/senior-etl-data-warehouse-specialist-at-adame-services-llc-3765023888
## 4 https://www.linkedin.com/jobs/view/senior-data-warehouse-developer-architect-at-morph-enterprise-3794602483
## 5 https://www.linkedin.com/jobs/view/lead-data-engineer-at-dice-3805948138
## 6 https://www.linkedin.com/jobs/view/senior-data-engineer-at-university-of-chicago-3798206502
## job_skills
## 1 Machine Learning, Programming, Python, Scala, Java, Data Engineering, Distributed Computing, Statistical Modeling, Optimization, Data Pipelines, Cloud Computing, DevOps, Software Development, Data Gathering, Data Preparation, Data Visualization, Machine Learning Frameworks, scikitlearn, PyTorch, Dask, Spark, TensorFlow, Distributed File Systems, Multi node Database Paradigms, Open Source ML Software, Responsible AI, Explainable AI
## 2 C++, Python, PyTorch, TensorFlow, MXNet, CUDA, OpenCL, OpenVX, Halide, SIMD programming models, MLspecific accelerators, Linux/unix environments, Deep learning frameworks, Computer vision deep learning models, ML software and hardware technology, Inference on edge platforms, Cloud ML training pipelines, HPC experience, Performance troubleshooting, Profiling, Roofline model, Analytical skills, Communication skills
## 3 ETL, Data Integration, Data Transformation, Data Warehousing, Business Intelligence, Data Modeling, Data Architecture, Data Quality, Data Validation, Data Cleansing, Performance Optimization, Performance Tuning, Troubleshooting, Documentation, Reporting, Data Analysis, Collaboration, Communication, SQL, Informatica, Talend, Apache NiFi, AWS Redshift, Azure SQL Data Warehouse, Financial/Banking, CloudBased Data Platforms, Regulatory Compliance
## 4 Data Lakes, Data Bricks, Azure Data Factory Pipelines, Spark, Python, Business Intelligence, Data Warehouse, SQL Server, Azure, ETL/ELT, SQL Server Integration Services, TSQL, Data Formatting, Data Capture, Data Search, Data Retrieval, Data Extraction, Data Classification, Information Filtering, Data Mining Architectures, Modeling Standards, Reporting, Data Analysis Methodologies, Data Engineering, Database File Systems Optimization, API's, Analytics as a Service, Relational Databases, Dimensional Databases, Entity Relationships, Data Warehousing, Facts, Dimensions, Star Schema Concepts, Star Schema Terminology, Project Management, Organizational Skills, Collaboration, Communication, Technical Presentaion Skills, 12+ Years of Relevant Experience
## 5 Java, Scala, Python, RDBMS, NoSQL, Redshift, Snowflake, Unit testing, Agile engineering, Big data technologies, Cloud computing (AWS Microsoft Azure Google Cloud), Distributed data/computing tools (MapReduce Hadoop Hive EMR Kafka Spark Gurobi MySQL), Realtime data and streaming applications, NoSQL implementation (Mongo Cassandra), Data warehousing (Redshift Snowflake), UNIX/Linux, Shell scripting
## 6 Data Warehouse (DW), Extract/Transform/Load (ETL), Oracle, VPD, MuleSoft, Cloudera Apache Hadoop Ecosystem, Hadoop, Java, Python, R, AI/ML, Predictive Analytics, Business Objects, Tableau, OBIA/OBIEE, Oracle Cloud Financials, Power Designer, Erwin, UNIX, ODBC, JDBC, Perl DBI, Shell Scripting, PL/SQL, SQL Developer, SQL Plus, SQL Loader, TOAD, Data Profiling, SourceTarget Mapping, Transformations, Business Rules, OWB, ODI, Data Stage, Informatica, SQL, Unix Server, Windows Workstation, Windows Server, Microsoft Office, Excel, Analytic Skills, ProblemSolving, Communication Skills, Teamwork, Accountability, Attention to Detail, Accuracy
In this section, we clean and transform the data to prepare it for analysis. We create normalized tables for a relational database structure.
# Transform the job_skills dataset
# Split comma-separated skills into individual rows
job_skills_expanded <- job_skills %>%
separate_rows(job_skills, sep = ",") %>%
mutate(job_skills = trimws(job_skills))
# Create a master table for skills
skills_master <- job_skills_expanded %>%
distinct(job_skills) %>%
mutate(skill_id = row_number())
# Create a linking table between jobs and skills
job_skill_linking <- job_skills_expanded %>%
left_join(skills_master, by = "job_skills") %>%
select(job_link, skill_id)
# Display the first few rows of the transformed tables
head(job_skills_expanded)
## # A tibble: 6 × 2
## job_link job_skills
## <chr> <chr>
## 1 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Machine L…
## 2 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Programmi…
## 3 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Python
## 4 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Scala
## 5 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Java
## 6 https://www.linkedin.com/jobs/view/senior-machine-learning-enginee… Data Engi…
head(skills_master)
## # A tibble: 6 × 2
## job_skills skill_id
## <chr> <int>
## 1 Machine Learning 1
## 2 Programming 2
## 3 Python 3
## 4 Scala 4
## 5 Java 5
## 6 Data Engineering 6
head(job_skill_linking)
## # A tibble: 6 × 2
## job_link skill_id
## <chr> <int>
## 1 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-… 1
## 2 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-… 2
## 3 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-… 3
## 4 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-… 4
## 5 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-… 5
## 6 https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-… 6
Now we analyze the data to identify the most valued data science skills based on their frequency in job postings.
# Count skill frequency
skill_frequency <- job_skill_linking %>%
group_by(skill_id) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
left_join(skills_master, by = "skill_id")
# Display top 20 skills by frequency
print("Top 20 most frequent skills:")
## [1] "Top 20 most frequent skills:"
head(skill_frequency, 20)
## # A tibble: 20 × 3
## skill_id count job_skills
## <int> <int> <chr>
## 1 3 4801 Python
## 2 66 4606 SQL
## 3 65 2498 Communication
## 4 63 2181 Data Analysis
## 5 1 1966 Machine Learning
## 6 298 1740 AWS
## 7 133 1685 Tableau
## 8 16 1562 Data Visualization
## 9 129 1542 R
## 10 5 1414 Java
## 11 21 1392 Spark
## 12 416 1285 Data Science
## 13 6 1262 Data Engineering
## 14 163 1218 Teamwork
## 15 104 1213 Project Management
## 16 334 1115 Data Analytics
## 17 605 1093 Problem Solving
## 18 128 1074 Hadoop
## 19 64 1072 Collaboration
## 20 203 1072 Data analysis
# Calculate basic statistics
skill_stats <- skill_frequency %>%
summarise(
mean_count = mean(count),
median_count = median(count),
sd_count = sd(count),
min_count = min(count),
max_count = max(count)
)
print("Skill frequency statistics:")
## [1] "Skill frequency statistics:"
skill_stats
## # A tibble: 1 × 5
## mean_count median_count sd_count min_count max_count
## <dbl> <dbl> <dbl> <int> <int>
## 1 4.20 1 41.4 1 4801
# Identify potential outliers (skills that are mentioned significantly more often)
outlier_threshold <- skill_stats$mean_count + 2 * skill_stats$sd_count
outlier_skills <- skill_frequency %>%
filter(count > outlier_threshold)
print("Skills that appear significantly more often (potential outliers):")
## [1] "Skills that appear significantly more often (potential outliers):"
outlier_skills
## # A tibble: 427 × 3
## skill_id count job_skills
## <int> <int> <chr>
## 1 3 4801 Python
## 2 66 4606 SQL
## 3 65 2498 Communication
## 4 63 2181 Data Analysis
## 5 1 1966 Machine Learning
## 6 298 1740 AWS
## 7 133 1685 Tableau
## 8 16 1562 Data Visualization
## 9 129 1542 R
## 10 5 1414 Java
## # ℹ 417 more rows
# Create a bar chart for the top skills with annotations
top_10_skills <- skill_frequency %>%
head(10)
ggplot(top_10_skills, aes(x = reorder(job_skills, count), y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Most Frequent Data Science Skills",
x = "Skill",
y = "Frequency") +
theme_minimal()
# Log-Transformed Histogram for skill frequency distribution
ggplot(skill_frequency, aes(x = log10(count + 1))) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
labs(title = "Distribution of Skill Frequencies (Log Scale)",
x = "Log10(Frequency + 1)",
y = "Count") +
theme_minimal()
We store our normalized tables in a SQLite database for better data management.
# Create or connect to a SQLite database file
db <- dbConnect(SQLite(), dbname = "data_science_jobs.db")
# Write normalized tables to the database
# Write the jobs table from the first dataset
dbWriteTable(db, "jobs", jobs, overwrite = TRUE)
# Write the skills master table
dbWriteTable(db, "skills_master", skills_master, overwrite = TRUE)
# Write the job-skill linking table
dbWriteTable(db, "job_skill_linking", job_skill_linking, overwrite = TRUE)
# List tables to confirm
dbListTables(db)
## [1] "job_skill_linking" "jobs" "skills_master"
# Example query: Get the top 5 skills from the database
query <- "
SELECT sm.job_skills, COUNT(*) as frequency
FROM job_skill_linking jsl
JOIN skills_master sm ON jsl.skill_id = sm.skill_id
GROUP BY sm.job_skills
ORDER BY frequency DESC
LIMIT 5
"
result <- dbGetQuery(db, query)
print("Top 5 skills from database query:")
## [1] "Top 5 skills from database query:"
result
## job_skills frequency
## 1 Python 4801
## 2 SQL 4606
## 3 Communication 2498
## 4 Data Analysis 2181
## 5 Machine Learning 1966
# Disconnect after writing
dbDisconnect(db)
Based on our analysis, the most valued data science skills (as measured by frequency in job postings) are:
These findings align with industry expectations, where technical skills like Python and SQL are fundamental, but soft skills like communication are also highly valued.
The analysis reveals a clear hierarchy of skills, with a few skills (like Python and SQL) appearing significantly more often than others. This suggests that these are core skills that almost every data science job requires.
Future analyses could: - Incorporate salary data to correlate skills with compensation - Analyze skill co-occurrence to identify skill clusters - Track skill trends over time to identify emerging skills - Segment analysis by job title, seniority level, or industry