Data 607 - Project 3: Data Science Skills

Introduction

Which are the most valued data science skills?

Our team collaborated together to find data that supports the skills that are needed for data science. There were a few datasets we explored and we felt the Data Science Job Postings & skills had a lot of good data for us to analyze and tidy up. Our team was able to collaborate effectively through Slack, Google doc, and Zoom. For project documentation and code sharing, we used Github.

The dataset we used is from Kaggle - Data Science Job Postings & Skills (2024) which contains jobs and the job skills that are needed for those roles.

Our approach for this project was to analyze the data and determine the job skills needed for data science and break it down by different countries and seniority. First step we had to take was combine the three dataset into one and then breakdown the job skills into individual observations.

We came up with a few questions we wanted to answer based on the data obtained.

What skills are the most common for data scientist?
What skills are needed for data scientist at different countries?
What skills are needed for data scientist depending on their level of seniority?

Load libraries

# load in the library
library(tidyverse)
library(lubridate)
library(wordcloud)
library(RMySQL)

Importing Data through MySQL Database

The first approach was to load multiple CSV files into MySQL, for a total of four tables: country, job_level, job_postings, and job_skills.

#importing country table 
mysqlconnection = dbConnect(RMySQL::MySQL(),
                            dbname= 'natalie.kalukeerthie02',
                            host= 'cunydata607sql.mysql.database.azure.com',
                            port=3306,
                            user='natalie.kalukeerthie02',
                            password='natalie.kalukeerthie02')
                            
result = dbSendQuery(mysqlconnection, "select * from country")

country <- fetch(result)

print(country)

#importing job_level table
mysqlconnection = dbConnect(RMySQL::MySQL(),
                            dbname= 'natalie.kalukeerthie02',
                            host= 'cunydata607sql.mysql.database.azure.com',
                            port=3306,
                            user='natalie.kalukeerthie02',
                            password='natalie.kalukeerthie02')
                            
result = dbSendQuery(mysqlconnection, "select * from job_level")

job_level <- fetch(result)

print(job_level)

The uploading and importing both the job_level and country tables was successful, mainly because the tables were small. When it came to both job_postings and job_skills, this proved to be a bit more difficult.

Due to the size of each dataset (around 12.1K rows), the files could not be imported through MySQL Workbench’s Import Data Wizard Tool and could only be uploaded into a local database.

This would be done using the LOAD DATA INFILE query. In order to use the query, the data set must be imported into an empty table that matches in number of columns and column data types.The TEXT function was used for columns that need an unlimited character length. VARCHAR(n) was used for columns needing only a specific character length.

#Create SQL table for job_postings
create table job_postings (
    job_link TEXT,
  last_processed_time VARCHAR(50),
  last_status VARCHAR(20),
  got_summary VARCHAR(1),
    got_ner VARCHAR(1),
  is_being_worked VARCHAR(1),
  job_title VARCHAR(200),
  company VARCHAR(100),
    job_location VARCHAR(100),
    first_seen date,
    search_city VARCHAR(100),
    id_country VARCHAR(1),
    search_position VARCHAR(100),
    id_level VARCHAR(2),
  job_type VARCHAR(10)
);

#Create table for job_skills
create table job_skills (
    job_link TEXT,
  last_processed_time TEXT,
);

Next, the data set is uploaded using LOAD DATA INFILE:

LOAD DATA INFILE 'C:\Program Files\MySQL\MySQL Server 8.0\job_postings_revised.csv' INTO TABLE job_postings
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS

Our group faced a few errors (show below) when attempting to load the datasets in, mainly these errors revolved around MySQL Workbench’s security protocols set in place over the last few years, which require granting certain permissions:

Error Code: 1290. The MySQL server is running with the –secure-file-priv option so it cannot execute this statement

This was resolved adding LOCAL into the LOAD DATA INFILE query:

LOAD DATA LOCAL INFILE 'C:\Program Files\MySQL\MySQL Server 8.0\job_postings_revised.csv' INTO TABLE job_postings
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\r\n'
IGNORE 1 ROWS

Error Code: 3948. Loading local data is disabled; this must be enabled on both the client and server sides

This error requires enabling permissions via multiple sources and will need further exploration.

Due to the issues issues face, we decided to take the approach of using a combined CSV file:

Combine all three dataset in one csv file before loading

# read the one file with all different dataset combined
all_data <- read_csv("https://raw.githubusercontent.com/AnnaMoy/Data-607-Project-3/main/job_postings_combine.csv")

Tidy up the data

# separate out the last processed time into date and time
# break out processed time from date and time
all_data <- all_data %>%
  mutate(last_processed_time = lubridate::ymd_hms(last_processed_time),
         last_process_date = lubridate::date(last_processed_time),
         last_processed_time = hms::as_hms(last_processed_time))

#remove unused columns
all_data <- dplyr::select(all_data, -c('last_status','got_summary','got_ner','is_being_worked','first_seen'))

# separate job skills into one column
all_data <- separate_rows(all_data, job_skills, sep = ",")

# added groups to be able to analyze the different roles and see if there is a difference in job skills
all_data$group_title= ifelse(grepl("*data engineer*", all_data$job_title, ignore.case= TRUE), "data engineer",
                        ifelse(grepl("*data scientist*", all_data$job_title, ignore.case= TRUE), "data scientist",
                        ifelse(grepl("*data architect*", all_data$job_title, ignore.case= TRUE), "data architect",
                        ifelse(grepl("*machine learning*", all_data$job_title, ignore.case= TRUE), "machine learning",      
                        ifelse(grepl("*mlops engineer*", all_data$job_title, ignore.case= TRUE), "machine learning",                                 ifelse(grepl("*database administrator*", all_data$job_title, ignore.case= TRUE), "database administrator",
                        ifelse(grepl("*database engineer*", all_data$job_title, ignore.case= TRUE), "database engineer",
                        ifelse(grepl("*data science*", all_data$job_title, ignore.case= TRUE), "data science",
                        ifelse(grepl("*data analyst*", all_data$job_title, ignore.case = TRUE), "data analyst", "other")))))))))

Analysis

What skills are the most common for data scientist?

Based on the visual, Python and SQL appear the be the most common skills for Data Scientists.

What skills do data scientist need in different countries?

For Australia, the United States, and the United Kingdom, Python is the most requested skill for Data Scientists, while in Canada, Machine Learning is the most requested skill.

To look at a few relevant roles (data engineer, data analyst, and machine learning roles) more closely, we grouped job titles into a bucket and determined the most common skills they need to have:

job_grouping <- all_data %>%
  group_by(group_title) %>%
  count(job_skills) %>%
  filter(n > 500, group_title == "data engineer") %>%
  arrange(group_title, desc(n)) %>%
  head(5)

ggplot(job_grouping, aes(x = reorder(job_skills,n),n,job_skills, fill = job_skills)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Data Engineer Job Skills Needed") +
   xlab("Job Skills") +
   ylab("# of counts")

job_grouping_data <- all_data %>%
  group_by(group_title) %>%
  count(job_skills) %>%
  filter(n > 500, group_title == "data analyst") %>%
  arrange(group_title, desc(n)) %>%
  head(5)

ggplot(job_grouping_data, aes(x = reorder(job_skills,n),n,job_skills, fill = job_skills)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Data Analyst Job Skills Needed") +
   xlab("Job Skills") +
   ylab("# of counts")

job_grouping_ml <- all_data %>%
  group_by(group_title) %>%
  count(job_skills) %>%
  filter(n > 100, group_title == "machine learning") %>%
  arrange(group_title, desc(n)) %>%
  head(5)

ggplot(job_grouping_ml, aes(x = reorder(job_skills,n),n,job_skills, fill = job_skills)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Machine Learning Job Skills Needed") +
   xlab("Job Skills") +
   ylab("# of counts")

For Data Engineers and Machine Learning positions, Python is the most requested skill. For Data Analysts, SQL is the most common.

What are the skills for data scientist depending on their seniority level?

seniority_assoc <- all_data %>%
  filter(group_title =="data scientist") %>%
  filter(job_level == "Associate") %>%
  select(job_level, job_skills) %>%
  count(job_skills)%>%
  arrange(desc(n)) %>%
  head(10)

seniority_mid <- all_data %>%
  filter(group_title =="data scientist") %>%
  filter(job_level == "Mid senior") %>%
  select(job_level, job_skills) %>%
  count(job_skills)%>%
  arrange(desc(n)) %>%
  head(10)

seniority_assoc

## # A tibble: 10 × 2
##    job_skills                n
##    <chr>                 <int>
##  1 " Python"                61
##  2 " SQL"                   45
##  3 " R"                     37
##  4 " Machine Learning"      36
##  5 "Data Science"           28
##  6 " Statistics"            26
##  7 " Data Visualization"    24
##  8 " Tableau"               21
##  9 " Communication"         20
## 10 " Hadoop"                18

seniority_mid

## # A tibble: 10 × 2
##    job_skills                n
##    <chr>                 <int>
##  1 " Python"               566
##  2 " SQL"                  437
##  3 " Machine Learning"     359
##  4 " R"                    354
##  5 "Data Science"          299
##  6 " Statistics"           267
##  7 " Data Visualization"   202
##  8 " Communication"        184
##  9 " Spark"                162
## 10 " Tableau"              161

Based on seniority level, the most common skill for associate data scientists is SQL, while for mid-senior level it is Python.

Top 30 job skills

We created a word cloud to give a simpler view on which skills are most asked for by employers. The top skills are Python and SQL, followed by communication.

Conclusion

Based on our analysis and findings, we believe that the most valued data science skills are Python, SQL, and Machine Learning.