Project 3: Data Preparation

Author

Team 4: Alahdab, Fox, Toybis

Published

March 17, 2024

Introduction

Knowing which data skills are most important for success in the field of data science is key for the best education and preparation for a a fruitful career. Despite the abundance of learning resources and educational offerings for data science, or because of such abundance, it can be challenging to know where to start and which skills are most critical. To try to answer this question and fill this knowledge gap, we undertook an evaluation of a contemporary data on which skills were most in demand for data science jobs.

Process and Data

To facilitate collaboration, we created two shared resources (a MySQL database in Azure and a Github repository) and used email/Zoom for communication.

We obtained the dataset “Data Science Job Postings & Skills (2024)” from Kaggle(1). The data was originally sourced by Kaggle user “Asaniczka” by scraping publicly available LinkedIn job postings related to the term “data science”. It is presented as a raw data dump in three files, each with 12,218 rows, one per job posting:

Job_summary: Two columns: URL and “job_summary,” which appears to be the original formatted job posting data
Job_postings: 15 columns, one row per posting. Appears to have been derived from #1,
Job_skills: Two columns, URL and comma-delimited list of skills found in the job listing, also appears to have been derived from #1 above

The team evaluated all tables and used #2 and #3 in this analysis, referred to below as raw_job_postings.csv and raw_job_skills.csv.

Significant cleaning and transformation was required to import this data into MySQL in a normalized and useful format.

We began with loading the libraries:

library (tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library (dplyr)

Data Preparation

The desired outcome of this process was a normalized MySQL database as follows:

Fig. 1 ER diagram of MySQL database

1. Job Postings Table

Upon examination, the raw_job_postings file pictured in Fig. 1 below was in a wide tidy format, with one row per observation (job posting). The following issues were noted:

The primary key was the URL of the job posting, which was unwieldy and also proved to have a small number of duplicates (<20).
Several columns appeared to be for the author’s use in processing and were not meaningful for analysis
Rogue commas and invalid characters caused repeated errors in the import process
The table was denormalized, with values repeating across postings

These and other minor issues needed to be resolved before moving the data into MySQL (see Fig. 2 and Fig. 3):

Fig. 2: Original File

Fig. 3: Format of Table with Keys in MySQL

First, we imported the raw file and created a working dataframe of only necessary columns with revised names. (Note that one transformation was done during the evaluation process outside of R: the creation of a unique random number “id_simple” as a new primary key for job postings in lieu of URLs. All other keys were generated in R below.)

df_job_posting_raw <- read_csv("https://raw.githubusercontent.com/unsecuredAMRAP/607pr3/main/1_R_transform_raw_files_for_SQL/raw_job_postings.csv")

Rows: 12217 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (11): job_link, last_status, job_title, company, job_location, first_se...
dbl   (1): id_simple
lgl   (3): got_summary, got_ner, is_being_worked
dttm  (1): last_processed_time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df_job_posting_work <- df_job_posting_raw %>% 
  select(posting_id = id_simple,
         URL = job_link,
         first_seen = first_seen,
         last_processed = last_processed_time,
         title_desc = job_title,
         company_desc = company,
         location_raw = job_location,
         city_desc = search_city,
         country_desc = search_country,
         search_pos_desc = search_position,
         job_level_desc = job_level,
         onsite_flag_desc = job_type
  )

str(df_job_posting_work)

tibble [12,217 × 12] (S3: tbl_df/tbl/data.frame)
 $ posting_id      : num [1:12217] 6334263 6950875 7920777 5655137 2358397 ...
 $ URL             : chr [1:12217] "https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-at-jobs-for-humanity-3804053819" "https://www.linkedin.com/jobs/view/principal-software-engineer-ml-accelerators-at-aurora-3703455068" "https://www.linkedin.com/jobs/view/senior-etl-data-warehouse-specialist-at-adame-services-llc-3765023888" "https://www.linkedin.com/jobs/view/senior-data-warehouse-developer-architect-at-morph-enterprise-3794602483" ...
 $ first_seen      : chr [1:12217] "1/14/2024" "1/14/2024" "1/14/2024" "1/12/2024" ...
 $ last_processed  : POSIXct[1:12217], format: "2024-01-21 08:08:48" "2024-01-20 04:02:12" ...
 $ title_desc      : chr [1:12217] "Senior Machine Learning Engineer" "Principal Software Engineer, ML Accelerators" "Senior ETL Data Warehouse Specialist" "Senior Data Warehouse Developer / Architect" ...
 $ company_desc    : chr [1:12217] "Jobs for Humanity" "Aurora" "Adame Services LLC" "Morph Enterprise" ...
 $ location_raw    : chr [1:12217] "New Haven, CT" "San Francisco, CA" "New York, NY" "Harrisburg, PA" ...
 $ city_desc       : chr [1:12217] "East Haven" "El Cerrito" "Middletown" "Lebanon" ...
 $ country_desc    : chr [1:12217] "United States" "United States" "United States" "United States" ...
 $ search_pos_desc : chr [1:12217] "Agricultural-Research Engineer" "Set-Key Driver" "Technical Support Specialist" "Architect" ...
 $ job_level_desc  : chr [1:12217] "Mid senior" "Mid senior" "Associate" "Mid senior" ...
 $ onsite_flag_desc: chr [1:12217] "Onsite" "Onsite" "Onsite" "Onsite" ...

This clean, tidy dataset was then normalized for MySQL by creating keys and lookup tables for values that repeat across postings.

Normalizing data supports data integrity and ease of maintenance: for example, renaming job levels or adding new data elements about companies would only need a single edit to a lookup table instead of across all tables where those data elements appear.

#-------- Create all lookup tables and unique keys for each, add key to job_postings table

# Title
df_title <- df_job_posting_work %>% 
  distinct(title_desc) 

df_title <- df_title %>% 
  mutate(title_id = 1:nrow(df_title))

df_title <- df_title[,c(2,1)]

df_job_posting_work <- mutate(left_join(df_job_posting_work,df_title,by = "title_desc"))

# Company
df_company <- df_job_posting_work %>% 
  distinct(company_desc) 

df_company <- df_company %>% 
  mutate(company_id = 1:nrow(df_company))

df_company <- df_company[,c(2,1)]

df_job_posting_work <- mutate(left_join(df_job_posting_work,
                                        df_company,
                                        by = "company_desc"))

# City
df_city <- df_job_posting_work %>% 
  distinct(city_desc) 

df_city <- df_city %>% 
  mutate(city_id = 1:nrow(df_city))

df_city <- df_city[,c(2,1)]

df_job_posting_work <- mutate(left_join(df_job_posting_work,
                                        df_city,
                                        by = "city_desc"))

# country
df_country <- df_job_posting_work %>% 
  distinct(country_desc) 

df_country <- df_country %>% 
  mutate(country_id = 1:nrow(df_country))

df_country <- df_country[,c(2,1)]

df_job_posting_work <- mutate(left_join(df_job_posting_work,
                                        df_country,
                                        by = "country_desc"))

# search_position (appears to group job titles)
df_search_pos <- df_job_posting_work %>% 
  distinct(search_pos_desc) 

df_search_pos <- df_search_pos %>% 
  mutate(search_pos_id = 1:nrow(df_search_pos))

df_search_pos <- df_search_pos[,c(2,1)]

df_job_posting_work <- mutate(left_join(df_job_posting_work,
                                        df_search_pos,
                                        by = "search_pos_desc"))

# job_level
df_job_level <- df_job_posting_work %>% 
  distinct(job_level_desc) 

df_job_level <- df_job_level %>% 
  mutate(job_level_id = 1:nrow(df_job_level))

df_job_level <- df_job_level[,c(2,1)]

df_job_posting_work <- mutate(left_join(df_job_posting_work,
                                        df_job_level,
                                        by = "job_level_desc"))

# onsite_flag
df_onsite_flag <- df_job_posting_work %>% 
  distinct(onsite_flag_desc) 

df_onsite_flag <- df_onsite_flag %>% 
  mutate(onsite_flag_id = 1:nrow(df_onsite_flag))

df_onsite_flag <- df_onsite_flag[,c(2,1)]

df_job_posting_work <- mutate(left_join(df_job_posting_work,
                                        df_onsite_flag,
                                        by = "onsite_flag_desc"))


head(df_title)

# A tibble: 6 × 2
  title_id title_desc                                  
     <int> <chr>                                       
1        1 Senior Machine Learning Engineer            
2        2 Principal Software Engineer, ML Accelerators
3        3 Senior ETL Data Warehouse Specialist        
4        4 Senior Data Warehouse Developer / Architect 
5        5 Lead Data Engineer                          
6        6 Senior Data Engineer

str(df_job_posting_work)

tibble [12,217 × 19] (S3: tbl_df/tbl/data.frame)
 $ posting_id      : num [1:12217] 6334263 6950875 7920777 5655137 2358397 ...
 $ URL             : chr [1:12217] "https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-at-jobs-for-humanity-3804053819" "https://www.linkedin.com/jobs/view/principal-software-engineer-ml-accelerators-at-aurora-3703455068" "https://www.linkedin.com/jobs/view/senior-etl-data-warehouse-specialist-at-adame-services-llc-3765023888" "https://www.linkedin.com/jobs/view/senior-data-warehouse-developer-architect-at-morph-enterprise-3794602483" ...
 $ first_seen      : chr [1:12217] "1/14/2024" "1/14/2024" "1/14/2024" "1/12/2024" ...
 $ last_processed  : POSIXct[1:12217], format: "2024-01-21 08:08:48" "2024-01-20 04:02:12" ...
 $ title_desc      : chr [1:12217] "Senior Machine Learning Engineer" "Principal Software Engineer, ML Accelerators" "Senior ETL Data Warehouse Specialist" "Senior Data Warehouse Developer / Architect" ...
 $ company_desc    : chr [1:12217] "Jobs for Humanity" "Aurora" "Adame Services LLC" "Morph Enterprise" ...
 $ location_raw    : chr [1:12217] "New Haven, CT" "San Francisco, CA" "New York, NY" "Harrisburg, PA" ...
 $ city_desc       : chr [1:12217] "East Haven" "El Cerrito" "Middletown" "Lebanon" ...
 $ country_desc    : chr [1:12217] "United States" "United States" "United States" "United States" ...
 $ search_pos_desc : chr [1:12217] "Agricultural-Research Engineer" "Set-Key Driver" "Technical Support Specialist" "Architect" ...
 $ job_level_desc  : chr [1:12217] "Mid senior" "Mid senior" "Associate" "Mid senior" ...
 $ onsite_flag_desc: chr [1:12217] "Onsite" "Onsite" "Onsite" "Onsite" ...
 $ title_id        : int [1:12217] 1 2 3 4 5 6 7 8 9 10 ...
 $ company_id      : int [1:12217] 1 2 3 4 5 6 1 1 7 8 ...
 $ city_id         : int [1:12217] 1 2 3 4 5 6 7 8 9 10 ...
 $ country_id      : int [1:12217] 1 1 1 1 1 1 1 1 1 1 ...
 $ search_pos_id   : int [1:12217] 1 2 3 4 5 6 7 4 8 9 ...
 $ job_level_id    : int [1:12217] 1 1 2 1 1 1 1 1 1 1 ...
 $ onsite_flag_id  : int [1:12217] 1 1 1 1 1 1 1 1 1 1 ...

With the lookup tables created and all the keys now in the working copy of the table, we removed all the repeating values and were left with the final job posting table for MySQL:

# Create job_posting normalized table with keys

df_job_posting <- df_job_posting_work %>% 
  select(posting_id,
         URL,
         first_seen,
         last_processed,
         title_id,
         company_id,
         location_raw,
         city_id,
         country_id,
         search_pos_id,
         job_level_id,
         onsite_flag_id
         )

str(df_job_posting)

tibble [12,217 × 12] (S3: tbl_df/tbl/data.frame)
 $ posting_id    : num [1:12217] 6334263 6950875 7920777 5655137 2358397 ...
 $ URL           : chr [1:12217] "https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-at-jobs-for-humanity-3804053819" "https://www.linkedin.com/jobs/view/principal-software-engineer-ml-accelerators-at-aurora-3703455068" "https://www.linkedin.com/jobs/view/senior-etl-data-warehouse-specialist-at-adame-services-llc-3765023888" "https://www.linkedin.com/jobs/view/senior-data-warehouse-developer-architect-at-morph-enterprise-3794602483" ...
 $ first_seen    : chr [1:12217] "1/14/2024" "1/14/2024" "1/14/2024" "1/12/2024" ...
 $ last_processed: POSIXct[1:12217], format: "2024-01-21 08:08:48" "2024-01-20 04:02:12" ...
 $ title_id      : int [1:12217] 1 2 3 4 5 6 7 8 9 10 ...
 $ company_id    : int [1:12217] 1 2 3 4 5 6 1 1 7 8 ...
 $ location_raw  : chr [1:12217] "New Haven, CT" "San Francisco, CA" "New York, NY" "Harrisburg, PA" ...
 $ city_id       : int [1:12217] 1 2 3 4 5 6 7 8 9 10 ...
 $ country_id    : int [1:12217] 1 1 1 1 1 1 1 1 1 1 ...
 $ search_pos_id : int [1:12217] 1 2 3 4 5 6 7 4 8 9 ...
 $ job_level_id  : int [1:12217] 1 1 2 1 1 1 1 1 1 1 ...
 $ onsite_flag_id: int [1:12217] 1 1 1 1 1 1 1 1 1 1 ...

2. Skills by Job Posting

The raw_job_postings_skills file pictured in Fig. 4 below was in a tidy format, with one row per observation (in this case, job posting/skills list pairs).

However, although tidy, it required significant cleaning, transformation, and tidying in order to be used for analysis:

The primary key was the URL of the job posting
Skills for each posting were listed in concatenated strings of free text
These strings contained many invalid characters that would cause import errors in MySQL

Fig. 4: Original Skills by Posting File

Our first step was to parse the skills string into columns:

#-------- Import table and parse comma delimited list of skills into columns: 

df_job_skills_raw <- read_csv("https://raw.githubusercontent.com/unsecuredAMRAP/607pr3/main/1_R_transform_raw_files_for_SQL/raw_job_skills2.csv")

Rows: 12217 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): job_link, job_skills
dbl (1): id_simple

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df_job_skills_raw <- df_job_skills_raw  %>%  
  rename(posting_id = id_simple, URL = job_link)

#NOTE we capped this at 200 skills per posting; 32 positions had more than 200

df_skills_work <- df_job_skills_raw %>% 
  separate_wider_delim(job_skills,
                    delim=",",
                    names=c("Skill_1","Skill_2","Skill_3","Skill_4","Skill_5", "Skill_6","Skill_7","Skill_8","Skill_9","Skill_10", "Skill_11","Skill_12","Skill_13","Skill_14","Skill_15", "Skill_16","Skill_17","Skill_18","Skill_19","Skill_20", "Skill_21","Skill_22","Skill_23","Skill_24","Skill_25", "Skill_26","Skill_27","Skill_28","Skill_29","Skill_30", "Skill_31","Skill_32","Skill_33","Skill_34","Skill_35", "Skill_36","Skill_37","Skill_38","Skill_39","Skill_40", "Skill_41","Skill_42","Skill_43","Skill_44","Skill_45", "Skill_46","Skill_47","Skill_48","Skill_49","Skill_50", "Skill_51","Skill_52","Skill_53","Skill_54","Skill_55", "Skill_56","Skill_57","Skill_58","Skill_59","Skill_60", "Skill_61","Skill_62","Skill_63","Skill_64","Skill_65", "Skill_66","Skill_67","Skill_68","Skill_69","Skill_70", "Skill_71","Skill_72","Skill_73","Skill_74","Skill_75", "Skill_76","Skill_77","Skill_78","Skill_79","Skill_80", "Skill_81","Skill_82","Skill_83","Skill_84","Skill_85", "Skill_86","Skill_87","Skill_88","Skill_89","Skill_90", "Skill_91","Skill_92","Skill_93","Skill_94","Skill_95", "Skill_96","Skill_97","Skill_98","Skill_99","Skill_100", "Skill_101","Skill_102","Skill_103","Skill_104","Skill_105", "Skill_106","Skill_107","Skill_108","Skill_109","Skill_110", "Skill_111","Skill_112","Skill_113","Skill_114","Skill_115", "Skill_116","Skill_117","Skill_118","Skill_119","Skill_120", "Skill_121","Skill_122","Skill_123","Skill_124","Skill_125", "Skill_126","Skill_127","Skill_128","Skill_129","Skill_130", "Skill_131","Skill_132","Skill_133","Skill_134","Skill_135", "Skill_136","Skill_137","Skill_138","Skill_139","Skill_140", "Skill_141","Skill_142","Skill_143","Skill_144","Skill_145", "Skill_146","Skill_147","Skill_148","Skill_149","Skill_150", "Skill_151","Skill_152","Skill_153","Skill_154","Skill_155", "Skill_156","Skill_157","Skill_158","Skill_159","Skill_160", "Skill_161","Skill_162","Skill_163","Skill_164","Skill_165", "Skill_166","Skill_167","Skill_168","Skill_169","Skill_170", "Skill_171","Skill_172","Skill_173","Skill_174","Skill_175", "Skill_176","Skill_177","Skill_178","Skill_179","Skill_180", "Skill_181","Skill_182","Skill_183","Skill_184","Skill_185", "Skill_186","Skill_187","Skill_188","Skill_189","Skill_190", "Skill_191","Skill_192","Skill_193","Skill_194","Skill_195", "Skill_196","Skill_197","Skill_198","Skill_199","Skill_200" ), 
      too_many = "drop", 
      too_few = "align_start")

df_skills_work$posting_id <- as.character(df_skills_work$posting_id)

str(df_skills_work)

tibble [12,217 × 202] (S3: tbl_df/tbl/data.frame)
 $ posting_id: chr [1:12217] "6334263" "6950875" "7920777" "5655137" ...
 $ URL       : chr [1:12217] "https://www.linkedin.com/jobs/view/senior-machine-learning-engineer-at-jobs-for-humanity-3804053819" "https://www.linkedin.com/jobs/view/principal-software-engineer-ml-accelerators-at-aurora-3703455068" "https://www.linkedin.com/jobs/view/senior-etl-data-warehouse-specialist-at-adame-services-llc-3765023888" "https://www.linkedin.com/jobs/view/senior-data-warehouse-developer-architect-at-morph-enterprise-3794602483" ...
 $ Skill_1   : chr [1:12217] "Machine Learning" "C++" "ETL" "Data Lakes" ...
 $ Skill_2   : chr [1:12217] " Programming" " Python" " Data Integration" " Data Bricks" ...
 $ Skill_3   : chr [1:12217] " Python" " PyTorch" " Data Transformation" " Azure Data Factory Pipelines" ...
 $ Skill_4   : chr [1:12217] " Scala" " TensorFlow" " Data Warehousing" " Spark" ...
 $ Skill_5   : chr [1:12217] " Java" " MXNet" " Business Intelligence" " Python" ...
 $ Skill_6   : chr [1:12217] " Data Engineering" " CUDA" " Data Modeling" " Business Intelligence" ...
 $ Skill_7   : chr [1:12217] " Distributed Computing" " OpenCL" " Data Architecture" " Data Warehouse" ...
 $ Skill_8   : chr [1:12217] " Statistical Modeling" " OpenVX" " Data Quality" " SQL Server" ...
 $ Skill_9   : chr [1:12217] " Optimization" " Halide" " Data Validation" " Azure" ...
 $ Skill_10  : chr [1:12217] " Data Pipelines" " SIMD programming models" " Data Cleansing" " ETL/ELT" ...
 $ Skill_11  : chr [1:12217] " Cloud Computing" " MLspecific accelerators" " Performance Optimization" " SQL Server Integration Services" ...
 $ Skill_12  : chr [1:12217] " DevOps" " Linux/unix environments" " Performance Tuning" " TSQL" ...
 $ Skill_13  : chr [1:12217] " Software Development" " Deep learning frameworks" " Troubleshooting" " Data Formatting" ...
 $ Skill_14  : chr [1:12217] " Data Gathering" " Computer vision deep learning models" " Documentation" " Data Capture" ...
 $ Skill_15  : chr [1:12217] " Data Preparation" " ML software and hardware technology" " Reporting" " Data Search" ...
 $ Skill_16  : chr [1:12217] " Data Visualization" " Inference on edge platforms" " Data Analysis" " Data Retrieval" ...
 $ Skill_17  : chr [1:12217] " Machine Learning Frameworks" " Cloud ML training pipelines" " Collaboration" " Data Extraction" ...
 $ Skill_18  : chr [1:12217] " scikitlearn" " HPC experience" " Communication" " Data Classification" ...
 $ Skill_19  : chr [1:12217] " PyTorch" " Performance troubleshooting" " SQL" " Information Filtering" ...
 $ Skill_20  : chr [1:12217] " Dask" " Profiling" " Informatica" " Data Mining Architectures" ...
 $ Skill_21  : chr [1:12217] " Spark" " Roofline model" " Talend" " Modeling Standards" ...
 $ Skill_22  : chr [1:12217] " TensorFlow" " Analytical skills" " Apache NiFi" " Reporting" ...
 $ Skill_23  : chr [1:12217] " Distributed File Systems" " Communication skills" " AWS Redshift" " Data Analysis Methodologies" ...
 $ Skill_24  : chr [1:12217] " Multi node Database Paradigms" NA " Azure SQL Data Warehouse" " Data Engineering" ...
 $ Skill_25  : chr [1:12217] " Open Source ML Software" NA " Financial/Banking" " Database File Systems Optimization" ...
 $ Skill_26  : chr [1:12217] " Responsible AI" NA " CloudBased Data Platforms" " API's" ...
 $ Skill_27  : chr [1:12217] " Explainable AI" NA " Regulatory Compliance" " Analytics as a Service" ...
 $ Skill_28  : chr [1:12217] NA NA NA " Relational Databases" ...
 $ Skill_29  : chr [1:12217] NA NA NA " Dimensional Databases" ...
 $ Skill_30  : chr [1:12217] NA NA NA " Entity Relationships" ...
 $ Skill_31  : chr [1:12217] NA NA NA " Data Warehousing" ...
 $ Skill_32  : chr [1:12217] NA NA NA " Facts" ...
 $ Skill_33  : chr [1:12217] NA NA NA " Dimensions" ...
 $ Skill_34  : chr [1:12217] NA NA NA " Star Schema Concepts" ...
 $ Skill_35  : chr [1:12217] NA NA NA " Star Schema Terminology" ...
 $ Skill_36  : chr [1:12217] NA NA NA " Project Management" ...
 $ Skill_37  : chr [1:12217] NA NA NA " Organizational Skills" ...
 $ Skill_38  : chr [1:12217] NA NA NA " Collaboration" ...
 $ Skill_39  : chr [1:12217] NA NA NA " Communication" ...
 $ Skill_40  : chr [1:12217] NA NA NA " Technical Presentaion Skills" ...
 $ Skill_41  : chr [1:12217] NA NA NA " 12+ Years of Relevant Experience" ...
 $ Skill_42  : chr [1:12217] NA NA NA NA ...
 $ Skill_43  : chr [1:12217] NA NA NA NA ...
 $ Skill_44  : chr [1:12217] NA NA NA NA ...
 $ Skill_45  : chr [1:12217] NA NA NA NA ...
 $ Skill_46  : chr [1:12217] NA NA NA NA ...
 $ Skill_47  : chr [1:12217] NA NA NA NA ...
 $ Skill_48  : chr [1:12217] NA NA NA NA ...
 $ Skill_49  : chr [1:12217] NA NA NA NA ...
 $ Skill_50  : chr [1:12217] NA NA NA NA ...
 $ Skill_51  : chr [1:12217] NA NA NA NA ...
 $ Skill_52  : chr [1:12217] NA NA NA NA ...
 $ Skill_53  : chr [1:12217] NA NA NA NA ...
 $ Skill_54  : chr [1:12217] NA NA NA NA ...
 $ Skill_55  : chr [1:12217] NA NA NA NA ...
 $ Skill_56  : chr [1:12217] NA NA NA NA ...
 $ Skill_57  : chr [1:12217] NA NA NA NA ...
 $ Skill_58  : chr [1:12217] NA NA NA NA ...
 $ Skill_59  : chr [1:12217] NA NA NA NA ...
 $ Skill_60  : chr [1:12217] NA NA NA NA ...
 $ Skill_61  : chr [1:12217] NA NA NA NA ...
 $ Skill_62  : chr [1:12217] NA NA NA NA ...
 $ Skill_63  : chr [1:12217] NA NA NA NA ...
 $ Skill_64  : chr [1:12217] NA NA NA NA ...
 $ Skill_65  : chr [1:12217] NA NA NA NA ...
 $ Skill_66  : chr [1:12217] NA NA NA NA ...
 $ Skill_67  : chr [1:12217] NA NA NA NA ...
 $ Skill_68  : chr [1:12217] NA NA NA NA ...
 $ Skill_69  : chr [1:12217] NA NA NA NA ...
 $ Skill_70  : chr [1:12217] NA NA NA NA ...
 $ Skill_71  : chr [1:12217] NA NA NA NA ...
 $ Skill_72  : chr [1:12217] NA NA NA NA ...
 $ Skill_73  : chr [1:12217] NA NA NA NA ...
 $ Skill_74  : chr [1:12217] NA NA NA NA ...
 $ Skill_75  : chr [1:12217] NA NA NA NA ...
 $ Skill_76  : chr [1:12217] NA NA NA NA ...
 $ Skill_77  : chr [1:12217] NA NA NA NA ...
 $ Skill_78  : chr [1:12217] NA NA NA NA ...
 $ Skill_79  : chr [1:12217] NA NA NA NA ...
 $ Skill_80  : chr [1:12217] NA NA NA NA ...
 $ Skill_81  : chr [1:12217] NA NA NA NA ...
 $ Skill_82  : chr [1:12217] NA NA NA NA ...
 $ Skill_83  : chr [1:12217] NA NA NA NA ...
 $ Skill_84  : chr [1:12217] NA NA NA NA ...
 $ Skill_85  : chr [1:12217] NA NA NA NA ...
 $ Skill_86  : chr [1:12217] NA NA NA NA ...
 $ Skill_87  : chr [1:12217] NA NA NA NA ...
 $ Skill_88  : chr [1:12217] NA NA NA NA ...
 $ Skill_89  : chr [1:12217] NA NA NA NA ...
 $ Skill_90  : chr [1:12217] NA NA NA NA ...
 $ Skill_91  : chr [1:12217] NA NA NA NA ...
 $ Skill_92  : chr [1:12217] NA NA NA NA ...
 $ Skill_93  : chr [1:12217] NA NA NA NA ...
 $ Skill_94  : chr [1:12217] NA NA NA NA ...
 $ Skill_95  : chr [1:12217] NA NA NA NA ...
 $ Skill_96  : chr [1:12217] NA NA NA NA ...
 $ Skill_97  : chr [1:12217] NA NA NA NA ...
  [list output truncated]

We needed to transform this very wide dataframe to a long tidy format for analysis by melting these repeating data elements into rows. We also cleaned those values by removing white spaces in this step.

#-------- Melt skills and remove white spaces

df_skills_melt <- df_skills_work  %>%  
  pivot_longer(-c(posting_id, URL), names_to = "skill_order", values_to = "skill_desc", values_drop_na = TRUE,)

df_skills_melt$skill_desc <- str_trim(df_skills_melt$skill_desc)

df_skills_melt <- df_skills_melt[,c(1,4)]

head(df_skills_melt, 5)

# A tibble: 5 × 2
  posting_id skill_desc      
  <chr>      <chr>           
1 6334263    Machine Learning
2 6334263    Programming     
3 6334263    Python          
4 6334263    Scala           
5 6334263    Java

Finally, we created the skills table and exported it to a local Github repo (note that below code exports to working directory):

# Create skills lookup table (note: this was not used in MySQL due to unresolved errors)
df_skills_master <- df_skills_melt %>% 
  distinct(skill_desc)

df_skills_master <- df_skills_master %>% 
  mutate(skill_id = 1:nrow(df_skills_master))

df_skills_master <- df_skills_master[,c(2,1)]

# Create job posting skills with desc table 

df_job_posting_skills_w_desc <- mutate(left_join(df_skills_melt,
                                        df_skills_master,
                                        by = "skill_desc"))

df_job_posting_skills_w_desc <- df_job_posting_skills_w_desc[,c(1,3,2)]

head(df_job_posting_skills_w_desc)

# A tibble: 6 × 3
  posting_id skill_id skill_desc      
  <chr>         <int> <chr>           
1 6334263           1 Machine Learning
2 6334263           2 Programming     
3 6334263           3 Python          
4 6334263           4 Scala           
5 6334263           5 Java            
6 6334263           6 Data Engineering

#----------- Write all to .csv for SQL 

write.csv(df_city,"tbl_city.csv", row.names=FALSE)
write.csv(df_company,"tbl_company.csv", row.names=FALSE)
write.csv(df_country,"tbl_country.csv", row.names=FALSE)
write.csv(df_job_level,"tbl_job_level.csv", row.names=FALSE)
write.csv(df_job_posting,"tbl_job_posting.csv", row.names=FALSE)
write.csv(df_onsite_flag,"tbl_onsite_flag.csv", row.names=FALSE)
write.csv(df_search_pos,"tbl_search_pos.csv", row.names=FALSE)
write.csv(df_title,"tbl_title.csv", row.names=FALSE)
write.csv(df_job_posting_skills_w_desc,"tbl_job_posting_skills_w_desc.csv", row.names=FALSE)

Import to MySQL:

A SQL script was created to generate the tables and load the data (see from a local copy of our repo as MySQL could not import them directly from Github. In loading the data, we found that two main tables contained an excessive number of rogue commas and invalid characters, which R had handled in the dataframes but MySQL could not import to tables.

To resolve:

All files were converted to tab-delimited before loading, to handle delimiter issues in text strings like job description or skill description
We completed multiple iterations of removing invalid characters from the main tables

During testing, we encountered one error that required us to denormalize the job posting skills table as a workaround due to time constraints: joins to the planned lookup table “skills_master” containing skill descriptions failed. In troubleshooting, we found a large number of silent errors had occurred during the load process (visible when using the “show warnings” command) that were partially resolved by seeking out more invalid characters. A workaround “job_posting_skills_w_desc table” was created, and while not normalized, the functionality of the data was maintained for analysis.

Record counts and links were validated and the database released for analysis.


Non-executable SQL script included below for reference only

-- This script creates the nine tables and loads in data from .txt. files

-- --------------------------------------------------------
-- Create main tbl_job_posting: one row per job posting
-- --------------------------------------------------------

drop table if exists tbl_job_posting;
create table tbl_job_posting (
    posting_id varchar(8) primary key,
    URL longtext,
    first_seen varchar(255),
    last_processed varchar(255),
    title_id varchar(8),
    company_id varchar(8),
    location_id varchar(8),
    city_id varchar(8),
    country_id varchar(8),
    search_pos_id varchar(8),
    job_level_id varchar(8),
    onsite_flag_id varchar(8)
    );
load data local infile 'C:/Users/amand/Git_Projects/DATA607/project_3/tbl_job_posting.txt'
into table tbl_job_posting
ignore 1 rows;
select * from tbl_job_posting;
select count(*) from tbl_job_posting;


-- -----------------------------------------------------------
-- Create all lookup tables
-- -----------------------------------------------------------
drop table if exists tbl_title;
create table tbl_title (
    title_id varchar(8) primary key,
    title_desc longtext
    );
load data local infile 'C:/Users/amand/Git_Projects/DATA607/project_3/tbl_title.txt'
into table tbl_title
ignore 1 rows;
select * from tbl_title;
select count(*) from tbl_title;


drop table if exists tbl_company;
create table tbl_company (
    company_id varchar(8) primary key,
    company_desc longtext
    );
load data local infile 'C:/Users/amand/Git_Projects/DATA607/project_3/tbl_company.txt'
into table tbl_company
ignore 1 rows;
select * from tbl_company;
select count(*) from tbl_company;


drop table if exists tbl_city;
create table tbl_city (
    city_id varchar(8) primary key,
    city_desc longtext
    );
load data local infile 'C:/Users/amand/Git_Projects/DATA607/project_3/tbl_city.txt'
into table tbl_city
ignore 1 rows;
select * from tbl_city;
select count(*) from tbl_city;


drop table if exists tbl_country;
create table tbl_country (
    country_id varchar(8) primary key,
    country_desc longtext
    );
load data local infile 'C:/Users/amand/Git_Projects/DATA607/project_3/tbl_country.txt'
into table tbl_country
ignore 1 rows;
select * from tbl_country;
select count(*) from tbl_country;


drop table if exists tbl_search_position;
create table tbl_search_position (
    search_pos_id varchar(8) primary key,
    search_pos_desc longtext
    );
load data local infile 'C:/Users/amand/Git_Projects/DATA607/project_3/tbl_search_pos.txt'
into table tbl_search_position
ignore 1 rows;
select * from tbl_search_position;
select count(*) from tbl_search_position;


drop table if exists tbl_job_level;
create table tbl_job_level(
    job_level_id varchar(8) primary key,
    job_level_desc longtext
    );
load data local infile 'C:/Users/amand/Git_Projects/DATA607/project_3/tbl_job_level.txt'
into table tbl_job_level
ignore 1 rows;
select * from tbl_job_level;
select count(*) from tbl_job_level;

drop table if exists tbl_onsite_flag;
create table tbl_onsite_flag as
SELECT distinct onsite_flag_id
FROM project_3_team.tbl_job_posting;

ALTER TABLE tbl_onsite_flag ADD COLUMN onsite_desc varchar(20);
UPDATE 
  tbl_onsite_flag 
SET onsite_desc = CASE onsite_flag_id 
                  WHEN 1 THEN "On_Site"
                  WHEN 2 THEN "Hybrid"
                  ELSE "Remote"
                END;


drop table if exists tbl_job_posting_skills_w_desc;
create table tbl_job_posting_skills_w_desc (
    posting_id varchar(20),
    skills_id varchar(20),
    skills_desc varchar(255)
    );
load data local infile 'C:/Users/amand/Git_Projects/DATA607/project_3/tbl_job_posting_skills_w_desc.txt'
into table tbl_job_posting_skills_w_desc
FIELDS TERMINATED BY '\t'
ignore 1 rows;
show warnings;   
select * from tbl_job_posting_skills_w_desc;
select count(*) from tbl_job_posting_skills_w_desc;

1 Asaniczka. (2024). Data Science Job Postings & Skills (2024) [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DS/4407481