Introduction:

The purpose of this project was to collect the most relevant job skills needed for data science positions. Job descriptions from several job posting websites: Zip Recruiter, Linkedin, Builtin, and Monster were observed and the soft skills and technical skills were collected and put into a table. The table was split up into smaller tables: “Job names” which contained the website link, primary key, position name, company, and platform the job was hosted. The “Job names” table was put into an untidy format, then subsequently made tidy. “Soft skills” was put into a table where the most popular soft skills were put into columns and each post was labeled as having that soft skill or not. The “soft skills” table was then made to be a list of every soft skill for each job post. The “technical skills” table was left in a tidy format. These tables were merged together matching with the primary key column. Analysis was done to find the most popular soft and technical skills. All tables were uploaded into the database.

library(RODBC)
library(RSQLite)
library(RMySQL)

## Loading required package: DBI

## 
## Attaching package: 'RMySQL'

## The following object is masked from 'package:RSQLite':
## 
##     isIdCurrent

library(DBI)
library(odbc)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(dbplyr)

## 
## Attaching package: 'dbplyr'

## The following objects are masked from 'package:dplyr':
## 
##     ident, sql

library(RPostgres)
library(sqldf)

## Loading required package: gsubfn

## Loading required package: proto

## sqldf will default to using MySQL

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.4.4     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dbplyr::ident() masks dplyr::ident()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dbplyr::sql()   masks dplyr::sql()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Libraries:

Database connection as well as data manipulating libraries were loaded.

vectorpass <- c('t','h','e','q','u','i','c','k','b','r','o','w','n','f','o','x','j','u','m','p','o','v','e','r','l','a','z','y','d','o','g','0','1','2','3','4','5','6','7','8','9',".")
#list of alphanumeric characters to make the password into a simple code.

pass = paste(vectorpass[8],vectorpass[3],vectorpass[6],vectorpass[1],vectorpass[2],vectorpass[42],vectorpass[29],vectorpass[3],vectorpass[13],vectorpass[6],vectorpass[22],vectorpass[11],vectorpass[36],vectorpass[41],sep="")
#constructing the password from the list

mysqlconnection <- dbConnect(RMySQL::MySQL(),

                             host="cunydata607sql.mysql.database.azure.com",

                             dbname = "keith.denivo49",

                             port=3306,

                             user="keith.denivo49",

                             password=pass)
#connecting to database.
src_dbi(mysqlconnection)

## src:  mysql 8.0.35 [@cunydata607sql.mysql.database.azure.com:/keith.denivo49]
## tbls: job_names_tidy, job_names_untidy, merged_df, soft_skills,
##   soft_skills_tidy, technical_skills

#lists all the tables in the database with tbls.

Connection:

The previously created database was successfully connected to. The relevant csv files of the tables of job names untidy, soft skills, and technical skills were uploaded to the database using MySQL workbench. The tables could then be read into a data frame.

#Reading Tables into R

soft_skills_df <- dbReadTable(mysqlconnection, "soft_skills")
job_names_df <- dbReadTable(mysqlconnection, "job_names_untidy")
technical_skills_df <- dbReadTable(mysqlconnection, "technical_skills")
#read tables from database into a dataframe.
View(soft_skills_df)
View(job_names_df)
View(technical_skills_df)

Data Frames:

Each Table was put into a dataframe.

Job Names Tidying

# Extracting the rows and columns from the job names data frame and putting them into 4 separate data frames 
job_names_one<- job_names_df[1:21, 1:4]
job_names_one<- job_names_one[-1, ]
colnames(job_names_one)<- c("Primary Key", "Job Posting Link", "Job Title", "Company")
job_names_two<- job_names_df[1:21, 5:8]
job_names_two<- job_names_two[-1, ]
colnames(job_names_two)<- c("Primary Key", "Job Posting Link", "Job Title", "Company")
job_names_three<- job_names_df[1:21, 9:12]
job_names_three<- job_names_three[-1, ]
colnames(job_names_three)<- c("Primary Key", "Job Posting Link", "Job Title", "Company")
job_names_four<- job_names_df[1:21, 13:16]
job_names_four<- job_names_four[-1, ]
colnames(job_names_four)<- c("Primary Key", "Job Posting Link", "Job Title", "Company")

# Combining the four separate data frames
jc<- rbind(job_names_one, job_names_two, job_names_three, job_names_four)

jn_df_final <- jc %>%
mutate("Platform" = str_extract(jc$`Job Posting Link`, "monster|linkedin|builtin|ziprecruiter"))
#extract the name of the website and put into a column.

Job Names Tidy:

Job names untidy was in a long format with each row having multiple job posting. The data was made tidy by making each row to have the information of a single job positing.

Soft Skills Tidying

#cleaning up some of the column names
colnames(soft_skills_df)[1] <- "Primary Key" 
colnames(soft_skills_df)[2] <- "communication" 
colnames(soft_skills_df)[3] <- "collaboration" 
colnames(soft_skills_df)[13] <- "detail oriented" 
colnames(soft_skills_df)[16] <- "cross functional" 
colnames(soft_skills_df)[18] <- "time management" 
colnames(soft_skills_df)[20] <- "fast paced" 
colnames(soft_skills_df)[28] <- "driven" 
colnames(soft_skills_df)[30] <- "critical thinking"

Soft skills Cleaning:

Initially some the column names were renamed.

# Function to create new column excluding the first column
create_new_column <- function(row) {
  # Exclude the first column
  row <- row[-1]
  # If all elements in the remaining row are 0, return empty string
  if (all(row == 0)) {
    return("")
  } else {
    # Extract column names where the value is 1
    columns_with_ones <- colnames(soft_skills_df)[-1][as.logical(row)]
    # Concatenate column names with a comma separator
    return(paste(columns_with_ones, collapse = ","))
  }
}

# Apply the function to each row of the dataframe
new_column <- apply(soft_skills_df, 1, create_new_column)

# Add the new column to the dataframe
soft_skills_df$NewColumn <- new_column

# Print
View(soft_skills_df)

Soft skills Cleaning:

If the soft skill was observed in a particular job post then it was assigned a 1 if it was not it was assigned a 0 in the original table. If the word had a zero nothing was extracted, if it had a one it was extracted and put into a column with each word separated with a comma.

#filtering the columns to create a new dataframe
softskills_final <- select(soft_skills_df, "Primary Key", "NewColumn" )
colnames(softskills_final)[2] <- "Soft Skills"

Soft skills data frame:

the new column with the soft skills was put together with the primary key for each job posting.

Soft Skills analysis

Identify the most prevalent soft skill among the sampled job postings:

# Split the Soft Skills column into individual soft skills
softskills_split <- unlist(strsplit(softskills_final$ "Soft Skills", ","))

# Remove leading and trailing whitespaces
softskills_split <- trimws(softskills_split)

# Count the occurrences of each soft skill
soft_skills_counts <- table(softskills_split)

# Find the soft skill with the highest count
most_valued_soft_skill <- names(which.max(soft_skills_counts))

# Print the most prevalent soft skill
print(most_valued_soft_skill)

## [1] "communication"

Communication is the most prevalent soft skill among the job postings we sampled.

Identfy the top 10 soft skills:

# Get the top 10 soft skills
top_10_soft_skills <- head(sort(soft_skills_counts, decreasing = TRUE), 10)

# Create the bar plot 
barplot(top_10_soft_skills, 
        col = rainbow(length(top_10_soft_skills)),
        main = "Top 10 Soft Skills",
        xlab = "Soft Skills",
        ylab = "Frequency",
        las = 2, 
        cex.names = 0.8,
        xaxt = "n")  # Remove x-axis labels

# Add a legend 
legend("topright", 
       legend = names(top_10_soft_skills), 
       fill = rainbow(length(top_10_soft_skills)),
       cex = 0.7,  # Set the font size of the legend
       ncol = 2)   # Set the number of columns in the legend

Technical Skills Analysis

Identify the most prevalent technical skill among the sampled job postings:

# Split the Technical Skills column into individual tech skills
technical_skills_split <- unlist(strsplit(technical_skills_df$ "Technical.Skills", ","))

# Remove leading and trailing whitespaces
technical_skills_split <- trimws(technical_skills_split)

# Convert all technical skills to lowercase to account for both "python" and "Python"
technical_skills_lower <- tolower(technical_skills_split)

# Count the occurrences of each technical skill
technical_skills_counts <- table(technical_skills_lower)

# Find the technical skill with the highest count
top_tech_skill <- names(which.max(technical_skills_counts))

# Print the most prevalent technical skill
print(top_tech_skill)

## [1] "python"

Python is the most prevalent technical skill among the job postings we sampled.
Identify the top 10 technical skills:

top_ten_tech_skills <- head(sort(technical_skills_counts, decreasing = TRUE), 10)

# Create the bar plot
barplot(top_ten_tech_skills,
        col = heat.colors(length(top_ten_tech_skills)),
        main = "Top 10 Most Prevalent Technical Skills",
        xlab = "Technical Skills",
        ylab = "Frequency",
        las = 2, 
        cex.names = 0.8,
         xaxt = "n")

# Create a legend 
legend("topright", 
       legend = names(top_ten_tech_skills), 
       fill = heat.colors(length(top_ten_tech_skills)),
       title = "Technical Skills",
       cex = 0.6)

Combine all data frames for a merged data frame anyone can use

#Rename to match the other "Primary Key" column names
names(technical_skills_df)[names(technical_skills_df) == "Primary.Key"] <- "Primary Key"

# Rename the "Technical.Skills" column for consistency
names(technical_skills_df)[names(technical_skills_df) == "Technical.Skills"] <- "Technical Skills"

# Check the modified column names
print(names(technical_skills_df))

## [1] "Primary Key"      "Technical Skills"

# Merge the data frames based on the "Primary Key" columns
merged_df <- jn_df_final %>%
              merge(softskills_final, by = "Primary Key") %>%
              merge(technical_skills_df, by = "Primary Key")

Data frames were merged by primary key column.

#The primary key is not organized in numerical order due to the objects being characters. Convert them to numeric data types for effective use in the future.
merged_df$`Primary Key` <- as.numeric(merged_df$`Primary Key`)

#Check col names
colnames(merged_df)

## [1] "Primary Key"      "Job Posting Link" "Job Title"        "Company"         
## [5] "Platform"         "Soft Skills"      "Technical Skills"

Upload the data frames to the database:

conn <-  mysqlconnection
con <- mysqlconnection
#merged_df |> 
#dbCreateTable(con = conn, 
#             name = "merged_df", 
#             value = merged_df, 
#             field.types = c(`Primary Key`="INTEGER(max)",
#             `Job Posting Link`="varchar(max)",
#             `Job Title`="varchar(max)",
#             Company="varchar(max)",
#             Platform="varchar(max)",
 #            `Soft Skills`="varchar(max)",
 #            `Technical Skills`="varchar(max)")
 #            )
#table was created

dbWriteTable(conn = con,
             name = "merged_df",
             value = merged_df,            
             overwrite = TRUE)

## [1] TRUE

dbWriteTable(conn = con,
             name = "job_names_tidy",
             value = jn_df_final,            
             overwrite = TRUE)

## [1] TRUE

dbWriteTable(conn = con,
             name = "soft_skills_tidy",
             value = softskills_final,           
             overwrite = TRUE)

## [1] TRUE

src_dbi(mysqlconnection)

## src:  mysql 8.0.35 [@cunydata607sql.mysql.database.azure.com:/keith.denivo49]
## tbls: job_names_tidy, job_names_untidy, merged_df, soft_skills,
##   soft_skills_tidy, technical_skills

Conclusion:

The database included several tables. The tables were connected together based on the designated primary key column. Each data frame was put into the database. The tables were all merged together into one table called merged_df. Python, SQL, R, Spark, tableau, machine learning were the most requested technical skills. The most requested soft skills were: communication, collaboration, working on a team, and problem solving.

Project 3 Data Science Job Skills

Keith DeNivo, Seyi Jimoh, Adriana Medina, Ursula Podosenin

2024-03-13