Introduction and Approach

This project is a group project with the following members:

Cameron Smith
Brett D
Darius Z

Formal citation for data source used:

JobsPikr. (2019; December). 10000 Data Scientist Job Postings from the USA, Version 1. Retrieved 15 October 2020 from https://www.kaggle.com/jobspikr/data-scientist-job-postings-from-the-usa.

Load the data

A Google Cloud Postgres database was created for this project to maximize reproducibility. The below code loads the data from the normalized database and converts it into data frames that can be used for further analysis.

This required (1) setting up a Postgres database; (2) creating a storage bucket; (3) uploading the raw data to the storage bucket; (4) migrating that raw data into the new database, which consists of two tables in the first normal form 1NF; and (5) opening up the firewall to allow public access. Per the rules for 1NF, each table cell should contain a single value and each record needs to be unique. Both of these conditions are satisfied and thus the data can be considered normalized.

Get the data from the database

# Configure database connection
db <- "postgres"
host_db <- "35.243.187.235"
db_port <- "5432"
# Using public account created with read only access
db_user <- "project3public"
db_password <- "cunymsds2020"
con <- dbConnect(RPostgres::Postgres(), dbname = db, host=host_db, port=db_port, user=db_user, password=db_password)
# Verify the connection to the database is working
dbListTables(con)

## [1] "jobs"   "skills"

# Get job posting data
jobdata_raw <- 
  dbGetQuery(con, "SELECT job_title, job_description, category, company, city, state, job_board, post_date FROM jobs")
# Get skills data
skills_df <- dbGetQuery(con, "SELECT skill_name FROM skills")
# Disconnect from the database
dbDisconnect(con)

Tidy and Transform

With the data now loaded into a data frame, it needs to be prepared for analysis.

Main rules of ‘tidyness’:

Each variable must have its own column
Each observation must have its own row
Each value must have its own cell

Following these rules we end up with a very wide data set in this instance.

# Start with a bit of exploratory data analysis
glimpse(jobdata_raw)

## Rows: 10,002
## Columns: 8
## $ job_title       <chr> "Enterprise Data Scientist I", "Data Scientist", "D...
## $ job_description <chr> "Read what people are saying about working here. \n...
## $ category        <chr> "Accounting/Finance", NA, NA, "Accounting/Finance",...
## $ company         <chr> "Farmers Insurance Group", "Luxoft USA Inc", "Cinci...
## $ city            <chr> "Woodland hills", "Middletown", "New york", "New yo...
## $ state           <chr> "California", "New jersey", "New york", "New york",...
## $ job_board       <chr> "indeed", "dice", "dice", "indeed", "monster", "ind...
## $ post_date       <date> 2019-02-06, 2019-02-05, 2019-02-05, 2019-02-06, 20...

# Summary of job postings by site
jobdata_raw %>% 
  select(job_board) %>% 
  group_by(job_board) %>% 
  summarise(num_postings = n()) %>%
  ggplot(aes(x = job_board, y = num_postings)) +
    geom_bar(stat = 'identity')

## `summarise()` ungrouping output (override with `.groups` argument)

Prepare the data frame to capture the data needed.

# Copy the raw data into a new data frame to preserve the original
jobdata_tidy <- jobdata_raw
# Create wide data frame with new columns and blank based on each skill name
for (i in 1:nrow(skills_df)){
  tempvar = skills_df$skill_name[i]
  jobdata_tidy[[tempvar]] <- ""
}

Now that the data frame is in the intended format, with the key columns created, we will search through each job listing to identify whether each skill is listed, then add it to a count in the columns created.

The first for loop iterates through every row of the job postings data frame (jobdata_tidy), and then the nested for loop checks for each skill in the skills data frame (skills_df) to see if it is in the text of the description of the job posting. If it is there, the column is marked with a “yes” otherwise a “no”. Numbers were used rather than “yes” and “no” to facilitate analysis.

Note: It takes between 1 and 2 minutes to run

# Nested for loop to check each description field for each skill, then mark the relevant skill column accordingly
for (j in 1:nrow(jobdata_tidy)){
    for (k in 1:nrow(skills_df)){
      tempvar2 <- skills_df[k, 1]
      tempregexstring <- tolower(paste0("\\s", tempvar2, "\\s|\\s",tempvar2,"[++][++]]\\s"))
      ifelse((str_detect(tolower(jobdata_tidy$job_description[j]), tempregexstring, negate = FALSE)) == TRUE, 
       jobdata_tidy[j, tempvar2] <- "yes", 
       jobdata_tidy[j, tempvar2] <- "no")
    }
}

With the data calculated and populated, a new data frame will be created in a longer format for easier analysis.

# Create long data frame with key data for easier analysis
skillsdata_long <- jobdata_tidy %>%
  select(9:54) %>%
  gather ("skill_name", "exists", 1:46) %>%
  filter(exists == "yes")

Analysis

[INSERT TEXT HERE TO DESCRIBE ANALYSIS]

skillssummary <- skillsdata_long %>% 
  group_by(skill_name) %>%
  summarise(num_postings = n())
# Visualize in descending order the popularity of all skills identified
skillssummary %>%  
  ggplot(aes(x = reorder(skill_name, num_postings), y = num_postings)) +
  geom_bar(stat = 'identity') +
  coord_flip()

Conclusion

In conclusion, based on a consolidated list of 10,000 data science job postings from a variety of websites, the top 10 data science skills are:

top_n(arrange(skillssummary, desc(num_postings)), 10)

## Selecting by num_postings

## # A tibble: 10 x 2
##    skill_name num_postings
##    <chr>             <int>
##  1 Python             3832
##  2 SQL                3370
##  3 R                  2197
##  4 Spark              1139
##  5 Hadoop             1063
##  6 AWS                1034
##  7 Tableau             843
##  8 SAS                 800
##  9 NoSQL               586
## 10 Azure               505

Data 607 - Project 3

Cameron Smith

2020-10-17