Data 607 Project 3

PART 1 - Project Description

Create a short document, with the names of group members. You should briefly describe your collaboration tool(s) you’ll use as a group, including for communication, code sharing, and project documentation. You should have identified your data sources, where the data can be found, and how to load it. And you should have created at least a logical model for your normalized database, and produced an Entity-Relationship (ER) diagram documenting your database design

Team Members

Sanielle Worrell
Vladimir Nimchenko
Jose Rodriguez
Johnny Rodriguez

Tools

The team is using R Studio Cloud (https://rstudio.cloud) for collaboration and code development. This allows us to view and share code within the project. We are using R Markdown within RStudio Cloud for project documentation to publish through RPubs (https://rpubs.com) . To create the ERD, the team used Quick Database Diagrams (https://www.quickdatabasediagrams.com). Source CSV and RMD files are saved in a github repo so files can be accessed centrally (https://github.com/johnnydrodriguez/data607_project3) In addition to this, the team communicates over Zoom and Slack.

Project Tools Insights

We looked for a collaboration tools that allowed for real time editing.

We experimented with Google Collab with R & RStudio Cloud but could get them to work for our needs.

We used Github as a central repository for shared code and collaborated in real time over Zoom sessions and Slack. This was effective in moving the project forward.

Data Sources

The data used to answer the question is taken from Glassdoor via Kaggle. The data was scrapped from Glassdoor and posted to the site.

Source: https://www.kaggle.com/datasets/nikhilbhathi/data-scientist-salary-us-glassdoor?resource=download

Source CSV (Uncleaned for Project): https://raw.githubusercontent.com/johnnydrodriguez/data607_project3/main/glassdoor_2021.csv

Data Sources Insights

Finding datasets that answered this specific question we challenging to find.

Job sites proved challenging as API were deprecated and public data sets were not available.

The Federal government sites are data-rich but job titles, skills and other variables that may have directly helped answer the question where typically aggregated generically under ‘Data Science’ or ‘Computer Information’ variables.

The Glassdoor data used in the project was scraped and made available through Kaggle. It was Data Science Job Listings from Glassdoor in 2021. This data contained separate skills data we could extract from the individual job listings.

Entity Relationship Diagram

The team was able to initially divide the raw data into 3 tables: Company, Jobs, Tech

https://github.com/johnnydrodriguez/data607_project3/blob/main/QuickDBD-export.png

Data Science Valuble Skills ERD

PART 2 - Which are the most valued data science skills?

Prepping the Environment Project Packages and Libraries - Jose

# install.packages("tidyverse", repos = "http://cran.us.r-project.org")
# install.packages("RMySQL", repos = "http://cran.us.r-project.org")
# install.packages("ggplot2", repos = "http://cran.us.r-project.org")
# install.packages("aws.s3", repos = "http://cran.us.r-project.org")

library(tidyverse)
library(RMySQL)
library(ggplot2)
library(aws.s3)

Connecting to the AWS S3 bucket, loading and reading the raw data - Jose

We used AWS RDS to load and read the raw data. This initial step allows us to load the data from an AWS bucket.

#Connect to S3------------------------------------------------------------------
#Use IAM login credentials/Set to "read-only"
Sys.setenv("AWS_ACCESS_KEY_ID" = AWS_ACCESS_KEY_ID,
           "AWS_SECRET_ACCESS_KEY" = AWS_SECRET_ACCESS_KEY)

#Check bucket
get_bucket("data-skills-p3")

## Bucket: data-skills-p3 
## 
## $Contents
## Key:            glassdoor_2021.csv 
## LastModified:   2022-10-21T00:15:57.000Z 
## ETag:           "0b29dbb1049bdfcd1b77637c91325dce" 
## Size (B):       3155265 
## Owner:          jlixander 
## Storage class:  STANDARD

#Load csv file from s3 into a df
jobs_df <- aws.s3::s3read_using(read.csv, object = "s3://data-skills-p3/glassdoor_2021.csv")

glimpse(jobs_df, n = 1)

## Rows: 742
## Columns: 42
## $ index              <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ Job.Title          <chr> "Data Scientist", "Healthcare Data Scientist", "Dat…
## $ Salary.Estimate    <chr> "$53K-$91K (Glassdoor est.)", "$63K-$112K (Glassdoo…
## $ Job.Description    <chr> "Data Scientist\nLocation: Albuquerque, NM\nEducati…
## $ Rating             <dbl> 3.8, 3.4, 4.8, 3.8, 2.9, 3.4, 4.1, 3.8, 3.3, 4.6, 3…
## $ Company.Name       <chr> "Tecolote Research\n3.8", "University of Maryland M…
## $ Location           <chr> "Albuquerque, NM", "Linthicum, MD", "Clearwater, FL…
## $ Headquarters       <chr> "Goleta, CA", "Baltimore, MD", "Clearwater, FL", "R…
## $ Size               <chr> "501 - 1000 ", "10000+ ", "501 - 1000 ", "1001 - 50…
## $ Founded            <int> 1973, 1984, 2010, 1965, 1998, 2000, 2008, 2005, 201…
## $ Type.of.ownership  <chr> "Company - Private", "Other Organization", "Company…
## $ Industry           <chr> "Aerospace & Defense", "Health Care Services & Hosp…
## $ Sector             <chr> "Aerospace & Defense", "Health Care", "Business Ser…
## $ Revenue            <chr> "$50 to $100 million (USD)", "$2 to $5 billion (USD…
## $ Competitors        <chr> "-1", "-1", "-1", "Oak Ridge National Laboratory, N…
## $ Hourly             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Employer.provided  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Lower.Salary       <int> 53, 63, 80, 56, 86, 71, 54, 86, 38, 120, 126, 64, 1…
## $ Upper.Salary       <int> 91, 112, 90, 97, 143, 119, 93, 142, 84, 160, 201, 1…
## $ Avg.Salary.K.      <dbl> 72.0, 87.5, 85.0, 76.5, 114.5, 95.0, 73.5, 114.0, 6…
## $ company_txt        <chr> "Tecolote Research", "University of Maryland Medica…
## $ Job.Location       <chr> "NM", "MD", "FL", "WA", "NY", "TX", "MD", "CA", "NY…
## $ Age                <int> 48, 37, 11, 56, 23, 21, 13, 16, 7, 12, 10, 53, 59, …
## $ Python             <int> 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, …
## $ spark              <int> 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, …
## $ aws                <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ excel              <int> 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ sql                <int> 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, …
## $ sas                <int> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ keras              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ pytorch            <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ scikit             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ tensor             <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ hadoop             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
## $ tableau            <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ bi                 <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ flink              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mongo              <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ google_an          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ job_title_sim      <chr> "data scientist", "data scientist", "data scientist…
## $ seniority_by_title <chr> "na", "na", "na", "na", "na", "na", "na", "na", "na…
## $ Degree             <chr> "M", "M", "M", "na", "na", "na", "na", "M", "P", "n…

Data clean up and transformation - Vladimir

Once extracted, significant cleanup was needed to convert the raw data into tables we could use for the plots and analysis. We ultimately added 1 additional table to the original 3 table design.

# Remove duplicates
jobs_unique <- jobs_df %>% distinct(Job.Description,Company.Name, .keep_all = TRUE)

# Clean up commas in data
jobs_unique$Job.Description <- gsub("’", "'", jobs_unique$Job.Description)

# Identify keywords
education <- c("Masters", "Bachelors", "Master's", "Bachelor's", "Phd", "PhD", "Ph.D")
jobs_unique$education <- mapply(function(x)paste(education[str_detect(x,(education))],
                                        collapse = ","), jobs_unique$Job.Description)


#Extracts additional tech from the job description
jobs_unique <- jobs_unique %>%
  mutate(R = str_match(Job.Description, '.*(\\sR\\s).*')[,2],
       API = str_match(Job.Description, '.*(API).*')[,2]) %>% 
  mutate_at(vars(R), ~replace(., !is.na(.), 1)) %>% 
  mutate_at(vars(R), ~replace(., is.na(.), 0)) %>%
  mutate_at(vars(API), ~replace(., !is.na(.), 1)) %>% 
  mutate_at(vars(API), ~replace(., is.na(.), 0))


# Skills Extraction from Job Description Convert values in 1 and 0
Skills <- jobs_unique %>%
  select(index, Job.Description) %>% 
  mutate(Data_Analysis = str_match(Job.Description, regex('.*(data analy).*', ignore_case = TRUE))[,2],
       Modeling = str_match(Job.Description, regex('.*(model).*', ignore_case = TRUE))[,2],
       Visualization = str_match(Job.Description, regex('.*(visualization).*', ignore_case = TRUE))[,2],
       ML = str_match(Job.Description, regex('.*(machine learning).*', ignore_case = TRUE))[,2],
       ETL = str_match(Job.Description, '.*(ETL).*')[,2],
       Pred_Analysis = str_match(Job.Description, regex('.*(predictive analy).*', ignore_case = TRUE))[,2],
       Statistics = str_match(Job.Description, regex('.*(statis).*', ignore_case = TRUE))[,2],
       Economics = str_match(Job.Description, regex('.*(econom).*', ignore_case = TRUE))[,2],
       Math = str_match(Job.Description, regex('.*(math).*', ignore_case = TRUE))[,2],
       Problem_Solving = str_match(Job.Description, regex('.*(problem-solv).*', ignore_case = TRUE))[,2],
       Communication = str_match(Job.Description, regex('.*(communicat).*', ignore_case = TRUE))[,2],
       Decision_Making= str_match(Job.Description, regex('.*(decision-mak).*', ignore_case = TRUE))[,2],
       Creativity = str_match(Job.Description, regex('.*(creativity).*', ignore_case = TRUE))[,2],
       Critical_Thinking = str_match(Job.Description, regex('.*(critical think).*', ignore_case = TRUE))[,2]) %>% 
  mutate_at(vars(-c(index)), ~replace(., !is.na(.), 1)) %>% 
  mutate_at(vars(-c(index)), ~replace(., is.na(.), 0)) %>% 
  select(-c(Job.Description)) %>% 
  rename('SkillsID' = 'index')


#Tidy/Transform data from the glass door data and create "Company" data frame.
# create the Company data frame from the glass door data
Company <-select(jobs_unique, c('index','Rating', 'Company.Name','Location','Headquarters','Size','Founded','Type.of.ownership','Industry','Sector','Revenue','Competitors','company_txt','Age'))

# Change the "index" column to "CompanyID"
colnames(Company)[colnames(Company)== "index"] <-"CompanyID"
# Change the "Rating" column to "Company_Rating"
colnames(Company)[colnames(Company)== "Rating"] <-"Column_Rating"
# Change the "Company.Name" column to "Company_Name"
colnames(Company)[colnames(Company)== "Company.Name"] <-"Company_Name"
# Change the "Headquarters" column to "HQ_Address"
colnames(Company)[colnames(Company)== "Headquarters"] <-"HQ_Address"
# Change the "Type.of.ownership" column to "type_of_ownership"
colnames(Company)[colnames(Company)== "Type.of.ownership"] <-"type_of_ownership"
# Change the "Age" column to "Company_Age"
colnames(Company)[colnames(Company)== "Age"] <-"Company_Age"


#Tidy/Transform data from the glass door data and create "Jobs" data frame.
# create the Company data frame from the glass door data
Jobs <-select(jobs_unique, c('index','Job.Title', 'Salary.Estimate','Job.Description','Hourly','Lower.Salary','Upper.Salary','Avg.Salary.K.','Job.Location','job_title_sim','seniority_by_title','Degree', 'education'))

# Change the "index" column to "JobID"
colnames(Jobs)[colnames(Jobs)== "index"] <-"JobID"
# Change the "Job.Title" column to "Job_Title"
colnames(Jobs)[colnames(Jobs)== "Job.Title"] <-"Job_Title"
# Change the "Salary.Estimate" column to "Salary_Estimate"
colnames(Jobs)[colnames(Jobs)== "Salary.Estimate"] <-"Salary_Estimate"
# Change the "Job.Description" column to "Job_Description"
colnames(Jobs)[colnames(Jobs)== "Job.Description"] <-"Job_Description"
# Change the "Lower.Salary" column to "Lower_Salary_K"
colnames(Jobs)[colnames(Jobs)== "Lower.Salary"] <-"Lower_Salary_K"
# Change the "Upper.Salary" column to "Upper_Salary_K"
colnames(Jobs)[colnames(Jobs)== "Upper.Salary"] <-"Upper_Salary_K"
# Change the "Avg.Salary.K." column to "Avg_Salary_K"
colnames(Jobs)[colnames(Jobs)== "Avg.Salary.K."] <-"Avg_Salary_K"
# Change the "Job.Location" column to "Job_Location"
colnames(Jobs)[colnames(Jobs)== "Job.Location"] <-"Job_Location"


Tech <-select(jobs_unique, c("index","Python","spark","aws","excel","sql","sas","keras","pytorch","scikit","tensor","hadoop","tableau","bi","flink","mongo","google_an","R", "API"))
# Change the "index" column to "TechID"
colnames(Tech)[colnames(Tech)== "index"] <-"TechID"

Connect to the SQL database, create SQL tables for analysis - Jose

Once the normalized tables were ready, we connected to the AWS SQL and loaded the tables to the SQL database. For the purposes of our project, the rest of the transformations, plots and analysis are based on the tables read from the AWS SQL instance.

#Connection details
endpoint <- "database-1.c4xyb2t3srpc.us-east-1.rds.amazonaws.com"
port <- 3306
username <- "admin"
dbname <- "database1"
region <- "us-east-1a"

# Connect to the database using an IAM authentication token.
con <- DBI::dbConnect(
  RMySQL::MySQL(),
  host = "database-1.c4xyb2t3srpc.us-east-1.rds.amazonaws.com",
  port = 3306,
  db = "database1",
  user = "admin",
  password = password
)

## TEST DATABASE CONNECTION
class(con)

## [1] "MySQLConnection"
## attr(,"package")
## [1] "RMySQL"

#Drop Tables If they exist before creating (to avoid SQL error)
dbSendQuery(con, "DROP TABLE IF EXISTS company,jobs,tech, skills")

## <MySQLResult:65024,0,0>

#Upload tables to AWS RDS - MySQL Instance
dbWriteTable(conn = con, name = 'jobs', value = as.data.frame(Jobs))

## [1] TRUE

dbWriteTable(conn = con, name = 'tech', value = as.data.frame(Tech))

## [1] TRUE

dbWriteTable(conn = con, name = 'company', value = as.data.frame(Company))

## [1] TRUE

dbWriteTable(conn = con, name = 'skills', value = as.data.frame(Skills))

## [1] TRUE

#List database tables
dbListTables(con) # load tables in rds to variable

## [1] "company" "jobs"    "skills"  "tech"

#Read tables from SQL back to R for analysis
jobs_tbl <- dbGetQuery(con, "SELECT * FROM jobs")
tech_tbl <- dbGetQuery(con, "SELECT * FROM tech")
company_tbl <- dbGetQuery(con, "SELECT * FROM company")
skills_tbl <- dbGetQuery(con, "SELECT * FROM skills")

glimpse(jobs_tbl)

## Rows: 463
## Columns: 14
## $ row_names          <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", …
## $ JobID              <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ Job_Title          <chr> "Data Scientist", "Healthcare Data Scientist", "Dat…
## $ Salary_Estimate    <chr> "$53K-$91K (Glassdoor est.)", "$63K-$112K (Glassdoo…
## $ Job_Description    <chr> "Data Scientist\nLocation: Albuquerque, NM\nEducati…
## $ Hourly             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Lower_Salary_K     <dbl> 53, 63, 80, 56, 86, 71, 54, 86, 38, 120, 126, 64, 1…
## $ Upper_Salary_K     <dbl> 91, 112, 90, 97, 143, 119, 93, 142, 84, 160, 201, 1…
## $ Avg_Salary_K       <dbl> 72.0, 87.5, 85.0, 76.5, 114.5, 95.0, 73.5, 114.0, 6…
## $ Job_Location       <chr> "NM", "MD", "FL", "WA", "NY", "TX", "MD", "CA", "NY…
## $ job_title_sim      <chr> "data scientist", "data scientist", "data scientist…
## $ seniority_by_title <chr> "na", "na", "na", "na", "na", "na", "na", "na", "na…
## $ Degree             <chr> "M", "M", "M", "na", "na", "na", "na", "M", "P", "n…
## $ education          <chr> "Master's,Bachelor's,Ph.D", "Master's", "Master's,P…

glimpse(tech_tbl)

## Rows: 463
## Columns: 20
## $ row_names <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12…
## $ TechID    <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ Python    <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, …
## $ spark     <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, …
## $ aws       <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, …
## $ excel     <dbl> 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, …
## $ sql       <dbl> 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, …
## $ sas       <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ keras     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ pytorch   <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ scikit    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ tensor    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, …
## $ hadoop    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ tableau   <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ bi        <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ flink     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mongo     <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ google_an <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ R         <chr> "0", "0", "1", "0", "1", "0", "0", "0", "0", "0", "1", "0", …
## $ API       <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …

glimpse(company_tbl)

## Rows: 463
## Columns: 15
## $ row_names         <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "…
## $ CompanyID         <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ Column_Rating     <dbl> 3.8, 3.4, 4.8, 3.8, 2.9, 3.4, 4.1, 3.8, 3.3, 4.6, 3.…
## $ Company_Name      <chr> "Tecolote Research\n3.8", "University of Maryland Me…
## $ Location          <chr> "Albuquerque, NM", "Linthicum, MD", "Clearwater, FL"…
## $ HQ_Address        <chr> "Goleta, CA", "Baltimore, MD", "Clearwater, FL", "Ri…
## $ Size              <chr> "501 - 1000 ", "10000+ ", "501 - 1000 ", "1001 - 500…
## $ Founded           <dbl> 1973, 1984, 2010, 1965, 1998, 2000, 2008, 2005, 2014…
## $ type_of_ownership <chr> "Company - Private", "Other Organization", "Company …
## $ Industry          <chr> "Aerospace & Defense", "Health Care Services & Hospi…
## $ Sector            <chr> "Aerospace & Defense", "Health Care", "Business Serv…
## $ Revenue           <chr> "$50 to $100 million (USD)", "$2 to $5 billion (USD)…
## $ Competitors       <chr> "-1", "-1", "-1", "Oak Ridge National Laboratory, Na…
## $ company_txt       <chr> "Tecolote Research", "University of Maryland Medical…
## $ Company_Age       <dbl> 48, 37, 11, 56, 23, 21, 13, 16, 7, 12, 10, 53, 59, 9…

glimpse(skills_tbl)

## Rows: 463
## Columns: 16
## $ row_names         <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "…
## $ SkillsID          <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ Data_Analysis     <chr> "1", "0", "1", "1", "1", "1", "0", "0", "0", "1", "1…
## $ Modeling          <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "0…
## $ Visualization     <chr> "1", "1", "1", "0", "0", "1", "1", "0", "0", "0", "0…
## $ ML                <chr> "1", "1", "1", "1", "1", "1", "0", "1", "0", "1", "1…
## $ ETL               <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0…
## $ Pred_Analysis     <chr> "0", "1", "0", "0", "0", "0", "0", "0", "0", "1", "0…
## $ Statistics        <chr> "0", "1", "1", "1", "1", "1", "1", "1", "0", "1", "1…
## $ Economics         <chr> "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "1…
## $ Math              <chr> "1", "1", "1", "1", "1", "0", "1", "1", "0", "0", "0…
## $ Problem_Solving   <chr> "0", "0", "0", "0", "0", "1", "1", "0", "1", "0", "0…
## $ Communication     <chr> "1", "1", "1", "0", "1", "1", "0", "1", "0", "1", "0…
## $ Decision_Making   <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0…
## $ Creativity        <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0…
## $ Critical_Thinking <chr> "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0…

Results

Ranked Technical Skills by Associated Median Salary - Johnny

This plot ranks the list of technical skills based on an associated median salary for each technical skill.

The dataset contained average job salary per job. We were able to map the avg salary value to each technical skill that was included in the data set.

For example, if the job listed mongo in the job description, that average salary was mapped to that skill.

Each skill ended up with a range of salaries listed for it.

The plot ranks the technical skills using the associated median salary from these range of available salaries for that skill.

The skills that rose to the top are all related to machine & deep learning technologies.

# Creates the joing from between the jobs and tech tables
df_join1 <- left_join(jobs_tbl, tech_tbl, by = c("JobID" = "TechID"))

# Selects the columns needed for the analysis and plot
df_join1 <- select(df_join1, c('Avg_Salary_K', 'Python', 'spark', 'aws', 'excel', 'sql', 'sas', 'keras', 'pytorch', 'scikit', 'tensor', 'hadoop', 'tableau', 'bi', 'flink', 'mongo', 'google_an', 'R', 'API'))

# Inserts the average salary for all technical skills represented in the data (ie, where the skill = 1).  It then calculates the median salary per skill.
df_join1 <- df_join1 %>% 
  mutate(Python = ifelse(Python == "1" , Avg_Salary_K, Python),
         spark = ifelse(spark == "1" , Avg_Salary_K, spark),
         aws = ifelse(aws == "1" , Avg_Salary_K, aws),
         excel = ifelse(excel == "1" , Avg_Salary_K, excel),
         sql = ifelse(sql == "1" , Avg_Salary_K, sql),
         sas = ifelse(sas == "1" , Avg_Salary_K, sas),
         keras = ifelse(keras == "1" , Avg_Salary_K, keras),
         pytorch = ifelse(pytorch == "1" , Avg_Salary_K, pytorch),
         scikit= ifelse(scikit == "1" , Avg_Salary_K, scikit),
         tensor = ifelse(tensor == "1" , Avg_Salary_K, tensor),
         hadoop = ifelse(hadoop == "1" , Avg_Salary_K, hadoop),
         tableau = ifelse(tableau == "1" , Avg_Salary_K, tableau),
         bi = ifelse(bi == "1" , Avg_Salary_K, bi),
         flink = ifelse(flink == "1" , Avg_Salary_K, flink),
         mongo = ifelse(mongo == "1" , Avg_Salary_K, mongo),
         google_an = ifelse(google_an == "1" , Avg_Salary_K, google_an),
         R = ifelse(R == "1" , Avg_Salary_K, R),
         API = ifelse(API == "1" , Avg_Salary_K, API)) %>% 
  select(Python, spark, aws, excel, sql, sas, keras, pytorch,scikit, tensor, hadoop, tableau, bi, flink, mongo, google_an, R, API) %>% 
  na_if(0) %>% 
  summarize_if(is.numeric, median, na.rm = TRUE)

# Creates a long table for the plot  
tech_join <- df_join1 %>% 
  pivot_longer(
    cols = 1:16,
    names_to = 'TechTools',
    values_to = 'MedianSalary') %>% 
  select(TechTools, MedianSalary)
  
# Plots Technical Skill by associated median salary
ggplot(tech_join, aes(x=reorder(TechTools, +MedianSalary), y=MedianSalary)) +
  geom_bar(stat="identity", fill = "#E59866") + 
  coord_flip() +
  xlab("Tech Skill") + 
  ylab("Median Salary of Tech Skill (K)" ) + 
  ggtitle("Ranked Technical Skills by Associated Median Salary") + theme_minimal()

Ranked Soft Skills by Associated Median Salary - Johnny

This plot ranks the list of soft and knowledge skills based on an associated median salary for each soft and knowledge skill.

The soft skills data was mined from the individual job descriptions to create “soft” skills.

Like above, these soft skills were associated to the job’s average salary

The plot ranks the soft skills using the associated median salary from these range of available salaries for that skill.

The skills that rose to the top all appear to be related to the machine learning domain.

# Creates the joing from between the jobs and skills tables
df_join2 <- left_join(jobs_tbl, skills_tbl, by = c("JobID" = "SkillsID"))

# Selects the columns needed for the analysis and plot
df_join2 <- select(df_join2, c('Avg_Salary_K', 'Data_Analysis', 'Modeling', 'Visualization', 'ML', 'ETL', 'Pred_Analysis', 'Statistics', 'Economics', 'Math', 'Problem_Solving', 'Communication', 'Decision_Making', 'Creativity', 'Critical_Thinking'))

# Inserts the average salary for all soft skills represented in the data (ie, where the skill = 1).  It then calculates the median salary per skill.
df_join2 <- df_join2 %>% 
  mutate(Data_Analysis = ifelse(Data_Analysis == "1" , Avg_Salary_K, Data_Analysis),
         Modeling = ifelse(Modeling == "1" , Avg_Salary_K, Modeling),
         Visualization = ifelse(Visualization == "1" , Avg_Salary_K, Visualization),
         ML = ifelse(ML == "1" , Avg_Salary_K, ML),
         ETL = ifelse(ETL == "1" , Avg_Salary_K, ETL),
         Pred_Analysis = ifelse(Pred_Analysis == "1" , Avg_Salary_K, Pred_Analysis),
         Statistics = ifelse(Statistics == "1" , Avg_Salary_K, Statistics),
         Economics = ifelse(Economics == "1" , Avg_Salary_K, Economics),
         Math = ifelse(Math == "1" , Avg_Salary_K, Math),
         Problem_Solving = ifelse(Problem_Solving == "1" , Avg_Salary_K, Problem_Solving),
         Communication = ifelse(Communication == "1" , Avg_Salary_K, Communication),
         Decision_Making = ifelse(Decision_Making == "1" , Avg_Salary_K, Decision_Making),
         Creativity = ifelse(Creativity == "1" , Avg_Salary_K, Creativity),
         Critical_Thinking = ifelse(Critical_Thinking == "1" , Avg_Salary_K, Critical_Thinking)) %>% 
  select(Data_Analysis, Modeling, Visualization, ML, ETL, Pred_Analysis, Statistics, Economics,Math, Problem_Solving, Communication, Decision_Making, Creativity, Critical_Thinking) %>% 
  na_if(0) %>% 
  mutate_if(is.character, as.numeric) %>% 
  summarize_if(is.numeric, median, na.rm = TRUE)

# Creates a long table for the plot  
skills_join <- df_join2 %>% 
  pivot_longer(
    cols = 1:14,
    names_to = 'Skills',
    values_to = 'MedianSalary') %>% 
  select(Skills, MedianSalary)
  
# Plots Soft Skill by associated median salary
ggplot(skills_join, aes(x=reorder(Skills, +MedianSalary), y=MedianSalary)) +
  geom_bar(stat="identity", fill = "#E59866") + 
  coord_flip() +
  xlab("Soft & Knowlege Skills") + 
  ylab("Median Salary of Soft Skills (K)" ) + 
  ggtitle("Ranked Soft Skills by Associated Median Salary") + theme_minimal()

Ranked Soft Skills In Job Listings - Johnny

This plot ranks the “soft” skills based on the proportion of times the skill was referenced in the Glassdoor job listing.

The calculation is based on whether the specific “soft” was referenced in the job listing.

If it was, the skill received 1 point, if the skill was not mentioned it received 0 points.

Each skill’s proportion was calculated over all skills referenced in all the job listings.

Ranked based on reference per job listing, “Communication” (and its flavors) tops the list.

# Tidy up for the skills table for analysis 
skills_tbl <- select(skills_tbl, -c(row_names, SkillsID))        #deletes columns
skills_tbl <- skills_tbl %>% mutate_if(is.character, as.numeric) #converts characters to number for calculations

# Sums each skills column
allskills <- skills_tbl %>% 
  summarise_all(sum)

# Pivots the resulting sums to perform further calculations
allskills <- allskills %>% 
  pivot_longer(
     cols = c(1:14), 
     names_to = "Job_Listing_Skill", 
     values_to = "Total") %>% 
  select(Job_Listing_Skill, Total)


# Creates ratio columns
allskills <- allskills %>%
  mutate(Job_Listing_ratio = Total/463,             #ratio of skills by overall job listings
         Overall_Skills_ratio = Total/sum(Total))   #ratio by a specific skill over all skills 

# Plot of ratio over all soft skills
ggplot(allskills, aes(x=reorder(Job_Listing_Skill, +Overall_Skills_ratio), y=Overall_Skills_ratio)) +
  geom_bar(stat="identity", fill = "#E59866") + 
  coord_flip() +
  xlab("Skills in Job Listings") + 
  ylab("Ratio of Soft Skills in Job Listings" ) + 
  ggtitle("Ranked Soft Skills In Job Listings") + theme_minimal()

Ranked Technical Skills In Job Listings - Johnny

This plot ranks the technical skills based on the proportion of time the skill was referenced in the Job Listing.

Using the same calculation as above, each tech skill’s proportion was calculated over all skills referenced in all the job listings.

Ranked based on reference per job listing, “python” tops the list.

# Tidy up for the skills table for analysis 
tech_tbl <- select(tech_tbl, -c(row_names, TechID))        #deletes columns
tech_tbl <- tech_tbl %>% mutate_if(is.character, as.numeric) #converts characters to number for calculations

# Sums each skills column
alltechskills <- tech_tbl %>% 
  summarise_all(sum)

# Pivots the resulting sums to perform further calculations
alltechskills <- alltechskills %>% 
  pivot_longer(
     cols = c(1:18), 
     names_to = "Job_Listing_Tech", 
     values_to = "Total") %>% 
  select(Job_Listing_Tech, Total)

# Creates ratio columns
alltechskills <- alltechskills %>%
  mutate(Job_ListingTech_ratio = Total/463,             #ratio of skills by overall job listings
         Overall_Tech_ratio = Total/sum(Total))   #ratio by a specific skill over all skills 

# Plot of ratio over all technical skills
ggplot(alltechskills, aes(x=reorder(Job_Listing_Tech, +Overall_Tech_ratio), y=Overall_Tech_ratio)) +
  geom_bar(stat="identity", fill = "#E59866") + 
  coord_flip() +
  xlab("Tech in Job Listings") + 
  ylab("Ratio of Tech Skills in Job Listings" ) + 
  ggtitle("Ranked Technical Skills In Job Listings") + theme_minimal()

Data Skills Salary by Education and Job Title - Sanielle

These plots rank data science salary by Education and Job Titles.

#Clean up titles in Job titles column in education by formatting to uppercase and replacing misspellings
jobs_tbl$job_title_sim <- str_to_title(jobs_tbl$job_title_sim)
jobs_tbl$job_title_sim <- str_replace(jobs_tbl$job_title_sim, "Data Scientist Project Manager", "Data Scientist PM")
jobs_tbl$job_title_sim <- str_replace(jobs_tbl$job_title_sim, "Data Analitics", "Data Analytics")

# Calculates the average salary by Job Title
AvgJobSalary <- jobs_tbl %>%
  group_by(job_title_sim) %>% 
  summarise(mean(Avg_Salary_K)) %>% 
  rename('AvgSalary' = 'mean(Avg_Salary_K)')


# Plots average salary by job title
ggplot(AvgJobSalary[which(AvgJobSalary$job_title_sim != "Na"),], aes(x = reorder(job_title_sim, +AvgSalary), y = AvgSalary, fill = job_title_sim)) +
       geom_bar(stat="identity")+
       theme_bw()+
       scale_fill_brewer(palette="Set1")+
       labs(title = "Average Salary by Job Title",
       x = "Job Title",
       y = "Average Salary (K)", fill = "Job Titles")+
  theme(axis.text.x = element_text(angle=0, vjust=0.6 ))+ coord_flip()

# Clean up data in education by removing period
jobs_tbl$education <-gsub("\\.","",jobs_tbl$education)


#Clean up keywords in education
jobs_tbl$education <- str_replace(jobs_tbl$education, "Masters", "Master's")
jobs_tbl$education <- str_replace(jobs_tbl$education, "Bachelors", "Bachelor's")
jobs_tbl$education <- str_replace(jobs_tbl$education, "PhD,PhD", "PhD")

# Calculates the average salary by Education
AvgJobSalaryEdu <- jobs_tbl %>%
  group_by(education) %>% 
  summarise(mean(Avg_Salary_K)) %>% 
  rename('AvgSalary' = 'mean(Avg_Salary_K)') %>% 
  filter(education != "")

#Plot chart - Job description Salary by Education
ggplot(AvgJobSalaryEdu, aes(x = reorder(education, +AvgSalary,), y = AvgSalary, fill = education)) +
       geom_bar(stat="identity")+ 
       theme_bw()+
       scale_fill_brewer(palette="Set1")+
       labs(title = "Average Salary by Education",
       x = "Education",
       y = "Average Salary (K)", fill = "Education")+
  theme(axis.text.x = element_text(angle=0, vjust=0.6 ))+ coord_flip()

#Plot chart - Job titles by profession and education level
ggplot(data = jobs_tbl[which(jobs_tbl$education != "" & jobs_tbl$job_title_sim != "Na"),], aes(x = `job_title_sim`, fill = education)) + 
  geom_bar()+
  scale_fill_brewer(palette="Set3")+
  theme_bw()+
  labs(title = " Job titles by profession and education level",
       caption = "Does not include jobs that did not specify education",
       x = "Job Title",
       fill = "Education")+
  theme(axis.text.x = element_text(angle=45, vjust=1, hjust = 1 ))

Conclusions - Sanielle

The “Average Salary by Job” chart shows that titles of “Data Scientist” are paid the third highest in the field, in relation to the skills requested, at approximately an average salary of $110k+. The ‘Director’ were the top paid at $125k+.
To become a data scientist, the “Job titles by profession and education” chart shows that the education requested varies, with most jobs requesting at least a Bachelor’s, and multiple jobs requesting at least a Master’s or Ph.D.; compared to an analyst for which most jobs request only a Bachelor’s.
The top salary ($112k+) for the top 3 technical skills are for keras, scikit, and tensor.
The top salary ($105K) for the top 3 soft skills are Machine Learning, Predictive Analytics and Statistics.
The top 3 referenced technical skills across all of the job listings are python, sql, excel.
The top 3 referenced soft skills across all of the job listings are communication, modeling and statistics.