Create a short document, with the names of group members. You should briefly describe your collaboration tool(s) you’ll use as a group, including for communication, code sharing, and project documentation. You should have identified your data sources, where the data can be found, and how to load it. And you should have created at least a logical model for your normalized database, and produced an Entity-Relationship (ER) diagram documenting your database design
The team is using R Studio Cloud (https://rstudio.cloud) for collaboration and code development. This allows us to view and share code within the project. We are using R Markdown within RStudio Cloud for project documentation to publish through RPubs (https://rpubs.com) . To create the ERD, the team used Quick Database Diagrams (https://www.quickdatabasediagrams.com). Source CSV and RMD files are saved in a github repo so files can be accessed centrally (https://github.com/johnnydrodriguez/data607_project3) In addition to this, the team communicates over Zoom and Slack.
Project Tools Insights
- We looked for a collaboration tools that allowed for real time editing.
- We experimented with Google Collab with R & RStudio Cloud but could get them to work for our needs.
- We used Github as a central repository for shared code and collaborated in real time over Zoom sessions and Slack. This was effective in moving the project forward.
The data used to answer the question is taken from Glassdoor via Kaggle. The data was scrapped from Glassdoor and posted to the site.
Source: https://www.kaggle.com/datasets/nikhilbhathi/data-scientist-salary-us-glassdoor?resource=download
Source CSV (Uncleaned for Project): https://raw.githubusercontent.com/johnnydrodriguez/data607_project3/main/glassdoor_2021.csv
Data Sources Insights
- Finding datasets that answered this specific question we challenging to find.
- Job sites proved challenging as API were deprecated and public data sets were not available.
- The Federal government sites are data-rich but job titles, skills and other variables that may have directly helped answer the question where typically aggregated generically under ‘Data Science’ or ‘Computer Information’ variables.
- The Glassdoor data used in the project was scraped and made available through Kaggle. It was Data Science Job Listings from Glassdoor in 2021. This data contained separate skills data we could extract from the individual job listings.
The team was able to initially divide the raw data into 3 tables: Company, Jobs, Tech
https://github.com/johnnydrodriguez/data607_project3/blob/main/QuickDBD-export.png
Data Science Valuble Skills ERD
# install.packages("tidyverse", repos = "http://cran.us.r-project.org")
# install.packages("RMySQL", repos = "http://cran.us.r-project.org")
# install.packages("ggplot2", repos = "http://cran.us.r-project.org")
# install.packages("aws.s3", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(RMySQL)
library(ggplot2)
library(aws.s3)
We used AWS RDS to load and read the raw data. This initial step allows us to load the data from an AWS bucket.
#Connect to S3------------------------------------------------------------------
#Use IAM login credentials/Set to "read-only"
Sys.setenv("AWS_ACCESS_KEY_ID" = AWS_ACCESS_KEY_ID,
"AWS_SECRET_ACCESS_KEY" = AWS_SECRET_ACCESS_KEY)
#Check bucket
get_bucket("data-skills-p3")
## Bucket: data-skills-p3
##
## $Contents
## Key: glassdoor_2021.csv
## LastModified: 2022-10-21T00:15:57.000Z
## ETag: "0b29dbb1049bdfcd1b77637c91325dce"
## Size (B): 3155265
## Owner: jlixander
## Storage class: STANDARD
#Load csv file from s3 into a df
<- aws.s3::s3read_using(read.csv, object = "s3://data-skills-p3/glassdoor_2021.csv")
jobs_df
glimpse(jobs_df, n = 1)
## Rows: 742
## Columns: 42
## $ index <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ Job.Title <chr> "Data Scientist", "Healthcare Data Scientist", "Dat…
## $ Salary.Estimate <chr> "$53K-$91K (Glassdoor est.)", "$63K-$112K (Glassdoo…
## $ Job.Description <chr> "Data Scientist\nLocation: Albuquerque, NM\nEducati…
## $ Rating <dbl> 3.8, 3.4, 4.8, 3.8, 2.9, 3.4, 4.1, 3.8, 3.3, 4.6, 3…
## $ Company.Name <chr> "Tecolote Research\n3.8", "University of Maryland M…
## $ Location <chr> "Albuquerque, NM", "Linthicum, MD", "Clearwater, FL…
## $ Headquarters <chr> "Goleta, CA", "Baltimore, MD", "Clearwater, FL", "R…
## $ Size <chr> "501 - 1000 ", "10000+ ", "501 - 1000 ", "1001 - 50…
## $ Founded <int> 1973, 1984, 2010, 1965, 1998, 2000, 2008, 2005, 201…
## $ Type.of.ownership <chr> "Company - Private", "Other Organization", "Company…
## $ Industry <chr> "Aerospace & Defense", "Health Care Services & Hosp…
## $ Sector <chr> "Aerospace & Defense", "Health Care", "Business Ser…
## $ Revenue <chr> "$50 to $100 million (USD)", "$2 to $5 billion (USD…
## $ Competitors <chr> "-1", "-1", "-1", "Oak Ridge National Laboratory, N…
## $ Hourly <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Employer.provided <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Lower.Salary <int> 53, 63, 80, 56, 86, 71, 54, 86, 38, 120, 126, 64, 1…
## $ Upper.Salary <int> 91, 112, 90, 97, 143, 119, 93, 142, 84, 160, 201, 1…
## $ Avg.Salary.K. <dbl> 72.0, 87.5, 85.0, 76.5, 114.5, 95.0, 73.5, 114.0, 6…
## $ company_txt <chr> "Tecolote Research", "University of Maryland Medica…
## $ Job.Location <chr> "NM", "MD", "FL", "WA", "NY", "TX", "MD", "CA", "NY…
## $ Age <int> 48, 37, 11, 56, 23, 21, 13, 16, 7, 12, 10, 53, 59, …
## $ Python <int> 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, …
## $ spark <int> 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, …
## $ aws <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ excel <int> 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ sql <int> 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, …
## $ sas <int> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ keras <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ pytorch <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ scikit <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ tensor <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ hadoop <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
## $ tableau <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ bi <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ flink <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mongo <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ google_an <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ job_title_sim <chr> "data scientist", "data scientist", "data scientist…
## $ seniority_by_title <chr> "na", "na", "na", "na", "na", "na", "na", "na", "na…
## $ Degree <chr> "M", "M", "M", "na", "na", "na", "na", "M", "P", "n…
Once extracted, significant cleanup was needed to convert the raw data into tables we could use for the plots and analysis. We ultimately added 1 additional table to the original 3 table design.
# Remove duplicates
<- jobs_df %>% distinct(Job.Description,Company.Name, .keep_all = TRUE)
jobs_unique
# Clean up commas in data
$Job.Description <- gsub("’", "'", jobs_unique$Job.Description)
jobs_unique
# Identify keywords
<- c("Masters", "Bachelors", "Master's", "Bachelor's", "Phd", "PhD", "Ph.D")
education $education <- mapply(function(x)paste(education[str_detect(x,(education))],
jobs_uniquecollapse = ","), jobs_unique$Job.Description)
#Extracts additional tech from the job description
<- jobs_unique %>%
jobs_unique mutate(R = str_match(Job.Description, '.*(\\sR\\s).*')[,2],
API = str_match(Job.Description, '.*(API).*')[,2]) %>%
mutate_at(vars(R), ~replace(., !is.na(.), 1)) %>%
mutate_at(vars(R), ~replace(., is.na(.), 0)) %>%
mutate_at(vars(API), ~replace(., !is.na(.), 1)) %>%
mutate_at(vars(API), ~replace(., is.na(.), 0))
# Skills Extraction from Job Description Convert values in 1 and 0
<- jobs_unique %>%
Skills select(index, Job.Description) %>%
mutate(Data_Analysis = str_match(Job.Description, regex('.*(data analy).*', ignore_case = TRUE))[,2],
Modeling = str_match(Job.Description, regex('.*(model).*', ignore_case = TRUE))[,2],
Visualization = str_match(Job.Description, regex('.*(visualization).*', ignore_case = TRUE))[,2],
ML = str_match(Job.Description, regex('.*(machine learning).*', ignore_case = TRUE))[,2],
ETL = str_match(Job.Description, '.*(ETL).*')[,2],
Pred_Analysis = str_match(Job.Description, regex('.*(predictive analy).*', ignore_case = TRUE))[,2],
Statistics = str_match(Job.Description, regex('.*(statis).*', ignore_case = TRUE))[,2],
Economics = str_match(Job.Description, regex('.*(econom).*', ignore_case = TRUE))[,2],
Math = str_match(Job.Description, regex('.*(math).*', ignore_case = TRUE))[,2],
Problem_Solving = str_match(Job.Description, regex('.*(problem-solv).*', ignore_case = TRUE))[,2],
Communication = str_match(Job.Description, regex('.*(communicat).*', ignore_case = TRUE))[,2],
Decision_Making= str_match(Job.Description, regex('.*(decision-mak).*', ignore_case = TRUE))[,2],
Creativity = str_match(Job.Description, regex('.*(creativity).*', ignore_case = TRUE))[,2],
Critical_Thinking = str_match(Job.Description, regex('.*(critical think).*', ignore_case = TRUE))[,2]) %>%
mutate_at(vars(-c(index)), ~replace(., !is.na(.), 1)) %>%
mutate_at(vars(-c(index)), ~replace(., is.na(.), 0)) %>%
select(-c(Job.Description)) %>%
rename('SkillsID' = 'index')
#Tidy/Transform data from the glass door data and create "Company" data frame.
# create the Company data frame from the glass door data
<-select(jobs_unique, c('index','Rating', 'Company.Name','Location','Headquarters','Size','Founded','Type.of.ownership','Industry','Sector','Revenue','Competitors','company_txt','Age'))
Company
# Change the "index" column to "CompanyID"
colnames(Company)[colnames(Company)== "index"] <-"CompanyID"
# Change the "Rating" column to "Company_Rating"
colnames(Company)[colnames(Company)== "Rating"] <-"Column_Rating"
# Change the "Company.Name" column to "Company_Name"
colnames(Company)[colnames(Company)== "Company.Name"] <-"Company_Name"
# Change the "Headquarters" column to "HQ_Address"
colnames(Company)[colnames(Company)== "Headquarters"] <-"HQ_Address"
# Change the "Type.of.ownership" column to "type_of_ownership"
colnames(Company)[colnames(Company)== "Type.of.ownership"] <-"type_of_ownership"
# Change the "Age" column to "Company_Age"
colnames(Company)[colnames(Company)== "Age"] <-"Company_Age"
#Tidy/Transform data from the glass door data and create "Jobs" data frame.
# create the Company data frame from the glass door data
<-select(jobs_unique, c('index','Job.Title', 'Salary.Estimate','Job.Description','Hourly','Lower.Salary','Upper.Salary','Avg.Salary.K.','Job.Location','job_title_sim','seniority_by_title','Degree', 'education'))
Jobs
# Change the "index" column to "JobID"
colnames(Jobs)[colnames(Jobs)== "index"] <-"JobID"
# Change the "Job.Title" column to "Job_Title"
colnames(Jobs)[colnames(Jobs)== "Job.Title"] <-"Job_Title"
# Change the "Salary.Estimate" column to "Salary_Estimate"
colnames(Jobs)[colnames(Jobs)== "Salary.Estimate"] <-"Salary_Estimate"
# Change the "Job.Description" column to "Job_Description"
colnames(Jobs)[colnames(Jobs)== "Job.Description"] <-"Job_Description"
# Change the "Lower.Salary" column to "Lower_Salary_K"
colnames(Jobs)[colnames(Jobs)== "Lower.Salary"] <-"Lower_Salary_K"
# Change the "Upper.Salary" column to "Upper_Salary_K"
colnames(Jobs)[colnames(Jobs)== "Upper.Salary"] <-"Upper_Salary_K"
# Change the "Avg.Salary.K." column to "Avg_Salary_K"
colnames(Jobs)[colnames(Jobs)== "Avg.Salary.K."] <-"Avg_Salary_K"
# Change the "Job.Location" column to "Job_Location"
colnames(Jobs)[colnames(Jobs)== "Job.Location"] <-"Job_Location"
<-select(jobs_unique, c("index","Python","spark","aws","excel","sql","sas","keras","pytorch","scikit","tensor","hadoop","tableau","bi","flink","mongo","google_an","R", "API"))
Tech # Change the "index" column to "TechID"
colnames(Tech)[colnames(Tech)== "index"] <-"TechID"
Once the normalized tables were ready, we connected to the AWS SQL and loaded the tables to the SQL database. For the purposes of our project, the rest of the transformations, plots and analysis are based on the tables read from the AWS SQL instance.
#Connection details
<- "database-1.c4xyb2t3srpc.us-east-1.rds.amazonaws.com"
endpoint <- 3306
port <- "admin"
username <- "database1"
dbname <- "us-east-1a"
region
# Connect to the database using an IAM authentication token.
<- DBI::dbConnect(
con ::MySQL(),
RMySQLhost = "database-1.c4xyb2t3srpc.us-east-1.rds.amazonaws.com",
port = 3306,
db = "database1",
user = "admin",
password = password
)
## TEST DATABASE CONNECTION
class(con)
## [1] "MySQLConnection"
## attr(,"package")
## [1] "RMySQL"
#Drop Tables If they exist before creating (to avoid SQL error)
dbSendQuery(con, "DROP TABLE IF EXISTS company,jobs,tech, skills")
## <MySQLResult:65024,0,0>
#Upload tables to AWS RDS - MySQL Instance
dbWriteTable(conn = con, name = 'jobs', value = as.data.frame(Jobs))
## [1] TRUE
dbWriteTable(conn = con, name = 'tech', value = as.data.frame(Tech))
## [1] TRUE
dbWriteTable(conn = con, name = 'company', value = as.data.frame(Company))
## [1] TRUE
dbWriteTable(conn = con, name = 'skills', value = as.data.frame(Skills))
## [1] TRUE
#List database tables
dbListTables(con) # load tables in rds to variable
## [1] "company" "jobs" "skills" "tech"
#Read tables from SQL back to R for analysis
<- dbGetQuery(con, "SELECT * FROM jobs")
jobs_tbl <- dbGetQuery(con, "SELECT * FROM tech")
tech_tbl <- dbGetQuery(con, "SELECT * FROM company")
company_tbl <- dbGetQuery(con, "SELECT * FROM skills")
skills_tbl
glimpse(jobs_tbl)
## Rows: 463
## Columns: 14
## $ row_names <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", …
## $ JobID <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ Job_Title <chr> "Data Scientist", "Healthcare Data Scientist", "Dat…
## $ Salary_Estimate <chr> "$53K-$91K (Glassdoor est.)", "$63K-$112K (Glassdoo…
## $ Job_Description <chr> "Data Scientist\nLocation: Albuquerque, NM\nEducati…
## $ Hourly <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Lower_Salary_K <dbl> 53, 63, 80, 56, 86, 71, 54, 86, 38, 120, 126, 64, 1…
## $ Upper_Salary_K <dbl> 91, 112, 90, 97, 143, 119, 93, 142, 84, 160, 201, 1…
## $ Avg_Salary_K <dbl> 72.0, 87.5, 85.0, 76.5, 114.5, 95.0, 73.5, 114.0, 6…
## $ Job_Location <chr> "NM", "MD", "FL", "WA", "NY", "TX", "MD", "CA", "NY…
## $ job_title_sim <chr> "data scientist", "data scientist", "data scientist…
## $ seniority_by_title <chr> "na", "na", "na", "na", "na", "na", "na", "na", "na…
## $ Degree <chr> "M", "M", "M", "na", "na", "na", "na", "M", "P", "n…
## $ education <chr> "Master's,Bachelor's,Ph.D", "Master's", "Master's,P…
glimpse(tech_tbl)
## Rows: 463
## Columns: 20
## $ row_names <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12…
## $ TechID <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ Python <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, …
## $ spark <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, …
## $ aws <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, …
## $ excel <dbl> 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, …
## $ sql <dbl> 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, …
## $ sas <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ keras <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ pytorch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ scikit <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ tensor <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, …
## $ hadoop <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, …
## $ tableau <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ bi <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ flink <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mongo <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ google_an <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ R <chr> "0", "0", "1", "0", "1", "0", "0", "0", "0", "0", "1", "0", …
## $ API <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
glimpse(company_tbl)
## Rows: 463
## Columns: 15
## $ row_names <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "…
## $ CompanyID <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ Column_Rating <dbl> 3.8, 3.4, 4.8, 3.8, 2.9, 3.4, 4.1, 3.8, 3.3, 4.6, 3.…
## $ Company_Name <chr> "Tecolote Research\n3.8", "University of Maryland Me…
## $ Location <chr> "Albuquerque, NM", "Linthicum, MD", "Clearwater, FL"…
## $ HQ_Address <chr> "Goleta, CA", "Baltimore, MD", "Clearwater, FL", "Ri…
## $ Size <chr> "501 - 1000 ", "10000+ ", "501 - 1000 ", "1001 - 500…
## $ Founded <dbl> 1973, 1984, 2010, 1965, 1998, 2000, 2008, 2005, 2014…
## $ type_of_ownership <chr> "Company - Private", "Other Organization", "Company …
## $ Industry <chr> "Aerospace & Defense", "Health Care Services & Hospi…
## $ Sector <chr> "Aerospace & Defense", "Health Care", "Business Serv…
## $ Revenue <chr> "$50 to $100 million (USD)", "$2 to $5 billion (USD)…
## $ Competitors <chr> "-1", "-1", "-1", "Oak Ridge National Laboratory, Na…
## $ company_txt <chr> "Tecolote Research", "University of Maryland Medical…
## $ Company_Age <dbl> 48, 37, 11, 56, 23, 21, 13, 16, 7, 12, 10, 53, 59, 9…
glimpse(skills_tbl)
## Rows: 463
## Columns: 16
## $ row_names <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "…
## $ SkillsID <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ Data_Analysis <chr> "1", "0", "1", "1", "1", "1", "0", "0", "0", "1", "1…
## $ Modeling <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "0…
## $ Visualization <chr> "1", "1", "1", "0", "0", "1", "1", "0", "0", "0", "0…
## $ ML <chr> "1", "1", "1", "1", "1", "1", "0", "1", "0", "1", "1…
## $ ETL <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0…
## $ Pred_Analysis <chr> "0", "1", "0", "0", "0", "0", "0", "0", "0", "1", "0…
## $ Statistics <chr> "0", "1", "1", "1", "1", "1", "1", "1", "0", "1", "1…
## $ Economics <chr> "0", "0", "0", "0", "1", "0", "0", "0", "0", "0", "1…
## $ Math <chr> "1", "1", "1", "1", "1", "0", "1", "1", "0", "0", "0…
## $ Problem_Solving <chr> "0", "0", "0", "0", "0", "1", "1", "0", "1", "0", "0…
## $ Communication <chr> "1", "1", "1", "0", "1", "1", "0", "1", "0", "1", "0…
## $ Decision_Making <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0…
## $ Creativity <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0…
## $ Critical_Thinking <chr> "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "0…
This plot ranks the list of technical skills based on an associated median salary for each technical skill.
- The dataset contained average job salary per job. We were able to map the avg salary value to each technical skill that was included in the data set.
- For example, if the job listed mongo in the job description, that average salary was mapped to that skill.
- Each skill ended up with a range of salaries listed for it.
- The plot ranks the technical skills using the associated median salary from these range of available salaries for that skill.
- The skills that rose to the top are all related to machine & deep learning technologies.
# Creates the joing from between the jobs and tech tables
<- left_join(jobs_tbl, tech_tbl, by = c("JobID" = "TechID"))
df_join1
# Selects the columns needed for the analysis and plot
<- select(df_join1, c('Avg_Salary_K', 'Python', 'spark', 'aws', 'excel', 'sql', 'sas', 'keras', 'pytorch', 'scikit', 'tensor', 'hadoop', 'tableau', 'bi', 'flink', 'mongo', 'google_an', 'R', 'API'))
df_join1
# Inserts the average salary for all technical skills represented in the data (ie, where the skill = 1). It then calculates the median salary per skill.
<- df_join1 %>%
df_join1 mutate(Python = ifelse(Python == "1" , Avg_Salary_K, Python),
spark = ifelse(spark == "1" , Avg_Salary_K, spark),
aws = ifelse(aws == "1" , Avg_Salary_K, aws),
excel = ifelse(excel == "1" , Avg_Salary_K, excel),
sql = ifelse(sql == "1" , Avg_Salary_K, sql),
sas = ifelse(sas == "1" , Avg_Salary_K, sas),
keras = ifelse(keras == "1" , Avg_Salary_K, keras),
pytorch = ifelse(pytorch == "1" , Avg_Salary_K, pytorch),
scikit= ifelse(scikit == "1" , Avg_Salary_K, scikit),
tensor = ifelse(tensor == "1" , Avg_Salary_K, tensor),
hadoop = ifelse(hadoop == "1" , Avg_Salary_K, hadoop),
tableau = ifelse(tableau == "1" , Avg_Salary_K, tableau),
bi = ifelse(bi == "1" , Avg_Salary_K, bi),
flink = ifelse(flink == "1" , Avg_Salary_K, flink),
mongo = ifelse(mongo == "1" , Avg_Salary_K, mongo),
google_an = ifelse(google_an == "1" , Avg_Salary_K, google_an),
R = ifelse(R == "1" , Avg_Salary_K, R),
API = ifelse(API == "1" , Avg_Salary_K, API)) %>%
select(Python, spark, aws, excel, sql, sas, keras, pytorch,scikit, tensor, hadoop, tableau, bi, flink, mongo, google_an, R, API) %>%
na_if(0) %>%
summarize_if(is.numeric, median, na.rm = TRUE)
# Creates a long table for the plot
<- df_join1 %>%
tech_join pivot_longer(
cols = 1:16,
names_to = 'TechTools',
values_to = 'MedianSalary') %>%
select(TechTools, MedianSalary)
# Plots Technical Skill by associated median salary
ggplot(tech_join, aes(x=reorder(TechTools, +MedianSalary), y=MedianSalary)) +
geom_bar(stat="identity", fill = "#E59866") +
coord_flip() +
xlab("Tech Skill") +
ylab("Median Salary of Tech Skill (K)" ) +
ggtitle("Ranked Technical Skills by Associated Median Salary") + theme_minimal()
This plot ranks the list of soft and knowledge skills based on an associated median salary for each soft and knowledge skill.
- The soft skills data was mined from the individual job descriptions to create “soft” skills.
- Like above, these soft skills were associated to the job’s average salary
- The plot ranks the soft skills using the associated median salary from these range of available salaries for that skill.
- The skills that rose to the top all appear to be related to the machine learning domain.
# Creates the joing from between the jobs and skills tables
<- left_join(jobs_tbl, skills_tbl, by = c("JobID" = "SkillsID"))
df_join2
# Selects the columns needed for the analysis and plot
<- select(df_join2, c('Avg_Salary_K', 'Data_Analysis', 'Modeling', 'Visualization', 'ML', 'ETL', 'Pred_Analysis', 'Statistics', 'Economics', 'Math', 'Problem_Solving', 'Communication', 'Decision_Making', 'Creativity', 'Critical_Thinking'))
df_join2
# Inserts the average salary for all soft skills represented in the data (ie, where the skill = 1). It then calculates the median salary per skill.
<- df_join2 %>%
df_join2 mutate(Data_Analysis = ifelse(Data_Analysis == "1" , Avg_Salary_K, Data_Analysis),
Modeling = ifelse(Modeling == "1" , Avg_Salary_K, Modeling),
Visualization = ifelse(Visualization == "1" , Avg_Salary_K, Visualization),
ML = ifelse(ML == "1" , Avg_Salary_K, ML),
ETL = ifelse(ETL == "1" , Avg_Salary_K, ETL),
Pred_Analysis = ifelse(Pred_Analysis == "1" , Avg_Salary_K, Pred_Analysis),
Statistics = ifelse(Statistics == "1" , Avg_Salary_K, Statistics),
Economics = ifelse(Economics == "1" , Avg_Salary_K, Economics),
Math = ifelse(Math == "1" , Avg_Salary_K, Math),
Problem_Solving = ifelse(Problem_Solving == "1" , Avg_Salary_K, Problem_Solving),
Communication = ifelse(Communication == "1" , Avg_Salary_K, Communication),
Decision_Making = ifelse(Decision_Making == "1" , Avg_Salary_K, Decision_Making),
Creativity = ifelse(Creativity == "1" , Avg_Salary_K, Creativity),
Critical_Thinking = ifelse(Critical_Thinking == "1" , Avg_Salary_K, Critical_Thinking)) %>%
select(Data_Analysis, Modeling, Visualization, ML, ETL, Pred_Analysis, Statistics, Economics,Math, Problem_Solving, Communication, Decision_Making, Creativity, Critical_Thinking) %>%
na_if(0) %>%
mutate_if(is.character, as.numeric) %>%
summarize_if(is.numeric, median, na.rm = TRUE)
# Creates a long table for the plot
<- df_join2 %>%
skills_join pivot_longer(
cols = 1:14,
names_to = 'Skills',
values_to = 'MedianSalary') %>%
select(Skills, MedianSalary)
# Plots Soft Skill by associated median salary
ggplot(skills_join, aes(x=reorder(Skills, +MedianSalary), y=MedianSalary)) +
geom_bar(stat="identity", fill = "#E59866") +
coord_flip() +
xlab("Soft & Knowlege Skills") +
ylab("Median Salary of Soft Skills (K)" ) +
ggtitle("Ranked Soft Skills by Associated Median Salary") + theme_minimal()
This plot ranks the “soft” skills based on the proportion of times the skill was referenced in the Glassdoor job listing.
- The calculation is based on whether the specific “soft” was referenced in the job listing.
- If it was, the skill received 1 point, if the skill was not mentioned it received 0 points.
- Each skill’s proportion was calculated over all skills referenced in all the job listings.
- Ranked based on reference per job listing, “Communication” (and its flavors) tops the list.
# Tidy up for the skills table for analysis
<- select(skills_tbl, -c(row_names, SkillsID)) #deletes columns
skills_tbl <- skills_tbl %>% mutate_if(is.character, as.numeric) #converts characters to number for calculations
skills_tbl
# Sums each skills column
<- skills_tbl %>%
allskills summarise_all(sum)
# Pivots the resulting sums to perform further calculations
<- allskills %>%
allskills pivot_longer(
cols = c(1:14),
names_to = "Job_Listing_Skill",
values_to = "Total") %>%
select(Job_Listing_Skill, Total)
# Creates ratio columns
<- allskills %>%
allskills mutate(Job_Listing_ratio = Total/463, #ratio of skills by overall job listings
Overall_Skills_ratio = Total/sum(Total)) #ratio by a specific skill over all skills
# Plot of ratio over all soft skills
ggplot(allskills, aes(x=reorder(Job_Listing_Skill, +Overall_Skills_ratio), y=Overall_Skills_ratio)) +
geom_bar(stat="identity", fill = "#E59866") +
coord_flip() +
xlab("Skills in Job Listings") +
ylab("Ratio of Soft Skills in Job Listings" ) +
ggtitle("Ranked Soft Skills In Job Listings") + theme_minimal()
This plot ranks the technical skills based on the proportion of time the skill was referenced in the Job Listing.
- Using the same calculation as above, each tech skill’s proportion was calculated over all skills referenced in all the job listings.
- Ranked based on reference per job listing, “python” tops the list.
# Tidy up for the skills table for analysis
<- select(tech_tbl, -c(row_names, TechID)) #deletes columns
tech_tbl <- tech_tbl %>% mutate_if(is.character, as.numeric) #converts characters to number for calculations
tech_tbl
# Sums each skills column
<- tech_tbl %>%
alltechskills summarise_all(sum)
# Pivots the resulting sums to perform further calculations
<- alltechskills %>%
alltechskills pivot_longer(
cols = c(1:18),
names_to = "Job_Listing_Tech",
values_to = "Total") %>%
select(Job_Listing_Tech, Total)
# Creates ratio columns
<- alltechskills %>%
alltechskills mutate(Job_ListingTech_ratio = Total/463, #ratio of skills by overall job listings
Overall_Tech_ratio = Total/sum(Total)) #ratio by a specific skill over all skills
# Plot of ratio over all technical skills
ggplot(alltechskills, aes(x=reorder(Job_Listing_Tech, +Overall_Tech_ratio), y=Overall_Tech_ratio)) +
geom_bar(stat="identity", fill = "#E59866") +
coord_flip() +
xlab("Tech in Job Listings") +
ylab("Ratio of Tech Skills in Job Listings" ) +
ggtitle("Ranked Technical Skills In Job Listings") + theme_minimal()
These plots rank data science salary by Education and Job Titles.
#Clean up titles in Job titles column in education by formatting to uppercase and replacing misspellings
$job_title_sim <- str_to_title(jobs_tbl$job_title_sim)
jobs_tbl$job_title_sim <- str_replace(jobs_tbl$job_title_sim, "Data Scientist Project Manager", "Data Scientist PM")
jobs_tbl$job_title_sim <- str_replace(jobs_tbl$job_title_sim, "Data Analitics", "Data Analytics")
jobs_tbl
# Calculates the average salary by Job Title
<- jobs_tbl %>%
AvgJobSalary group_by(job_title_sim) %>%
summarise(mean(Avg_Salary_K)) %>%
rename('AvgSalary' = 'mean(Avg_Salary_K)')
# Plots average salary by job title
ggplot(AvgJobSalary[which(AvgJobSalary$job_title_sim != "Na"),], aes(x = reorder(job_title_sim, +AvgSalary), y = AvgSalary, fill = job_title_sim)) +
geom_bar(stat="identity")+
theme_bw()+
scale_fill_brewer(palette="Set1")+
labs(title = "Average Salary by Job Title",
x = "Job Title",
y = "Average Salary (K)", fill = "Job Titles")+
theme(axis.text.x = element_text(angle=0, vjust=0.6 ))+ coord_flip()
# Clean up data in education by removing period
$education <-gsub("\\.","",jobs_tbl$education)
jobs_tbl
#Clean up keywords in education
$education <- str_replace(jobs_tbl$education, "Masters", "Master's")
jobs_tbl$education <- str_replace(jobs_tbl$education, "Bachelors", "Bachelor's")
jobs_tbl$education <- str_replace(jobs_tbl$education, "PhD,PhD", "PhD")
jobs_tbl
# Calculates the average salary by Education
<- jobs_tbl %>%
AvgJobSalaryEdu group_by(education) %>%
summarise(mean(Avg_Salary_K)) %>%
rename('AvgSalary' = 'mean(Avg_Salary_K)') %>%
filter(education != "")
#Plot chart - Job description Salary by Education
ggplot(AvgJobSalaryEdu, aes(x = reorder(education, +AvgSalary,), y = AvgSalary, fill = education)) +
geom_bar(stat="identity")+
theme_bw()+
scale_fill_brewer(palette="Set1")+
labs(title = "Average Salary by Education",
x = "Education",
y = "Average Salary (K)", fill = "Education")+
theme(axis.text.x = element_text(angle=0, vjust=0.6 ))+ coord_flip()
#Plot chart - Job titles by profession and education level
ggplot(data = jobs_tbl[which(jobs_tbl$education != "" & jobs_tbl$job_title_sim != "Na"),], aes(x = `job_title_sim`, fill = education)) +
geom_bar()+
scale_fill_brewer(palette="Set3")+
theme_bw()+
labs(title = " Job titles by profession and education level",
caption = "Does not include jobs that did not specify education",
x = "Job Title",
fill = "Education")+
theme(axis.text.x = element_text(angle=45, vjust=1, hjust = 1 ))
The “Average Salary by Job” chart shows that titles of “Data Scientist” are paid the third highest in the field, in relation to the skills requested, at approximately an average salary of $110k+. The ‘Director’ were the top paid at $125k+.
To become a data scientist, the “Job titles by profession and education” chart shows that the education requested varies, with most jobs requesting at least a Bachelor’s, and multiple jobs requesting at least a Master’s or Ph.D.; compared to an analyst for which most jobs request only a Bachelor’s.
The top salary ($112k+) for the top 3 technical skills are for keras, scikit, and tensor.
The top salary ($105K) for the top 3 soft skills are Machine Learning, Predictive Analytics and Statistics.
The top 3 referenced technical skills across all of the job listings are python, sql, excel.
The top 3 referenced soft skills across all of the job listings are communication, modeling and statistics.