Which are the most valued data skills?

Overview

In this project we were tasked with determining the most valuable data science skills

Introducing the Team

  • Raphael Nash - Project Manager, Fearless Leader
  • Oluwakemi Omotunde - Domain SME
  • Liam Byrne - Indeed scraping
  • Ravi Kothari - Indeed scraping
  • Dmitriy Vecheruk - Word Counts, Word Cloud
  • Walt Wells - Word Counts, Data Mining
  • Ahmed Sajjad - Database Creation
  • Todd Weigel - Stored Procedures
  • Luis Calleja - Cluster Analysis
  • Jonathan Hernandez - Cluster Analysis
  • Luisa Velasco - Quality Assurance and RMarkdown
  • Alex Low - Quality Assurance and RMarkdown
  • Brandon O’Hara - Visualizations
  • Leland Randles - Documentation and Presentation Moderator

Instructor: Andy Catlin

Tools

For this exercise, we used the following tools:

  • Slack - collaboration
  • GitHub - file repository, Source Code Management (SCM)
  • R - web scraping, word counts, data mining
  • R Markdown - code documentation
  • MySQL - data storage
  • Tableau - visualization

Part A: Collecting Phrases for Our Analysis

Our exercise was built on two foundational elements:

  1. We collected text from San Francisco job postings on Indeed.com where the words “data science” were in the job title, company name, or were keywords: http://www.indeed.com/jobs?q=%22data+scientist%22&l=san+francisco. When scraping the webpages, we have specifically focused on bullet-point lists, as these page elements are most likely to contain the necessary information on the required qualifications and skills.
source('Project_3.R')
  1. We built a database of pre-defined skill keywords which were grouped into skill categories.
skillz<- read.csv('skills_lists.csv', header = TRUE, sep = ",")
head(skillz)
##        x
## 1   java
## 2      r
## 3  scala
## 4 python
## 5      c
## 6    c++
  1. We have then populated MySQL database with the predefined terms and categroies, and added procedures to read from and update the database with the counts per term as required for our analysis.

Using these foundations, we could do separate analyses - one which involved analyzing an unsupervised set of phrases collected directly from the Indeed webpage. The other one involved analyzing the Indeed pages but focused on those phrases we had previously identified and stored in the DB.

Part B: Creating Analysis-Ready Data Sets

We started with working with the unsupervised set of phrases to make them ready for analysis. We parsed the text, cleaned it and counted all the key phrases. We then persisted it to a database and csv file. Below we display the top 10 unsupervised skills as well as an associated word cloud.

#if (!exists('wdcnt.df.sub')) source('wordcount.R', print.eval  = TRUE)

wordcount_unsup = read.csv("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607-2-Group-Project/master/wordcount.csv")
knitr::kable(head(wordcount_unsup, n=10), align = "l")
Keyword Freq
data 2150
experience 1435
francisco 651
business 542
work 526
product 506
san 485
learning 440
san francisco 428
skills 419

Next we worked on the supervised set of phrases. We again parsed and cleaned the text, but this time focused our analysis on the set of phrases we had identified. Again, we counted the phrases, and persisted them to a database and csv file. Below we display the top 10 skills from the supervised set.

#if (!exists('skill.freq.df')) source('wordcountSuper.R', print.eval  = TRUE)

wordcount_sup = read.csv("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607-2-Group-Project/master/wordcountSuper.csv")
knitr::kable(head(wordcount_sup, n=10), align = "l")
Keyword Freq
mac 448
sql 332
r 329
python 324
hadoop 200
c 194
java 179
spark 143
hive 109
scala 106

Our MySQL database was updated with the results of both analyses using several stored procedures writing to the respective tables displayed in the entity-relationship diagram below.

Part C: Analysis and Visualization

Next we performed a cluster analysis of the phrases we identified from the unsupervised data sets using the stringdist function. We requested that like words be organized into 200 groups. Below you will see a couple of items:

  • A diagram visualizing the groupings, showing how many phrases were assigned to each group.
  • A summary of the top groups and the like phrases which were assigned to those groups.
source('cluster4upload.r', print.eval  = F)

## [1] "Average number of models per cluster: 5"
cluster_out = read.csv("cluster_out.csv")

knitr::kable(head(cluster_out))
cluster members aggd.freq
1 dallas ,data ,data scientist ,data scientistkpermanentsan ,data scientistkpermanentsan francisco ,data scientistkpermanentsan francisco caworkbridge ,data scientists ,data sets ,data sharpening ,data sharpening fusion ,data sharpening fusion technologies ,data visualization ,data visualize ,data visualize go ,data visualize go engaging ,database ,databases ,datasets 4946
2 experience ,experience building ,experience data ,experience hadoop ,experience one ,experience using ,experience working ,experience working large ,experienced ,experiences ,experimental ,experimental design ,experiments ,expert ,expertise 3445
3 francisco ,francisco ca ,francisco cajobspring ,francisco cajobspring san ,francisco cajobspring san francisco ,francisco california ,francisco caworkbridge ,francisco caworkbridge san ,francisco caworkbridge san francisco ,franciscosilicon ,franciscosilicon valleytorontowashington ,franciscosilicon valleytorontowashington dc ,san francisco ,san francisco ca 3118
12 analyses ,analysis ,analysis data ,analyst ,analytic ,analytical ,analytics ,analyze ,analyzing ,data analysis ,data analytics 2021
24 account ,communicate ,communication ,communication skills ,community ,conditions ,contact ,contact us ,contract ,contractor ,contractor login ,county ,economics ,recommendations 1827
15 machine ,machine learning ,machine learning algorithms ,machine learning data ,machine learning models ,machine learning techniques ,manage ,manager ,managers ,managing ,matching 1715

Finally, we have created an interactive summary of our analysis using Tableau.