CUNY Class 607 - Project 3: A Project of Raphael Nash’s PMO

Which are the most valued data skills?

Overview

In this project we were tasked with determining the most valuable data science skills

Introducing the Team

Raphael Nash - Project Manager, Fearless Leader
Oluwakemi Omotunde - Domain SME
Liam Byrne - Indeed scraping
Ravi Kothari - Indeed scraping
Dmitriy Vecheruk - Word Counts, Word Cloud
Walt Wells - Word Counts, Data Mining
Ahmed Sajjad - Database Creation
Todd Weigel - Stored Procedures
Luis Calleja - Cluster Analysis
Jonathan Hernandez - Cluster Analysis
Luisa Velasco - Quality Assurance and RMarkdown
Alex Low - Quality Assurance and RMarkdown
Brandon O’Hara - Visualizations
Leland Randles - Documentation and Presentation Moderator

Instructor: Andy Catlin

Tools

For this exercise, we used the following tools:

Slack - collaboration
GitHub - file repository, Source Code Management (SCM)
R - web scraping, word counts, data mining
R Markdown - code documentation
MySQL - data storage
Tableau - visualization

Part A: Collecting Phrases for Our Analysis

Our exercise was built on two foundational elements:

We collected text from San Francisco job postings on Indeed.com where the words “data science” were in the job title, company name, or were keywords: http://www.indeed.com/jobs?q=%22data+scientist%22&l=san+francisco. When scraping the webpages, we have specifically focused on bullet-point lists, as these page elements are most likely to contain the necessary information on the required qualifications and skills.

source('Project_3.R')

We built a database of pre-defined skill keywords which were grouped into skill categories.

skillz<- read.csv('skills_lists.csv', header = TRUE, sep = ",")
head(skillz)

##        x
## 1   java
## 2      r
## 3  scala
## 4 python
## 5      c
## 6    c++

We have then populated MySQL database with the predefined terms and categroies, and added procedures to read from and update the database with the counts per term as required for our analysis.

Using these foundations, we could do separate analyses - one which involved analyzing an unsupervised set of phrases collected directly from the Indeed webpage. The other one involved analyzing the Indeed pages but focused on those phrases we had previously identified and stored in the DB.

Part B: Creating Analysis-Ready Data Sets

We started with working with the unsupervised set of phrases to make them ready for analysis. We parsed the text, cleaned it and counted all the key phrases. We then persisted it to a database and csv file. Below we display the top 10 unsupervised skills as well as an associated word cloud.

#if (!exists('wdcnt.df.sub')) source('wordcount.R', print.eval  = TRUE)

wordcount_unsup = read.csv("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607-2-Group-Project/master/wordcount.csv")
knitr::kable(head(wordcount_unsup, n=10), align = "l")

Keyword	Freq
data	2150
experience	1435
francisco	651
business	542
work	526
product	506
san	485
learning	440
san francisco	428
skills	419

Next we worked on the supervised set of phrases. We again parsed and cleaned the text, but this time focused our analysis on the set of phrases we had identified. Again, we counted the phrases, and persisted them to a database and csv file. Below we display the top 10 skills from the supervised set.

#if (!exists('skill.freq.df')) source('wordcountSuper.R', print.eval  = TRUE)

wordcount_sup = read.csv("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607-2-Group-Project/master/wordcountSuper.csv")
knitr::kable(head(wordcount_sup, n=10), align = "l")

Keyword	Freq
mac	448
sql	332
r	329
python	324
hadoop	200
c	194
java	179
spark	143
hive	109
scala	106

Our MySQL database was updated with the results of both analyses using several stored procedures writing to the respective tables displayed in the entity-relationship diagram below.

Part C: Analysis and Visualization

Next we performed a cluster analysis of the phrases we identified from the unsupervised data sets using the stringdist function. We requested that like words be organized into 200 groups. Below you will see a couple of items:

A diagram visualizing the groupings, showing how many phrases were assigned to each group.
A summary of the top groups and the like phrases which were assigned to those groups.

source('cluster4upload.r', print.eval  = F)

## [1] "Average number of models per cluster: 5"

cluster_out = read.csv("cluster_out.csv")

knitr::kable(head(cluster_out))

cluster	members	aggd.freq
1	dallas ,data ,data scientist ,data scientistkpermanentsan ,data scientistkpermanentsan francisco ,data scientistkpermanentsan francisco caworkbridge ,data scientists ,data sets ,data sharpening ,data sharpening fusion ,data sharpening fusion technologies ,data visualization ,data visualize ,data visualize go ,data visualize go engaging ,database ,databases ,datasets	4946
2	experience ,experience building ,experience data ,experience hadoop ,experience one ,experience using ,experience working ,experience working large ,experienced ,experiences ,experimental ,experimental design ,experiments ,expert ,expertise	3445
3	francisco ,francisco ca ,francisco cajobspring ,francisco cajobspring san ,francisco cajobspring san francisco ,francisco california ,francisco caworkbridge ,francisco caworkbridge san ,francisco caworkbridge san francisco ,franciscosilicon ,franciscosilicon valleytorontowashington ,franciscosilicon valleytorontowashington dc ,san francisco ,san francisco ca	3118
12	analyses ,analysis ,analysis data ,analyst ,analytic ,analytical ,analytics ,analyze ,analyzing ,data analysis ,data analytics	2021
24	account ,communicate ,communication ,communication skills ,community ,conditions ,contact ,contact us ,contract ,contractor ,contractor login ,county ,economics ,recommendations	1827
15	machine ,machine learning ,machine learning algorithms ,machine learning data ,machine learning models ,machine learning techniques ,manage ,manager ,managers ,managing ,matching	1715

Finally, we have created an interactive summary of our analysis using Tableau.