In this project we were tasked with determining the most valuable data science skills
Instructor: Andy Catlin
For this exercise, we used the following tools:
Our exercise was built on two foundational elements:
source('Project_3.R')
skillz<- read.csv('skills_lists.csv', header = TRUE, sep = ",")
head(skillz)
## x
## 1 java
## 2 r
## 3 scala
## 4 python
## 5 c
## 6 c++
Using these foundations, we could do separate analyses - one which involved analyzing an unsupervised set of phrases collected directly from the Indeed webpage. The other one involved analyzing the Indeed pages but focused on those phrases we had previously identified and stored in the DB.
We started with working with the unsupervised set of phrases to make them ready for analysis. We parsed the text, cleaned it and counted all the key phrases. We then persisted it to a database and csv file. Below we display the top 10 unsupervised skills as well as an associated word cloud.
#if (!exists('wdcnt.df.sub')) source('wordcount.R', print.eval = TRUE)
wordcount_unsup = read.csv("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607-2-Group-Project/master/wordcount.csv")
knitr::kable(head(wordcount_unsup, n=10), align = "l")
| Keyword | Freq |
|---|---|
| data | 2150 |
| experience | 1435 |
| francisco | 651 |
| business | 542 |
| work | 526 |
| product | 506 |
| san | 485 |
| learning | 440 |
| san francisco | 428 |
| skills | 419 |
Next we worked on the supervised set of phrases. We again parsed and cleaned the text, but this time focused our analysis on the set of phrases we had identified. Again, we counted the phrases, and persisted them to a database and csv file. Below we display the top 10 skills from the supervised set.
#if (!exists('skill.freq.df')) source('wordcountSuper.R', print.eval = TRUE)
wordcount_sup = read.csv("https://raw.githubusercontent.com/RaphaelNash/CUNY-DATA-607-2-Group-Project/master/wordcountSuper.csv")
knitr::kable(head(wordcount_sup, n=10), align = "l")
| Keyword | Freq |
|---|---|
| mac | 448 |
| sql | 332 |
| r | 329 |
| python | 324 |
| hadoop | 200 |
| c | 194 |
| java | 179 |
| spark | 143 |
| hive | 109 |
| scala | 106 |
Our MySQL database was updated with the results of both analyses using several stored procedures writing to the respective tables displayed in the entity-relationship diagram below.
Next we performed a cluster analysis of the phrases we identified from the unsupervised data sets using the stringdist function. We requested that like words be organized into 200 groups. Below you will see a couple of items:
source('cluster4upload.r', print.eval = F)
## [1] "Average number of models per cluster: 5"
cluster_out = read.csv("cluster_out.csv")
knitr::kable(head(cluster_out))
| cluster | members | aggd.freq |
|---|---|---|
| 1 | dallas ,data ,data scientist ,data scientistkpermanentsan ,data scientistkpermanentsan francisco ,data scientistkpermanentsan francisco caworkbridge ,data scientists ,data sets ,data sharpening ,data sharpening fusion ,data sharpening fusion technologies ,data visualization ,data visualize ,data visualize go ,data visualize go engaging ,database ,databases ,datasets | 4946 |
| 2 | experience ,experience building ,experience data ,experience hadoop ,experience one ,experience using ,experience working ,experience working large ,experienced ,experiences ,experimental ,experimental design ,experiments ,expert ,expertise | 3445 |
| 3 | francisco ,francisco ca ,francisco cajobspring ,francisco cajobspring san ,francisco cajobspring san francisco ,francisco california ,francisco caworkbridge ,francisco caworkbridge san ,francisco caworkbridge san francisco ,franciscosilicon ,franciscosilicon valleytorontowashington ,franciscosilicon valleytorontowashington dc ,san francisco ,san francisco ca | 3118 |
| 12 | analyses ,analysis ,analysis data ,analyst ,analytic ,analytical ,analytics ,analyze ,analyzing ,data analysis ,data analytics | 2021 |
| 24 | account ,communicate ,communication ,communication skills ,community ,conditions ,contact ,contact us ,contract ,contractor ,contractor login ,county ,economics ,recommendations | 1827 |
| 15 | machine ,machine learning ,machine learning algorithms ,machine learning data ,machine learning models ,machine learning techniques ,manage ,manager ,managers ,managing ,matching | 1715 |
Finally, we have created an interactive summary of our analysis using Tableau.