March 24, 2019

Introduction

We were asked to use data to answer the question, "Which are the most valued data science skills?"

As a team we used Github and Slack as our method of team collaboration. Within Github, we forked, edited and commited. Then we utilized the blame view to review line-by-line revision history.

Data

Data Source

We obtained data from Kaggle.com

Jeff Hale obtained data from online job listing sites such as LinkedIn, Indeed, SimplyHired and Monster in the US in October 2018 using Python. When observing this data he noted how many times a keyword was mentioned by post throughout the different platforms.

Libraries

library(tidyverse)
library(knitr)
library(kableExtra)
library(tm)
library(wordcloud)
library(memoise)
library(SnowballC)
library(RColorBrewer)
library(RCurl)
library(XML)
library(treemap)

Data Load

We read data from the CSV file which was uploaded to Github.

url <- "https://raw.githubusercontent.com/miachen410/DATA607Project3/master/DataSkills.csv"
data_skills <-read.csv(url, stringsAsFactors = FALSE)
kable(data_skills) %>% kable_styling(bootstrap_options = "striped", font_size = 7)
Keyword LinkedIn Indeed SimplyHired Monster
machine learning 5,701 3,439 2,561 2,340
analysis 5,168 3,500 2,668 3,306
statistics 4,893 2,992 2,308 2,399
computer science 4,517 2,739 2,093 1,900
communication 3,404 2,344 1,791 2,053
mathematics 2,605 1,961 1,497 1,815
visualization 1,879 1,413 1,153 1,207
AI composite 1,568 1,125 811 687
deep learning 1,310 979 675 606
NLP composite 1,212 910 660 582
software development 732 627 481 784
neural networks 671 485 421 305
data engineering 514 339 276 200
project management 476 397 330 348
software engineering 413 295 250 512
Total 35,063 23,545 17,975 19,044
add AI and artificial intelligence and subtract the overlap search term with both terms in it
AI 916 690 508 680
artificial intelligence 964 754 498 679
AI + artificial intelligence 312 319 195 672
add NLP and natural language processing and subtract the overlap search term with both terms in it
NLP 643 466 362 576
natural language processing 791 621 429 575
NLP + natural language processing 222 177 131 569
"data scientist" "[keyword]"
Oct 10, 2018

Tidy and Wrangle

Data Structure

First we looked at the structure of the dataset.

str(data_skills)
## 'data.frame':    30 obs. of  5 variables:
##  $ Keyword    : chr  "machine learning" "analysis" "statistics" "computer science" ...
##  $ LinkedIn   : chr  "5,701" "5,168" "4,893" "4,517" ...
##  $ Indeed     : chr  "3,439" "3,500" "2,992" "2,739" ...
##  $ SimplyHired: chr  "2,561" "2,668" "2,308" "2,093" ...
##  $ Monster    : chr  "2,340" "3,306" "2,399" "1,900" ...

Data Types

We removed the commas in the numbers and changed the data types from character to numeric for the following columns: LinkedIn, Indeed, SimplyHired and Monster.

data_skills$LinkedIn <- str_replace_all(data_skills$LinkedIn, ",", "") %>% as.numeric()
data_skills$Indeed <- str_replace_all(data_skills$Indeed, ",", "") %>% as.numeric()
data_skills$SimplyHired <- str_replace_all(data_skills$SimplyHired, ",", "") %>% as.numeric()
data_skills$Monster <- str_replace_all(data_skills$Monster, ",", "") %>% as.numeric()
str(data_skills)
## 'data.frame':    30 obs. of  5 variables:
##  $ Keyword    : chr  "machine learning" "analysis" "statistics" "computer science" ...
##  $ LinkedIn   : num  5701 5168 4893 4517 3404 ...
##  $ Indeed     : num  3439 3500 2992 2739 2344 ...
##  $ SimplyHired: num  2561 2668 2308 2093 1791 ...
##  $ Monster    : num  2340 3306 2399 1900 2053 ...

Data Subset

We got rid of the rows we didn't need by subsetting and eliminating those in which LinkedIn was NA; we also excluded the "Total" row which was not a data science skill.

data_skills_subset <- subset(data_skills, !is.na(LinkedIn)) %>% subset(!Keyword == "Total")
kable(data_skills_subset) %>% kable_styling(bootstrap_options = "striped", font_size = 7)
Keyword LinkedIn Indeed SimplyHired Monster
1 machine learning 5701 3439 2561 2340
2 analysis 5168 3500 2668 3306
3 statistics 4893 2992 2308 2399
4 computer science 4517 2739 2093 1900
5 communication 3404 2344 1791 2053
6 mathematics 2605 1961 1497 1815
7 visualization 1879 1413 1153 1207
8 AI composite 1568 1125 811 687
9 deep learning 1310 979 675 606
10 NLP composite 1212 910 660 582
11 software development 732 627 481 784
12 neural networks 671 485 421 305
13 data engineering 514 339 276 200
14 project management 476 397 330 348
15 software engineering 413 295 250 512
20 AI 916 690 508 680
21 artificial intelligence 964 754 498 679
22 AI + artificial intelligence 312 319 195 672
25 NLP 643 466 362 576
26 natural language processing 791 621 429 575
27 NLP + natural language processing 222 177 131 569

Data Mutate

We mutuated the data frame to generate a new column Total_Mention. Total_Mention was calculated by adding all numbers from the 4 jobboards for each skill.

data_skills_2 <- data_skills_subset %>% mutate(Total_Mention = LinkedIn + Indeed + SimplyHired + Monster) 
kable(data_skills_2) %>% kable_styling(bootstrap_options = "striped", font_size = 7)
Keyword LinkedIn Indeed SimplyHired Monster Total_Mention
machine learning 5701 3439 2561 2340 14041
analysis 5168 3500 2668 3306 14642
statistics 4893 2992 2308 2399 12592
computer science 4517 2739 2093 1900 11249
communication 3404 2344 1791 2053 9592
mathematics 2605 1961 1497 1815 7878
visualization 1879 1413 1153 1207 5652
AI composite 1568 1125 811 687 4191
deep learning 1310 979 675 606 3570
NLP composite 1212 910 660 582 3364
software development 732 627 481 784 2624
neural networks 671 485 421 305 1882
data engineering 514 339 276 200 1329
project management 476 397 330 348 1551
software engineering 413 295 250 512 1470
AI 916 690 508 680 2794
artificial intelligence 964 754 498 679 2895
AI + artificial intelligence 312 319 195 672 1498
NLP 643 466 362 576 2047
natural language processing 791 621 429 575 2416
NLP + natural language processing 222 177 131 569 1099

Data Mutate

We added rows "AI" and "Artificial Intelligence" then subtracted the overlapping skills. We assigned the values to "AI + Artificial Intelligence".

data_skills_2[18,2:6] <- data_skills_2[16,2:6] + data_skills_2[17,2:6] - data_skills_2[18,2:6]

We added rows "NLP" and "Natural Language Processing" then subtracted the overlapping skills. We assigned the values to "NLP + Natural Language Processing".

data_skills_2[21,2:6] <- data_skills_2[19,2:6] + data_skills_2[20,2:6] - data_skills_2[21,2:6]

Data Mutate

We then removed the unnecessary rows "AI", "Artificial Intelligence", "NLP" and "Natural Language Processing".

data_skills_tidy <- data_skills_2[- c(16, 17, 19, 20), ]

We mutuated the data frame to generate another new column Percentage. Percentage was calculated by dividing the total number of each skill by the overall total of all skills.

data_skills_tidy <- data_skills_tidy %>% mutate(Percentage = Total_Mention/sum(Total_Mention))

Analysis and Visualization

Bar Plot

Using the ggplot2 package, we created a bar plot that shows the total frequency of each data science skill mentioned in the jobboards, ranked from highest to lowest. ##Treemap Using the treemap package, we create a treemap to show the percentage of each data science skill in respect to total.

Word Cloud

Using the wordcloud package, we decided to visually determine the most popular words in our data set. We used Data Science Tutorial YouTube Channel YouTube.com as a reference.

Conclusions

We observed the top five keyword/skills mentioned by post throughout the different platforms were Machine Learning, Analysis, Statistics, Computer Science and Communication. These keywords/skills ranked differently across platforms.

In LinkedIn, the rank was as follows: Machine Learning, Analysis, Statistics, Computer Science and Communication.

In Indeed and Simply Hired, the rank was as follows: Analysis, Machine Learning, Statistics, Computer Science and Communication.

In Monster, the rank was as follows: Analysis, Statistics, Machine Learning, Communication and Computer Science.