Introduction

This task, in Data 607 Project 3, attempts to illuminate the necessary set of specific skills required to effectively work as a Data Scientist. The task is to explore variety of publicly available data to perform analysis to determine which set of skills are ranked the highest. To narrow the focus of this study, the analysis will be conducted on specific software tools and computing skills. See Data and Variables of Measurement for additional information.

Team Members

Mario Pena

Ajay Arora

Initial Communication / Establishing Work Activity

We established communication over #Slack and spoke over the phone to introduce ourselves properly. Furthermore, we divided the work effort in half. The work effort consisted of Data Preparation, Data Cleaning, Data Analysis, and Conclusion.

Ongoing Communication / Artifact Location

We both decided to communicate via email and/or #Slack. In addition, we establihed a Project 3 group folder on GitHub.

https://github.com/AjayArora35/Data-607-Group-Project-3

https://data607fall2019.slack.com/

Data

The data was acquired at Kaggle.com. (https://www.kaggle.com/discdiver/datasets) The specific software tools skills and computing skills were gathered from the following job posting sites, as indicated in the following image. The focus of the study is on specific software tools skills and computing skills described next.

LinkedIn, Indeed, Monster, SimplyHired, AngelList

Preparing Environment

#Loading Libraries
library(DBI)

## Warning: package 'DBI' was built under R version 3.5.3

library("knitr")
library("tidyverse")

## Warning: package 'tidyverse' was built under R version 3.5.3

## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1       v purrr   0.3.0  
## v tibble  2.0.1       v dplyr   0.8.0.1
## v tidyr   0.8.2       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.5.3

## Warning: package 'readr' was built under R version 3.5.3

## Warning: package 'forcats' was built under R version 3.5.3

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library("tidyr")
library("dplyr")
library("stringr")
library("plotly")

## Warning: package 'plotly' was built under R version 3.5.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library("htmlTable")

## Warning: package 'htmlTable' was built under R version 3.5.3

library("stringr")
library("ggplot2")
library("stats")
library("scales")

## Warning: package 'scales' was built under R version 3.5.3

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library("viridis")

## Warning: package 'viridis' was built under R version 3.5.3

## Loading required package: viridisLite

## 
## Attaching package: 'viridis'

## The following object is masked from 'package:scales':
## 
##     viridis_pal

library("wordcloud")

## Warning: package 'wordcloud' was built under R version 3.5.3

## Loading required package: RColorBrewer

Variables of Measurement

The following software tools skills are analyzed to determine their importance and/or ranking.

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 3.5.3

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

softskills = matrix( 
   c("Python", "SQL", "R", "Java", "Hadoop", "Spark", "Excel", "Tableau", "AWS", "SAS", "Scala", "C++", "Hive", "Javascript", "NoSQL", "Azure", "TensorFlow", "C", "PowerPoint", "Matlab", "Docker", "Git", "C#", "MySQL", "Ruby", "Microsoft", "Office", "SPSS", "MongoDB", "Pig", "Pandas", "Hbase", "Cassandra", "Numpy", "Perl", "Power", "BI", "Node", "PostgreSQL", "D3", "Keras", "PHP", "Redis", "Alteryx", "Jupyter", "Stata", "Caffe", "PyTorch"),  
   nrow=8,               
   ncol=6,              
   byrow = TRUE)         

ssdf <-  as.data.frame(t(softskills))
ssdf %>% kable() %>%  kable_styling()

V1	V2	V3	V4	V5	V6	V7	V8
Python	Excel	Hive	PowerPoint	Ruby	Pandas	BI	Redis
SQL	Tableau	Javascript	Matlab	Microsoft	Hbase	Node	Alteryx
R	AWS	NoSQL	Docker	Office	Cassandra	PostgreSQL	Jupyter
Java	SAS	Azure	Git	SPSS	Numpy	D3	Stata
Hadoop	Scala	TensorFlow	C#	MongoDB	Perl	Keras	Caffe
Spark	C++	C	MySQL	Pig	Power	PHP	PyTorch

The following computing skills are analyzed to determine their importance and/or ranking.

compskills <- matrix(c("Machine Learning", "Analysis", "Statistics", "Computer Science", "Communication", "Mathematics", "Visualization", "AI Composite", "Deep Learning", "NLP Composite", "Software Development", "Neural Networks", "Data Engineering", "Project Management", "Software Engineering"),
  nrow=5,               
  ncol=3,              
  byrow = TRUE)

ssdf <-  as.data.frame(t(compskills))
ssdf %>% kable() %>%  kable_styling()

V1	V2	V3	V4	V5
Machine Learning	Computer Science	Visualization	NLP Composite	Data Engineering
Analysis	Communication	AI Composite	Software Development	Project Management
Statistics	Mathematics	Deep Learning	Neural Networks	Software Engineering

Data Preparation

Read Software skills and Computing skills data from GitHub

#data <- read.csv(file="https://raw.githubusercontent.com/AjayArora35/Data-607-Group-Project-3/master/Data%20Science%20Software%20Skills.csv", header=TRUE, stringsAsFactors = FALSE)

#data2C <- read.csv(file="https://raw.githubusercontent.com/AjayArora35/Data-607-Group-Project-3/master/Data%20Science%20Computing%20Skills.csv", header=TRUE, stringsAsFactors = FALSE)

cn <- dbConnect(drv      = RMySQL::MySQL(), 
                username = "admin", 
                password = "Data__607", 
                host     = "database-1.cxdov2mcmzlo.us-east-2.rds.amazonaws.com", 
                port     = 3306, 
                dbname   = "data607project3")


data <- dbGetQuery(cn, "SELECT * FROM softwareskills")

data2C <- dbGetQuery(cn, "SELECT * FROM computingskills")

data3E<- dbGetQuery(cn, "SELECT * FROM educationlevels")

Sample data from Software Skills and Computing Skills

head(data) %>% kable() %>%  kable_styling()

id	Keyword	LinkedIn	Indeed	Monster	SimplyHired	AngelList
1	Python	10534	7067	5353	5564	962
2	SQL	7359	5504	8970	4475	559
3	R	6454	4617	3421	3709	332
4	Java	4458	3366	4359	2549	527
5	Hadoop	4388	2982	2858	2240	194
6	Spark	3686	2978	2453	2262	259

head(data2C) %>% kable() %>%  kable_styling()

id	Keyword	LinkedIn	Indeed	SimplyHired	Monster
1	machine learning	5701	3439	2561	2340
2	analysis	5168	3500	2668	3306
3	statistics	4893	2992	2308	2399
4	computer science	4517	2739	2093	1900
5	communication	3404	2344	1791	2053
6	mathematics	2605	1961	1497	1815

Data Cleaning

Remove commas, and fill in 0’s for missing values

data$LinkedIn <- as.numeric(gsub(",","",data$LinkedIn))
data$Indeed <- as.numeric(gsub(",","",data$Indeed))
data$Monster <- as.numeric(gsub(",","",data$Monster))
data$SimplyHired <- as.numeric(gsub(",","",data$SimplyHired))
data$AngelList <- as.numeric(gsub(",","",data$AngelList))
data[is.na(data)] <- 0

data2C$LinkedIn <- as.numeric(gsub(",","", data2C$LinkedIn))
data2C$Indeed <- as.numeric(gsub(",","", data2C$Indeed))
data2C$SimplyHired <- as.numeric(gsub(",","", data2C$SimplyHired))
data2C$Monster <- as.numeric(gsub(",","", data2C$Monster))
data2C[is.na(data2C)] <- 0


head(data) %>% kable() %>%  kable_styling()

id	Keyword	LinkedIn	Indeed	Monster	SimplyHired	AngelList
1	Python	10534	7067	5353	5564	962
2	SQL	7359	5504	8970	4475	559
3	R	6454	4617	3421	3709	332
4	Java	4458	3366	4359	2549	527
5	Hadoop	4388	2982	2858	2240	194
6	Spark	3686	2978	2453	2262	259

head(data2C) %>% kable() %>%  kable_styling()

id	Keyword	LinkedIn	Indeed	SimplyHired	Monster
1	machine learning	5701	3439	2561	2340
2	analysis	5168	3500	2668	3306
3	statistics	4893	2992	2308	2399
4	computer science	4517	2739	2093	1900
5	communication	3404	2344	1791	2053
6	mathematics	2605	1961	1497	1815

Analysis – Software Tools Skills

Software Tools Respondents Distribution

stdist <- data %>% 
  mutate(LinkedInTotal = sum(data$LinkedIn),
         IndeedTotal = sum(data$Indeed),
         MonsterTotal = sum(data$Monster),
         SimplyHiredTotal = sum(data$SimplyHired),
         AngelListTotal = sum(data$AngelList)
         )

stvector <- c(stdist[1,'LinkedInTotal'], stdist[1,'IndeedTotal'], stdist[1,'MonsterTotal'], stdist[1,'SimplyHiredTotal'], stdist[1,'AngelListTotal'])

barplot(stvector,
main = "Software Tools Respondents Distribution",
xlab = "Posting Sites",
ylab = "Respondents",
names.arg = c("LinkedIn", "Indeed", "Monster", "SimplyHired", "AngelList"),
col = viridis(5),
horiz = FALSE)

LinkedIn Software Tool Favorability Ranking

data2 <- data %>% 
  mutate(total = sum(data$LinkedIn),
         Favorability = percent(data$LinkedIn/total*100, accuracy = .01, scale=1))

data2 <- data2[order(-data2$LinkedIn),]

data2$Ranked <- 1:length(data2$LinkedIn)

         
select (data2, Keyword, LinkedIn, Favorability, Ranked)

##             Keyword LinkedIn Favorability Ranked
## 1            Python    10534       14.08%      1
## 2               SQL     7359        9.84%      2
## 3                 R     6454        8.63%      3
## 4              Java     4458        5.96%      4
## 5            Hadoop     4388        5.87%      5
## 6             Spark     3686        4.93%      6
## 7             Excel     2727        3.64%      7
## 8           Tableau     2636        3.52%      8
## 9               AWS     2621        3.50%      9
## 10              SAS     2589        3.46%     10
## 11            Scala     1980        2.65%     11
## 12              C++     1977        2.64%     12
## 13             Hive     1859        2.48%     13
## 14       Javascript     1564        2.09%     14
## 15            NoSQL     1417        1.89%     15
## 16            Azure     1332        1.78%     16
## 17       TensorFlow     1130        1.51%     17
## 19       PowerPoint     1061        1.42%     18
## 20           Matlab      986        1.32%     19
## 21           Docker      951        1.27%     20
## 22              Git      931        1.24%     21
## 23               C#      839        1.12%     22
## 24            MySQL      780        1.04%     23
## 25             Ruby      736        0.98%     24
## 26 Microsoft Office      711        0.95%     25
## 27             SPSS      660        0.88%     26
## 28          MongoDB      653        0.87%     27
## 29              Pig      639        0.85%     28
## 30           Pandas      630        0.84%     29
## 31            Hbase      629        0.84%     30
## 32        Cassandra      610        0.82%     31
## 33            Numpy      585        0.78%     32
## 34             Perl      521        0.70%     33
## 35         Power BI      494        0.66%     34
## 36             Node      476        0.64%     35
## 37       PostgreSQL      438        0.59%     36
## 38               D3      436        0.58%     37
## 39            Keras      403        0.54%     38
## 40              PHP      382        0.51%     39
## 41            Redis      274        0.37%     40
## 42          Alteryx      270        0.36%     41
## 43          Jupyter      270        0.36%     42
## 44            Stata      256        0.34%     43
## 45            Caffe      243        0.32%     44
## 46          PyTorch      241        0.32%     45
## 18                C        0        0.00%     46

Indeed Software Tool Favorability Ranking

data3 <- data %>% 
  mutate(total = sum(data$Indeed),
         Favorability = percent(data$Indeed/total*100, accuracy = .01, scale=1))

data3 <- data3[order(-data3$Indeed),]

data3$Ranked <- 1:length(data3$Indeed)


select (data3, Keyword, Indeed, Favorability, Ranked)

##             Keyword Indeed Favorability Ranked
## 1            Python   7067       12.43%      1
## 2               SQL   5504        9.68%      2
## 3                 R   4617        8.12%      3
## 4              Java   3366        5.92%      4
## 5            Hadoop   2982        5.24%      5
## 6             Spark   2978        5.24%      6
## 9               AWS   2346        4.13%      7
## 7             Excel   2287        4.02%      8
## 8           Tableau   2183        3.84%      9
## 10              SAS   1744        3.07%     10
## 12              C++   1567        2.76%     11
## 13             Hive   1534        2.70%     12
## 11            Scala   1506        2.65%     13
## 16            Azure   1205        2.12%     14
## 14       Javascript   1192        2.10%     15
## 15            NoSQL   1103        1.94%     16
## 17       TensorFlow    926        1.63%     17
## 19       PowerPoint    832        1.46%     18
## 20           Matlab    828        1.46%     19
## 22              Git    813        1.43%     20
## 21           Docker    792        1.39%     21
## 24            MySQL    659        1.16%     22
## 23               C#    645        1.13%     23
## 28          MongoDB    584        1.03%     24
## 26 Microsoft Office    550        0.97%     25
## 25             Ruby    535        0.94%     26
## 30           Pandas    525        0.92%     27
## 29              Pig    508        0.89%     28
## 31            Hbase    500        0.88%     29
## 32        Cassandra    490        0.86%     30
## 34             Perl    473        0.83%     31
## 27             SPSS    470        0.83%     32
## 33            Numpy    433        0.76%     33
## 35         Power BI    400        0.70%     34
## 37       PostgreSQL    376        0.66%     35
## 36             Node    360        0.63%     36
## 40              PHP    320        0.56%     37
## 39            Keras    302        0.53%     38
## 42          Alteryx    262        0.46%     39
## 43          Jupyter    198        0.35%     40
## 41            Redis    194        0.34%     41
## 45            Caffe    185        0.33%     42
## 46          PyTorch    178        0.31%     43
## 38               D3    176        0.31%     44
## 44            Stata    172        0.30%     45
## 18                C      0        0.00%     46

Monster Software Tool Favorability Ranking

data4 <- data %>% 
  mutate(total = sum(data$Monster),
         Favorability = percent(data$Monster/total*100, accuracy = .01, scale=1))

data4 <- data4[order(-data4$Monster),]

data4$Ranked <- 1:length(data4$Monster)

select (data4, Keyword, Monster, Favorability, Ranked)

##             Keyword Monster Favorability Ranked
## 2               SQL    8970       17.65%      1
## 1            Python    5353       10.53%      2
## 4              Java    4359        8.58%      3
## 3                 R    3421        6.73%      4
## 5            Hadoop    2858        5.62%      5
## 6             Spark    2453        4.83%      6
## 8           Tableau    1826        3.59%      7
## 9               AWS    1682        3.31%      8
## 7             Excel    1674        3.29%      9
## 15            NoSQL    1562        3.07%     10
## 14       Javascript    1436        2.82%     11
## 11            Scala    1206        2.37%     12
## 10              SAS    1153        2.27%     13
## 12              C++    1094        2.15%     14
## 22              Git     814        1.60%     15
## 16            Azure     752        1.48%     16
## 19       PowerPoint     706        1.39%     17
## 26 Microsoft Office     661        1.30%     18
## 17       TensorFlow     643        1.26%     19
## 20           Matlab     625        1.23%     20
## 23               C#     610        1.20%     21
## 21           Docker     583        1.15%     22
## 24            MySQL     501        0.99%     23
## 31            Hbase     495        0.97%     24
## 29              Pig     477        0.94%     25
## 37       PostgreSQL     470        0.92%     26
## 32        Cassandra     461        0.91%     27
## 27             SPSS     416        0.82%     28
## 28          MongoDB     401        0.79%     29
## 34             Perl     386        0.76%     30
## 30           Pandas     333        0.66%     31
## 33            Numpy     311        0.61%     32
## 25             Ruby     302        0.59%     33
## 35         Power BI     275        0.54%     34
## 39            Keras     211        0.42%     35
## 42          Alteryx     201        0.40%     36
## 38               D3     173        0.34%     37
## 46          PyTorch     161        0.32%     38
## 45            Caffe     155        0.30%     39
## 43          Jupyter     139        0.27%     40
## 40              PHP     133        0.26%     41
## 41            Redis     131        0.26%     42
## 44            Stata     124        0.24%     43
## 36             Node      95        0.19%     44
## 13             Hive      40        0.08%     45
## 18                C       0        0.00%     46

SimplyHired Software Tool Favorability Ranking

data5 <- data %>% 
  mutate(total = sum(data$SimplyHired),
         Favorability = percent(data$SimplyHired/total*100, accuracy = .01, scale=1))

data5 <- data5[order(-data5$SimplyHired),]

data5$Ranked <- 1:length(data5$SimplyHired)


select (data5, Keyword, SimplyHired, Favorability, Ranked)

##             Keyword SimplyHired Favorability Ranked
## 1            Python        5564       12.40%      1
## 2               SQL        4475        9.98%      2
## 3                 R        3709        8.27%      3
## 4              Java        2549        5.68%      4
## 6             Spark        2262        5.04%      5
## 5            Hadoop        2240        4.99%      6
## 7             Excel        1862        4.15%      7
## 9               AWS        1793        4.00%      8
## 8           Tableau        1732        3.86%      9
## 10              SAS        1417        3.16%     10
## 11            Scala        1175        2.62%     11
## 13             Hive        1161        2.59%     12
## 12              C++        1160        2.59%     13
## 14       Javascript         992        2.21%     14
## 15            NoSQL         942        2.10%     15
## 16            Azure         800        1.78%     16
## 19       PowerPoint         712        1.59%     17
## 20           Matlab         691        1.54%     18
## 17       TensorFlow         686        1.53%     19
## 22              Git         669        1.49%     20
## 21           Docker         603        1.34%     21
## 24            MySQL         524        1.17%     22
## 23               C#         497        1.11%     23
## 30           Pandas         450        1.00%     24
## 27             SPSS         428        0.95%     25
## 26 Microsoft Office         427        0.95%     26
## 28          MongoDB         423        0.94%     27
## 29              Pig         422        0.94%     28
## 31            Hbase         419        0.93%     29
## 25             Ruby         411        0.92%     30
## 32        Cassandra         408        0.91%     31
## 33            Numpy         365        0.81%     32
## 35         Power BI         347        0.77%     33
## 34             Perl         332        0.74%     34
## 37       PostgreSQL         324        0.72%     35
## 36             Node         317        0.71%     36
## 39            Keras         245        0.55%     37
## 40              PHP         212        0.47%     38
## 43          Jupyter         173        0.39%     39
## 42          Alteryx         170        0.38%     40
## 46          PyTorch         165        0.37%     41
## 41            Redis         158        0.35%     42
## 44            Stata         155        0.35%     43
## 45            Caffe         149        0.33%     44
## 38               D3         141        0.31%     45
## 18                C           0        0.00%     46

AngelList Software Tool Favorability Ranking

data6 <- data %>% 
  mutate(total = sum(data$AngelList),
         Favorability = percent(data$AngelList/total*100, accuracy = .01, scale=1))


data6 <- data6[order(-data6$AngelList),]

data6$Ranked <- 1:length(data6$AngelList)


select (data6, Keyword, AngelList, Favorability, Ranked)

##             Keyword AngelList Favorability Ranked
## 1            Python       962       10.85%      1
## 14       Javascript       635        7.16%      2
## 7             Excel       627        7.07%      3
## 9               AWS       571        6.44%      4
## 2               SQL       559        6.30%      5
## 4              Java       527        5.94%      6
## 12              C++       459        5.18%      7
## 23               C#       459        5.18%      8
## 3                 R       332        3.74%      9
## 25             Ruby       262        2.95%     10
## 6             Spark       259        2.92%     11
## 22              Git       232        2.62%     12
## 21           Docker       228        2.57%     13
## 24            MySQL       226        2.55%     14
## 15            NoSQL       221        2.49%     15
## 5            Hadoop       194        2.19%     16
## 37       PostgreSQL       194        2.19%     17
## 11            Scala       182        2.05%     18
## 28          MongoDB       159        1.79%     19
## 17       TensorFlow       149        1.68%     20
## 41            Redis       136        1.53%     21
## 40              PHP       113        1.27%     22
## 36             Node       110        1.24%     23
## 13             Hive        91        1.03%     24
## 16            Azure        90        1.01%     25
## 32        Cassandra        90        1.01%     26
## 33            Numpy        88        0.99%     27
## 30           Pandas        86        0.97%     28
## 34             Perl        86        0.97%     29
## 8           Tableau        76        0.86%     30
## 38               D3        69        0.78%     31
## 20           Matlab        68        0.77%     32
## 39            Keras        50        0.56%     33
## 10              SAS        37        0.42%     34
## 26 Microsoft Office        37        0.42%     35
## 45            Caffe        35        0.39%     36
## 31            Hbase        33        0.37%     37
## 46          PyTorch        32        0.36%     38
## 43          Jupyter        22        0.25%     39
## 27             SPSS        18        0.20%     40
## 29              Pig        17        0.19%     41
## 19       PowerPoint        16        0.18%     42
## 35         Power BI        16        0.18%     43
## 44            Stata        14        0.16%     44
## 18                C         0        0.00%     45
## 42          Alteryx         0        0.00%     46

What is the most relevant software tool identified by LinkedIn?

grid1 <- ggplot(data = data,aes(x=reorder(data$Keyword, data$LinkedIn), y=data$LinkedIn, fill = viridis(46),)) + 
  theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data$LinkedIn), position = position_dodge(width = 0.5), size = 2.4,   label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "LinkedIn", x = "Software Tools", y = "Tool Relevance")+
  coord_flip()
grid1

What is the most relevant software tool identified by Indeed?

grid2 <- ggplot(data = data,aes(x=reorder(data$Keyword, data$Indeed), y=data$Indeed, fill = magma(46),)) + 
  theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data$Indeed), position = position_dodge(width = 0.5), size = 2.4,   label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "Indeed", x = "Software Tools", y = "Tool Relevance")+
  coord_flip()
grid2

What is the most relevant software tool identified by Monster?

grid3 <- ggplot(data = data,aes(x=reorder(data$Keyword, data$Monster), y=data$Monster, fill = data$Keyword,)) +
  theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data$Monster), position = position_dodge(width = 0.5), size = 2.4, label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "Monster", x = "Software Tools", y = "Tool Relevance")+
  coord_flip()
grid3

What is the most relevant software tool identified by SimplyHired?

grid4 <- ggplot(data = data,aes(x=reorder(data$Keyword, data$SimplyHired), y=data$SimplyHired, fill = plasma(46),)) +
  theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data$SimplyHired), position = position_dodge(width = 0.5), size = 2.4,   label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "SimplyHired", x = "Software Tools", y = "Tool Relevance")+
  coord_flip()
grid4

What is the most relevant software tool identified by AngelList?

grid5 <- ggplot(data = data,aes(x=reorder(data$Keyword, data$AngelList), y=data$AngelList, fill = viridis(46),)) +
  theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data$AngelList), position = position_dodge(width = 0.5), size = 2.4, label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "AngelList", x = "Software Tools", y = "Tool Relevance")+
  coord_flip()
grid5

require(gridExtra)

## Loading required package: gridExtra

## Warning: package 'gridExtra' was built under R version 3.5.3

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(grid1, grid2, ncol=2)

grid.arrange(grid3, grid4, ncol=2)

grid.arrange(grid5, ncol=2)

Analysis – Computing Skills

Please note that NLP and AI composite were removed from the graphs as we believe they are categorical variables used to describe other skills in the dataset, thus they contain 0 total observations.

Computing Skills Respondents Distribution

csdist <- data2C %>% 
  mutate(LinkedInTotal = sum(data$LinkedIn),
         IndeedTotal = sum(data$Indeed),
         MonsterTotal = sum(data$Monster),
         SimplyHiredTotal = sum(data$SimplyHired)
         )

csvector <- c(csdist[1,'LinkedInTotal'], csdist[1,'IndeedTotal'], csdist[1,'MonsterTotal'], csdist[1,'SimplyHiredTotal'])

barplot(csvector,
main = "Computing Skills Respondents Distribution",
xlab = "Posting Sites",
ylab = "Respondents",
names.arg = c("LinkedIn", "Indeed", "Monster", "SimplyHired"),
col = plasma(4),
horiz = FALSE)

LinkedIn Computing Skill Favorability Ranking

data22 <- data2C %>% 
  mutate(total = sum(data2C$LinkedIn),
  Favorability = percent(data2C$LinkedIn/total*100, accuracy = .01, scale=1)) %>% filter (Keyword != "NLP composite" & Keyword != "AI composite")


data22 <- data22[order(-data22$LinkedIn),]

data22$Ranked <- 1:length(data22$LinkedIn)


select (data22, Keyword, LinkedIn, Favorability, Ranked)

##                 Keyword LinkedIn Favorability Ranked
## 1      machine learning     5701       17.66%      1
## 2              analysis     5168       16.01%      2
## 3            statistics     4893       15.16%      3
## 4      computer science     4517       13.99%      4
## 5         communication     3404       10.54%      5
## 6           mathematics     2605        8.07%      6
## 7         visualization     1879        5.82%      7
## 8         deep learning     1310        4.06%      8
## 9  software development      732        2.27%      9
## 10      neural networks      671        2.08%     10
## 11     data engineering      514        1.59%     11
## 12   project management      476        1.47%     12
## 13 software engineering      413        1.28%     13

Indeed Computing Skill Favorability Ranking

data23 <- data2C %>% 
   mutate(total = sum(data2C$Indeed),
   Favorability = percent(data2C$Indeed/total*100, accuracy = .01, scale=1)) %>% filter (Keyword != "NLP composite" & Keyword != "AI composite")

data23 <- data23[order(-data23$Indeed),]

data23$Ranked <- 1:length(data23$Indeed)


select (data23, Keyword, Indeed, Favorability, Ranked)

##                 Keyword Indeed Favorability Ranked
## 2              analysis   3500       16.53%      1
## 1      machine learning   3439       16.24%      2
## 3            statistics   2992       14.13%      3
## 4      computer science   2739       12.94%      4
## 5         communication   2344       11.07%      5
## 6           mathematics   1961        9.26%      6
## 7         visualization   1413        6.67%      7
## 8         deep learning    979        4.62%      8
## 9  software development    627        2.96%      9
## 10      neural networks    485        2.29%     10
## 12   project management    397        1.88%     11
## 13 software engineering    295        1.39%     12
## 11     data engineering      0        0.00%     13

Monster Computing Skill Favorability Ranking

data24 <- data2C %>% 
  mutate(total = sum(data2C$Monster),
  Favorability = percent(data2C$Monster/total*100, accuracy = .01, scale=1)) %>% filter (Keyword != "NLP composite" & Keyword != "AI composite")

data24 <- data24[order(-data24$Monster),]

data24$Ranked <- 1:length(data24$Monster)

select (data24, Keyword, Monster, Favorability, Ranked)

##                 Keyword Monster Favorability Ranked
## 2              analysis    3306       18.60%      1
## 3            statistics    2399       13.50%      2
## 1      machine learning    2340       13.16%      3
## 5         communication    2053       11.55%      4
## 4      computer science    1900       10.69%      5
## 6           mathematics    1815       10.21%      6
## 7         visualization    1207        6.79%      7
## 9  software development     784        4.41%      8
## 8         deep learning     606        3.41%      9
## 13 software engineering     512        2.88%     10
## 12   project management     348        1.96%     11
## 10      neural networks     305        1.72%     12
## 11     data engineering     200        1.13%     13

SimplyHired Computing Skill Favorability Ranking

data25 <- data2C %>% 
  mutate(total = sum(data2C$SimplyHired),
  Favorability = percent(data2C$SimplyHired/total*100, accuracy = .01, scale=1)) %>% filter (Keyword != "NLP composite" & Keyword != "AI composite")

data25 <- data25[order(-data25$SimplyHired),]

data25$Ranked <- 1:length(data25$SimplyHired)

select (data25, Keyword, SimplyHired, Favorability, Ranked)

##                 Keyword SimplyHired Favorability Ranked
## 2              analysis        2668       16.44%      1
## 1      machine learning        2561       15.78%      2
## 3            statistics        2308       14.22%      3
## 4      computer science        2093       12.90%      4
## 5         communication        1791       11.04%      5
## 6           mathematics        1497        9.22%      6
## 7         visualization        1153        7.11%      7
## 8         deep learning         675        4.16%      8
## 9  software development         481        2.96%      9
## 10      neural networks         421        2.59%     10
## 12   project management         330        2.03%     11
## 13 software engineering         250        1.54%     12
## 11     data engineering           0        0.00%     13

What is the most relevant Computing Skill identified by LinkedIn?

grid21 <- ggplot(data = data22,aes(x=reorder(Keyword, LinkedIn), y=LinkedIn, fill = viridis(13),)) +theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data22$LinkedIn), position = position_dodge(width = 0.5), size = 3, label.padding = unit(0.3, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "LinkedIn", x = "Computing Skills", y = "Skill Relevance")+
  coord_flip()
grid21

What is the most relevant Computing Skill identified by Indeed?

grid22 <- ggplot(data = data23,aes(x=reorder(Keyword, Indeed), y=Indeed, fill = magma(13),)) +theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data23$Indeed), position = position_dodge(width = 0.5), size = 3, label.padding = unit(0.3, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "Indeed", x = "Computing Skills", y = "Skill Relevance")+
  coord_flip()
grid22

What is the most relevant Computing Skill identified by Monster?

grid23 <- ggplot(data = data24,aes(x=reorder(Keyword, Monster), y=Monster, fill = Keyword,)) +theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data24$Monster), position = position_dodge(width = 0.5), size = 3, label.padding = unit(0.3, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "Monster", x = "Computing Skills", y = "Skill Relevance")+
  coord_flip()
grid23

What is the most relevant Computing Skill identified by SimplyHired?

grid24 <- ggplot(data = data25,aes(x=reorder(Keyword, SimplyHired), y=SimplyHired, fill = plasma(13),)) +theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=data25$SimplyHired), position = position_dodge(width = 0.5), size = 3, label.padding = unit(0.3, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "SimplyHired", x = "Computing Skills", y = "Skill Relevance")+
  coord_flip()
grid24

require(gridExtra)
grid.arrange(grid21, grid22, grid23, grid24,  ncol=2)

Aggregations

What is the most relevant software tool identified across all job search sites?

Total_soft <- data %>% mutate(Total_Count = LinkedIn + Indeed + Monster + SimplyHired + AngelList)

plot_soft <- ggplot(data = Total_soft,aes(x=reorder(Total_soft$Keyword, Total_soft$Total_Count), y=Total_soft$Total_Count, fill = viridis(46),)) + 
  theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=Total_soft$Total_Count), position = position_dodge(width = 0.5), size = 2.4,   label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "Most Relevant Software Tools Across All Sites ", x = "Software Tools", y = "Tool Relevance")+
  coord_flip()
  

plot_soft

set.seed(2)
wordcloud(words = Total_soft$Keyword, freq = Total_soft$Total_Count,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(7, "Dark2"))

What is the most relevant Computing Skill identified across all job search sites?

Total_comp <- data2C %>% mutate(Total_Count = LinkedIn + Indeed + Monster + SimplyHired) %>% filter (Keyword != "NLP composite" & Keyword != "AI composite")

plot_comp <- ggplot(data = Total_comp,aes(x=reorder(Total_comp$Keyword, Total_comp$Total_Count), y=Total_comp$Total_Count, fill = magma(13),)) + 
  theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8)) +
  geom_bar(stat = "identity") + 
  geom_label(aes(label=Total_comp$Total_Count), position = position_dodge(width = 0.5), size = 3,   label.padding = unit(0.3, "lines"), label.size = 0.15, inherit.aes = TRUE)+
  labs(title = "Most Relevant Computing Skills Across All Sites ", x = "Computing Skills", y = "Skill Relevance")+
  coord_flip()

plot_comp

#Substitute "computer science" with "cp" and "machine learning" with "ml" so they fit in the wordcloud below

Total_comp$Keyword <- sub("^computer science$", "cp", Total_comp$Keyword)
Total_comp$Keyword <- sub("^machine learning$", "ml", Total_comp$Keyword)

set.seed(1)                          
wordcloud(words = Total_comp$Keyword, freq = Total_comp$Total_Count,
          max.words=200, random.order=FALSE, rot.per=0.30, 
          colors=brewer.pal(7, "Dark2"))

Additional Findings

We would also like to make note of a small dataset containing the education level that respondents identified as being essential for a data scientist.

Please also note that we believe there may be an anomaly in the dataset. Monster posting site contains 12,086 observations for Kaggle, which is a significant count more than any other in the dataset. Additionally, we are not very certain that Kaggle should be considered an education level but we have kept it as it was mentioned across all job posting sites.

data3E$LinkedIn <- as.numeric(gsub(",","",data3E$LinkedIn))
data3E$Indeed <- as.numeric(gsub(",","",data3E$Indeed))
data3E$Monster <- as.numeric(gsub(",","",data3E$Monster))
data3E$SimplyHired <- as.numeric(gsub(",","",data3E$SimplyHired))
data3E$AngelList <- as.numeric(gsub(",","",data3E$AngelList))
data3E[is.na(data3E)] <- 0

Total_edu <- data3E %>% mutate(Total_Count = LinkedIn + Indeed + Monster + SimplyHired + AngelList)

ggplot(Total_edu, aes(x=reorder(Keyword, Total_Count), Total_Count)) + geom_bar(stat="identity", width = 0.5, fill="tomato2") + labs(x = "Education Level", y="Education Relevance", title="Most Relevant Education Level in Data Science")

Conclusions

Although this study is not based on a scientific inquiry with control mechanisms in place, the anecdotal evidence cannot be overlooked regarding the current trending of data science software and computing skills. Before we discuss any of the findings, it is worth bearing in mind, the unequal respondent’s distribution among the different job posting sites. For example, LinkedIn enjoys much larger audience than AngelList. These unequal distributions are worth keeping in mind as we discuss some final numbers. Furthermore, it is worth keeping in mind, whether mechanisms were in place to prevent a user respondent from submitting more than once per question.

LinkedIn, Indeed, SimplyHired, and AngelList, the computing practitioners’ respondents, chose Python as their most relevant Data Science software tools skills. However, there was one exception with Monster. The participants at Monster chose SQL as the top software tool skill. The computing skills outcome favored “analysis” by Indeed, Monster, and SimplyHired, as the most relevant computing skill followed by LinkedIn choosing “machine learning”.

Data 607 Project 3

Ajay Arora and Mario Pena

October 4, 2019