This task, in Data 607 Project 3, attempts to illuminate the necessary set of specific skills required to effectively work as a Data Scientist. The task is to explore variety of publicly available data to perform analysis to determine which set of skills are ranked the highest. To narrow the focus of this study, the analysis will be conducted on specific software tools and computing skills. See Data and Variables of Measurement for additional information.
Mario Pena
Ajay Arora
We established communication over #Slack and spoke over the phone to introduce ourselves properly. Furthermore, we divided the work effort in half. The work effort consisted of Data Preparation, Data Cleaning, Data Analysis, and Conclusion.
We both decided to communicate via email and/or #Slack. In addition, we establihed a Project 3 group folder on GitHub.
The data was acquired at Kaggle.com. (https://www.kaggle.com/discdiver/datasets) The specific software tools skills and computing skills were gathered from the following job posting sites, as indicated in the following image. The focus of the study is on specific software tools skills and computing skills described next.
LinkedIn, Indeed, Monster, SimplyHired, AngelList
#Loading Libraries
library(DBI)
## Warning: package 'DBI' was built under R version 3.5.3
library("knitr")
library("tidyverse")
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.0
## v tibble 2.0.1 v dplyr 0.8.0.1
## v tidyr 0.8.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("tidyr")
library("dplyr")
library("stringr")
library("plotly")
## Warning: package 'plotly' was built under R version 3.5.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library("htmlTable")
## Warning: package 'htmlTable' was built under R version 3.5.3
library("stringr")
library("ggplot2")
library("stats")
library("scales")
## Warning: package 'scales' was built under R version 3.5.3
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library("viridis")
## Warning: package 'viridis' was built under R version 3.5.3
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
##
## viridis_pal
library("wordcloud")
## Warning: package 'wordcloud' was built under R version 3.5.3
## Loading required package: RColorBrewer
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.5.3
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
softskills = matrix(
c("Python", "SQL", "R", "Java", "Hadoop", "Spark", "Excel", "Tableau", "AWS", "SAS", "Scala", "C++", "Hive", "Javascript", "NoSQL", "Azure", "TensorFlow", "C", "PowerPoint", "Matlab", "Docker", "Git", "C#", "MySQL", "Ruby", "Microsoft", "Office", "SPSS", "MongoDB", "Pig", "Pandas", "Hbase", "Cassandra", "Numpy", "Perl", "Power", "BI", "Node", "PostgreSQL", "D3", "Keras", "PHP", "Redis", "Alteryx", "Jupyter", "Stata", "Caffe", "PyTorch"),
nrow=8,
ncol=6,
byrow = TRUE)
ssdf <- as.data.frame(t(softskills))
ssdf %>% kable() %>% kable_styling()
| V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 |
|---|---|---|---|---|---|---|---|
| Python | Excel | Hive | PowerPoint | Ruby | Pandas | BI | Redis |
| SQL | Tableau | Javascript | Matlab | Microsoft | Hbase | Node | Alteryx |
| R | AWS | NoSQL | Docker | Office | Cassandra | PostgreSQL | Jupyter |
| Java | SAS | Azure | Git | SPSS | Numpy | D3 | Stata |
| Hadoop | Scala | TensorFlow | C# | MongoDB | Perl | Keras | Caffe |
| Spark | C++ | C | MySQL | Pig | Power | PHP | PyTorch |
compskills <- matrix(c("Machine Learning", "Analysis", "Statistics", "Computer Science", "Communication", "Mathematics", "Visualization", "AI Composite", "Deep Learning", "NLP Composite", "Software Development", "Neural Networks", "Data Engineering", "Project Management", "Software Engineering"),
nrow=5,
ncol=3,
byrow = TRUE)
ssdf <- as.data.frame(t(compskills))
ssdf %>% kable() %>% kable_styling()
| V1 | V2 | V3 | V4 | V5 |
|---|---|---|---|---|
| Machine Learning | Computer Science | Visualization | NLP Composite | Data Engineering |
| Analysis | Communication | AI Composite | Software Development | Project Management |
| Statistics | Mathematics | Deep Learning | Neural Networks | Software Engineering |
#data <- read.csv(file="https://raw.githubusercontent.com/AjayArora35/Data-607-Group-Project-3/master/Data%20Science%20Software%20Skills.csv", header=TRUE, stringsAsFactors = FALSE)
#data2C <- read.csv(file="https://raw.githubusercontent.com/AjayArora35/Data-607-Group-Project-3/master/Data%20Science%20Computing%20Skills.csv", header=TRUE, stringsAsFactors = FALSE)
cn <- dbConnect(drv = RMySQL::MySQL(),
username = "admin",
password = "Data__607",
host = "database-1.cxdov2mcmzlo.us-east-2.rds.amazonaws.com",
port = 3306,
dbname = "data607project3")
data <- dbGetQuery(cn, "SELECT * FROM softwareskills")
data2C <- dbGetQuery(cn, "SELECT * FROM computingskills")
data3E<- dbGetQuery(cn, "SELECT * FROM educationlevels")
head(data) %>% kable() %>% kable_styling()
| id | Keyword | Indeed | Monster | SimplyHired | AngelList | |
|---|---|---|---|---|---|---|
| 1 | Python | 10534 | 7067 | 5353 | 5564 | 962 |
| 2 | SQL | 7359 | 5504 | 8970 | 4475 | 559 |
| 3 | R | 6454 | 4617 | 3421 | 3709 | 332 |
| 4 | Java | 4458 | 3366 | 4359 | 2549 | 527 |
| 5 | Hadoop | 4388 | 2982 | 2858 | 2240 | 194 |
| 6 | Spark | 3686 | 2978 | 2453 | 2262 | 259 |
head(data2C) %>% kable() %>% kable_styling()
| id | Keyword | Indeed | SimplyHired | Monster | |
|---|---|---|---|---|---|
| 1 | machine learning | 5701 | 3439 | 2561 | 2340 |
| 2 | analysis | 5168 | 3500 | 2668 | 3306 |
| 3 | statistics | 4893 | 2992 | 2308 | 2399 |
| 4 | computer science | 4517 | 2739 | 2093 | 1900 |
| 5 | communication | 3404 | 2344 | 1791 | 2053 |
| 6 | mathematics | 2605 | 1961 | 1497 | 1815 |
data$LinkedIn <- as.numeric(gsub(",","",data$LinkedIn))
data$Indeed <- as.numeric(gsub(",","",data$Indeed))
data$Monster <- as.numeric(gsub(",","",data$Monster))
data$SimplyHired <- as.numeric(gsub(",","",data$SimplyHired))
data$AngelList <- as.numeric(gsub(",","",data$AngelList))
data[is.na(data)] <- 0
data2C$LinkedIn <- as.numeric(gsub(",","", data2C$LinkedIn))
data2C$Indeed <- as.numeric(gsub(",","", data2C$Indeed))
data2C$SimplyHired <- as.numeric(gsub(",","", data2C$SimplyHired))
data2C$Monster <- as.numeric(gsub(",","", data2C$Monster))
data2C[is.na(data2C)] <- 0
head(data) %>% kable() %>% kable_styling()
| id | Keyword | Indeed | Monster | SimplyHired | AngelList | |
|---|---|---|---|---|---|---|
| 1 | Python | 10534 | 7067 | 5353 | 5564 | 962 |
| 2 | SQL | 7359 | 5504 | 8970 | 4475 | 559 |
| 3 | R | 6454 | 4617 | 3421 | 3709 | 332 |
| 4 | Java | 4458 | 3366 | 4359 | 2549 | 527 |
| 5 | Hadoop | 4388 | 2982 | 2858 | 2240 | 194 |
| 6 | Spark | 3686 | 2978 | 2453 | 2262 | 259 |
head(data2C) %>% kable() %>% kable_styling()
| id | Keyword | Indeed | SimplyHired | Monster | |
|---|---|---|---|---|---|
| 1 | machine learning | 5701 | 3439 | 2561 | 2340 |
| 2 | analysis | 5168 | 3500 | 2668 | 3306 |
| 3 | statistics | 4893 | 2992 | 2308 | 2399 |
| 4 | computer science | 4517 | 2739 | 2093 | 1900 |
| 5 | communication | 3404 | 2344 | 1791 | 2053 |
| 6 | mathematics | 2605 | 1961 | 1497 | 1815 |
stdist <- data %>%
mutate(LinkedInTotal = sum(data$LinkedIn),
IndeedTotal = sum(data$Indeed),
MonsterTotal = sum(data$Monster),
SimplyHiredTotal = sum(data$SimplyHired),
AngelListTotal = sum(data$AngelList)
)
stvector <- c(stdist[1,'LinkedInTotal'], stdist[1,'IndeedTotal'], stdist[1,'MonsterTotal'], stdist[1,'SimplyHiredTotal'], stdist[1,'AngelListTotal'])
barplot(stvector,
main = "Software Tools Respondents Distribution",
xlab = "Posting Sites",
ylab = "Respondents",
names.arg = c("LinkedIn", "Indeed", "Monster", "SimplyHired", "AngelList"),
col = viridis(5),
horiz = FALSE)
data2 <- data %>%
mutate(total = sum(data$LinkedIn),
Favorability = percent(data$LinkedIn/total*100, accuracy = .01, scale=1))
data2 <- data2[order(-data2$LinkedIn),]
data2$Ranked <- 1:length(data2$LinkedIn)
select (data2, Keyword, LinkedIn, Favorability, Ranked)
## Keyword LinkedIn Favorability Ranked
## 1 Python 10534 14.08% 1
## 2 SQL 7359 9.84% 2
## 3 R 6454 8.63% 3
## 4 Java 4458 5.96% 4
## 5 Hadoop 4388 5.87% 5
## 6 Spark 3686 4.93% 6
## 7 Excel 2727 3.64% 7
## 8 Tableau 2636 3.52% 8
## 9 AWS 2621 3.50% 9
## 10 SAS 2589 3.46% 10
## 11 Scala 1980 2.65% 11
## 12 C++ 1977 2.64% 12
## 13 Hive 1859 2.48% 13
## 14 Javascript 1564 2.09% 14
## 15 NoSQL 1417 1.89% 15
## 16 Azure 1332 1.78% 16
## 17 TensorFlow 1130 1.51% 17
## 19 PowerPoint 1061 1.42% 18
## 20 Matlab 986 1.32% 19
## 21 Docker 951 1.27% 20
## 22 Git 931 1.24% 21
## 23 C# 839 1.12% 22
## 24 MySQL 780 1.04% 23
## 25 Ruby 736 0.98% 24
## 26 Microsoft Office 711 0.95% 25
## 27 SPSS 660 0.88% 26
## 28 MongoDB 653 0.87% 27
## 29 Pig 639 0.85% 28
## 30 Pandas 630 0.84% 29
## 31 Hbase 629 0.84% 30
## 32 Cassandra 610 0.82% 31
## 33 Numpy 585 0.78% 32
## 34 Perl 521 0.70% 33
## 35 Power BI 494 0.66% 34
## 36 Node 476 0.64% 35
## 37 PostgreSQL 438 0.59% 36
## 38 D3 436 0.58% 37
## 39 Keras 403 0.54% 38
## 40 PHP 382 0.51% 39
## 41 Redis 274 0.37% 40
## 42 Alteryx 270 0.36% 41
## 43 Jupyter 270 0.36% 42
## 44 Stata 256 0.34% 43
## 45 Caffe 243 0.32% 44
## 46 PyTorch 241 0.32% 45
## 18 C 0 0.00% 46
data3 <- data %>%
mutate(total = sum(data$Indeed),
Favorability = percent(data$Indeed/total*100, accuracy = .01, scale=1))
data3 <- data3[order(-data3$Indeed),]
data3$Ranked <- 1:length(data3$Indeed)
select (data3, Keyword, Indeed, Favorability, Ranked)
## Keyword Indeed Favorability Ranked
## 1 Python 7067 12.43% 1
## 2 SQL 5504 9.68% 2
## 3 R 4617 8.12% 3
## 4 Java 3366 5.92% 4
## 5 Hadoop 2982 5.24% 5
## 6 Spark 2978 5.24% 6
## 9 AWS 2346 4.13% 7
## 7 Excel 2287 4.02% 8
## 8 Tableau 2183 3.84% 9
## 10 SAS 1744 3.07% 10
## 12 C++ 1567 2.76% 11
## 13 Hive 1534 2.70% 12
## 11 Scala 1506 2.65% 13
## 16 Azure 1205 2.12% 14
## 14 Javascript 1192 2.10% 15
## 15 NoSQL 1103 1.94% 16
## 17 TensorFlow 926 1.63% 17
## 19 PowerPoint 832 1.46% 18
## 20 Matlab 828 1.46% 19
## 22 Git 813 1.43% 20
## 21 Docker 792 1.39% 21
## 24 MySQL 659 1.16% 22
## 23 C# 645 1.13% 23
## 28 MongoDB 584 1.03% 24
## 26 Microsoft Office 550 0.97% 25
## 25 Ruby 535 0.94% 26
## 30 Pandas 525 0.92% 27
## 29 Pig 508 0.89% 28
## 31 Hbase 500 0.88% 29
## 32 Cassandra 490 0.86% 30
## 34 Perl 473 0.83% 31
## 27 SPSS 470 0.83% 32
## 33 Numpy 433 0.76% 33
## 35 Power BI 400 0.70% 34
## 37 PostgreSQL 376 0.66% 35
## 36 Node 360 0.63% 36
## 40 PHP 320 0.56% 37
## 39 Keras 302 0.53% 38
## 42 Alteryx 262 0.46% 39
## 43 Jupyter 198 0.35% 40
## 41 Redis 194 0.34% 41
## 45 Caffe 185 0.33% 42
## 46 PyTorch 178 0.31% 43
## 38 D3 176 0.31% 44
## 44 Stata 172 0.30% 45
## 18 C 0 0.00% 46
data4 <- data %>%
mutate(total = sum(data$Monster),
Favorability = percent(data$Monster/total*100, accuracy = .01, scale=1))
data4 <- data4[order(-data4$Monster),]
data4$Ranked <- 1:length(data4$Monster)
select (data4, Keyword, Monster, Favorability, Ranked)
## Keyword Monster Favorability Ranked
## 2 SQL 8970 17.65% 1
## 1 Python 5353 10.53% 2
## 4 Java 4359 8.58% 3
## 3 R 3421 6.73% 4
## 5 Hadoop 2858 5.62% 5
## 6 Spark 2453 4.83% 6
## 8 Tableau 1826 3.59% 7
## 9 AWS 1682 3.31% 8
## 7 Excel 1674 3.29% 9
## 15 NoSQL 1562 3.07% 10
## 14 Javascript 1436 2.82% 11
## 11 Scala 1206 2.37% 12
## 10 SAS 1153 2.27% 13
## 12 C++ 1094 2.15% 14
## 22 Git 814 1.60% 15
## 16 Azure 752 1.48% 16
## 19 PowerPoint 706 1.39% 17
## 26 Microsoft Office 661 1.30% 18
## 17 TensorFlow 643 1.26% 19
## 20 Matlab 625 1.23% 20
## 23 C# 610 1.20% 21
## 21 Docker 583 1.15% 22
## 24 MySQL 501 0.99% 23
## 31 Hbase 495 0.97% 24
## 29 Pig 477 0.94% 25
## 37 PostgreSQL 470 0.92% 26
## 32 Cassandra 461 0.91% 27
## 27 SPSS 416 0.82% 28
## 28 MongoDB 401 0.79% 29
## 34 Perl 386 0.76% 30
## 30 Pandas 333 0.66% 31
## 33 Numpy 311 0.61% 32
## 25 Ruby 302 0.59% 33
## 35 Power BI 275 0.54% 34
## 39 Keras 211 0.42% 35
## 42 Alteryx 201 0.40% 36
## 38 D3 173 0.34% 37
## 46 PyTorch 161 0.32% 38
## 45 Caffe 155 0.30% 39
## 43 Jupyter 139 0.27% 40
## 40 PHP 133 0.26% 41
## 41 Redis 131 0.26% 42
## 44 Stata 124 0.24% 43
## 36 Node 95 0.19% 44
## 13 Hive 40 0.08% 45
## 18 C 0 0.00% 46
data5 <- data %>%
mutate(total = sum(data$SimplyHired),
Favorability = percent(data$SimplyHired/total*100, accuracy = .01, scale=1))
data5 <- data5[order(-data5$SimplyHired),]
data5$Ranked <- 1:length(data5$SimplyHired)
select (data5, Keyword, SimplyHired, Favorability, Ranked)
## Keyword SimplyHired Favorability Ranked
## 1 Python 5564 12.40% 1
## 2 SQL 4475 9.98% 2
## 3 R 3709 8.27% 3
## 4 Java 2549 5.68% 4
## 6 Spark 2262 5.04% 5
## 5 Hadoop 2240 4.99% 6
## 7 Excel 1862 4.15% 7
## 9 AWS 1793 4.00% 8
## 8 Tableau 1732 3.86% 9
## 10 SAS 1417 3.16% 10
## 11 Scala 1175 2.62% 11
## 13 Hive 1161 2.59% 12
## 12 C++ 1160 2.59% 13
## 14 Javascript 992 2.21% 14
## 15 NoSQL 942 2.10% 15
## 16 Azure 800 1.78% 16
## 19 PowerPoint 712 1.59% 17
## 20 Matlab 691 1.54% 18
## 17 TensorFlow 686 1.53% 19
## 22 Git 669 1.49% 20
## 21 Docker 603 1.34% 21
## 24 MySQL 524 1.17% 22
## 23 C# 497 1.11% 23
## 30 Pandas 450 1.00% 24
## 27 SPSS 428 0.95% 25
## 26 Microsoft Office 427 0.95% 26
## 28 MongoDB 423 0.94% 27
## 29 Pig 422 0.94% 28
## 31 Hbase 419 0.93% 29
## 25 Ruby 411 0.92% 30
## 32 Cassandra 408 0.91% 31
## 33 Numpy 365 0.81% 32
## 35 Power BI 347 0.77% 33
## 34 Perl 332 0.74% 34
## 37 PostgreSQL 324 0.72% 35
## 36 Node 317 0.71% 36
## 39 Keras 245 0.55% 37
## 40 PHP 212 0.47% 38
## 43 Jupyter 173 0.39% 39
## 42 Alteryx 170 0.38% 40
## 46 PyTorch 165 0.37% 41
## 41 Redis 158 0.35% 42
## 44 Stata 155 0.35% 43
## 45 Caffe 149 0.33% 44
## 38 D3 141 0.31% 45
## 18 C 0 0.00% 46
data6 <- data %>%
mutate(total = sum(data$AngelList),
Favorability = percent(data$AngelList/total*100, accuracy = .01, scale=1))
data6 <- data6[order(-data6$AngelList),]
data6$Ranked <- 1:length(data6$AngelList)
select (data6, Keyword, AngelList, Favorability, Ranked)
## Keyword AngelList Favorability Ranked
## 1 Python 962 10.85% 1
## 14 Javascript 635 7.16% 2
## 7 Excel 627 7.07% 3
## 9 AWS 571 6.44% 4
## 2 SQL 559 6.30% 5
## 4 Java 527 5.94% 6
## 12 C++ 459 5.18% 7
## 23 C# 459 5.18% 8
## 3 R 332 3.74% 9
## 25 Ruby 262 2.95% 10
## 6 Spark 259 2.92% 11
## 22 Git 232 2.62% 12
## 21 Docker 228 2.57% 13
## 24 MySQL 226 2.55% 14
## 15 NoSQL 221 2.49% 15
## 5 Hadoop 194 2.19% 16
## 37 PostgreSQL 194 2.19% 17
## 11 Scala 182 2.05% 18
## 28 MongoDB 159 1.79% 19
## 17 TensorFlow 149 1.68% 20
## 41 Redis 136 1.53% 21
## 40 PHP 113 1.27% 22
## 36 Node 110 1.24% 23
## 13 Hive 91 1.03% 24
## 16 Azure 90 1.01% 25
## 32 Cassandra 90 1.01% 26
## 33 Numpy 88 0.99% 27
## 30 Pandas 86 0.97% 28
## 34 Perl 86 0.97% 29
## 8 Tableau 76 0.86% 30
## 38 D3 69 0.78% 31
## 20 Matlab 68 0.77% 32
## 39 Keras 50 0.56% 33
## 10 SAS 37 0.42% 34
## 26 Microsoft Office 37 0.42% 35
## 45 Caffe 35 0.39% 36
## 31 Hbase 33 0.37% 37
## 46 PyTorch 32 0.36% 38
## 43 Jupyter 22 0.25% 39
## 27 SPSS 18 0.20% 40
## 29 Pig 17 0.19% 41
## 19 PowerPoint 16 0.18% 42
## 35 Power BI 16 0.18% 43
## 44 Stata 14 0.16% 44
## 18 C 0 0.00% 45
## 42 Alteryx 0 0.00% 46
grid1 <- ggplot(data = data,aes(x=reorder(data$Keyword, data$LinkedIn), y=data$LinkedIn, fill = viridis(46),)) +
theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
geom_bar(stat = "identity") +
geom_label(aes(label=data$LinkedIn), position = position_dodge(width = 0.5), size = 2.4, label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
labs(title = "LinkedIn", x = "Software Tools", y = "Tool Relevance")+
coord_flip()
grid1
grid2 <- ggplot(data = data,aes(x=reorder(data$Keyword, data$Indeed), y=data$Indeed, fill = magma(46),)) +
theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
geom_bar(stat = "identity") +
geom_label(aes(label=data$Indeed), position = position_dodge(width = 0.5), size = 2.4, label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
labs(title = "Indeed", x = "Software Tools", y = "Tool Relevance")+
coord_flip()
grid2
grid3 <- ggplot(data = data,aes(x=reorder(data$Keyword, data$Monster), y=data$Monster, fill = data$Keyword,)) +
theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
geom_bar(stat = "identity") +
geom_label(aes(label=data$Monster), position = position_dodge(width = 0.5), size = 2.4, label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
labs(title = "Monster", x = "Software Tools", y = "Tool Relevance")+
coord_flip()
grid3
grid4 <- ggplot(data = data,aes(x=reorder(data$Keyword, data$SimplyHired), y=data$SimplyHired, fill = plasma(46),)) +
theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
geom_bar(stat = "identity") +
geom_label(aes(label=data$SimplyHired), position = position_dodge(width = 0.5), size = 2.4, label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
labs(title = "SimplyHired", x = "Software Tools", y = "Tool Relevance")+
coord_flip()
grid4
grid5 <- ggplot(data = data,aes(x=reorder(data$Keyword, data$AngelList), y=data$AngelList, fill = viridis(46),)) +
theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
geom_bar(stat = "identity") +
geom_label(aes(label=data$AngelList), position = position_dodge(width = 0.5), size = 2.4, label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
labs(title = "AngelList", x = "Software Tools", y = "Tool Relevance")+
coord_flip()
grid5
require(gridExtra)
## Loading required package: gridExtra
## Warning: package 'gridExtra' was built under R version 3.5.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(grid1, grid2, ncol=2)
grid.arrange(grid3, grid4, ncol=2)
grid.arrange(grid5, ncol=2)
Please note that NLP and AI composite were removed from the graphs as we believe they are categorical variables used to describe other skills in the dataset, thus they contain 0 total observations.
csdist <- data2C %>%
mutate(LinkedInTotal = sum(data$LinkedIn),
IndeedTotal = sum(data$Indeed),
MonsterTotal = sum(data$Monster),
SimplyHiredTotal = sum(data$SimplyHired)
)
csvector <- c(csdist[1,'LinkedInTotal'], csdist[1,'IndeedTotal'], csdist[1,'MonsterTotal'], csdist[1,'SimplyHiredTotal'])
barplot(csvector,
main = "Computing Skills Respondents Distribution",
xlab = "Posting Sites",
ylab = "Respondents",
names.arg = c("LinkedIn", "Indeed", "Monster", "SimplyHired"),
col = plasma(4),
horiz = FALSE)
data22 <- data2C %>%
mutate(total = sum(data2C$LinkedIn),
Favorability = percent(data2C$LinkedIn/total*100, accuracy = .01, scale=1)) %>% filter (Keyword != "NLP composite" & Keyword != "AI composite")
data22 <- data22[order(-data22$LinkedIn),]
data22$Ranked <- 1:length(data22$LinkedIn)
select (data22, Keyword, LinkedIn, Favorability, Ranked)
## Keyword LinkedIn Favorability Ranked
## 1 machine learning 5701 17.66% 1
## 2 analysis 5168 16.01% 2
## 3 statistics 4893 15.16% 3
## 4 computer science 4517 13.99% 4
## 5 communication 3404 10.54% 5
## 6 mathematics 2605 8.07% 6
## 7 visualization 1879 5.82% 7
## 8 deep learning 1310 4.06% 8
## 9 software development 732 2.27% 9
## 10 neural networks 671 2.08% 10
## 11 data engineering 514 1.59% 11
## 12 project management 476 1.47% 12
## 13 software engineering 413 1.28% 13
data23 <- data2C %>%
mutate(total = sum(data2C$Indeed),
Favorability = percent(data2C$Indeed/total*100, accuracy = .01, scale=1)) %>% filter (Keyword != "NLP composite" & Keyword != "AI composite")
data23 <- data23[order(-data23$Indeed),]
data23$Ranked <- 1:length(data23$Indeed)
select (data23, Keyword, Indeed, Favorability, Ranked)
## Keyword Indeed Favorability Ranked
## 2 analysis 3500 16.53% 1
## 1 machine learning 3439 16.24% 2
## 3 statistics 2992 14.13% 3
## 4 computer science 2739 12.94% 4
## 5 communication 2344 11.07% 5
## 6 mathematics 1961 9.26% 6
## 7 visualization 1413 6.67% 7
## 8 deep learning 979 4.62% 8
## 9 software development 627 2.96% 9
## 10 neural networks 485 2.29% 10
## 12 project management 397 1.88% 11
## 13 software engineering 295 1.39% 12
## 11 data engineering 0 0.00% 13
data24 <- data2C %>%
mutate(total = sum(data2C$Monster),
Favorability = percent(data2C$Monster/total*100, accuracy = .01, scale=1)) %>% filter (Keyword != "NLP composite" & Keyword != "AI composite")
data24 <- data24[order(-data24$Monster),]
data24$Ranked <- 1:length(data24$Monster)
select (data24, Keyword, Monster, Favorability, Ranked)
## Keyword Monster Favorability Ranked
## 2 analysis 3306 18.60% 1
## 3 statistics 2399 13.50% 2
## 1 machine learning 2340 13.16% 3
## 5 communication 2053 11.55% 4
## 4 computer science 1900 10.69% 5
## 6 mathematics 1815 10.21% 6
## 7 visualization 1207 6.79% 7
## 9 software development 784 4.41% 8
## 8 deep learning 606 3.41% 9
## 13 software engineering 512 2.88% 10
## 12 project management 348 1.96% 11
## 10 neural networks 305 1.72% 12
## 11 data engineering 200 1.13% 13
data25 <- data2C %>%
mutate(total = sum(data2C$SimplyHired),
Favorability = percent(data2C$SimplyHired/total*100, accuracy = .01, scale=1)) %>% filter (Keyword != "NLP composite" & Keyword != "AI composite")
data25 <- data25[order(-data25$SimplyHired),]
data25$Ranked <- 1:length(data25$SimplyHired)
select (data25, Keyword, SimplyHired, Favorability, Ranked)
## Keyword SimplyHired Favorability Ranked
## 2 analysis 2668 16.44% 1
## 1 machine learning 2561 15.78% 2
## 3 statistics 2308 14.22% 3
## 4 computer science 2093 12.90% 4
## 5 communication 1791 11.04% 5
## 6 mathematics 1497 9.22% 6
## 7 visualization 1153 7.11% 7
## 8 deep learning 675 4.16% 8
## 9 software development 481 2.96% 9
## 10 neural networks 421 2.59% 10
## 12 project management 330 2.03% 11
## 13 software engineering 250 1.54% 12
## 11 data engineering 0 0.00% 13
grid21 <- ggplot(data = data22,aes(x=reorder(Keyword, LinkedIn), y=LinkedIn, fill = viridis(13),)) +theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8)) +
geom_bar(stat = "identity") +
geom_label(aes(label=data22$LinkedIn), position = position_dodge(width = 0.5), size = 3, label.padding = unit(0.3, "lines"), label.size = 0.15, inherit.aes = TRUE)+
labs(title = "LinkedIn", x = "Computing Skills", y = "Skill Relevance")+
coord_flip()
grid21
grid22 <- ggplot(data = data23,aes(x=reorder(Keyword, Indeed), y=Indeed, fill = magma(13),)) +theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8)) +
geom_bar(stat = "identity") +
geom_label(aes(label=data23$Indeed), position = position_dodge(width = 0.5), size = 3, label.padding = unit(0.3, "lines"), label.size = 0.15, inherit.aes = TRUE)+
labs(title = "Indeed", x = "Computing Skills", y = "Skill Relevance")+
coord_flip()
grid22
grid23 <- ggplot(data = data24,aes(x=reorder(Keyword, Monster), y=Monster, fill = Keyword,)) +theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8)) +
geom_bar(stat = "identity") +
geom_label(aes(label=data24$Monster), position = position_dodge(width = 0.5), size = 3, label.padding = unit(0.3, "lines"), label.size = 0.15, inherit.aes = TRUE)+
labs(title = "Monster", x = "Computing Skills", y = "Skill Relevance")+
coord_flip()
grid23
grid24 <- ggplot(data = data25,aes(x=reorder(Keyword, SimplyHired), y=SimplyHired, fill = plasma(13),)) +theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8)) +
geom_bar(stat = "identity") +
geom_label(aes(label=data25$SimplyHired), position = position_dodge(width = 0.5), size = 3, label.padding = unit(0.3, "lines"), label.size = 0.15, inherit.aes = TRUE)+
labs(title = "SimplyHired", x = "Computing Skills", y = "Skill Relevance")+
coord_flip()
grid24
require(gridExtra)
grid.arrange(grid21, grid22, grid23, grid24, ncol=2)
Total_soft <- data %>% mutate(Total_Count = LinkedIn + Indeed + Monster + SimplyHired + AngelList)
plot_soft <- ggplot(data = Total_soft,aes(x=reorder(Total_soft$Keyword, Total_soft$Total_Count), y=Total_soft$Total_Count, fill = viridis(46),)) +
theme(legend.position = "none", axis.text.y = element_text(size=6), axis.text.x = element_text(size=7)) +
geom_bar(stat = "identity") +
geom_label(aes(label=Total_soft$Total_Count), position = position_dodge(width = 0.5), size = 2.4, label.padding = unit(0.04, "lines"), label.size = 0.15, inherit.aes = TRUE)+
labs(title = "Most Relevant Software Tools Across All Sites ", x = "Software Tools", y = "Tool Relevance")+
coord_flip()
plot_soft
set.seed(2)
wordcloud(words = Total_soft$Keyword, freq = Total_soft$Total_Count,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(7, "Dark2"))
Total_comp <- data2C %>% mutate(Total_Count = LinkedIn + Indeed + Monster + SimplyHired) %>% filter (Keyword != "NLP composite" & Keyword != "AI composite")
plot_comp <- ggplot(data = Total_comp,aes(x=reorder(Total_comp$Keyword, Total_comp$Total_Count), y=Total_comp$Total_Count, fill = magma(13),)) +
theme(legend.position = "none", axis.text.y = element_text(size=8), axis.text.x = element_text(size=8)) +
geom_bar(stat = "identity") +
geom_label(aes(label=Total_comp$Total_Count), position = position_dodge(width = 0.5), size = 3, label.padding = unit(0.3, "lines"), label.size = 0.15, inherit.aes = TRUE)+
labs(title = "Most Relevant Computing Skills Across All Sites ", x = "Computing Skills", y = "Skill Relevance")+
coord_flip()
plot_comp
#Substitute "computer science" with "cp" and "machine learning" with "ml" so they fit in the wordcloud below
Total_comp$Keyword <- sub("^computer science$", "cp", Total_comp$Keyword)
Total_comp$Keyword <- sub("^machine learning$", "ml", Total_comp$Keyword)
set.seed(1)
wordcloud(words = Total_comp$Keyword, freq = Total_comp$Total_Count,
max.words=200, random.order=FALSE, rot.per=0.30,
colors=brewer.pal(7, "Dark2"))
We would also like to make note of a small dataset containing the education level that respondents identified as being essential for a data scientist.
Please also note that we believe there may be an anomaly in the dataset. Monster posting site contains 12,086 observations for Kaggle, which is a significant count more than any other in the dataset. Additionally, we are not very certain that Kaggle should be considered an education level but we have kept it as it was mentioned across all job posting sites.
data3E$LinkedIn <- as.numeric(gsub(",","",data3E$LinkedIn))
data3E$Indeed <- as.numeric(gsub(",","",data3E$Indeed))
data3E$Monster <- as.numeric(gsub(",","",data3E$Monster))
data3E$SimplyHired <- as.numeric(gsub(",","",data3E$SimplyHired))
data3E$AngelList <- as.numeric(gsub(",","",data3E$AngelList))
data3E[is.na(data3E)] <- 0
Total_edu <- data3E %>% mutate(Total_Count = LinkedIn + Indeed + Monster + SimplyHired + AngelList)
ggplot(Total_edu, aes(x=reorder(Keyword, Total_Count), Total_Count)) + geom_bar(stat="identity", width = 0.5, fill="tomato2") + labs(x = "Education Level", y="Education Relevance", title="Most Relevant Education Level in Data Science")
Although this study is not based on a scientific inquiry with control mechanisms in place, the anecdotal evidence cannot be overlooked regarding the current trending of data science software and computing skills. Before we discuss any of the findings, it is worth bearing in mind, the unequal respondent’s distribution among the different job posting sites. For example, LinkedIn enjoys much larger audience than AngelList. These unequal distributions are worth keeping in mind as we discuss some final numbers. Furthermore, it is worth keeping in mind, whether mechanisms were in place to prevent a user respondent from submitting more than once per question.
LinkedIn, Indeed, SimplyHired, and AngelList, the computing practitioners’ respondents, chose Python as their most relevant Data Science software tools skills. However, there was one exception with Monster. The participants at Monster chose SQL as the top software tool skill. The computing skills outcome favored “analysis” by Indeed, Monster, and SimplyHired, as the most relevant computing skill followed by LinkedIn choosing “machine learning”.