This is a project for your entire class section to work on together, since being able to work effectively on a virtual team is a key “soft skill” for data scientists. Please note especially the requirement about making a presentation during our first meetup after the project is due.
W. Edwards Deming said, “In God we trust, all others must bring data.” Please use data to answer the question, “Which are the most valued data science skills?” Consider your work as an exploration; there is not necessarily a “right answer.”
When researching for this project, this study https://365datascience.com/research-into-1001-data-scientist-profiles/, has findings that fits the critia needed to be calculated in order to complete this project. However, as a team, we needed some sort of data pool to start calulating perameters to answer a research question(s).
imgage <- "journal.png"
include_graphics(imgage)
With some futher research, we found a similar project that wanted to answer a similar question.
https://www.kaggle.com/discdiver/the-most-in-demand-skills-for-data-scientists
imgage <- "kaggle.png"
include_graphics(imgage)
For this project we used the two main data-sets from the above kaggle web-page to answer our research question(s)
kaggle_ds_job_listing_software.csv & kaggle_ds_general_skills_revised.csv
imgage <- "kaggle_data.png"
include_graphics(imgage)
general_skills <- read.csv(file="https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/kaggle_ds_general_skills_revised.csv")
head(general_skills)
## ï..Keyword LinkedIn Indeed SimplyHired Monster
## 1 machine learning 5,701 3,439 2,561 2,340
## 2 analysis 5,168 3,500 2,668 3,306
## 3 statistics 4,893 2,992 2,308 2,399
## 4 computer science 4,517 2,739 2,093 1,900
## 5 communication 3,404 2,344 1,791 2,053
## 6 mathematics 2,605 1,961 1,497 1,815
software_skills <-read.csv(file = "https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/kaggle_ds_job_listing_software.csv")
As a team, we thought it be best to join these two data-sets, cleaning the respective data-frames into similar formats to be merged.
dt1 <- data.table(general_skills, key = "ï..Keyword")
dt1$Skill_Set <- "Soft"
dt2 <- data.table(software_skills, key = "ï..Keyword")
dt2$Skill_Set <- "Tech"
dt2 = subset(dt2, select = -c(LinkedIn.., Indeed.., SimplyHired.., Monster.., Avg.., GlassDoor.Self.Reported...2017, Difference) )
result<-rbind(dt1, dt2)
head(result)
## ï..Keyword LinkedIn Indeed SimplyHired Monster Skill_Set
## 1: AI composite 1,568 1,125 811 687 Soft
## 2: analysis 5,168 3,500 2,668 3,306 Soft
## 3: communication 3,404 2,344 1,791 2,053 Soft
## 4: computer science 4,517 2,739 2,093 1,900 Soft
## 5: data engineering 514 339 276 200 Soft
## 6: deep learning 1,310 979 675 606 Soft
names(result)[names(result) == "ï..Keyword"] <- "Keyword"
write.csv(result, file = "joined_df2.csv",row.names=FALSE)
ds_skills <- read.csv("joined_df2.csv")
ds_skills <- read_csv(file = "https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/joined_df2.csv")
## Parsed with column specification:
## cols(
## Keyword = col_character(),
## LinkedIn = col_number(),
## Indeed = col_number(),
## SimplyHired = col_number(),
## Monster = col_number(),
## Skill_Set = col_character()
## )
head(ds_skills)
## # A tibble: 6 x 6
## Keyword LinkedIn Indeed SimplyHired Monster Skill_Set
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 AI composite 1568 1125 811 687 Soft
## 2 analysis 5168 3500 2668 3306 Soft
## 3 communication 3404 2344 1791 2053 Soft
## 4 computer science 4517 2739 2093 1900 Soft
## 5 data engineering 514 339 276 200 Soft
## 6 deep learning 1310 979 675 606 Soft
ds_skills$Total <- rowSums(sapply(ds_skills[,c(2:5)], as.numeric))
ds_skills <- mutate(ds_skills[, c(1,6,2:5,7)])
ds_skills <- data.frame(ds_skills)
colnames(ds_skills)
## [1] "Keyword" "Skill_Set" "LinkedIn" "Indeed" "SimplyHired"
## [6] "Monster" "Total"
dim(ds_skills)
## [1] 52 7
head(ds_skills)
## Keyword Skill_Set LinkedIn Indeed SimplyHired Monster Total
## 1 AI composite Soft 1568 1125 811 687 4191
## 2 analysis Soft 5168 3500 2668 3306 14642
## 3 communication Soft 3404 2344 1791 2053 9592
## 4 computer science Soft 4517 2739 2093 1900 11249
## 5 data engineering Soft 514 339 276 200 1329
## 6 deep learning Soft 1310 979 675 606 3570
These two combined data-sets used the amount of keywords present during a general search on well known job recruitment websites, to fill in its’ values. As a team, we thought it was more effective to create an average out of these values to get a better understanding of the dataset.
LinkedIn <- 8610
Indeed <- 5138
SimplyHired <- 3829
Monster <- 3746
Total <- LinkedIn + Indeed + SimplyHired + Monster
ds_skills$LinkedIn <- ((ds_skills$LinkedIn)/(LinkedIn))*100
ds_skills$Indeed <- ((ds_skills$Indeed)/(Indeed))*100
ds_skills$SimplyHired <- ((ds_skills$SimplyHired)/(SimplyHired))*100
ds_skills$Monster <- ((ds_skills$Monster)/((Monster)))*100
ds_skills$Total <- ((ds_skills$Total/((Total))))*100
dim(ds_skills)
## [1] 52 7
head(ds_skills)
## Keyword Skill_Set LinkedIn Indeed SimplyHired Monster
## 1 AI composite Soft 18.211382 21.895679 21.180465 18.339562
## 2 analysis Soft 60.023229 68.119891 69.678767 88.254138
## 3 communication Soft 39.535424 45.620864 46.774615 54.805125
## 4 computer science Soft 52.462253 53.308680 54.661792 50.720769
## 5 data engineering Soft 5.969803 6.597898 7.208148 5.339028
## 6 deep learning Soft 15.214866 19.054107 17.628624 16.177256
## Total
## 1 19.654833
## 2 68.667636
## 3 44.984289
## 4 52.755241
## 5 6.232706
## 6 16.742485
ds_skills_soft <- filter(ds_skills, ds_skills$Skill_Set == "Soft")
dim(ds_skills_soft)
## [1] 15 7
head(ds_skills_soft)
## Keyword Skill_Set LinkedIn Indeed SimplyHired Monster
## 1 AI composite Soft 18.211382 21.895679 21.180465 18.339562
## 2 analysis Soft 60.023229 68.119891 69.678767 88.254138
## 3 communication Soft 39.535424 45.620864 46.774615 54.805125
## 4 computer science Soft 52.462253 53.308680 54.661792 50.720769
## 5 data engineering Soft 5.969803 6.597898 7.208148 5.339028
## 6 deep learning Soft 15.214866 19.054107 17.628624 16.177256
## Total
## 1 19.654833
## 2 68.667636
## 3 44.984289
## 4 52.755241
## 5 6.232706
## 6 16.742485
# Arranged in descending order
ds_skills_soft <- arrange(ds_skills_soft, desc(Total))
head(ds_skills_soft)
## Keyword Skill_Set LinkedIn Indeed SimplyHired Monster
## 1 analysis Soft 60.02323 68.11989 69.67877 88.25414
## 2 machine learning Soft 66.21370 66.93266 66.88430 62.46663
## 3 statistics Soft 56.82927 58.23278 60.27683 64.04164
## 4 computer science Soft 52.46225 53.30868 54.66179 50.72077
## 5 communication Soft 39.53542 45.62086 46.77461 54.80513
## 6 mathematics Soft 30.25552 38.16660 39.09637 48.45168
## Total
## 1 68.66764
## 2 65.84908
## 3 59.05360
## 4 52.75524
## 5 44.98429
## 6 36.94602
#Top 10 soft skills
ds_skills_soft <- ds_skills_soft[1:10,c(1,3:7)]
head(ds_skills_soft)
## Keyword LinkedIn Indeed SimplyHired Monster Total
## 1 analysis 60.02323 68.11989 69.67877 88.25414 68.66764
## 2 machine learning 66.21370 66.93266 66.88430 62.46663 65.84908
## 3 statistics 56.82927 58.23278 60.27683 64.04164 59.05360
## 4 computer science 52.46225 53.30868 54.66179 50.72077 52.75524
## 5 communication 39.53542 45.62086 46.77461 54.80513 44.98429
## 6 mathematics 30.25552 38.16660 39.09637 48.45168 36.94602
#ds_skills_soft$Keyword
p <- plot_ly(ds_skills_soft, x = ~ ds_skills_soft, y = ds_skills_soft$Total, type = 'bar', name = ds_skills_soft$Keyword)
p
top_10_soft_visual <- data.frame("Keyword"=ds_skills_soft, ds_skills_soft)
head(top_10_soft_visual, 2)
## Keyword.Keyword Keyword.LinkedIn Keyword.Indeed Keyword.SimplyHired
## 1 analysis 60.02323 68.11989 69.67877
## 2 machine learning 66.21370 66.93266 66.88430
## Keyword.Monster Keyword.Total Keyword LinkedIn Indeed
## 1 88.25414 68.66764 analysis 60.02323 68.11989
## 2 62.46663 65.84908 machine learning 66.21370 66.93266
## SimplyHired Monster Total
## 1 69.67877 88.25414 68.66764
## 2 66.88430 62.46663 65.84908
data_soft <- top_10_soft_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]
ds_skills_soft <- filter(ds_skills, ds_skills$Skill_Set == "Soft")
dim(ds_skills_soft)
## [1] 15 7
# View first 20 non-technical skills
head(ds_skills_soft)
## Keyword Skill_Set LinkedIn Indeed SimplyHired Monster
## 1 AI composite Soft 18.211382 21.895679 21.180465 18.339562
## 2 analysis Soft 60.023229 68.119891 69.678767 88.254138
## 3 communication Soft 39.535424 45.620864 46.774615 54.805125
## 4 computer science Soft 52.462253 53.308680 54.661792 50.720769
## 5 data engineering Soft 5.969803 6.597898 7.208148 5.339028
## 6 deep learning Soft 15.214866 19.054107 17.628624 16.177256
## Total
## 1 19.654833
## 2 68.667636
## 3 44.984289
## 4 52.755241
## 5 6.232706
## 6 16.742485
#Arrange data from largest to smallest by the Totals column
ds_skills_soft <- arrange(ds_skills_soft, desc(Total))
#Filter out only the Top 10 non-technical skills
top_10_soft <- ds_skills_soft[1:10,c(1,3:7)]
top_10_soft
## Keyword LinkedIn Indeed SimplyHired Monster Total
## 1 analysis 60.02323 68.11989 69.67877 88.25414 68.66764
## 2 machine learning 66.21370 66.93266 66.88430 62.46663 65.84908
## 3 statistics 56.82927 58.23278 60.27683 64.04164 59.05360
## 4 computer science 52.46225 53.30868 54.66179 50.72077 52.75524
## 5 communication 39.53542 45.62086 46.77461 54.80513 44.98429
## 6 mathematics 30.25552 38.16660 39.09637 48.45168 36.94602
## 7 visualization 21.82346 27.50097 30.11230 32.22104 26.50659
## 8 AI composite 18.21138 21.89568 21.18046 18.33956 19.65483
## 9 deep learning 15.21487 19.05411 17.62862 16.17726 16.74248
## 10 NLP composite 14.07666 17.71117 17.23688 15.53657 15.77639
ds_skills_tech <- filter(ds_skills, ds_skills$Skill_Set == "Tech")
dim(ds_skills_tech)
## [1] 37 7
# View first 20 skills
head(ds_skills_tech)
## Keyword Skill_Set LinkedIn Indeed SimplyHired Monster Total
## 1 AWS Tech 10.998839 15.395095 15.852703 12.466631 13.187638
## 2 Azure Tech 6.713124 8.096536 7.443197 7.261078 7.273836
## 3 C Tech 9.233449 9.575710 10.028728 13.961559 10.289359
## 4 C# Tech 3.763066 4.768392 4.753199 5.846236 4.549078
## 5 C++ Tech 11.893148 14.889062 15.147558 11.719167 13.168879
## 6 Caffe Tech 2.392567 2.899961 2.951162 2.562734 2.645031
#Arrange data from largest to smallest by the Totals column
ds_skills_tech <- arrange(ds_skills_tech, desc(Total))
#Filter out only the Top 10 technical skills
top_10_tech <- ds_skills_tech[1:10,c(1,3:7)]
top_10_tech
## Keyword LinkedIn Indeed SimplyHired Monster Total
## 1 Python 73.71661 74.30907 75.42439 67.91244 73.14637
## 2 R 52.88037 60.45154 62.49674 63.13401 58.23289
## 3 SQL 45.05226 51.14831 53.69548 49.14576 48.79238
## 4 Hadoop 24.87805 30.71234 30.39958 32.03417 28.53257
## 5 Spark 25.19164 30.18684 30.47793 28.35024 27.89945
## 6 Java 22.57840 26.80031 27.65735 26.74853 25.24035
## 7 SAS 19.89547 22.07084 23.76600 26.10785 22.20607
## 8 Tableau 14.12311 19.69638 20.37085 19.86119 17.59602
## 9 Hive 13.72822 16.15415 16.63620 16.52429 15.32617
## 10 Scala 12.07898 14.38303 15.38261 13.88147 13.54406
x <- plot_ly(top_10_tech, x = ~ top_10_tech, y = top_10_tech$Total, type = 'bar', name = top_10_tech$Keyword)
x
y <- plot_ly(top_10_soft, x = ~ top_10_soft, y = top_10_soft$Total, type = 'bar', name = top_10_soft$Keyword)
y
top_10_tech_visual <- data.frame("Keyword"=top_10_tech, top_10_tech)
head(top_10_tech_visual, 2)
## Keyword.Keyword Keyword.LinkedIn Keyword.Indeed Keyword.SimplyHired
## 1 Python 73.71661 74.30907 75.42439
## 2 R 52.88037 60.45154 62.49674
## Keyword.Monster Keyword.Total Keyword LinkedIn Indeed SimplyHired
## 1 67.91244 73.14637 Python 73.71661 74.30907 75.42439
## 2 63.13401 58.23289 R 52.88037 60.45154 62.49674
## Monster Total
## 1 67.91244 73.14637
## 2 63.13401 58.23289
data <- top_10_tech_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]
head(data, 2)
## Keyword LinkedIn Indeed SimplyHired Monster Total
## 1 Python 73.71661 74.30907 75.42439 67.91244 73.14637
## 2 R 52.88037 60.45154 62.49674 63.13401 58.23289
s <- plot_ly(data, x = ~Keyword, y = ~LinkedIn, type = 'bar', name = "Linkedin") %>%
add_trace(y = ~Indeed, name = "Indeed") %>%
add_trace(y = ~SimplyHired, name = "SimplyHired") %>%
add_trace(y = ~Monster, name = "Monster") %>%
add_trace(y = ~Total, name = "Average") %>%
layout(title = 'Technical Skills by Job Sites', yaxis = list(title = '% of Data Science Jobs'), xaxis = list(title = 'Tech Skills'), barmode = 'group')
s
top_10_soft_visual <- data.frame("Keyword"=top_10_soft, top_10_soft)
head(top_10_soft_visual, 2)
## Keyword.Keyword Keyword.LinkedIn Keyword.Indeed Keyword.SimplyHired
## 1 analysis 60.02323 68.11989 69.67877
## 2 machine learning 66.21370 66.93266 66.88430
## Keyword.Monster Keyword.Total Keyword LinkedIn Indeed
## 1 88.25414 68.66764 analysis 60.02323 68.11989
## 2 62.46663 65.84908 machine learning 66.21370 66.93266
## SimplyHired Monster Total
## 1 69.67877 88.25414 68.66764
## 2 66.88430 62.46663 65.84908
data2 <- top_10_soft_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]
head(data, 2)
## Keyword LinkedIn Indeed SimplyHired Monster Total
## 1 Python 73.71661 74.30907 75.42439 67.91244 73.14637
## 2 R 52.88037 60.45154 62.49674 63.13401 58.23289
t <- plot_ly(data2, x = ~Keyword, y = ~LinkedIn, type = 'bar', name = "Linkedin") %>%
add_trace(y = ~Indeed, name = "Indeed") %>%
add_trace(y = ~SimplyHired, name = "SimplyHired") %>%
add_trace(y = ~Monster, name = "Monster") %>%
add_trace(y = ~Total, name = "Average") %>%
layout(title = 'Non-Technical Skills by Job Sites', yaxis = list(title = '% of Data Science Jobs'), xaxis = list(title = 'Non-Tech Skills'), barmode = 'group')
t
head(ds_skills_tech, 2)
## Keyword Skill_Set LinkedIn Indeed SimplyHired Monster Total
## 1 Python Tech 73.71661 74.30907 75.42439 67.91244 73.14637
## 2 R Tech 52.88037 60.45154 62.49674 63.13401 58.23289
head(ds_skills_soft, 2)
## Keyword Skill_Set LinkedIn Indeed SimplyHired Monster
## 1 analysis Soft 60.02323 68.11989 69.67877 88.25414
## 2 machine learning Soft 66.21370 66.93266 66.88430 62.46663
## Total
## 1 68.66764
## 2 65.84908
There are many skills and skill-sets that employers are looking for when hiring a data-scientist. Given that Data Science is amultidisciplinary field, it is only fitting that there are both Techology based skills and Soft skills that companies look for in a data science.
The top Techology skill that employers are looking for when hiring a data scientist, according to this dataset, is AWS.
The top Soft skill that employers are looking for when hiring a data scientist, according to this dataset, is Software Development.
The top Overall skill that employers are looking for when hiring a data scientist, according to this dataset, is AWS.