Project # 3 - Data Scienctist Skills Compared

Objective

This is a project for your entire class section to work on together, since being able to work effectively on a virtual team is a key “soft skill” for data scientists. Please note especially the requirement about making a presentation during our first meetup after the project is due.

W. Edwards Deming said, “In God we trust, all others must bring data.” Please use data to answer the question, “Which are the most valued data science skills?” Consider your work as an exploration; there is not necessarily a “right answer.”

Background

When researching for this project, this study https://365datascience.com/research-into-1001-data-scientist-profiles/, has findings that fits the critia needed to be calculated in order to complete this project. However, as a team, we needed some sort of data pool to start calulating perameters to answer a research question(s).

imgage <- "journal.png"
include_graphics(imgage)

With some futher research, we found a similar project that wanted to answer a similar question.

https://www.kaggle.com/discdiver/the-most-in-demand-skills-for-data-scientists

imgage <- "kaggle.png"
include_graphics(imgage)

For this project we used the two main data-sets from the above kaggle web-page to answer our research question(s)

kaggle_ds_job_listing_software.csv & kaggle_ds_general_skills_revised.csv

imgage <- "kaggle_data.png"
include_graphics(imgage)

general_skills <- read.csv(file="https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/kaggle_ds_general_skills_revised.csv")
head(general_skills)
##         ï..Keyword LinkedIn Indeed SimplyHired Monster
## 1 machine learning    5,701  3,439       2,561   2,340
## 2         analysis    5,168  3,500       2,668   3,306
## 3       statistics    4,893  2,992       2,308   2,399
## 4 computer science    4,517  2,739       2,093   1,900
## 5    communication    3,404  2,344       1,791   2,053
## 6      mathematics    2,605  1,961       1,497   1,815
software_skills <-read.csv(file = "https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/kaggle_ds_job_listing_software.csv")

Research Question

What are the most important Techology based skill and Soft Skill, respectively, when hiring a Data Scientist?
Overall, what is the most sought after skill for a Data Scientist?

Data import and cleaning

As a team, we thought it be best to join these two data-sets, cleaning the respective data-frames into similar formats to be merged.

dt1 <- data.table(general_skills, key = "ï..Keyword")
dt1$Skill_Set <- "Soft"
dt2 <- data.table(software_skills, key = "ï..Keyword")
dt2$Skill_Set <- "Tech"
dt2 = subset(dt2, select = -c(LinkedIn.., Indeed.., SimplyHired.., Monster.., Avg.., GlassDoor.Self.Reported...2017, Difference) )

Combine Data-frames

result<-rbind(dt1, dt2)

Explore newly created data-frame

head(result)
##          ï..Keyword LinkedIn Indeed SimplyHired Monster Skill_Set
## 1:     AI composite    1,568  1,125         811     687      Soft
## 2:         analysis    5,168  3,500       2,668   3,306      Soft
## 3:    communication    3,404  2,344       1,791   2,053      Soft
## 4: computer science    4,517  2,739       2,093   1,900      Soft
## 5: data engineering      514    339         276     200      Soft
## 6:    deep learning    1,310    979         675     606      Soft

Rename variable

names(result)[names(result) == "ï..Keyword"] <- "Keyword"

Export to .csv

write.csv(result, file = "joined_df2.csv",row.names=FALSE)

Importing newly created csv

ds_skills <- read.csv("joined_df2.csv")
ds_skills <- read_csv(file = "https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/joined_df2.csv")
## Parsed with column specification:
## cols(
##   Keyword = col_character(),
##   LinkedIn = col_number(),
##   Indeed = col_number(),
##   SimplyHired = col_number(),
##   Monster = col_number(),
##   Skill_Set = col_character()
## )
head(ds_skills)
## # A tibble: 6 x 6
##   Keyword          LinkedIn Indeed SimplyHired Monster Skill_Set
##   <chr>               <dbl>  <dbl>       <dbl>   <dbl> <chr>    
## 1 AI composite         1568   1125         811     687 Soft     
## 2 analysis             5168   3500        2668    3306 Soft     
## 3 communication        3404   2344        1791    2053 Soft     
## 4 computer science     4517   2739        2093    1900 Soft     
## 5 data engineering      514    339         276     200 Soft     
## 6 deep learning        1310    979         675     606 Soft

Data-frame cleaning and tidying

ds_skills$Total <- rowSums(sapply(ds_skills[,c(2:5)], as.numeric))
ds_skills <- mutate(ds_skills[, c(1,6,2:5,7)])
ds_skills <- data.frame(ds_skills)
colnames(ds_skills)
## [1] "Keyword"     "Skill_Set"   "LinkedIn"    "Indeed"      "SimplyHired"
## [6] "Monster"     "Total"
dim(ds_skills)
## [1] 52  7
head(ds_skills)
##            Keyword Skill_Set LinkedIn Indeed SimplyHired Monster Total
## 1     AI composite      Soft     1568   1125         811     687  4191
## 2         analysis      Soft     5168   3500        2668    3306 14642
## 3    communication      Soft     3404   2344        1791    2053  9592
## 4 computer science      Soft     4517   2739        2093    1900 11249
## 5 data engineering      Soft      514    339         276     200  1329
## 6    deep learning      Soft     1310    979         675     606  3570

Creation of New Variable

These two combined data-sets used the amount of keywords present during a general search on well known job recruitment websites, to fill in its’ values. As a team, we thought it was more effective to create an average out of these values to get a better understanding of the dataset.

LinkedIn <- 8610
Indeed <- 5138
SimplyHired <- 3829
Monster <- 3746
Total <- LinkedIn + Indeed + SimplyHired + Monster
ds_skills$LinkedIn <- ((ds_skills$LinkedIn)/(LinkedIn))*100
ds_skills$Indeed <- ((ds_skills$Indeed)/(Indeed))*100
ds_skills$SimplyHired <- ((ds_skills$SimplyHired)/(SimplyHired))*100
ds_skills$Monster <- ((ds_skills$Monster)/((Monster)))*100
ds_skills$Total <- ((ds_skills$Total/((Total))))*100
dim(ds_skills)
## [1] 52  7
head(ds_skills)
##            Keyword Skill_Set  LinkedIn    Indeed SimplyHired   Monster
## 1     AI composite      Soft 18.211382 21.895679   21.180465 18.339562
## 2         analysis      Soft 60.023229 68.119891   69.678767 88.254138
## 3    communication      Soft 39.535424 45.620864   46.774615 54.805125
## 4 computer science      Soft 52.462253 53.308680   54.661792 50.720769
## 5 data engineering      Soft  5.969803  6.597898    7.208148  5.339028
## 6    deep learning      Soft 15.214866 19.054107   17.628624 16.177256
##       Total
## 1 19.654833
## 2 68.667636
## 3 44.984289
## 4 52.755241
## 5  6.232706
## 6 16.742485

Top 10 Soft Skills for Data Scientists

Data transformations for visualization

ds_skills_soft <- filter(ds_skills, ds_skills$Skill_Set == "Soft")
dim(ds_skills_soft)
## [1] 15  7
head(ds_skills_soft)
##            Keyword Skill_Set  LinkedIn    Indeed SimplyHired   Monster
## 1     AI composite      Soft 18.211382 21.895679   21.180465 18.339562
## 2         analysis      Soft 60.023229 68.119891   69.678767 88.254138
## 3    communication      Soft 39.535424 45.620864   46.774615 54.805125
## 4 computer science      Soft 52.462253 53.308680   54.661792 50.720769
## 5 data engineering      Soft  5.969803  6.597898    7.208148  5.339028
## 6    deep learning      Soft 15.214866 19.054107   17.628624 16.177256
##       Total
## 1 19.654833
## 2 68.667636
## 3 44.984289
## 4 52.755241
## 5  6.232706
## 6 16.742485
# Arranged in descending order
ds_skills_soft <- arrange(ds_skills_soft, desc(Total))
head(ds_skills_soft)
##            Keyword Skill_Set LinkedIn   Indeed SimplyHired  Monster
## 1         analysis      Soft 60.02323 68.11989    69.67877 88.25414
## 2 machine learning      Soft 66.21370 66.93266    66.88430 62.46663
## 3       statistics      Soft 56.82927 58.23278    60.27683 64.04164
## 4 computer science      Soft 52.46225 53.30868    54.66179 50.72077
## 5    communication      Soft 39.53542 45.62086    46.77461 54.80513
## 6      mathematics      Soft 30.25552 38.16660    39.09637 48.45168
##      Total
## 1 68.66764
## 2 65.84908
## 3 59.05360
## 4 52.75524
## 5 44.98429
## 6 36.94602
#Top 10 soft skills
ds_skills_soft <- ds_skills_soft[1:10,c(1,3:7)]
head(ds_skills_soft)
##            Keyword LinkedIn   Indeed SimplyHired  Monster    Total
## 1         analysis 60.02323 68.11989    69.67877 88.25414 68.66764
## 2 machine learning 66.21370 66.93266    66.88430 62.46663 65.84908
## 3       statistics 56.82927 58.23278    60.27683 64.04164 59.05360
## 4 computer science 52.46225 53.30868    54.66179 50.72077 52.75524
## 5    communication 39.53542 45.62086    46.77461 54.80513 44.98429
## 6      mathematics 30.25552 38.16660    39.09637 48.45168 36.94602
#ds_skills_soft$Keyword

Top ten Soft skills as a percentage of total Data Science jobs

p <- plot_ly(ds_skills_soft, x = ~ ds_skills_soft, y = ds_skills_soft$Total, type = 'bar', name = ds_skills_soft$Keyword)
p

Bar Graphs

Top 10 Soft as a pecentage of total data science jobs by job site

top_10_soft_visual <- data.frame("Keyword"=ds_skills_soft, ds_skills_soft)
head(top_10_soft_visual, 2)
##    Keyword.Keyword Keyword.LinkedIn Keyword.Indeed Keyword.SimplyHired
## 1         analysis         60.02323       68.11989            69.67877
## 2 machine learning         66.21370       66.93266            66.88430
##   Keyword.Monster Keyword.Total          Keyword LinkedIn   Indeed
## 1        88.25414      68.66764         analysis 60.02323 68.11989
## 2        62.46663      65.84908 machine learning 66.21370 66.93266
##   SimplyHired  Monster    Total
## 1    69.67877 88.25414 68.66764
## 2    66.88430 62.46663 65.84908
data_soft <- top_10_soft_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]

Top 10 Non-Technical Skills for Data Scientists

Data transformations for visualization

ds_skills_soft <- filter(ds_skills, ds_skills$Skill_Set == "Soft")
dim(ds_skills_soft)
## [1] 15  7
# View first 20 non-technical skills
head(ds_skills_soft)
##            Keyword Skill_Set  LinkedIn    Indeed SimplyHired   Monster
## 1     AI composite      Soft 18.211382 21.895679   21.180465 18.339562
## 2         analysis      Soft 60.023229 68.119891   69.678767 88.254138
## 3    communication      Soft 39.535424 45.620864   46.774615 54.805125
## 4 computer science      Soft 52.462253 53.308680   54.661792 50.720769
## 5 data engineering      Soft  5.969803  6.597898    7.208148  5.339028
## 6    deep learning      Soft 15.214866 19.054107   17.628624 16.177256
##       Total
## 1 19.654833
## 2 68.667636
## 3 44.984289
## 4 52.755241
## 5  6.232706
## 6 16.742485
#Arrange data from largest to smallest by the Totals column
ds_skills_soft <- arrange(ds_skills_soft, desc(Total))
#Filter out only the Top 10 non-technical skills
top_10_soft <- ds_skills_soft[1:10,c(1,3:7)]
top_10_soft
##             Keyword LinkedIn   Indeed SimplyHired  Monster    Total
## 1          analysis 60.02323 68.11989    69.67877 88.25414 68.66764
## 2  machine learning 66.21370 66.93266    66.88430 62.46663 65.84908
## 3        statistics 56.82927 58.23278    60.27683 64.04164 59.05360
## 4  computer science 52.46225 53.30868    54.66179 50.72077 52.75524
## 5     communication 39.53542 45.62086    46.77461 54.80513 44.98429
## 6       mathematics 30.25552 38.16660    39.09637 48.45168 36.94602
## 7     visualization 21.82346 27.50097    30.11230 32.22104 26.50659
## 8      AI composite 18.21138 21.89568    21.18046 18.33956 19.65483
## 9     deep learning 15.21487 19.05411    17.62862 16.17726 16.74248
## 10    NLP composite 14.07666 17.71117    17.23688 15.53657 15.77639

Top 10 Technical Skills for Data Scientists

Data transformations for visualization

ds_skills_tech <- filter(ds_skills, ds_skills$Skill_Set == "Tech")
dim(ds_skills_tech)
## [1] 37  7
# View first 20 skills
head(ds_skills_tech)
##   Keyword Skill_Set  LinkedIn    Indeed SimplyHired   Monster     Total
## 1     AWS      Tech 10.998839 15.395095   15.852703 12.466631 13.187638
## 2   Azure      Tech  6.713124  8.096536    7.443197  7.261078  7.273836
## 3       C      Tech  9.233449  9.575710   10.028728 13.961559 10.289359
## 4      C#      Tech  3.763066  4.768392    4.753199  5.846236  4.549078
## 5     C++      Tech 11.893148 14.889062   15.147558 11.719167 13.168879
## 6   Caffe      Tech  2.392567  2.899961    2.951162  2.562734  2.645031
#Arrange data from largest to smallest by the Totals column
ds_skills_tech <- arrange(ds_skills_tech, desc(Total))
#Filter out only the Top 10 technical skills
top_10_tech <- ds_skills_tech[1:10,c(1,3:7)]
top_10_tech
##    Keyword LinkedIn   Indeed SimplyHired  Monster    Total
## 1   Python 73.71661 74.30907    75.42439 67.91244 73.14637
## 2        R 52.88037 60.45154    62.49674 63.13401 58.23289
## 3      SQL 45.05226 51.14831    53.69548 49.14576 48.79238
## 4   Hadoop 24.87805 30.71234    30.39958 32.03417 28.53257
## 5    Spark 25.19164 30.18684    30.47793 28.35024 27.89945
## 6     Java 22.57840 26.80031    27.65735 26.74853 25.24035
## 7      SAS 19.89547 22.07084    23.76600 26.10785 22.20607
## 8  Tableau 14.12311 19.69638    20.37085 19.86119 17.59602
## 9     Hive 13.72822 16.15415    16.63620 16.52429 15.32617
## 10   Scala 12.07898 14.38303    15.38261 13.88147 13.54406

Top ten Technical skills as a percentage of total Data Science jobs

x <- plot_ly(top_10_tech, x = ~ top_10_tech, y = top_10_tech$Total, type = 'bar', name = top_10_tech$Keyword)
x

Top ten Technical skills as a percentage of total Data Science jobs

y <- plot_ly(top_10_soft, x = ~ top_10_soft, y = top_10_soft$Total, type = 'bar', name = top_10_soft$Keyword)
y

Bar Graphs

Top 10 Techs as a pecentage of total data science jobs by job site

top_10_tech_visual <- data.frame("Keyword"=top_10_tech, top_10_tech)
head(top_10_tech_visual, 2)
##   Keyword.Keyword Keyword.LinkedIn Keyword.Indeed Keyword.SimplyHired
## 1          Python         73.71661       74.30907            75.42439
## 2               R         52.88037       60.45154            62.49674
##   Keyword.Monster Keyword.Total Keyword LinkedIn   Indeed SimplyHired
## 1        67.91244      73.14637  Python 73.71661 74.30907    75.42439
## 2        63.13401      58.23289       R 52.88037 60.45154    62.49674
##    Monster    Total
## 1 67.91244 73.14637
## 2 63.13401 58.23289
data <- top_10_tech_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]
head(data, 2)
##   Keyword LinkedIn   Indeed SimplyHired  Monster    Total
## 1  Python 73.71661 74.30907    75.42439 67.91244 73.14637
## 2       R 52.88037 60.45154    62.49674 63.13401 58.23289
s <- plot_ly(data, x = ~Keyword, y = ~LinkedIn, type = 'bar', name = "Linkedin") %>%
add_trace(y = ~Indeed, name = "Indeed") %>% 
add_trace(y = ~SimplyHired, name = "SimplyHired") %>% 
add_trace(y = ~Monster, name = "Monster") %>% 
add_trace(y = ~Total, name = "Average") %>% 
layout(title = 'Technical Skills by Job Sites', yaxis = list(title = '% of Data Science Jobs'), xaxis = list(title = 'Tech Skills'), barmode = 'group')
s

Bar Graphs

Top 10 Techs as a pecentage of total data science jobs by job site

top_10_soft_visual <- data.frame("Keyword"=top_10_soft, top_10_soft)
head(top_10_soft_visual, 2)
##    Keyword.Keyword Keyword.LinkedIn Keyword.Indeed Keyword.SimplyHired
## 1         analysis         60.02323       68.11989            69.67877
## 2 machine learning         66.21370       66.93266            66.88430
##   Keyword.Monster Keyword.Total          Keyword LinkedIn   Indeed
## 1        88.25414      68.66764         analysis 60.02323 68.11989
## 2        62.46663      65.84908 machine learning 66.21370 66.93266
##   SimplyHired  Monster    Total
## 1    69.67877 88.25414 68.66764
## 2    66.88430 62.46663 65.84908
data2 <- top_10_soft_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]
head(data, 2)
##   Keyword LinkedIn   Indeed SimplyHired  Monster    Total
## 1  Python 73.71661 74.30907    75.42439 67.91244 73.14637
## 2       R 52.88037 60.45154    62.49674 63.13401 58.23289
t <- plot_ly(data2, x = ~Keyword, y = ~LinkedIn, type = 'bar', name = "Linkedin") %>%
add_trace(y = ~Indeed, name = "Indeed") %>% 
add_trace(y = ~SimplyHired, name = "SimplyHired") %>% 
add_trace(y = ~Monster, name = "Monster") %>% 
add_trace(y = ~Total, name = "Average") %>% 
layout(title = 'Non-Technical Skills by Job Sites', yaxis = list(title = '% of Data Science Jobs'), xaxis = list(title = 'Non-Tech Skills'), barmode = 'group')
t

Top two “Technology” Skills for Data Scientist

head(ds_skills_tech, 2)
##   Keyword Skill_Set LinkedIn   Indeed SimplyHired  Monster    Total
## 1  Python      Tech 73.71661 74.30907    75.42439 67.91244 73.14637
## 2       R      Tech 52.88037 60.45154    62.49674 63.13401 58.23289

Top two “Non-Technology” Skills for Data Scientist

head(ds_skills_soft, 2)
##            Keyword Skill_Set LinkedIn   Indeed SimplyHired  Monster
## 1         analysis      Soft 60.02323 68.11989    69.67877 88.25414
## 2 machine learning      Soft 66.21370 66.93266    66.88430 62.46663
##      Total
## 1 68.66764
## 2 65.84908

Conclusion

There are many skills and skill-sets that employers are looking for when hiring a data-scientist. Given that Data Science is amultidisciplinary field, it is only fitting that there are both Techology based skills and Soft skills that companies look for in a data science.

The top Techology skill that employers are looking for when hiring a data scientist, according to this dataset, is AWS.

The top Soft skill that employers are looking for when hiring a data scientist, according to this dataset, is Software Development.

The top Overall skill that employers are looking for when hiring a data scientist, according to this dataset, is AWS.