Data Scienctist Skills: Compared

10/22/2019

Background

During early research,
We found this study
https://365datascience.com/research-into-1001-data-scientist-profiles/

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/journal.png"
include_graphics(imgage)

As a team we saw similiar findings that fits the critia needed to be calculated in order to complete this project.
However, as a team, we needed some sort of data pool to start calulating perameters to answer a research question(s).
With some futher research, we found a similar project that wanted to answer a similar question.

-https://www.kaggle.com/discdiver/the-most-in-demand-skills-for-data-scientists

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/kaggle.png"
include_graphics(imgage)

We decided to use the two main data sets from this kaggle project as our data pool to try and make futher analysis

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/kaggle_data.png"
include_graphics(imgage)

 general_skills <- read_csv(file="https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/kaggle_ds_general_skills_revised.csv")

## Parsed with column specification:
## cols(
##   Keyword = col_character(),
##   LinkedIn = col_number(),
##   Indeed = col_number(),
##   SimplyHired = col_number(),
##   Monster = col_number()
## )

head(general_skills)

## # A tibble: 6 x 5
##   Keyword          LinkedIn Indeed SimplyHired Monster
##   <chr>               <dbl>  <dbl>       <dbl>   <dbl>
## 1 machine learning     5701   3439        2561    2340
## 2 analysis             5168   3500        2668    3306
## 3 statistics           4893   2992        2308    2399
## 4 computer science     4517   2739        2093    1900
## 5 communication        3404   2344        1791    2053
## 6 mathematics          2605   1961        1497    1815

software_skills <-read_csv(file = "https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/kaggle_ds_job_listing_software.csv")

## Parsed with column specification:
## cols(
##   Keyword = col_character(),
##   LinkedIn = col_number(),
##   Indeed = col_number(),
##   SimplyHired = col_number(),
##   Monster = col_number(),
##   `LinkedIn %` = col_character(),
##   `Indeed %` = col_character(),
##   `SimplyHired %` = col_character(),
##   `Monster %` = col_character(),
##   `Avg %` = col_character(),
##   `GlassDoor Self Reported % 2017` = col_character(),
##   Difference = col_character()
## )

head(software_skills)

## # A tibble: 6 x 12
##   Keyword LinkedIn Indeed SimplyHired Monster `LinkedIn %` `Indeed %`
##   <chr>      <dbl>  <dbl>       <dbl>   <dbl> <chr>        <chr>     
## 1 Python      6347   3818        2888    2544 74%          74%       
## 2 R           4553   3106        2393    2365 53%          60%       
## 3 SQL         3879   2628        2056    1841 45%          51%       
## 4 Spark       2169   1551        1167    1062 25%          30%       
## 5 Hadoop      2142   1578        1164    1200 25%          31%       
## 6 Java        1944   1377        1059    1002 23%          27%       
## # ... with 5 more variables: `SimplyHired %` <chr>, `Monster %` <chr>,
## #   `Avg %` <chr>, `GlassDoor Self Reported % 2017` <chr>,
## #   Difference <chr>

Research Question(s)

What are the most important Techology based skill and Soft Skill, respectively, when hiring a Data Scientist?
Overall, what is the most sought after skill for a Data Scientist?

Data import and cleaning

As a team, we thought it be best to join these two data-sets, cleaning the respective data-frames into similar formats to be merged.

dt1 <- as_tibble(general_skills, key = "ï..Keyword")
dt1$Skill_Set <- "Soft"
dt2 <- as_tibble(software_skills, key = "ï..Keyword")
dt2$Skill_Set <- "Tech"
dt2 <- select(dt2,-c(6,7,8,9,10,11,12))

Combine Data-frames

result<-rbind(dt1, dt2)
names(result)[names(result) == "ï..Keyword"] <- "Keyword"
head(result)

## # A tibble: 6 x 6
##   Keyword          LinkedIn Indeed SimplyHired Monster Skill_Set
##   <chr>               <dbl>  <dbl>       <dbl>   <dbl> <chr>    
## 1 machine learning     5701   3439        2561    2340 Soft     
## 2 analysis             5168   3500        2668    3306 Soft     
## 3 statistics           4893   2992        2308    2399 Soft     
## 4 computer science     4517   2739        2093    1900 Soft     
## 5 communication        3404   2344        1791    2053 Soft     
## 6 mathematics          2605   1961        1497    1815 Soft

Data-frame cleaning and tidying

ds_skills <- as.data.frame(result)
ds_skills$Total <- rowSums(sapply(ds_skills[,c(2:5)], as.numeric))
ds_skills <- mutate(ds_skills[, c(1,6,2:5,7)])
ds_skills <- data.frame(ds_skills)
colnames(ds_skills)

## [1] "Keyword"     "Skill_Set"   "LinkedIn"    "Indeed"      "SimplyHired"
## [6] "Monster"     "Total"

head(ds_skills)

##            Keyword Skill_Set LinkedIn Indeed SimplyHired Monster Total
## 1 machine learning      Soft     5701   3439        2561    2340 14041
## 2         analysis      Soft     5168   3500        2668    3306 14642
## 3       statistics      Soft     4893   2992        2308    2399 12592
## 4 computer science      Soft     4517   2739        2093    1900 11249
## 5    communication      Soft     3404   2344        1791    2053  9592
## 6      mathematics      Soft     2605   1961        1497    1815  7878

Creation of New Variable

These two combined data-sets used the amount of keywords present during a general search on well known job recruitment websites, to fill in its’ values. As a team, we thought it was more effective to create an average out of these values to get a better understanding of the dataset.

LinkedIn <- 8610
Indeed <- 5138
SimplyHired <- 3829
Monster <- 3746
Total <- LinkedIn + Indeed + SimplyHired + Monster
ds_skills$LinkedIn <- ((ds_skills$LinkedIn)/(LinkedIn*100))
ds_skills$Indeed <- ((ds_skills$Indeed)/(Indeed*100))
ds_skills$SimplyHired <- ((ds_skills$SimplyHired)/(SimplyHired*100))
ds_skills$Monster <- ((ds_skills$Monster)/((Monster*100)))
ds_skills$Total <- ((ds_skills$Total/((Total*100))))

Set up for Bar Graphs

ds_skills_soft <- filter(ds_skills, ds_skills$Skill_Set == "Soft")
dim(ds_skills_soft)

## [1] 15  7

head(ds_skills_soft, 20)

##                 Keyword Skill_Set     LinkedIn       Indeed  SimplyHired
## 1      machine learning      Soft 0.0066213705 0.0066932659 0.0066884304
## 2              analysis      Soft 0.0060023229 0.0068119891 0.0069678767
## 3            statistics      Soft 0.0056829268 0.0058232775 0.0060276835
## 4      computer science      Soft 0.0052462253 0.0053308680 0.0054661792
## 5         communication      Soft 0.0039535424 0.0045620864 0.0046774615
## 6           mathematics      Soft 0.0030255517 0.0038166602 0.0039096370
## 7         visualization      Soft 0.0021823461 0.0027500973 0.0030112301
## 8          AI composite      Soft 0.0018211382 0.0021895679 0.0021180465
## 9         deep learning      Soft 0.0015214866 0.0019054107 0.0017628624
## 10        NLP composite      Soft 0.0014076655 0.0017711172 0.0017236876
## 11 software development      Soft 0.0008501742 0.0012203192 0.0012562027
## 12      neural networks      Soft 0.0007793264 0.0009439471 0.0010995038
## 13     data engineering      Soft 0.0005969803 0.0006597898 0.0007208148
## 14   project management      Soft 0.0005528455 0.0007726742 0.0008618438
## 15 software engineering      Soft 0.0004796748 0.0005741534 0.0006529120
##         Monster        Total
## 1  0.0062466631 0.0065849083
## 2  0.0088254138 0.0068667636
## 3  0.0064041644 0.0059053604
## 4  0.0050720769 0.0052755241
## 5  0.0054805125 0.0044984289
## 6  0.0048451682 0.0036946021
## 7  0.0032221036 0.0026506589
## 8  0.0018339562 0.0019654833
## 9  0.0016177256 0.0016742485
## 10 0.0015536572 0.0015776392
## 11 0.0020928991 0.0012305961
## 12 0.0008142018 0.0008826150
## 13 0.0005339028 0.0006232706
## 14 0.0009289909 0.0007273836
## 15 0.0013667912 0.0006893964

ds_skills_soft <- arrange(ds_skills_soft, desc(Total))

top_10_soft <- ds_skills_soft[1:10,c(1,3:7)]
top_10_soft

##             Keyword    LinkedIn      Indeed SimplyHired     Monster
## 1          analysis 0.006002323 0.006811989 0.006967877 0.008825414
## 2  machine learning 0.006621370 0.006693266 0.006688430 0.006246663
## 3        statistics 0.005682927 0.005823278 0.006027683 0.006404164
## 4  computer science 0.005246225 0.005330868 0.005466179 0.005072077
## 5     communication 0.003953542 0.004562086 0.004677461 0.005480513
## 6       mathematics 0.003025552 0.003816660 0.003909637 0.004845168
## 7     visualization 0.002182346 0.002750097 0.003011230 0.003222104
## 8      AI composite 0.001821138 0.002189568 0.002118046 0.001833956
## 9     deep learning 0.001521487 0.001905411 0.001762862 0.001617726
## 10    NLP composite 0.001407666 0.001771117 0.001723688 0.001553657
##          Total
## 1  0.006866764
## 2  0.006584908
## 3  0.005905360
## 4  0.005275524
## 5  0.004498429
## 6  0.003694602
## 7  0.002650659
## 8  0.001965483
## 9  0.001674248
## 10 0.001577639

Top Ten Non-Technical skills as a percentage of total Data Science jobs

Top 5 Technical Skills for Data Scientists

Data transformations for visualization

ds_skills_tech <- filter(ds_skills, ds_skills$Skill_Set == "Tech")

#Arrange data from largest to smallest by the Totals column
ds_skills_tech <- arrange(ds_skills_tech, desc(Total))

#Filter out only the Top 10 technical skills
top_10_tech <- ds_skills_tech[1:10,c(1,3:7)]

Top Ten Technical skills as a percentage of total Data Science jobs

Top 10 Tech Skills as an average pecentage of total data science jobs by job site

top_10_tech_visual <- data.frame("Keyword"=top_10_tech, top_10_tech)

data <- top_10_tech_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]

Top 10 Non-Tech SKills as an average pecentage of total data science jobs by job site

top_10_soft_visual <- data.frame("Keyword"=top_10_soft, top_10_soft)

data2 <- top_10_soft_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]

Top two “Technology” Skills for Data Scientist

head(ds_skills_tech, 2)

##   Keyword Skill_Set    LinkedIn      Indeed SimplyHired     Monster
## 1  Python      Tech 0.007371661 0.007430907 0.007542439 0.006791244
## 2       R      Tech 0.005288037 0.006045154 0.006249674 0.006313401
##         Total
## 1 0.007314637
## 2 0.005823289

Top two “Non-Technology” Skills for Data Scientist

head(ds_skills_soft, 2)

##            Keyword Skill_Set    LinkedIn      Indeed SimplyHired
## 1         analysis      Soft 0.006002323 0.006811989 0.006967877
## 2 machine learning      Soft 0.006621370 0.006693266 0.006688430
##       Monster       Total
## 1 0.008825414 0.006866764
## 2 0.006246663 0.006584908

Conclusion

There are many skills and skill-sets that employers are looking for when hiring a data-scientist. Given that Data Science is amultidisciplinary field, it is only fitting that there are both Techology based skills and Soft skills that companies look for in a data science.

The top Techology skill that employers are looking for when hiring a data scientist, according to this dataset, is Python.

The top Soft skill that employers are looking for when hiring a data scientist, according to this dataset, is Analysis.

The top Overall skill that employers are looking for when hiring a data scientist, according to this dataset, is Python.