Data Scienctist Skills: Compared

10/22/2019

Background

During early research,
We found this study
https://365datascience.com/research-into-1001-data-scientist-profiles/

imgage <- "journal.png"
include_graphics(imgage)

As a team we saw similiar findings that fits the critia needed to be calculated in order to complete this project.
However, as a team, we needed some sort of data pool to start calulating perameters to answer a research question(s).
With some futher research, we found a similar project that wanted to answer a similar question.

-https://www.kaggle.com/discdiver/the-most-in-demand-skills-for-data-scientists

imgage <- "kaggle.png"
include_graphics(imgage)

We decided to use the two main data sets from this kaggle project as our data pool to try and make futher analysis

imgage <- "kaggle_data.png"
include_graphics(imgage)

 general_skills <- read_csv(file="https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/kaggle_ds_general_skills_revised.csv")

## Parsed with column specification:
## cols(
##   Keyword = col_character(),
##   LinkedIn = col_number(),
##   Indeed = col_number(),
##   SimplyHired = col_number(),
##   Monster = col_number()
## )

head(general_skills)

## # A tibble: 6 x 5
##   Keyword          LinkedIn Indeed SimplyHired Monster
##   <chr>               <dbl>  <dbl>       <dbl>   <dbl>
## 1 machine learning     5701   3439        2561    2340
## 2 analysis             5168   3500        2668    3306
## 3 statistics           4893   2992        2308    2399
## 4 computer science     4517   2739        2093    1900
## 5 communication        3404   2344        1791    2053
## 6 mathematics          2605   1961        1497    1815

software_skills <-read_csv(file = "https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/kaggle_ds_job_listing_software.csv")

## Parsed with column specification:
## cols(
##   Keyword = col_character(),
##   LinkedIn = col_number(),
##   Indeed = col_number(),
##   SimplyHired = col_number(),
##   Monster = col_number(),
##   `LinkedIn %` = col_character(),
##   `Indeed %` = col_character(),
##   `SimplyHired %` = col_character(),
##   `Monster %` = col_character(),
##   `Avg %` = col_character(),
##   `GlassDoor Self Reported % 2017` = col_character(),
##   Difference = col_character()
## )

head(software_skills)

## # A tibble: 6 x 12
##   Keyword LinkedIn Indeed SimplyHired Monster `LinkedIn %` `Indeed %`
##   <chr>      <dbl>  <dbl>       <dbl>   <dbl> <chr>        <chr>     
## 1 Python      6347   3818        2888    2544 74%          74%       
## 2 R           4553   3106        2393    2365 53%          60%       
## 3 SQL         3879   2628        2056    1841 45%          51%       
## 4 Spark       2169   1551        1167    1062 25%          30%       
## 5 Hadoop      2142   1578        1164    1200 25%          31%       
## 6 Java        1944   1377        1059    1002 23%          27%       
## # ... with 5 more variables: `SimplyHired %` <chr>, `Monster %` <chr>,
## #   `Avg %` <chr>, `GlassDoor Self Reported % 2017` <chr>,
## #   Difference <chr>

Research Question(s)

What are the most important Techology based skill and Soft Skill, respectively, when hiring a Data Scientist?
Overall, what is the most sought after skill for a Data Scientist?

Data import and cleaning

As a team, we thought it be best to join these two data-sets, cleaning the respective data-frames into similar formats to be merged.

dt1 <- as_tibble(general_skills, key = "ï..Keyword")
dt1$Skill_Set <- "Soft"
dt2 <- as_tibble(software_skills, key = "ï..Keyword")
dt2$Skill_Set <- "Tech"
dt2 <- select(dt2,-c(6,7,8,9,10,11,12))

Combine Data-frames

result<-rbind(dt1, dt2)
names(result)[names(result) == "ï..Keyword"] <- "Keyword"
head(result)

## # A tibble: 6 x 6
##   Keyword          LinkedIn Indeed SimplyHired Monster Skill_Set
##   <chr>               <dbl>  <dbl>       <dbl>   <dbl> <chr>    
## 1 machine learning     5701   3439        2561    2340 Soft     
## 2 analysis             5168   3500        2668    3306 Soft     
## 3 statistics           4893   2992        2308    2399 Soft     
## 4 computer science     4517   2739        2093    1900 Soft     
## 5 communication        3404   2344        1791    2053 Soft     
## 6 mathematics          2605   1961        1497    1815 Soft

Data-frame cleaning and tidying

ds_skills <- as.data.frame(result)
ds_skills$Total <- rowSums(sapply(ds_skills[,c(2:5)], as.numeric))
ds_skills <- mutate(ds_skills[, c(1,6,2:5,7)])
ds_skills <- data.frame(ds_skills)
colnames(ds_skills)

## [1] "Keyword"     "Skill_Set"   "LinkedIn"    "Indeed"      "SimplyHired"
## [6] "Monster"     "Total"

head(ds_skills)

##            Keyword Skill_Set LinkedIn Indeed SimplyHired Monster Total
## 1 machine learning      Soft     5701   3439        2561    2340 14041
## 2         analysis      Soft     5168   3500        2668    3306 14642
## 3       statistics      Soft     4893   2992        2308    2399 12592
## 4 computer science      Soft     4517   2739        2093    1900 11249
## 5    communication      Soft     3404   2344        1791    2053  9592
## 6      mathematics      Soft     2605   1961        1497    1815  7878

Creation of New Variable

These two combined data-sets used the amount of keywords present during a general search on well known job recruitment websites, to fill in its' values. As a team, we thought it was more effective to create an average out of these values to get a better understanding of the dataset.

LinkedIn <- 8610
Indeed <- 5138
SimplyHired <- 3829
Monster <- 3746
Total <- LinkedIn + Indeed + SimplyHired + Monster
ds_skills$LinkedIn <- ((ds_skills$LinkedIn)/(LinkedIn))*100
ds_skills$Indeed <- ((ds_skills$Indeed)/(Indeed))*100
ds_skills$SimplyHired <- ((ds_skills$SimplyHired)/(SimplyHired))*100
ds_skills$Monster <- ((ds_skills$Monster)/((Monster)))*100
ds_skills$Total <- ((ds_skills$Total/((Total*100))))*100

Set up for Bar Graphs

ds_skills_soft <- filter(ds_skills, ds_skills$Skill_Set == "Soft")
dim(ds_skills_soft)

## [1] 15  7

head(ds_skills_soft)

##            Keyword Skill_Set LinkedIn   Indeed SimplyHired  Monster
## 1 machine learning      Soft 66.21370 66.93266    66.88430 62.46663
## 2         analysis      Soft 60.02323 68.11989    69.67877 88.25414
## 3       statistics      Soft 56.82927 58.23278    60.27683 64.04164
## 4 computer science      Soft 52.46225 53.30868    54.66179 50.72077
## 5    communication      Soft 39.53542 45.62086    46.77461 54.80513
## 6      mathematics      Soft 30.25552 38.16660    39.09637 48.45168
##       Total
## 1 0.6584908
## 2 0.6866764
## 3 0.5905360
## 4 0.5275524
## 5 0.4498429
## 6 0.3694602

ds_skills_soft <- arrange(ds_skills_soft, desc(Total))
top_10_soft <- ds_skills_soft[1:10,c(1,3:7)]
top_10_soft

##             Keyword LinkedIn   Indeed SimplyHired  Monster     Total
## 1          analysis 60.02323 68.11989    69.67877 88.25414 0.6866764
## 2  machine learning 66.21370 66.93266    66.88430 62.46663 0.6584908
## 3        statistics 56.82927 58.23278    60.27683 64.04164 0.5905360
## 4  computer science 52.46225 53.30868    54.66179 50.72077 0.5275524
## 5     communication 39.53542 45.62086    46.77461 54.80513 0.4498429
## 6       mathematics 30.25552 38.16660    39.09637 48.45168 0.3694602
## 7     visualization 21.82346 27.50097    30.11230 32.22104 0.2650659
## 8      AI composite 18.21138 21.89568    21.18046 18.33956 0.1965483
## 9     deep learning 15.21487 19.05411    17.62862 16.17726 0.1674248
## 10    NLP composite 14.07666 17.71117    17.23688 15.53657 0.1577639

Top Ten Non-Technical skills as a percentage of total Data Science jobs

Top 5 Technical Skills for Data Scientists

Data transformations for visualization

ds_skills_tech <- filter(ds_skills, ds_skills$Skill_Set == "Tech")
#Arrange data from largest to smallest by the Totals column
ds_skills_tech <- arrange(ds_skills_tech, desc(Total))
#Filter out only the Top 10 technical skills
top_10_tech <- ds_skills_tech[1:10,c(1,3:7)]

Top Ten Technical skills as a percentage of total Data Science jobs

Top 10 Tech Skills as an average pecentage of total data science jobs by job site

top_10_tech_visual <- data.frame("Keyword"=top_10_tech, top_10_tech)
data <- top_10_tech_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]

Top 10 Non-Tech SKills as an average pecentage of total data science jobs by job site

top_10_soft_visual <- data.frame("Keyword"=top_10_soft, top_10_soft)

data2 <- top_10_soft_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]

Top two "Technology" Skills for Data Scientist

head(ds_skills_tech, 2)

##   Keyword Skill_Set LinkedIn   Indeed SimplyHired  Monster     Total
## 1  Python      Tech 73.71661 74.30907    75.42439 67.91244 0.7314637
## 2       R      Tech 52.88037 60.45154    62.49674 63.13401 0.5823289

Top two "Non-Technology" Skills for Data Scientist

head(ds_skills_soft, 2)

##            Keyword Skill_Set LinkedIn   Indeed SimplyHired  Monster
## 1         analysis      Soft 60.02323 68.11989    69.67877 88.25414
## 2 machine learning      Soft 66.21370 66.93266    66.88430 62.46663
##       Total
## 1 0.6866764
## 2 0.6584908

Conclusion

There are many skills and skill-sets that employers are looking for when hiring a data-scientist. Given that Data Science is amultidisciplinary field, it is only fitting that there are both Techology based skills and Soft skills that companies look for in a data science.

The top Techology skill that employers are looking for when hiring a data scientist, according to this dataset, is Python.

The top Soft skill that employers are looking for when hiring a data scientist, according to this dataset, is Analysis.

The top Overall skill that employers are looking for when hiring a data scientist, according to this dataset, is Python.