Project # 3 - Data Scienctist Skills Compared

Objective

This is a project for your entire class section to work on together, since being able to work effectively on a virtual team is a key “soft skill” for data scientists. Please note especially the requirement about making a presentation during our first meetup after the project is due.

W. Edwards Deming said, “In God we trust, all others must bring data.” Please use data to answer the question, “Which are the most valued data science skills?” Consider your work as an exploration; there is not necessarily a “right answer.”

Background

When researching for this project, this study https://365datascience.com/research-into-1001-data-scientist-profiles/, has findings that fits the critia needed to be calculated in order to complete this project. However, as a team, we needed some sort of data pool to start calulating perameters to answer a research question(s).

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/journal.png"
include_graphics(imgage)

With some futher research, we found a similar project that wanted to answer a similar question.

https://www.kaggle.com/discdiver/the-most-in-demand-skills-for-data-scientists

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/kaggle.png"
include_graphics(imgage)

For this project we used the two main data-sets from the above kaggle web-page to answer our research question(s)

kaggle_ds_job_listing_software.csv & kaggle_ds_general_skills_revised.csv

imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/kaggle_data.png"
include_graphics(imgage)

 general_skills <- read_csv(file="https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/kaggle_ds_general_skills_revised.csv")

## Parsed with column specification:
## cols(
##   Keyword = col_character(),
##   LinkedIn = col_number(),
##   Indeed = col_number(),
##   SimplyHired = col_number(),
##   Monster = col_number()
## )

head(general_skills)

## # A tibble: 6 x 5
##   Keyword          LinkedIn Indeed SimplyHired Monster
##   <chr>               <dbl>  <dbl>       <dbl>   <dbl>
## 1 machine learning     5701   3439        2561    2340
## 2 analysis             5168   3500        2668    3306
## 3 statistics           4893   2992        2308    2399
## 4 computer science     4517   2739        2093    1900
## 5 communication        3404   2344        1791    2053
## 6 mathematics          2605   1961        1497    1815

software_skills <-read_csv(file = "https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/kaggle_ds_job_listing_software.csv")

## Parsed with column specification:
## cols(
##   Keyword = col_character(),
##   LinkedIn = col_number(),
##   Indeed = col_number(),
##   SimplyHired = col_number(),
##   Monster = col_number(),
##   `LinkedIn %` = col_character(),
##   `Indeed %` = col_character(),
##   `SimplyHired %` = col_character(),
##   `Monster %` = col_character(),
##   `Avg %` = col_character(),
##   `GlassDoor Self Reported % 2017` = col_character(),
##   Difference = col_character()
## )

head(software_skills)

## # A tibble: 6 x 12
##   Keyword LinkedIn Indeed SimplyHired Monster `LinkedIn %` `Indeed %`
##   <chr>      <dbl>  <dbl>       <dbl>   <dbl> <chr>        <chr>     
## 1 Python      6347   3818        2888    2544 74%          74%       
## 2 R           4553   3106        2393    2365 53%          60%       
## 3 SQL         3879   2628        2056    1841 45%          51%       
## 4 Spark       2169   1551        1167    1062 25%          30%       
## 5 Hadoop      2142   1578        1164    1200 25%          31%       
## 6 Java        1944   1377        1059    1002 23%          27%       
## # ... with 5 more variables: `SimplyHired %` <chr>, `Monster %` <chr>,
## #   `Avg %` <chr>, `GlassDoor Self Reported % 2017` <chr>,
## #   Difference <chr>

Research Question

What are the most important Techology based skill and Soft Skill, respectively, when hiring a Data Scientist?

Overall, what is the most sought after skill for a Data Scientist?

Data import and cleaning

As a team, we thought it be best to join these two data-sets, cleaning the respective data-frames into similar formats to be merged.

dt1 <- as_tibble(general_skills, key = "ï..Keyword")
dt1$Skill_Set <- "Soft"
dt2 <- as_tibble(software_skills, key = "ï..Keyword")
dt2$Skill_Set <- "Tech"
dt2 <- select(dt2,-c(6,7,8,9,10,11,12))

Combine Data-frames

result<-rbind(dt1, dt2)

Explore newly created data-frame

head(result)

## # A tibble: 6 x 6
##   Keyword          LinkedIn Indeed SimplyHired Monster Skill_Set
##   <chr>               <dbl>  <dbl>       <dbl>   <dbl> <chr>    
## 1 machine learning     5701   3439        2561    2340 Soft     
## 2 analysis             5168   3500        2668    3306 Soft     
## 3 statistics           4893   2992        2308    2399 Soft     
## 4 computer science     4517   2739        2093    1900 Soft     
## 5 communication        3404   2344        1791    2053 Soft     
## 6 mathematics          2605   1961        1497    1815 Soft

Rename variable

names(result)[names(result) == "ï..Keyword"] <- "Keyword"

Export to .csv

write.csv(result, file = "joined_df2.csv",row.names=FALSE)

Importing newly created csv

#ds_skills <- read_csv("joined_df2.csv")
ds_skills <- read_csv(file = "https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/joined_df2.csv")

## Parsed with column specification:
## cols(
##   Keyword = col_character(),
##   LinkedIn = col_number(),
##   Indeed = col_number(),
##   SimplyHired = col_number(),
##   Monster = col_number(),
##   Skill_Set = col_character()
## )

head(ds_skills)

## # A tibble: 6 x 6
##   Keyword          LinkedIn Indeed SimplyHired Monster Skill_Set
##   <chr>               <dbl>  <dbl>       <dbl>   <dbl> <chr>    
## 1 AI composite         1568   1125         811     687 Soft     
## 2 analysis             5168   3500        2668    3306 Soft     
## 3 communication        3404   2344        1791    2053 Soft     
## 4 computer science     4517   2739        2093    1900 Soft     
## 5 data engineering      514    339         276     200 Soft     
## 6 deep learning        1310    979         675     606 Soft

Data-frame cleaning and tidying

ds_skills$Total <- rowSums(sapply(ds_skills[,c(2:5)], as.numeric))
ds_skills <- mutate(ds_skills[, c(1,6,2:5,7)])
ds_skills <- data.frame(ds_skills)
colnames(ds_skills)

## [1] "Keyword"     "Skill_Set"   "LinkedIn"    "Indeed"      "SimplyHired"
## [6] "Monster"     "Total"

Creation of New Variable

These two combined data-sets used the amount of keywords present during a general search on well known job recruitment websites, to fill in its’ values. As a team, we thought it was more effective to create an average out of these values to get a better understanding of the dataset.

LinkedIn <- 8610
Indeed <- 5138
SimplyHired <- 3829
Monster <- 3746
Total <- LinkedIn + Indeed + SimplyHired + Monster
ds_skills$LinkedIn <- ((ds_skills$LinkedIn)/(LinkedIn))*100
ds_skills$Indeed <- ((ds_skills$Indeed)/(Indeed))*100
ds_skills$SimplyHired <- ((ds_skills$SimplyHired)/(SimplyHired))*100
ds_skills$Monster <- ((ds_skills$Monster)/((Monster)))*100
ds_skills$Total <- ((ds_skills$Total/((Total))))*100
dim(ds_skills)

## [1] 52  7

head(ds_skills, 52)

##                 Keyword Skill_Set  LinkedIn    Indeed SimplyHired
## 1          AI composite      Soft 18.211382 21.895679   21.180465
## 2              analysis      Soft 60.023229 68.119891   69.678767
## 3         communication      Soft 39.535424 45.620864   46.774615
## 4      computer science      Soft 52.462253 53.308680   54.661792
## 5      data engineering      Soft  5.969803  6.597898    7.208148
## 6         deep learning      Soft 15.214866 19.054107   17.628624
## 7      machine learning      Soft 66.213705 66.932659   66.884304
## 8           mathematics      Soft 30.255517 38.166602   39.096370
## 9       neural networks      Soft  7.793264  9.439471   10.995038
## 10        NLP composite      Soft 14.076655 17.711172   17.236876
## 11   project management      Soft  5.528455  7.726742    8.618438
## 12 software development      Soft  8.501742 12.203192   12.562027
## 13 software engineering      Soft  4.796748  5.741534    6.529120
## 14           statistics      Soft 56.829268 58.232775   60.276835
## 15        visualization      Soft 21.823461 27.500973   30.112301
## 16                  AWS      Tech 10.998839 15.395095   15.852703
## 17                Azure      Tech  6.713124  8.096536    7.443197
## 18                    C      Tech  9.233449  9.575710   10.028728
## 19                   C#      Tech  3.763066  4.768392    4.753199
## 20                  C++      Tech 11.893148 14.889062   15.147558
## 21                Caffe      Tech  2.392567  2.899961    2.951162
## 22            Cassandra      Tech  2.740999  3.386532    3.813006
## 23                   D3      Tech  4.099884  2.899961    2.951162
## 24               Docker      Tech  3.368177  4.671078    3.865239
## 25                Excel      Tech  8.141696 11.074348   11.439018
## 26                  Git      Tech  3.275261  5.079798    4.857665
## 27               Hadoop      Tech 24.878049 30.712339   30.399582
## 28                Hbase      Tech  3.507549  4.262359    4.361452
## 29                 Hive      Tech 13.728223 16.154146   16.636197
## 30                 Java      Tech 22.578397 26.800311   27.657352
## 31           Javascript      Tech  3.809524  4.768392    5.588927
## 32                Keras      Tech  3.821138  4.924095    5.353878
## 33                Linux      Tech  6.980256 10.062281    9.506399
## 34               Matlab      Tech  9.361208 13.176333   14.207365
## 35              MongoDB      Tech  2.915215  3.814714    4.309219
## 36                MySQL      Tech  3.228804  4.534838    4.883782
## 37                NoSQL      Tech  6.945412  8.485792   10.107078
## 38                Numpy      Tech  4.494774  5.001946    6.059023
## 39               Pandas      Tech  4.889663  6.422733    7.364847
## 40                 Perl      Tech  3.588850  5.021409    5.275529
## 41                  Pig      Tech  4.262485  5.760996    6.032907
## 42               Python      Tech 73.716609 74.309070   75.424393
## 43              PyTorch      Tech  2.485482  2.783184    3.421259
## 44                    R      Tech 52.880372 60.451538   62.496735
## 45                  SAS      Tech 19.895470 22.070845   23.765996
## 46                Scala      Tech 12.078978 14.383028   15.382606
## 47         Scikit-learn      Tech  5.505226  7.824056    7.678245
## 48                Spark      Tech 25.191638 30.186843   30.477932
## 49                 SPSS      Tech  5.249710  6.422733    7.129799
## 50                  SQL      Tech 45.052265 51.148307   53.695482
## 51              Tableau      Tech 14.123113 19.696380   20.370854
## 52           TensorFlow      Tech  9.802555 12.864928   13.084356
##      Monster     Total
## 1  18.339562 19.654833
## 2  88.254138 68.667636
## 3  54.805125 44.984289
## 4  50.720769 52.755241
## 5   5.339028  6.232706
## 6  16.177256 16.742485
## 7  62.466631 65.849083
## 8  48.451682 36.946021
## 9   8.142018  8.826150
## 10 15.536572 15.776392
## 11  9.289909  7.273836
## 12 20.928991 12.305961
## 13 13.667912  6.893964
## 14 64.041644 59.053604
## 15 32.221036 26.506589
## 16 12.466631 13.187638
## 17  7.261078  7.273836
## 18 13.961559 10.289359
## 19  5.846236  4.549078
## 20 11.719167 13.168879
## 21  2.562734  2.645031
## 22  3.630539  3.245322
## 23  2.536038  3.329738
## 24  5.178857  4.089481
## 25 10.597971  9.871969
## 26  3.870796  4.098860
## 27 32.034170 28.532570
## 28  3.683930  3.873751
## 29 16.524293 15.326174
## 30 26.748532 25.240351
## 31  5.979712  4.741359
## 32  3.497064  4.305210
## 33  8.088628  8.371242
## 34 11.185264 11.471181
## 35  3.096636  3.414154
## 36  3.230112  3.840923
## 37  9.663641  8.361863
## 38  4.057662  4.821085
## 39  4.671650  5.665244
## 40  5.285638  4.535009
## 41  6.833956  5.393237
## 42 67.912440 73.146368
## 43  2.616124  2.748206
## 44 63.134010 58.232894
## 45 26.107848 22.206069
## 46 13.881474 13.544060
## 47  5.659370  6.481264
## 48 28.350240 27.899451
## 49  5.392419  5.895043
## 50 49.145755 48.792384
## 51 19.861185 17.596023
## 52 10.277629 11.213244

Top 10 Soft Skills for Data Scientists

Data transformations for visualization

ds_skills_soft <- filter(ds_skills, ds_skills$Skill_Set == "Soft")
dim(ds_skills_soft)

## [1] 15  7

head(ds_skills_soft, 20)

##                 Keyword Skill_Set  LinkedIn    Indeed SimplyHired
## 1          AI composite      Soft 18.211382 21.895679   21.180465
## 2              analysis      Soft 60.023229 68.119891   69.678767
## 3         communication      Soft 39.535424 45.620864   46.774615
## 4      computer science      Soft 52.462253 53.308680   54.661792
## 5      data engineering      Soft  5.969803  6.597898    7.208148
## 6         deep learning      Soft 15.214866 19.054107   17.628624
## 7      machine learning      Soft 66.213705 66.932659   66.884304
## 8           mathematics      Soft 30.255517 38.166602   39.096370
## 9       neural networks      Soft  7.793264  9.439471   10.995038
## 10        NLP composite      Soft 14.076655 17.711172   17.236876
## 11   project management      Soft  5.528455  7.726742    8.618438
## 12 software development      Soft  8.501742 12.203192   12.562027
## 13 software engineering      Soft  4.796748  5.741534    6.529120
## 14           statistics      Soft 56.829268 58.232775   60.276835
## 15        visualization      Soft 21.823461 27.500973   30.112301
##      Monster     Total
## 1  18.339562 19.654833
## 2  88.254138 68.667636
## 3  54.805125 44.984289
## 4  50.720769 52.755241
## 5   5.339028  6.232706
## 6  16.177256 16.742485
## 7  62.466631 65.849083
## 8  48.451682 36.946021
## 9   8.142018  8.826150
## 10 15.536572 15.776392
## 11  9.289909  7.273836
## 12 20.928991 12.305961
## 13 13.667912  6.893964
## 14 64.041644 59.053604
## 15 32.221036 26.506589

ds_skills_soft <- arrange(ds_skills_soft, desc(Total))
head(ds_skills_soft, 2)

##            Keyword Skill_Set LinkedIn   Indeed SimplyHired  Monster
## 1         analysis      Soft 60.02323 68.11989    69.67877 88.25414
## 2 machine learning      Soft 66.21370 66.93266    66.88430 62.46663
##      Total
## 1 68.66764
## 2 65.84908

#Top 10 soft skills
ds_skills_soft <- ds_skills_soft[1:10,c(1,3:7)]

#Bar Graphs ## Top ten Soft skills as a percentage of total Data Science jobs

p <- plot_ly(ds_skills_soft, x = ~ ds_skills_soft, y = ds_skills_soft$Total, type = 'bar', name = ds_skills_soft$Keyword)

## This version of Shiny is designed to work with 'htmlwidgets' >= 1.5.
##     Please upgrade via install.packages('htmlwidgets').

Top 10 Soft as a pecentage of total data science jobs by job site

top_10_soft_visual <- data.frame("Keyword"=ds_skills_soft, ds_skills_soft)
head(top_10_soft_visual, 2)

##    Keyword.Keyword Keyword.LinkedIn Keyword.Indeed Keyword.SimplyHired
## 1         analysis         60.02323       68.11989            69.67877
## 2 machine learning         66.21370       66.93266            66.88430
##   Keyword.Monster Keyword.Total          Keyword LinkedIn   Indeed
## 1        88.25414      68.66764         analysis 60.02323 68.11989
## 2        62.46663      65.84908 machine learning 66.21370 66.93266
##   SimplyHired  Monster    Total
## 1    69.67877 88.25414 68.66764
## 2    66.88430 62.46663 65.84908

data_soft <- top_10_soft_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]

Top 10 Non-Technical Skills for Data Scientists

Data transformations for visualization

ds_skills_soft <- filter(ds_skills, ds_skills$Skill_Set == "Soft")
dim(ds_skills_soft)

## [1] 15  7

# View first 20 non-technical skills
head(ds_skills_soft, 20)

##                 Keyword Skill_Set  LinkedIn    Indeed SimplyHired
## 1          AI composite      Soft 18.211382 21.895679   21.180465
## 2              analysis      Soft 60.023229 68.119891   69.678767
## 3         communication      Soft 39.535424 45.620864   46.774615
## 4      computer science      Soft 52.462253 53.308680   54.661792
## 5      data engineering      Soft  5.969803  6.597898    7.208148
## 6         deep learning      Soft 15.214866 19.054107   17.628624
## 7      machine learning      Soft 66.213705 66.932659   66.884304
## 8           mathematics      Soft 30.255517 38.166602   39.096370
## 9       neural networks      Soft  7.793264  9.439471   10.995038
## 10        NLP composite      Soft 14.076655 17.711172   17.236876
## 11   project management      Soft  5.528455  7.726742    8.618438
## 12 software development      Soft  8.501742 12.203192   12.562027
## 13 software engineering      Soft  4.796748  5.741534    6.529120
## 14           statistics      Soft 56.829268 58.232775   60.276835
## 15        visualization      Soft 21.823461 27.500973   30.112301
##      Monster     Total
## 1  18.339562 19.654833
## 2  88.254138 68.667636
## 3  54.805125 44.984289
## 4  50.720769 52.755241
## 5   5.339028  6.232706
## 6  16.177256 16.742485
## 7  62.466631 65.849083
## 8  48.451682 36.946021
## 9   8.142018  8.826150
## 10 15.536572 15.776392
## 11  9.289909  7.273836
## 12 20.928991 12.305961
## 13 13.667912  6.893964
## 14 64.041644 59.053604
## 15 32.221036 26.506589

#Arrange data from largest to smallest by the Totals column
ds_skills_soft <- arrange(ds_skills_soft, desc(Total))
#Filter out only the Top 10 non-technical skills
top_10_soft <- ds_skills_soft[1:10,c(1,3:7)]
top_10_soft

##             Keyword LinkedIn   Indeed SimplyHired  Monster    Total
## 1          analysis 60.02323 68.11989    69.67877 88.25414 68.66764
## 2  machine learning 66.21370 66.93266    66.88430 62.46663 65.84908
## 3        statistics 56.82927 58.23278    60.27683 64.04164 59.05360
## 4  computer science 52.46225 53.30868    54.66179 50.72077 52.75524
## 5     communication 39.53542 45.62086    46.77461 54.80513 44.98429
## 6       mathematics 30.25552 38.16660    39.09637 48.45168 36.94602
## 7     visualization 21.82346 27.50097    30.11230 32.22104 26.50659
## 8      AI composite 18.21138 21.89568    21.18046 18.33956 19.65483
## 9     deep learning 15.21487 19.05411    17.62862 16.17726 16.74248
## 10    NLP composite 14.07666 17.71117    17.23688 15.53657 15.77639

Top 10 Technical Skills for Data Scientists

Data transformations for visualization

ds_skills_tech <- filter(ds_skills, ds_skills$Skill_Set == "Tech")
dim(ds_skills_tech)

## [1] 37  7

# View first 20 skills
head(ds_skills_tech, 37)

##         Keyword Skill_Set  LinkedIn    Indeed SimplyHired   Monster
## 1           AWS      Tech 10.998839 15.395095   15.852703 12.466631
## 2         Azure      Tech  6.713124  8.096536    7.443197  7.261078
## 3             C      Tech  9.233449  9.575710   10.028728 13.961559
## 4            C#      Tech  3.763066  4.768392    4.753199  5.846236
## 5           C++      Tech 11.893148 14.889062   15.147558 11.719167
## 6         Caffe      Tech  2.392567  2.899961    2.951162  2.562734
## 7     Cassandra      Tech  2.740999  3.386532    3.813006  3.630539
## 8            D3      Tech  4.099884  2.899961    2.951162  2.536038
## 9        Docker      Tech  3.368177  4.671078    3.865239  5.178857
## 10        Excel      Tech  8.141696 11.074348   11.439018 10.597971
## 11          Git      Tech  3.275261  5.079798    4.857665  3.870796
## 12       Hadoop      Tech 24.878049 30.712339   30.399582 32.034170
## 13        Hbase      Tech  3.507549  4.262359    4.361452  3.683930
## 14         Hive      Tech 13.728223 16.154146   16.636197 16.524293
## 15         Java      Tech 22.578397 26.800311   27.657352 26.748532
## 16   Javascript      Tech  3.809524  4.768392    5.588927  5.979712
## 17        Keras      Tech  3.821138  4.924095    5.353878  3.497064
## 18        Linux      Tech  6.980256 10.062281    9.506399  8.088628
## 19       Matlab      Tech  9.361208 13.176333   14.207365 11.185264
## 20      MongoDB      Tech  2.915215  3.814714    4.309219  3.096636
## 21        MySQL      Tech  3.228804  4.534838    4.883782  3.230112
## 22        NoSQL      Tech  6.945412  8.485792   10.107078  9.663641
## 23        Numpy      Tech  4.494774  5.001946    6.059023  4.057662
## 24       Pandas      Tech  4.889663  6.422733    7.364847  4.671650
## 25         Perl      Tech  3.588850  5.021409    5.275529  5.285638
## 26          Pig      Tech  4.262485  5.760996    6.032907  6.833956
## 27       Python      Tech 73.716609 74.309070   75.424393 67.912440
## 28      PyTorch      Tech  2.485482  2.783184    3.421259  2.616124
## 29            R      Tech 52.880372 60.451538   62.496735 63.134010
## 30          SAS      Tech 19.895470 22.070845   23.765996 26.107848
## 31        Scala      Tech 12.078978 14.383028   15.382606 13.881474
## 32 Scikit-learn      Tech  5.505226  7.824056    7.678245  5.659370
## 33        Spark      Tech 25.191638 30.186843   30.477932 28.350240
## 34         SPSS      Tech  5.249710  6.422733    7.129799  5.392419
## 35          SQL      Tech 45.052265 51.148307   53.695482 49.145755
## 36      Tableau      Tech 14.123113 19.696380   20.370854 19.861185
## 37   TensorFlow      Tech  9.802555 12.864928   13.084356 10.277629
##        Total
## 1  13.187638
## 2   7.273836
## 3  10.289359
## 4   4.549078
## 5  13.168879
## 6   2.645031
## 7   3.245322
## 8   3.329738
## 9   4.089481
## 10  9.871969
## 11  4.098860
## 12 28.532570
## 13  3.873751
## 14 15.326174
## 15 25.240351
## 16  4.741359
## 17  4.305210
## 18  8.371242
## 19 11.471181
## 20  3.414154
## 21  3.840923
## 22  8.361863
## 23  4.821085
## 24  5.665244
## 25  4.535009
## 26  5.393237
## 27 73.146368
## 28  2.748206
## 29 58.232894
## 30 22.206069
## 31 13.544060
## 32  6.481264
## 33 27.899451
## 34  5.895043
## 35 48.792384
## 36 17.596023
## 37 11.213244

#Arrange data from largest to smallest by the Totals column
ds_skills_tech <- arrange(ds_skills_tech, desc(Total))
#Filter out only the Top 10 technical skills
top_10_tech <- ds_skills_tech[1:10,c(1,3:7)]
top_10_tech

##    Keyword LinkedIn   Indeed SimplyHired  Monster    Total
## 1   Python 73.71661 74.30907    75.42439 67.91244 73.14637
## 2        R 52.88037 60.45154    62.49674 63.13401 58.23289
## 3      SQL 45.05226 51.14831    53.69548 49.14576 48.79238
## 4   Hadoop 24.87805 30.71234    30.39958 32.03417 28.53257
## 5    Spark 25.19164 30.18684    30.47793 28.35024 27.89945
## 6     Java 22.57840 26.80031    27.65735 26.74853 25.24035
## 7      SAS 19.89547 22.07084    23.76600 26.10785 22.20607
## 8  Tableau 14.12311 19.69638    20.37085 19.86119 17.59602
## 9     Hive 13.72822 16.15415    16.63620 16.52429 15.32617
## 10   Scala 12.07898 14.38303    15.38261 13.88147 13.54406

Top ten Technical skills as a percentage of total Data Science jobs

x <- plot_ly(top_10_tech, x = ~ top_10_tech, y = top_10_tech$Total, type = 'bar', name = top_10_tech$Keyword)
x

Top ten Technical skills as a percentage of total Data Science jobs

y <- plot_ly(top_10_soft, x = ~ top_10_soft, y = top_10_soft$Total, type = 'bar', name = top_10_soft$Keyword)
y

Top 10 Techs as a pecentage of total data science jobs by job site

top_10_tech_visual <- data.frame("Keyword"=top_10_tech, top_10_tech)
head(top_10_tech_visual, 2)

##   Keyword.Keyword Keyword.LinkedIn Keyword.Indeed Keyword.SimplyHired
## 1          Python         73.71661       74.30907            75.42439
## 2               R         52.88037       60.45154            62.49674
##   Keyword.Monster Keyword.Total Keyword LinkedIn   Indeed SimplyHired
## 1        67.91244      73.14637  Python 73.71661 74.30907    75.42439
## 2        63.13401      58.23289       R 52.88037 60.45154    62.49674
##    Monster    Total
## 1 67.91244 73.14637
## 2 63.13401 58.23289

data <- top_10_tech_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]
head(data, 2)

##   Keyword LinkedIn   Indeed SimplyHired  Monster    Total
## 1  Python 73.71661 74.30907    75.42439 67.91244 73.14637
## 2       R 52.88037 60.45154    62.49674 63.13401 58.23289

s <- plot_ly(data, x = ~Keyword, y = ~LinkedIn, type = 'bar', name = "Linkedin") %>%
add_trace(y = ~Indeed, name = "Indeed") %>% 
add_trace(y = ~SimplyHired, name = "SimplyHired") %>% 
add_trace(y = ~Monster, name = "Monster") %>% 
add_trace(y = ~Total, name = "Average") %>% 
layout(title = 'Technical Skills by Job Sites', yaxis = list(title = '% of Data Science Jobs'), xaxis = list(title = 'Tech Skills'), barmode = 'group')
s

Top 10 Techs as a pecentage of total data science jobs by job site

top_10_soft_visual <- data.frame("Keyword"=top_10_soft, top_10_soft)
head(top_10_soft_visual, 2)

##    Keyword.Keyword Keyword.LinkedIn Keyword.Indeed Keyword.SimplyHired
## 1         analysis         60.02323       68.11989            69.67877
## 2 machine learning         66.21370       66.93266            66.88430
##   Keyword.Monster Keyword.Total          Keyword LinkedIn   Indeed
## 1        88.25414      68.66764         analysis 60.02323 68.11989
## 2        62.46663      65.84908 machine learning 66.21370 66.93266
##   SimplyHired  Monster    Total
## 1    69.67877 88.25414 68.66764
## 2    66.88430 62.46663 65.84908

data2 <- top_10_soft_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]
head(data, 2)

##   Keyword LinkedIn   Indeed SimplyHired  Monster    Total
## 1  Python 73.71661 74.30907    75.42439 67.91244 73.14637
## 2       R 52.88037 60.45154    62.49674 63.13401 58.23289

t <- plot_ly(data2, x = ~Keyword, y = ~LinkedIn, type = 'bar', name = "Linkedin") %>%
add_trace(y = ~Indeed, name = "Indeed") %>% 
add_trace(y = ~SimplyHired, name = "SimplyHired") %>% 
add_trace(y = ~Monster, name = "Monster") %>% 
add_trace(y = ~Total, name = "Average") %>% 
layout(title = 'Non-Technical Skills by Job Sites', yaxis = list(title = '% of Data Science Jobs'), xaxis = list(title = 'Non-Tech Skills'), barmode = 'group')
t

Top two “Technology” Skills for Data Scientist

head(ds_skills_tech, 2)

##   Keyword Skill_Set LinkedIn   Indeed SimplyHired  Monster    Total
## 1  Python      Tech 73.71661 74.30907    75.42439 67.91244 73.14637
## 2       R      Tech 52.88037 60.45154    62.49674 63.13401 58.23289

Top two “Non-Technology” Skills for Data Scientist

head(ds_skills_soft, 2)

##            Keyword Skill_Set LinkedIn   Indeed SimplyHired  Monster
## 1         analysis      Soft 60.02323 68.11989    69.67877 88.25414
## 2 machine learning      Soft 66.21370 66.93266    66.88430 62.46663
##      Total
## 1 68.66764
## 2 65.84908

Conclusion

There are many skills and skill-sets that employers are looking for when hiring a data-scientist. Given that Data Science is amultidisciplinary field, it is only fitting that there are both Techology based skills and Soft skills that companies look for in a data science.

The top Techology skill that employers are looking for when hiring a data scientist, according to this dataset, is Python.

The top Soft skill that employers are looking for when hiring a data scientist, according to this dataset, is Analyis.

The top Overall skill that employers are looking for when hiring a data scientist, according to this dataset, is Python.

Data 607 Project 3

Joseph Simone & Kigamba Samuel

10/20/2019

Project # 3 - Data Scienctist Skills Compared

Objective

Background

Research Question

What are the most important Techology based skill and Soft Skill, respectively, when hiring a Data Scientist?

Overall, what is the most sought after skill for a Data Scientist?

Data import and cleaning

Combine Data-frames

Explore newly created data-frame

Rename variable

Export to .csv

Importing newly created csv

Data-frame cleaning and tidying

Creation of New Variable

Top 10 Soft Skills for Data Scientists

Data transformations for visualization

Top 10 Soft as a pecentage of total data science jobs by job site

Top 10 Non-Technical Skills for Data Scientists

Data transformations for visualization

Top 10 Technical Skills for Data Scientists

Data transformations for visualization

Top ten Technical skills as a percentage of total Data Science jobs

Top ten Technical skills as a percentage of total Data Science jobs

Top 10 Techs as a pecentage of total data science jobs by job site

Top 10 Techs as a pecentage of total data science jobs by job site

Top two “Technology” Skills for Data Scientist

Top two “Non-Technology” Skills for Data Scientist

Conclusion