During early research,
We found this study
https://365datascience.com/research-into-1001-data-scientist-profiles/
imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/journal.png" include_graphics(imgage)
10/22/2019
During early research,
We found this study
https://365datascience.com/research-into-1001-data-scientist-profiles/
imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/journal.png" include_graphics(imgage)
As a team we saw similiar findings that fits the critia needed to be calculated in order to complete this project.
However, as a team, we needed some sort of data pool to start calulating perameters to answer a research question(s).
With some futher research, we found a similar project that wanted to answer a similar question.
-https://www.kaggle.com/discdiver/the-most-in-demand-skills-for-data-scientists
imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/kaggle.png" include_graphics(imgage)
imgage <- "C:/Users/jpsim/Documents/DATA Acquisition and Management/kaggle_data.png" include_graphics(imgage)
general_skills <- read_csv(file="https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/kaggle_ds_general_skills_revised.csv")
## Parsed with column specification: ## cols( ## Keyword = col_character(), ## LinkedIn = col_number(), ## Indeed = col_number(), ## SimplyHired = col_number(), ## Monster = col_number() ## )
head(general_skills)
## # A tibble: 6 x 5 ## Keyword LinkedIn Indeed SimplyHired Monster ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 machine learning 5701 3439 2561 2340 ## 2 analysis 5168 3500 2668 3306 ## 3 statistics 4893 2992 2308 2399 ## 4 computer science 4517 2739 2093 1900 ## 5 communication 3404 2344 1791 2053 ## 6 mathematics 2605 1961 1497 1815
software_skills <-read_csv(file = "https://raw.githubusercontent.com/josephsimone/DATA_607_Project_3/master/kaggle_ds_job_listing_software.csv")
## Parsed with column specification: ## cols( ## Keyword = col_character(), ## LinkedIn = col_number(), ## Indeed = col_number(), ## SimplyHired = col_number(), ## Monster = col_number(), ## `LinkedIn %` = col_character(), ## `Indeed %` = col_character(), ## `SimplyHired %` = col_character(), ## `Monster %` = col_character(), ## `Avg %` = col_character(), ## `GlassDoor Self Reported % 2017` = col_character(), ## Difference = col_character() ## )
head(software_skills)
## # A tibble: 6 x 12 ## Keyword LinkedIn Indeed SimplyHired Monster `LinkedIn %` `Indeed %` ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 Python 6347 3818 2888 2544 74% 74% ## 2 R 4553 3106 2393 2365 53% 60% ## 3 SQL 3879 2628 2056 1841 45% 51% ## 4 Spark 2169 1551 1167 1062 25% 30% ## 5 Hadoop 2142 1578 1164 1200 25% 31% ## 6 Java 1944 1377 1059 1002 23% 27% ## # ... with 5 more variables: `SimplyHired %` <chr>, `Monster %` <chr>, ## # `Avg %` <chr>, `GlassDoor Self Reported % 2017` <chr>, ## # Difference <chr>
What are the most important Techology based skill and Soft Skill, respectively, when hiring a Data Scientist?
Overall, what is the most sought after skill for a Data Scientist?
dt1 <- as_tibble(general_skills, key = "ï..Keyword") dt1$Skill_Set <- "Soft" dt2 <- as_tibble(software_skills, key = "ï..Keyword") dt2$Skill_Set <- "Tech" dt2 <- select(dt2,-c(6,7,8,9,10,11,12))
result<-rbind(dt1, dt2) names(result)[names(result) == "ï..Keyword"] <- "Keyword" head(result)
## # A tibble: 6 x 6 ## Keyword LinkedIn Indeed SimplyHired Monster Skill_Set ## <chr> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 machine learning 5701 3439 2561 2340 Soft ## 2 analysis 5168 3500 2668 3306 Soft ## 3 statistics 4893 2992 2308 2399 Soft ## 4 computer science 4517 2739 2093 1900 Soft ## 5 communication 3404 2344 1791 2053 Soft ## 6 mathematics 2605 1961 1497 1815 Soft
ds_skills <- as.data.frame(result) ds_skills$Total <- rowSums(sapply(ds_skills[,c(2:5)], as.numeric)) ds_skills <- mutate(ds_skills[, c(1,6,2:5,7)]) ds_skills <- data.frame(ds_skills) colnames(ds_skills)
## [1] "Keyword" "Skill_Set" "LinkedIn" "Indeed" "SimplyHired" ## [6] "Monster" "Total"
head(ds_skills)
## Keyword Skill_Set LinkedIn Indeed SimplyHired Monster Total ## 1 machine learning Soft 5701 3439 2561 2340 14041 ## 2 analysis Soft 5168 3500 2668 3306 14642 ## 3 statistics Soft 4893 2992 2308 2399 12592 ## 4 computer science Soft 4517 2739 2093 1900 11249 ## 5 communication Soft 3404 2344 1791 2053 9592 ## 6 mathematics Soft 2605 1961 1497 1815 7878
These two combined data-sets used the amount of keywords present during a general search on well known job recruitment websites, to fill in its’ values. As a team, we thought it was more effective to create an average out of these values to get a better understanding of the dataset.
LinkedIn <- 8610 Indeed <- 5138 SimplyHired <- 3829 Monster <- 3746 Total <- LinkedIn + Indeed + SimplyHired + Monster ds_skills$LinkedIn <- ((ds_skills$LinkedIn)/(LinkedIn*100)) ds_skills$Indeed <- ((ds_skills$Indeed)/(Indeed*100)) ds_skills$SimplyHired <- ((ds_skills$SimplyHired)/(SimplyHired*100)) ds_skills$Monster <- ((ds_skills$Monster)/((Monster*100))) ds_skills$Total <- ((ds_skills$Total/((Total*100))))
Set up for Bar Graphs
ds_skills_soft <- filter(ds_skills, ds_skills$Skill_Set == "Soft") dim(ds_skills_soft)
## [1] 15 7
head(ds_skills_soft, 20)
## Keyword Skill_Set LinkedIn Indeed SimplyHired ## 1 machine learning Soft 0.0066213705 0.0066932659 0.0066884304 ## 2 analysis Soft 0.0060023229 0.0068119891 0.0069678767 ## 3 statistics Soft 0.0056829268 0.0058232775 0.0060276835 ## 4 computer science Soft 0.0052462253 0.0053308680 0.0054661792 ## 5 communication Soft 0.0039535424 0.0045620864 0.0046774615 ## 6 mathematics Soft 0.0030255517 0.0038166602 0.0039096370 ## 7 visualization Soft 0.0021823461 0.0027500973 0.0030112301 ## 8 AI composite Soft 0.0018211382 0.0021895679 0.0021180465 ## 9 deep learning Soft 0.0015214866 0.0019054107 0.0017628624 ## 10 NLP composite Soft 0.0014076655 0.0017711172 0.0017236876 ## 11 software development Soft 0.0008501742 0.0012203192 0.0012562027 ## 12 neural networks Soft 0.0007793264 0.0009439471 0.0010995038 ## 13 data engineering Soft 0.0005969803 0.0006597898 0.0007208148 ## 14 project management Soft 0.0005528455 0.0007726742 0.0008618438 ## 15 software engineering Soft 0.0004796748 0.0005741534 0.0006529120 ## Monster Total ## 1 0.0062466631 0.0065849083 ## 2 0.0088254138 0.0068667636 ## 3 0.0064041644 0.0059053604 ## 4 0.0050720769 0.0052755241 ## 5 0.0054805125 0.0044984289 ## 6 0.0048451682 0.0036946021 ## 7 0.0032221036 0.0026506589 ## 8 0.0018339562 0.0019654833 ## 9 0.0016177256 0.0016742485 ## 10 0.0015536572 0.0015776392 ## 11 0.0020928991 0.0012305961 ## 12 0.0008142018 0.0008826150 ## 13 0.0005339028 0.0006232706 ## 14 0.0009289909 0.0007273836 ## 15 0.0013667912 0.0006893964
ds_skills_soft <- arrange(ds_skills_soft, desc(Total)) top_10_soft <- ds_skills_soft[1:10,c(1,3:7)] top_10_soft
## Keyword LinkedIn Indeed SimplyHired Monster ## 1 analysis 0.006002323 0.006811989 0.006967877 0.008825414 ## 2 machine learning 0.006621370 0.006693266 0.006688430 0.006246663 ## 3 statistics 0.005682927 0.005823278 0.006027683 0.006404164 ## 4 computer science 0.005246225 0.005330868 0.005466179 0.005072077 ## 5 communication 0.003953542 0.004562086 0.004677461 0.005480513 ## 6 mathematics 0.003025552 0.003816660 0.003909637 0.004845168 ## 7 visualization 0.002182346 0.002750097 0.003011230 0.003222104 ## 8 AI composite 0.001821138 0.002189568 0.002118046 0.001833956 ## 9 deep learning 0.001521487 0.001905411 0.001762862 0.001617726 ## 10 NLP composite 0.001407666 0.001771117 0.001723688 0.001553657 ## Total ## 1 0.006866764 ## 2 0.006584908 ## 3 0.005905360 ## 4 0.005275524 ## 5 0.004498429 ## 6 0.003694602 ## 7 0.002650659 ## 8 0.001965483 ## 9 0.001674248 ## 10 0.001577639
Top Ten Non-Technical skills as a percentage of total Data Science jobs
p
Top 5 Technical Skills for Data Scientists
Data transformations for visualization
ds_skills_tech <- filter(ds_skills, ds_skills$Skill_Set == "Tech") #Arrange data from largest to smallest by the Totals column ds_skills_tech <- arrange(ds_skills_tech, desc(Total)) #Filter out only the Top 10 technical skills top_10_tech <- ds_skills_tech[1:10,c(1,3:7)]
Top Ten Technical skills as a percentage of total Data Science jobs
x
Top 10 Tech Skills as an average pecentage of total data science jobs by job site
top_10_tech_visual <- data.frame("Keyword"=top_10_tech, top_10_tech)
data <- top_10_tech_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]
s
Top 10 Non-Tech SKills as an average pecentage of total data science jobs by job site
top_10_soft_visual <- data.frame("Keyword"=top_10_soft, top_10_soft)
data2 <- top_10_soft_visual[,c('Keyword', 'LinkedIn', 'Indeed', 'SimplyHired', 'Monster', 'Total')]
t
head(ds_skills_tech, 2)
## Keyword Skill_Set LinkedIn Indeed SimplyHired Monster ## 1 Python Tech 0.007371661 0.007430907 0.007542439 0.006791244 ## 2 R Tech 0.005288037 0.006045154 0.006249674 0.006313401 ## Total ## 1 0.007314637 ## 2 0.005823289
head(ds_skills_soft, 2)
## Keyword Skill_Set LinkedIn Indeed SimplyHired ## 1 analysis Soft 0.006002323 0.006811989 0.006967877 ## 2 machine learning Soft 0.006621370 0.006693266 0.006688430 ## Monster Total ## 1 0.008825414 0.006866764 ## 2 0.006246663 0.006584908
There are many skills and skill-sets that employers are looking for when hiring a data-scientist. Given that Data Science is amultidisciplinary field, it is only fitting that there are both Techology based skills and Soft skills that companies look for in a data science.
The top Techology skill that employers are looking for when hiring a data scientist, according to this dataset, is Python.
The top Soft skill that employers are looking for when hiring a data scientist, according to this dataset, is Analysis.
The top Overall skill that employers are looking for when hiring a data scientist, according to this dataset, is Python.