library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
This is a project for CUNY SPS 607 - Data Acquisition and Management. It was completed by: - Renida Kasa
W. Edwards Deming said, “In God we trust, all others must bring data.” Please use data to answer the question, “Which are the most valued data science skills?” Consider your work as an exploration; there is not necessarily a “right answer.”
My motive for this project was to gain an insight into the most useful skills for a data scientist, to then help make me a better data scientist.
I used a dataset published by Jeff Hale from the following Kaggle link to answer the research question: https://www.kaggle.com/code/discdiver/the-most-in-demand-skills-for-data-scientists
In this dataset, Hale collected job descriptions from four job boards: LinkedIn, Indeed, SimplyHired, and Monster, and extracted specific data science career terms. These terms were either general data science skills, or specific computer language skills. The data science skills terms were: machine learning, analysis, statistics, computer science, communication, mathematics, visualziation, AI composite, deep learning, NLP composite, software development, neural networks, data engineering, project management, and software engineering.
You can also read more about the work he published in the following article: https://towardsdatascience.com/the-most-in-demand-skills-for-data-scientists-4a4a8db896db
I will be using this data to demonstrate which are the most valued data science skills.
ds_skills <- read.csv('https://raw.githubusercontent.com/rkasa01/DATA607_PROJECT3/main/Data%20Science%20Career%20Terms%20-%20ds%20skills.csv')
print(ds_skills)
## Keyword
## 1 machine learning
## 2 analysis
## 3 statistics
## 4 computer science
## 5 communication
## 6 mathematics
## 7 visualization
## 8 AI composite
## 9 deep learning
## 10 NLP composite
## 11 software development
## 12 neural networks
## 13 data engineering
## 14 project management
## 15 software engineering
## 16
## 17 Total
## 18
## 19 add AI and artificial intelligence and subtract the overlap search term with both terms in it
## 20 AI
## 21 artificial intelligence
## 22 AI + artificial intelligence
## 23
## 24 add NLP and natural language processing and subtract the overlap search term with both terms in it
## 25 NLP
## 26 natural language processing
## 27 NLP + natural language processing
## 28
## 29 "data scientist" "[keyword]"
## 30 Oct 10, 2018
## LinkedIn Indeed SimplyHired Monster
## 1 5,701 3,439 2,561 2,340
## 2 5,168 3,500 2,668 3,306
## 3 4,893 2,992 2,308 2,399
## 4 4,517 2,739 2,093 1,900
## 5 3,404 2,344 1,791 2,053
## 6 2,605 1,961 1,497 1,815
## 7 1,879 1,413 1,153 1,207
## 8 1,568 1,125 811 687
## 9 1,310 979 675 606
## 10 1,212 910 660 582
## 11 732 627 481 784
## 12 671 485 421 305
## 13 514 200
## 14 476 397 330 348
## 15 413 295 250 512
## 16
## 17 35,063 23,206 17,699 19,044
## 18
## 19
## 20 916 690 508 680
## 21 964 754 498 679
## 22 312 319 195 672
## 23
## 24
## 25 643 466 362 576
## 26 791 621 429 575
## 27 222 177 131 569
## 28
## 29
## 30
summary(ds_skills)
## Keyword LinkedIn Indeed SimplyHired
## Length:30 Length:30 Length:30 Length:30
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## Monster
## Length:30
## Class :character
## Mode :character
variable_names <- colnames(ds_skills)
print(variable_names)
## [1] "Keyword" "LinkedIn" "Indeed" "SimplyHired" "Monster"
It looked like the commas in the values were prompting the numbers to be labelled as characters which could hinder me from interpretting the data correctly. I would have to remove all the commas, and then convert the values for each job board to numeric values.
cols_to_remove_commas <- c("LinkedIn", "Indeed", "SimplyHired", "Monster")
for (col in cols_to_remove_commas) { #removed commas from the columns
ds_skills[[col]] <- gsub(",", "", ds_skills[[col]])
}
ds_skills$LinkedIn <- as.numeric(ds_skills$LinkedIn) #converted each column to numeric
ds_skills$Indeed <- as.numeric(ds_skills$Indeed)
ds_skills$SimplyHired <- as.numeric(ds_skills$SimplyHired)
ds_skills$Monster <- as.numeric(ds_skills$Monster)
summary(ds_skills)
## Keyword LinkedIn Indeed SimplyHired
## Length:30 Min. : 222 Min. : 177 Min. : 131
## Class :character 1st Qu.: 650 1st Qu.: 485 1st Qu.: 421
## Mode :character Median : 1088 Median : 910 Median : 660
## Mean : 3362 Mean : 2354 Mean : 1787
## 3rd Qu.: 3204 3rd Qu.: 2344 3rd Qu.: 1791
## Max. :35063 Max. :23206 Max. :17699
## NA's :8 NA's :9 NA's :9
## Monster
## Min. : 200.0
## 1st Qu.: 575.2
## Median : 679.5
## Mean : 1901.8
## 3rd Qu.: 1878.8
## Max. :19044.0
## NA's :8
Here, I have a more accurate view of the summary statistics from each job board. For example, for LinkedIn, I can see that the minimal amount of times that a keyword appeared was 222 times. The maximal amount was 35,063. With these summary statistics, I also see some missing data for each of these columns.
sapply(ds_skills, function(x) sum(is.na(x)))
## Keyword LinkedIn Indeed SimplyHired Monster
## 0 8 9 9 8
There is a total of 34 missing data, which will have to be removed.
ds_skills <- ds_skills %>%
filter(!is.na(LinkedIn))
ds_skills <- ds_skills %>%
filter(!is.na(Indeed))
ds_skills <- ds_skills %>%
filter(!is.na(SimplyHired))
ds_skills <- ds_skills %>%
filter(!is.na(Monster))
sapply(ds_skills, function(x) sum(is.na(x)))
## Keyword LinkedIn Indeed SimplyHired Monster
## 0 0 0 0 0
Here, I filtered out the missing data since it will not be helpful for the purposes of this assignment. I can now use this updated set of data to create some plots.
ds_skills <- ds_skills %>% #removing the total so it does not show in the graph
filter(Keyword != "Total")
linkedin_bar <- ggplot(ds_skills, aes(x = LinkedIn, y = Keyword)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "LinkedIn Data Science Skills Count (2018)", x = "Count", y = "Data Science Skills") +
theme_minimal()
print(linkedin_bar)
Here, I created a plot for the most common data science skills from LinkedIn in 2018. The most common term was “machine learning”. Machine learning is the use of artificial intelligence for statistical algorithms. For example, with various forms of autonomy, such as in remote or robotic surgery, with machine learning, a surgeon’s techniques can be improved over time. It looks like the least common term was “natural language processing”, or NLP.
indeed_bar <- ggplot(ds_skills, aes(x = Indeed, y = Keyword)) +
geom_bar(stat = "identity", fill = "green") +
labs(title = "Indeed Data Science Skills Count (2018)", x = "Count", y = "Data Science Skills") +
theme_minimal()
print(indeed_bar)
Here, I created a plot for the most common data science skills from Indeed in 2018. The most common term was “analysis”. Analysis is a widely applicable term, meaning observation or examination. For example, as data scientists, we constantly analyze sets of data – just as I am doing for this assignment right now! It looks like the least common term reported on Indeed was also “natural language processing”.
simplyhired_bar <- ggplot(ds_skills, aes(x = SimplyHired, y = Keyword)) +
geom_bar(stat = "identity", fill = "orange") +
labs(title = "SimplyHired Data Science Skills Count (2018)", x = "Count", y = "Data Science Skills") +
theme_minimal()
print(simplyhired_bar)
Here, I created a plot for the most common data science skills from SimplyHired in 2018. The most common term here was also “analysis”. It looks like the least common term reported on SimplyHired was once again, “natural language processing”.
monster_bar <- ggplot(ds_skills, aes(x = Monster, y = Keyword)) +
geom_bar(stat = "identity", fill = "red") +
labs(title = "Monster Data Science Skills Count (2018)", x = "Count", y = "Data Science Skills") +
theme_minimal()
print(monster_bar)
Here, I created a plot for the most common data science skills from Monster in 2018. The most common term was also”analysis”. It looks like the least common term reported on Monster was “neural networks”.
combined_data <- ds_skills %>%
pivot_longer(cols = c(LinkedIn, Indeed, SimplyHired, Monster),
names_to = "Source",
values_to = "Count")
combined_bar_plot <- ggplot(combined_data, aes(x = Keyword, y = Count, fill = Source)) +
geom_bar(stat = "identity", position = "identity", width = 0.7) +
labs(title = "Data Science Skills Count (2018)", x = "Data Science Skills", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("blue", "green", "orange", "red"))
print(combined_bar_plot)
Here, we have a graph of the combined data science skills count. This is nice to look at because it puts into perscptive which job boards are using specific terms relative to the others. The three peaks for this plot were machine learning, statistics and analysis.
#Top 10 Skills Across All Job Boards
# Calculate the top 10 skills for each job board
top_skills_per_source <- combined_data %>%
group_by(Source, Keyword) %>%
summarise(Total_Count = sum(Count)) %>%
arrange(Source, desc(Total_Count)) %>%
top_n(10)
## `summarise()` has grouped output by 'Source'. You can override using the
## `.groups` argument.
## Selecting by Total_Count
# Combine the top 10 skills from each source into a single data frame
combined_top_skills <- top_skills_per_source %>%
group_by(Keyword) %>%
summarise(Total_Count = sum(Total_Count))
# Create a bar plot with overlapping bars for the combined top skills
combined_top_skills_plot <- ggplot(combined_top_skills, aes(x = reorder(Keyword, Total_Count), y = Total_Count, fill = Keyword)) +
geom_bar(stat = "identity", width = 0.7) +
labs(title = "Top 10 Data Science Skills Across All Sources", x = "Data Science Skills", y = "Total Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(combined_top_skills_plot)
ds_software <- read.csv('https://raw.githubusercontent.com/rkasa01/DATA607_PROJECT3/main/Data%20Science%20Career%20Terms%20-%20ds%20software.csv')
print(ds_software)
## Keyword LinkedIn Indeed SimplyHired Monster LinkedIn..
## 1 Python 6,347 3,818 2,888 2,544 74%
## 2 R 4,553 3,106 2,393 2,365 53%
## 3 SQL 3,879 2,628 2,056 1,841 45%
## 4 Spark 2,169 1,551 1,167 1,062 25%
## 5 Hadoop 2,142 1,578 1,164 1,200 25%
## 6 Java 1,944 1,377 1,059 1,002 23%
## 7 SAS 1,713 1,134 910 978 20%
## 8 Tableau 1,216 1,012 780 744 14%
## 9 Hive 1,182 830 637 619 14%
## 10 Scala 1,040 739 589 520 12%
## 11 C++ 1,024 765 580 439 12%
## 12 AWS 947 791 607 467 11%
## 13 TensorFlow 844 661 501 385 10%
## 14 Matlab 806 677 544 419 9%
## 15 C 795 492 384 523 9%
## 16 Excel 701 569 438 397 8%
## 17 Linux 601 517 364 303 7%
## 18 NoSQL 598 436 387 362 7%
## 19 Azure 578 416 285 272 7%
## 20 Scikit-learn 474 402 294 212 6%
## 21 SPSS 452 330 273 202 5%
## 22 Pandas 421 330 282 175 5%
## 23 Numpy 387 257 232 152 4%
## 24 Pig 367 296 231 256 4%
## 25 D3 353 149 113 95 4%
## 26 Keras 329 253 205 131 4%
## 27 Javascript 328 245 214 224 4%
## 28 C# 324 245 182 219 4%
## 29 Perl 309 258 202 198 4%
## 30 Hbase 302 219 167 138 4%
## 31 Docker 290 240 148 194 3%
## 32 Git 282 261 186 145 3%
## 33 MySQL 278 233 187 121 3%
## 34 MongoDB 251 196 165 116 3%
## 35 Cassandra 236 174 146 136 3%
## 36 PyTorch 214 143 131 98 2%
## 37 Caffe 206 149 113 96 2%
## 38
## 39 Total 38,882 27,477 21,204 19,350
## 40 "data scientist" alone 8,610 5,138 3,829 3,746
## 41 "data scientist" "[keyword]"
## 42 Oct. 10, 2018
## Indeed.. SimplyHired.. Monster.. Avg.. GlassDoor.Self.Reported...2017
## 1 74% 75% 68% 73% 72%
## 2 60% 62% 63% 60% 64%
## 3 51% 54% 49% 50% 51%
## 4 30% 30% 28% 29% 27%
## 5 31% 30% 32% 30% 39%
## 6 27% 28% 27% 26% 33%
## 7 22% 24% 26% 23% 30%
## 8 20% 20% 20% 19% 14%
## 9 16% 17% 17% 16% 17%
## 10 14% 15% 14% 14%
## 11 15% 15% 12% 13%
## 12 15% 16% 12% 14%
## 13 13% 13% 10% 12%
## 14 13% 14% 11% 12% 20%
## 15 10% 10% 14% 11%
## 16 11% 11% 11% 10%
## 17 10% 10% 8% 9%
## 18 8% 10% 10% 9%
## 19 8% 7% 7% 7%
## 20 8% 8% 6% 7%
## 21 6% 7% 5% 6%
## 22 6% 7% 5% 6%
## 23 5% 6% 4% 5%
## 24 6% 6% 7% 6%
## 25 3% 3% 3% 3%
## 26 5% 5% 3% 4%
## 27 5% 6% 6% 5%
## 28 5% 5% 6% 5%
## 29 5% 5% 5% 5%
## 30 4% 4% 4% 4%
## 31 5% 4% 5% 4%
## 32 5% 5% 4% 4%
## 33 5% 5% 3% 4%
## 34 4% 4% 3% 4%
## 35 3% 4% 4% 3%
## 36 3% 3% 3% 3%
## 37 3% 3% 3% 3%
## 38
## 39
## 40
## 41
## 42
## Difference
## 1 1%
## 2 -4%
## 3 -1%
## 4 2%
## 5 -9%
## 6 -7%
## 7 -7%
## 8 5%
## 9 -1%
## 10
## 11
## 12
## 13
## 14 -8%
## 15
## 16
## 17
## 18
## 19
## 20
## 21
## 22
## 23
## 24
## 25
## 26
## 27
## 28
## 29
## 30
## 31
## 32
## 33
## 34
## 35
## 36
## 37
## 38
## 39
## 40
## 41
## 42
cols_to_remove_commas <- c("LinkedIn", "Indeed", "SimplyHired", "Monster")
for (col in cols_to_remove_commas) { #removed commas from the columns
ds_software[[col]] <- gsub(",", "", ds_software[[col]])
}
ds_software$LinkedIn <- as.numeric(ds_software$LinkedIn) #converted each column to numeric
ds_software$Indeed <- as.numeric(ds_software$Indeed)
ds_software$SimplyHired <- as.numeric(ds_software$SimplyHired)
ds_software$Monster <- as.numeric(ds_software$Monster)
summary(ds_software)
## Keyword LinkedIn Indeed SimplyHired
## Length:42 Min. : 206 Min. : 143 Min. : 113.0
## Class :character 1st Qu.: 326 1st Qu.: 249 1st Qu.: 194.5
## Mode :character Median : 598 Median : 436 Median : 364.0
## Mean : 2215 Mean : 1541 Mean : 1185.6
## 3rd Qu.: 1199 3rd Qu.: 921 3rd Qu.: 708.5
## Max. :38882 Max. :27477 Max. :21204.0
## NA's :3 NA's :3 NA's :3
## Monster LinkedIn.. Indeed.. SimplyHired..
## Min. : 95.0 Length:42 Length:42 Length:42
## 1st Qu.: 163.5 Class :character Class :character Class :character
## Median : 303.0 Mode :character Mode :character Mode :character
## Mean : 1088.4
## 3rd Qu.: 681.5
## Max. :19350.0
## NA's :3
## Monster.. Avg.. GlassDoor.Self.Reported...2017
## Length:42 Length:42 Length:42
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Difference
## Length:42
## Class :character
## Mode :character
##
##
##
##
I repeated the steps above for data science skills, but this time, for data science computer language skills, to gain insight into the most useful computer languages within the field. I removed the commas and relabeled the dataset types to numeric just as before. With the improved summary statistics, I have a more accurate view of the summary statistics from each job board. For example, for LinkedIn, I can see that the minimal amount of times that a specific computer language skill appeared in a job description was 206 times. The maximal amount was 38,882 times. I will have to remove the missing data just as before.
ds_software <- ds_software %>%
filter(!is.na(LinkedIn))
ds_software <- ds_software %>%
filter(!is.na(Indeed))
ds_software <- ds_software %>%
filter(!is.na(SimplyHired))
ds_software <- ds_software %>%
filter(!is.na(Monster))
sapply(ds_software, function(x) sum(is.na(x)))
## Keyword LinkedIn
## 0 0
## Indeed SimplyHired
## 0 0
## Monster LinkedIn..
## 0 0
## Indeed.. SimplyHired..
## 0 0
## Monster.. Avg..
## 0 0
## GlassDoor.Self.Reported...2017 Difference
## 0 0
Here, I filtered out the missing data just as before.
ds_software <- ds_software %>% #removing the total so it does not show in the graph
filter(Keyword != "Total")
combined_data <- ds_software %>%
pivot_longer(cols = c(LinkedIn, Indeed, SimplyHired, Monster),
names_to = "Source",
values_to = "Count")
combined_bar_plot <- ggplot(combined_data, aes(x = Keyword, y = Count, fill = Source)) +
geom_bar(stat = "identity", position = "identity", width = 0.7) +
labs(title = "Data Science Skills Count (2018)", x = "Data Science Skills", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(values = c("blue", "green", "orange", "red"))
print(combined_bar_plot)
From this combined data science skills - computer language plot, we can see which job boards contain job descriptions demanding specific computer languages relative to the others. The three peaks for this plot were machine Python, R, and SQL. I would not necessarily include ‘“Data Scientist” alone’ as a peak, simply because that means that this job descriptions did not specify a computer language. They may have labelled computer languages under an umbrella term, making it unclear which language was specified.
I would say that because these are all phrases found from “data science” positions, I would agrue that most, if not all, are important skills for a data scientist to possess, otherwise they would not appear in so many job descriptions. It would be interesting to see this set of data compared to a more recent set of data, or even to jobs from other job boards. Another limitation could be that because of the broad terms that were used, different computer languages could be associated with the different terms, which is why this set of data has a section for ‘“Data Scientist” only’. It is unclear which computer language is associated with those job descriptions.
To conclude, terms associated with various forms of analysis are the top key words or skills associated with data scientist job descriptions. For this reason, I would say that analysis, as well as the various forms or other skills associated with it, such as machine learning, statistics, computer science, and mathematics, are among the top data science skills. Communication is a key factor in data science as well, in order to convey findings, or these analyses. For the most part, NLP, or natural language processing, appears to be the least common, or among the least commonly, used key skills for data scientists. NLP offers an aspect which many other data science skills on their own lack, and that is the skill of understanding human language. It is challenging to involve human sentiment in technology, which could be the reason why it ranked below most of the other key words. It has many limitations, such as when understanding human sarcasm, or irony, or human intention through writing. In terms of computer languages, the most commonly used terms across all job boards were Python, R, and SQL. The data science master’s program at CUNY SPS is doing a great job at making sure we are immersed in these computer languages, and others as well.