library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)

Project 3

This is a project for CUNY SPS 607 - Data Acquisition and Management. It was completed by: - Renida Kasa

Research Question

W. Edwards Deming said, “In God we trust, all others must bring data.” Please use data to answer the question, “Which are the most valued data science skills?” Consider your work as an exploration; there is not necessarily a “right answer.”

Motivation

My motive for this project was to gain an insight into the most useful skills for a data scientist, to then help make me a better data scientist.

Approach

I used a dataset published by Jeff Hale from the following Kaggle link to answer the research question: https://www.kaggle.com/code/discdiver/the-most-in-demand-skills-for-data-scientists

In this dataset, Hale collected job descriptions from four job boards: LinkedIn, Indeed, SimplyHired, and Monster, and extracted specific data science career terms. These terms were either general data science skills, or specific computer language skills. The data science skills terms were: machine learning, analysis, statistics, computer science, communication, mathematics, visualziation, AI composite, deep learning, NLP composite, software development, neural networks, data engineering, project management, and software engineering.

You can also read more about the work he published in the following article: https://towardsdatascience.com/the-most-in-demand-skills-for-data-scientists-4a4a8db896db

I will be using this data to demonstrate which are the most valued data science skills.

Loading the Dataset

ds_skills <- read.csv('https://raw.githubusercontent.com/rkasa01/DATA607_PROJECT3/main/Data%20Science%20Career%20Terms%20-%20ds%20skills.csv')
print(ds_skills)
##                                                                                               Keyword
## 1                                                                                    machine learning
## 2                                                                                            analysis
## 3                                                                                          statistics
## 4                                                                                    computer science
## 5                                                                                       communication
## 6                                                                                         mathematics
## 7                                                                                       visualization
## 8                                                                                        AI composite
## 9                                                                                       deep learning
## 10                                                                                      NLP composite
## 11                                                                               software development
## 12                                                                                    neural networks
## 13                                                                                   data engineering
## 14                                                                                 project management
## 15                                                                               software engineering
## 16                                                                                                   
## 17                                                                                              Total
## 18                                                                                                   
## 19      add AI and artificial intelligence and subtract the overlap search term with both terms in it
## 20                                                                                                 AI
## 21                                                                            artificial intelligence
## 22                                                                       AI + artificial intelligence
## 23                                                                                                   
## 24 add NLP and natural language processing and subtract the overlap search term with both terms in it
## 25                                                                                                NLP
## 26                                                                        natural language processing
## 27                                                                  NLP + natural language processing
## 28                                                                                                   
## 29                                                                       "data scientist" "[keyword]"
## 30                                                                                       Oct 10, 2018
##    LinkedIn Indeed SimplyHired Monster
## 1     5,701  3,439       2,561   2,340
## 2     5,168  3,500       2,668   3,306
## 3     4,893  2,992       2,308   2,399
## 4     4,517  2,739       2,093   1,900
## 5     3,404  2,344       1,791   2,053
## 6     2,605  1,961       1,497   1,815
## 7     1,879  1,413       1,153   1,207
## 8     1,568  1,125         811     687
## 9     1,310    979         675     606
## 10    1,212    910         660     582
## 11      732    627         481     784
## 12      671    485         421     305
## 13      514                        200
## 14      476    397         330     348
## 15      413    295         250     512
## 16                                    
## 17   35,063 23,206      17,699  19,044
## 18                                    
## 19                                    
## 20      916    690         508     680
## 21      964    754         498     679
## 22      312    319         195     672
## 23                                    
## 24                                    
## 25      643    466         362     576
## 26      791    621         429     575
## 27      222    177         131     569
## 28                                    
## 29                                    
## 30

Summary Statistics

summary(ds_skills)
##    Keyword            LinkedIn            Indeed          SimplyHired       
##  Length:30          Length:30          Length:30          Length:30         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##    Monster         
##  Length:30         
##  Class :character  
##  Mode  :character
variable_names <- colnames(ds_skills)
print(variable_names)
## [1] "Keyword"     "LinkedIn"    "Indeed"      "SimplyHired" "Monster"

It looked like the commas in the values were prompting the numbers to be labelled as characters which could hinder me from interpretting the data correctly. I would have to remove all the commas, and then convert the values for each job board to numeric values.

Converting from Character to Numeric

cols_to_remove_commas <- c("LinkedIn", "Indeed", "SimplyHired", "Monster") 
for (col in cols_to_remove_commas) { #removed commas from the columns
  ds_skills[[col]] <- gsub(",", "", ds_skills[[col]])
}

ds_skills$LinkedIn <- as.numeric(ds_skills$LinkedIn) #converted each column to numeric
ds_skills$Indeed <- as.numeric(ds_skills$Indeed)
ds_skills$SimplyHired <- as.numeric(ds_skills$SimplyHired)
ds_skills$Monster <- as.numeric(ds_skills$Monster)

summary(ds_skills)
##    Keyword             LinkedIn         Indeed       SimplyHired   
##  Length:30          Min.   :  222   Min.   :  177   Min.   :  131  
##  Class :character   1st Qu.:  650   1st Qu.:  485   1st Qu.:  421  
##  Mode  :character   Median : 1088   Median :  910   Median :  660  
##                     Mean   : 3362   Mean   : 2354   Mean   : 1787  
##                     3rd Qu.: 3204   3rd Qu.: 2344   3rd Qu.: 1791  
##                     Max.   :35063   Max.   :23206   Max.   :17699  
##                     NA's   :8       NA's   :9       NA's   :9      
##     Monster       
##  Min.   :  200.0  
##  1st Qu.:  575.2  
##  Median :  679.5  
##  Mean   : 1901.8  
##  3rd Qu.: 1878.8  
##  Max.   :19044.0  
##  NA's   :8

Here, I have a more accurate view of the summary statistics from each job board. For example, for LinkedIn, I can see that the minimal amount of times that a keyword appeared was 222 times. The maximal amount was 35,063. With these summary statistics, I also see some missing data for each of these columns.

Missing Data

sapply(ds_skills, function(x) sum(is.na(x)))
##     Keyword    LinkedIn      Indeed SimplyHired     Monster 
##           0           8           9           9           8

There is a total of 34 missing data, which will have to be removed.

ds_skills <- ds_skills %>%
  filter(!is.na(LinkedIn))
ds_skills <- ds_skills %>%
  filter(!is.na(Indeed))
ds_skills <- ds_skills %>%
  filter(!is.na(SimplyHired))
ds_skills <- ds_skills %>%
  filter(!is.na(Monster))

sapply(ds_skills, function(x) sum(is.na(x)))
##     Keyword    LinkedIn      Indeed SimplyHired     Monster 
##           0           0           0           0           0

Here, I filtered out the missing data since it will not be helpful for the purposes of this assignment. I can now use this updated set of data to create some plots.

Plots

Most Common Data Science Skills - LinkedIn (2018)

ds_skills <- ds_skills %>% #removing the total so it does not show in the graph
  filter(Keyword != "Total")

linkedin_bar <- ggplot(ds_skills, aes(x = LinkedIn, y = Keyword)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "LinkedIn Data Science Skills Count (2018)", x = "Count", y = "Data Science Skills") +
  theme_minimal()

print(linkedin_bar)

Here, I created a plot for the most common data science skills from LinkedIn in 2018. The most common term was “machine learning”. Machine learning is the use of artificial intelligence for statistical algorithms. For example, with various forms of autonomy, such as in remote or robotic surgery, with machine learning, a surgeon’s techniques can be improved over time. It looks like the least common term was “natural language processing”, or NLP.

Most Common Data Science Skills - Indeed (2018)

indeed_bar <- ggplot(ds_skills, aes(x = Indeed, y = Keyword)) +
  geom_bar(stat = "identity", fill = "green") +
  labs(title = "Indeed Data Science Skills Count (2018)", x = "Count", y = "Data Science Skills") +
  theme_minimal()

print(indeed_bar)

Here, I created a plot for the most common data science skills from Indeed in 2018. The most common term was “analysis”. Analysis is a widely applicable term, meaning observation or examination. For example, as data scientists, we constantly analyze sets of data – just as I am doing for this assignment right now! It looks like the least common term reported on Indeed was also “natural language processing”.

Most Common Data Science Skills - SimplyHired (2018)

simplyhired_bar <- ggplot(ds_skills, aes(x = SimplyHired, y = Keyword)) +
  geom_bar(stat = "identity", fill = "orange") +
  labs(title = "SimplyHired Data Science Skills Count (2018)", x = "Count", y = "Data Science Skills") +
  theme_minimal()

print(simplyhired_bar)

Here, I created a plot for the most common data science skills from SimplyHired in 2018. The most common term here was also “analysis”. It looks like the least common term reported on SimplyHired was once again, “natural language processing”.

Most Common Data Science Skills - Monster (2018)

monster_bar <- ggplot(ds_skills, aes(x = Monster, y = Keyword)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(title = "Monster Data Science Skills Count (2018)", x = "Count", y = "Data Science Skills") +
  theme_minimal()

print(monster_bar)

Here, I created a plot for the most common data science skills from Monster in 2018. The most common term was also”analysis”. It looks like the least common term reported on Monster was “neural networks”.

Graph of Job Boards Combined

combined_data <- ds_skills %>%
  pivot_longer(cols = c(LinkedIn, Indeed, SimplyHired, Monster), 
               names_to = "Source", 
               values_to = "Count")

combined_bar_plot <- ggplot(combined_data, aes(x = Keyword, y = Count, fill = Source)) +
  geom_bar(stat = "identity", position = "identity", width = 0.7) +
  labs(title = "Data Science Skills Count (2018)", x = "Data Science Skills", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("blue", "green", "orange", "red"))

print(combined_bar_plot)

Here, we have a graph of the combined data science skills count. This is nice to look at because it puts into perscptive which job boards are using specific terms relative to the others. The three peaks for this plot were machine learning, statistics and analysis.

#Top 10 Skills Across All Job Boards

Findings

# Calculate the top 10 skills for each job board
top_skills_per_source <- combined_data %>%
  group_by(Source, Keyword) %>%
  summarise(Total_Count = sum(Count)) %>%
  arrange(Source, desc(Total_Count)) %>%
  top_n(10)
## `summarise()` has grouped output by 'Source'. You can override using the
## `.groups` argument.
## Selecting by Total_Count
# Combine the top 10 skills from each source into a single data frame
combined_top_skills <- top_skills_per_source %>%
  group_by(Keyword) %>%
  summarise(Total_Count = sum(Total_Count))

# Create a bar plot with overlapping bars for the combined top skills
combined_top_skills_plot <- ggplot(combined_top_skills, aes(x = reorder(Keyword, Total_Count), y = Total_Count, fill = Keyword)) +
  geom_bar(stat = "identity", width = 0.7) +
  labs(title = "Top 10 Data Science Skills Across All Sources", x = "Data Science Skills", y = "Total Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(combined_top_skills_plot)

Data Science Computer Language Skills

ds_software <- read.csv('https://raw.githubusercontent.com/rkasa01/DATA607_PROJECT3/main/Data%20Science%20Career%20Terms%20-%20ds%20software.csv')
print(ds_software)
##                         Keyword LinkedIn Indeed SimplyHired Monster LinkedIn..
## 1                        Python    6,347  3,818       2,888   2,544        74%
## 2                             R    4,553  3,106       2,393   2,365        53%
## 3                           SQL    3,879  2,628       2,056   1,841        45%
## 4                         Spark    2,169  1,551       1,167   1,062        25%
## 5                        Hadoop    2,142  1,578       1,164   1,200        25%
## 6                         Java     1,944  1,377       1,059   1,002        23%
## 7                           SAS    1,713  1,134         910     978        20%
## 8                       Tableau    1,216  1,012         780     744        14%
## 9                          Hive    1,182    830         637     619        14%
## 10                        Scala    1,040    739         589     520        12%
## 11                          C++    1,024    765         580     439        12%
## 12                          AWS      947    791         607     467        11%
## 13                   TensorFlow      844    661         501     385        10%
## 14                       Matlab      806    677         544     419         9%
## 15                            C      795    492         384     523         9%
## 16                        Excel      701    569         438     397         8%
## 17                        Linux      601    517         364     303         7%
## 18                        NoSQL      598    436         387     362         7%
## 19                        Azure      578    416         285     272         7%
## 20                 Scikit-learn      474    402         294     212         6%
## 21                         SPSS      452    330         273     202         5%
## 22                       Pandas      421    330         282     175         5%
## 23                        Numpy      387    257         232     152         4%
## 24                          Pig      367    296         231     256         4%
## 25                           D3      353    149         113      95         4%
## 26                        Keras      329    253         205     131         4%
## 27                   Javascript      328    245         214     224         4%
## 28                          C#       324    245         182     219         4%
## 29                         Perl      309    258         202     198         4%
## 30                        Hbase      302    219         167     138         4%
## 31                       Docker      290    240         148     194         3%
## 32                          Git      282    261         186     145         3%
## 33                        MySQL      278    233         187     121         3%
## 34                      MongoDB      251    196         165     116         3%
## 35                    Cassandra      236    174         146     136         3%
## 36                      PyTorch      214    143         131      98         2%
## 37                        Caffe      206    149         113      96         2%
## 38                                                                            
## 39                        Total   38,882 27,477      21,204  19,350           
## 40       "data scientist" alone    8,610  5,138       3,829   3,746           
## 41 "data scientist" "[keyword]"                                               
## 42                Oct. 10, 2018                                               
##    Indeed.. SimplyHired.. Monster.. Avg.. GlassDoor.Self.Reported...2017
## 1       74%           75%       68%   73%                            72%
## 2       60%           62%       63%   60%                            64%
## 3       51%           54%       49%   50%                            51%
## 4       30%           30%       28%   29%                            27%
## 5       31%           30%       32%   30%                            39%
## 6       27%           28%       27%   26%                            33%
## 7       22%           24%       26%   23%                            30%
## 8       20%           20%       20%   19%                            14%
## 9       16%           17%       17%   16%                            17%
## 10      14%           15%       14%   14%                               
## 11      15%           15%       12%   13%                               
## 12      15%           16%       12%   14%                               
## 13      13%           13%       10%   12%                               
## 14      13%           14%       11%   12%                            20%
## 15      10%           10%       14%   11%                               
## 16      11%           11%       11%   10%                               
## 17      10%           10%        8%    9%                               
## 18       8%           10%       10%    9%                               
## 19       8%            7%        7%    7%                               
## 20       8%            8%        6%    7%                               
## 21       6%            7%        5%    6%                               
## 22       6%            7%        5%    6%                               
## 23       5%            6%        4%    5%                               
## 24       6%            6%        7%    6%                               
## 25       3%            3%        3%    3%                               
## 26       5%            5%        3%    4%                               
## 27       5%            6%        6%    5%                               
## 28       5%            5%        6%    5%                               
## 29       5%            5%        5%    5%                               
## 30       4%            4%        4%    4%                               
## 31       5%            4%        5%    4%                               
## 32       5%            5%        4%    4%                               
## 33       5%            5%        3%    4%                               
## 34       4%            4%        3%    4%                               
## 35       3%            4%        4%    3%                               
## 36       3%            3%        3%    3%                               
## 37       3%            3%        3%    3%                               
## 38                                                                      
## 39                                                                      
## 40                                                                      
## 41                                                                      
## 42                                                                      
##    Difference
## 1          1%
## 2         -4%
## 3         -1%
## 4          2%
## 5         -9%
## 6         -7%
## 7         -7%
## 8          5%
## 9         -1%
## 10           
## 11           
## 12           
## 13           
## 14        -8%
## 15           
## 16           
## 17           
## 18           
## 19           
## 20           
## 21           
## 22           
## 23           
## 24           
## 25           
## 26           
## 27           
## 28           
## 29           
## 30           
## 31           
## 32           
## 33           
## 34           
## 35           
## 36           
## 37           
## 38           
## 39           
## 40           
## 41           
## 42
cols_to_remove_commas <- c("LinkedIn", "Indeed", "SimplyHired", "Monster") 
for (col in cols_to_remove_commas) { #removed commas from the columns
  ds_software[[col]] <- gsub(",", "", ds_software[[col]])
}

ds_software$LinkedIn <- as.numeric(ds_software$LinkedIn) #converted each column to numeric
ds_software$Indeed <- as.numeric(ds_software$Indeed)
ds_software$SimplyHired <- as.numeric(ds_software$SimplyHired)
ds_software$Monster <- as.numeric(ds_software$Monster)

summary(ds_software)
##    Keyword             LinkedIn         Indeed       SimplyHired     
##  Length:42          Min.   :  206   Min.   :  143   Min.   :  113.0  
##  Class :character   1st Qu.:  326   1st Qu.:  249   1st Qu.:  194.5  
##  Mode  :character   Median :  598   Median :  436   Median :  364.0  
##                     Mean   : 2215   Mean   : 1541   Mean   : 1185.6  
##                     3rd Qu.: 1199   3rd Qu.:  921   3rd Qu.:  708.5  
##                     Max.   :38882   Max.   :27477   Max.   :21204.0  
##                     NA's   :3       NA's   :3       NA's   :3        
##     Monster         LinkedIn..          Indeed..         SimplyHired..     
##  Min.   :   95.0   Length:42          Length:42          Length:42         
##  1st Qu.:  163.5   Class :character   Class :character   Class :character  
##  Median :  303.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 1088.4                                                           
##  3rd Qu.:  681.5                                                           
##  Max.   :19350.0                                                           
##  NA's   :3                                                                 
##   Monster..            Avg..           GlassDoor.Self.Reported...2017
##  Length:42          Length:42          Length:42                     
##  Class :character   Class :character   Class :character              
##  Mode  :character   Mode  :character   Mode  :character              
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##   Difference       
##  Length:42         
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

I repeated the steps above for data science skills, but this time, for data science computer language skills, to gain insight into the most useful computer languages within the field. I removed the commas and relabeled the dataset types to numeric just as before. With the improved summary statistics, I have a more accurate view of the summary statistics from each job board. For example, for LinkedIn, I can see that the minimal amount of times that a specific computer language skill appeared in a job description was 206 times. The maximal amount was 38,882 times. I will have to remove the missing data just as before.

Missing Data

ds_software <- ds_software %>%
  filter(!is.na(LinkedIn))
ds_software <- ds_software %>%
  filter(!is.na(Indeed))
ds_software <- ds_software %>%
  filter(!is.na(SimplyHired))
ds_software <- ds_software %>%
  filter(!is.na(Monster))

sapply(ds_software, function(x) sum(is.na(x)))
##                        Keyword                       LinkedIn 
##                              0                              0 
##                         Indeed                    SimplyHired 
##                              0                              0 
##                        Monster                     LinkedIn.. 
##                              0                              0 
##                       Indeed..                  SimplyHired.. 
##                              0                              0 
##                      Monster..                          Avg.. 
##                              0                              0 
## GlassDoor.Self.Reported...2017                     Difference 
##                              0                              0

Here, I filtered out the missing data just as before.

Graph of Job Boards Combined - Computer Language Skills

ds_software <- ds_software %>% #removing the total so it does not show in the graph
  filter(Keyword != "Total")

combined_data <- ds_software %>%
  pivot_longer(cols = c(LinkedIn, Indeed, SimplyHired, Monster), 
               names_to = "Source", 
               values_to = "Count")

combined_bar_plot <- ggplot(combined_data, aes(x = Keyword, y = Count, fill = Source)) +
  geom_bar(stat = "identity", position = "identity", width = 0.7) +
  labs(title = "Data Science Skills Count (2018)", x = "Data Science Skills", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("blue", "green", "orange", "red"))

print(combined_bar_plot)

From this combined data science skills - computer language plot, we can see which job boards contain job descriptions demanding specific computer languages relative to the others. The three peaks for this plot were machine Python, R, and SQL. I would not necessarily include ‘“Data Scientist” alone’ as a peak, simply because that means that this job descriptions did not specify a computer language. They may have labelled computer languages under an umbrella term, making it unclear which language was specified.

Limitations

I would say that because these are all phrases found from “data science” positions, I would agrue that most, if not all, are important skills for a data scientist to possess, otherwise they would not appear in so many job descriptions. It would be interesting to see this set of data compared to a more recent set of data, or even to jobs from other job boards. Another limitation could be that because of the broad terms that were used, different computer languages could be associated with the different terms, which is why this set of data has a section for ‘“Data Scientist” only’. It is unclear which computer language is associated with those job descriptions.

Conclusion

To conclude, terms associated with various forms of analysis are the top key words or skills associated with data scientist job descriptions. For this reason, I would say that analysis, as well as the various forms or other skills associated with it, such as machine learning, statistics, computer science, and mathematics, are among the top data science skills. Communication is a key factor in data science as well, in order to convey findings, or these analyses. For the most part, NLP, or natural language processing, appears to be the least common, or among the least commonly, used key skills for data scientists. NLP offers an aspect which many other data science skills on their own lack, and that is the skill of understanding human language. It is challenging to involve human sentiment in technology, which could be the reason why it ranked below most of the other key words. It has many limitations, such as when understanding human sarcasm, or irony, or human intention through writing. In terms of computer languages, the most commonly used terms across all job boards were Python, R, and SQL. The data science master’s program at CUNY SPS is doing a great job at making sure we are immersed in these computer languages, and others as well.