Introduction

We selected the Data Science Skills option for Project 3. Here, we describe the identification, collection, database creation, and analyses of data to answer the question, “Which are the most valued data science skills?”

We divided this question into two parts—hard (technical) skills and soft skills—and analyzed each separately. We assessed the value of a skill by its frequency of occurrence (ie, a skill that is mentioned many times in a dataset would be a highly valued skill). As an exploratory analysis, we also considered the ability of skills to account for differences in annual compensation.


Hard skills analysis

Data overview

The hard skills analysis is based on the 2023 Stack Overflow Annual Developer Survey, which includes survey responses from more than 90,000 software developers from 185 countries. The survey was fielded from May 8, 2023 to May 19, 2023.

For analysis, we only considered respondents who indicated in the survey that they consented to share their information and answered all the required questions. Approximately 2,000 responses were excluded based on these criteria, leaving 89,184 for analysis.

We downloaded the original CSV data file from Stack Overflow and saved it in our team GitHub repository.

#Load the original .csv file
survey_2023 <- read.csv("https://media.githubusercontent.com/media/alexandersimon1/Data607/main/Project3/survey_results_public.csv", header = TRUE)

Data transformations

We filter the survey for only the observations of interest - those who identified as “Data scientists or Machine-Learning Specialist” and “a developer by profession”, so that all survey results reflect our population of interest - professional, working data scientists.

# Filter for the observations of interest
datascientists_2023 <- survey_2023 %>% 
  filter(DevType == "Data scientist or machine learning specialist",
         MainBranch == "I am a developer by profession")


Database design

Firstly we created the ER diagram. we used this to create our logical model. Once the logical model was created, we used it to create a normalized relational tables in postgreSQL.

The normalization was done considering the following factors:

  1. The variable values repeated for many observations were normalized into separate tables by creating IDs.

  2. We could add further group attributes to the dimensional tables once we normalized the main table.

We generated ID tables in PostgreSQL for all the possible answers from the survey for each topic we want to examine. The database was housed in Google Cloud.


Data tidying

The survey data was tidied in R, pushed to pre-made relational tables in PostgreSQL, and then analyzed through R.

# Match df to 'employee' table format in DB
employees_df <- datascientists_2023 %>% 
  mutate(response_id = ResponseId,
         country = Country,
         job_title = as.factor(DevType),
         industry = as.factor(Industry), 
         education = as.factor(EdLevel),
         office_location = as.factor(RemoteWork),
         compensation = as.numeric(ConvertedCompYearly),
         years = as.numeric(YearsCodePro),
        language = LanguageHaveWorkedWith,
        platform = PlatformHaveWorkedWith,
        database = DatabaseHaveWorkedWith,
        collab_tools = NEWCollabToolsHaveWorkedWith,
        ai_tools = str_c(AISearchHaveWorkedWith, AIDevHaveWorkedWith, sep = ";")
  ) %>% 
  select(response_id, 
         country,
         industry,
         education,
         office_location,
         compensation,
         years,
         language,
         platform,
         database,
         collab_tools,
         ai_tools)


Normalize multiple answers

Our variables of interest were all in response to select-all-that-apply survey questions. In order to record each of the answers, each of the 5 multi-choice answer columns must be split so that there is one answer per column in a relational table in the database.

We pivot our survey results longer so that every checkbox has an observation. But if we keep them in the same table, we’ll double our data.

We convert multiple choice questions to factors for ease-of-use later.

We define functions for algorithms we will run repeatedly. We initially tried to write a single function that did all the steps to normalize each survey question, but it was too nuanced.

Write the employees table to the database.

Normalize each survey question we are interested in to its own table.

Automated as much as I could in the function definitions above. Repeat these steps for the other tables.

All our relational tables were created successfully and contained clean and tidy data.

We removed the functions that write to the SQL database so our tables don’t get overwritten. They were all variations of dbWriteTable(sql_db, "table_name", table_df, append = TRUE).

This is the entity-relationship diagram of our relational database:

Entity-relationship diagram
Entity-relationship diagram

and these are the tables:

dbListTables(sql_db)
##  [1] "id_languages"           "id_databases"           "id_ai_tools"           
##  [4] "id_collab_tools"        "id_platforms"           "employees"             
##  [7] "employees_languages"    "employees_databases"    "employees_platforms"   
## [10] "employees_ai_tools"     "employees_collab_tools"


Analyses to answer the project question

Language skills

Python is the most popular language skill for data scientists, followed by SQL, Bash/Shell , Javascript, HTML/CSS, and R.

# count the occurrence of languages in the survey , group the occurrences more than less than 100 times into "others"  and finally arrange in the descending order and plot in a column chart.

language_skills <- languages_df 
  language_skills$language_id <- language_id$language[match(language_skills$language_id,language_id$language_id)] 


language_skills %>% 
  count(language_id)%>% 
  mutate(language_id = if_else(n > 100, language_id, "others")) %>% 
  ggplot(aes(x = fct_reorder(language_id,n), y = n)) + 
    geom_col(fill = "lightblue") + 
  coord_flip() +
    labs(title ="Most valuable data science - language skills",
         x = "Languages", 
         y = "Count", caption = "https://insights.stackoverflow.com/survey/")


Principal component analysis of the most common languages

We attempted to determine which of the most common languages accounted for most of the variation in annual compensation by using principal component analysis (PCA). Please note that we are not very familiar with this method and relied heavily on a PCA tutorial.

First create the dataframe.

pca_lang_comp <- employees_df %>%
  select(compensation, language) %>%
  drop_na() %>%
  rowwise() %>%
  mutate(
    Python = if_else(grepl("Python", language), 1, 0),
    SQL = if_else(grepl("SQL", language), 1, 0),    
    Bash_Shell = if_else(grepl("Bash\\/Shell", language), 1, 0),
    JavaScript = if_else(grepl("JavaScript", language), 1, 0),    
    HTML_CSS = if_else(grepl("HTML/CSS", language), 1, 0),
    R = if_else(grepl("R;|;R|;R;", language), 1, 0)
  ) %>%
  ungroup() %>%
  subset(select = -c(language))

Remove rows with compensation values not associated with any of the most common languages.

pca_lang_comp <- pca_lang_comp %>%
  filter((Python + SQL + Bash_Shell + JavaScript + HTML_CSS + R) > 0)

Then perform PCA of the data

res_pca_language <- PCA(pca_lang_comp, graph = FALSE)

HTML_CSS and JavaScript had the highest correlation in the first dimension.

res_desc_language <- dimdesc(res_pca_language, axes = c(1,2), proba = 0.05)
res_desc_language$Dim.1
## 
## Link between the variable and the continuous variables (R-square)
## =================================================================================
##              correlation       p.value
## HTML_CSS      0.83433027 1.284393e-216
## JavaScript    0.77967121 1.045096e-170
## SQL           0.48017272  3.779883e-49
## Bash_Shell    0.42736889  3.190426e-38
## R             0.26301267  1.292587e-14
## Python        0.08603377  1.310255e-02
## compensation  0.06997635  4.373213e-02

The correlation circle shows that HTML_CSS and JavaScript have the highest squared cosine (cos2) values, which is a measure of the quality of representation of the variable on the principal component. Their proximity to the circumference of the circle indicates they are the most important languages that account for the variation in compensation, followed by R. In contrast, the SQL vector is closer to the center of the circle, which indicates that it less important. However, it should be noted that dimensions 1 and 2 account for less than half of the variance in the data (Dim1 25.7% + Dim2 17.1% = 42.8%).

fviz_pca_var(res_pca_language, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping
             )


Database skills

PostgreSQL is the most popular database for data scientists, followed by MySQL, SQLite and Microsoft SQL server.

#creating a new database skills table by joining the employees_database table and the database_Id table to get the description of the IDs.
db_skills <- db_df 
db_skills$db_id <- db_id$database[match(db_skills$db_id,db_id$db_id)] 

#Filtering the rows which have NA as the database
db_skills <- subset(db_skills,db_id != "NA")

  
# count the occurrence of databases in the survey , group the occurrences less than 100 times into "others"  and finally arrange in the descending order and plot in a column chart.
  db_skills %>% 
  count(db_id)%>% 
  mutate(db_id = if_else(n > 100, db_id, "others")) %>% 
  ggplot(aes(x = fct_reorder(db_id,n), y = n)) + 
    geom_col(fill = "lightblue") + 
  coord_flip() +
    labs(title ="Most valuable data science - Database skills",x = "Databases", y = "Count", caption = "https://insights.stackoverflow.com/survey/")


Principal component analysis of the most common databases

We applied PCA to determine which of the most common databases accounted for most of the variation in annual compensation.

First create the dataframe

pca_db_comp <- employees_df %>%
  select(compensation, database) %>%
  drop_na() %>%
  rowwise() %>%
  mutate(
    PostgreSQL = if_else(grepl("PostgreSQL", database), 1, 0),
    MySQL = if_else(grepl("MySQL", database), 1, 0),    
    SQLite = if_else(grepl("SQLite", database), 1, 0),
    MS_SQL = if_else(grepl("Microsoft SQL server", database), 1, 0)
  ) %>%
  ungroup() %>%
  subset(select = -c(database))

Remove rows with compensation values not associated with any of the most common databases.

pca_db_comp <- pca_db_comp %>%
  filter((PostgreSQL + MySQL + SQLite + MS_SQL) > 0)

Then perform PCA of the data

res_pca_db <- PCA(pca_db_comp, graph = FALSE)

MySQL had the highest correlation in the first dimension.

res_desc_db <- dimdesc(res_pca_db, axes = c(1,2), proba = 0.05)
res_desc_db$Dim.1
## 
## Link between the variable and the continuous variables (R-square)
## =================================================================================
##              correlation       p.value
## <NA>                  NA            NA
## MySQL          0.6719540  1.946531e-69
## SQLite         0.4148056  5.313841e-23
## compensation  -0.1145160  9.023583e-03
## PostgreSQL    -0.8123305 3.643191e-123

The correlation circle shows that three databases (PostgreSQL, SQLite, and MySQL) had high cos2 values and their proximity to the circumference of the circle indicates they are the most important databases accounting for the variation in compensation. In contrast, Microsoft SQL server was very close to the center, which indicates that it is not very important. However, it should be noted that dimensions 1 and 2 only account for 59% of the variance in the data (Dim1 32.4% + Dim2 26.8% = 59.2%).

fviz_pca_var(res_pca_db, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping
             )


Collaboration tool skills

Visual Studio Code is the most popular collaboration tool for data scientists, followed by Jupyter Notebook/lab , PyCharm , IPython, Vim, Notepad++, RStudio.

#creating a new collaboration tool  skills table by joining the employees_database table and the database_Id table to get the description of the IDs.
tool_skills <- tools_df 
tool_skills$tool_id <- tool_id$tool[match(tool_skills$tool_id,tool_id$tool_id)] 

#Filtering the rows which have NA as the database
#db_skills <- subset(_skills,db_id != "NA")

  
# count the occurrence of databases in the survey , group the occurrences less than 100 times into "others"  and finally arrange in the descending order and plot in a column chart.
  tool_skills %>% 
  count(tool_id)%>% 
  mutate(tool_id = if_else(n > 100, tool_id, "others")) %>% 
  ggplot(aes(x = fct_reorder(tool_id,n), y = n)) + 
    geom_col(fill = "lightblue") + 
  coord_flip() +
    labs(title ="Most valuable data science - Tool skills",
         x = "Tools", y = "Count", caption = "https://insights.stackoverflow.com/survey/")


Principal component analysis of the most common collaboration tools

We applied PCA to determine which of the most common collaboration tools accounted for most of the variation in annual compensation.

First create the dataframe

pca_tools_comp <- employees_df %>%
  select(compensation, collab_tools) %>%
  drop_na() %>%
  rowwise() %>%
  mutate(
    Visual_Studio = if_else(grepl("Visual Studio", collab_tools), 1, 0),
    Jupyter = if_else(grepl("Jupyter", collab_tools), 1, 0),    
    PyCharm = if_else(grepl("PyCharm", collab_tools), 1, 0),
    IPython = if_else(grepl("IPython", collab_tools), 1, 0),
    Vim = if_else(grepl("Vim", collab_tools), 1, 0),
    Notepad = if_else(grepl("Notepad", collab_tools), 1, 0),
    RStudio = if_else(grepl("RStudio", collab_tools), 1, 0)    
  ) %>%
  ungroup() %>%
  subset(select = -c(collab_tools))

Remove rows with compensation values not associated with any of the most common tools.

pca_tools_comp <- pca_tools_comp %>%
  filter((Visual_Studio + Jupyter + PyCharm + IPython + Vim + Notepad + RStudio) > 0)

Then perform PCA of the data

res_pca_tools <- PCA(pca_tools_comp, graph = FALSE)

IPython and Jupyter had the highest correlations in the first dimension.

res_desc_tools <- dimdesc(res_pca_tools, axes = c(1,2), proba = 0.05)
res_desc_tools$Dim.1
## 
## Link between the variable and the continuous variables (R-square)
## =================================================================================
##               correlation       p.value
## IPython         0.7354539 3.066092e-139
## Jupyter         0.7228838 2.448742e-132
## Vim             0.3143727  4.161017e-20
## Notepad         0.2810555  3.171411e-16
## PyCharm         0.2689083  6.203581e-15
## compensation    0.2597899  5.247165e-14
## RStudio         0.1537817  1.061362e-05
## Visual_Studio   0.1106642  1.576378e-03

The correlation circle shows that IPython and Jupyter have the highest cos2 values and their proximity to the circumference of the circle indicates they are the most important collaboration tools accounting for variation in annual compensation, followed by PyCharm and Visual Studio. In contrast, the RStudio and Vim vectors are closer to the center of the circle, which indicates that they are less important. However, it should be noted that dimensions 1 and 2 only account for 35% of the variance in the data (Dim1 17.7% + Dim2 16.8% = 34.5%).

fviz_pca_var(res_pca_tools, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping
             )


Additional exploratory data analyses

Countries

Data scientists in the dataset were from 91 countries.

survey_countries <- employees_df %>%
  group_by(country) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  rename(region = country)
library(ggplot2)
library(maps)
library(ggmap)
mapdata <- map_data("world")
mapdata <- mapdata %>%
  mutate(
    # Align map_data country names with those from survey
    region = str_replace(region, "Hong Kong", "Hong Kong (S.A.R.)"),
    region = str_replace(region, "Iran", "Iran, Islamic Republic of"),
    region = str_replace(region, "North Macedonia", "The former Yugoslav Republic of Macedonia"),    
    region = str_replace(region, "Russia", "Russian Federation"),
    region = str_replace(region, "UK", "United Kingdom of Great Britain and Northern Ireland"),
    region = str_replace(region, "USA", "United States of America"),    
    region = str_replace(region, "Vietnam", "Viet Nam")
  )
map_surveys <- left_join(mapdata, survey_countries, by = "region")

The map shows that although survey responses were received from 91 countries, most responsess were from the US. Countries without any survey responses are shown in gray.

ggplot(map_surveys, aes(x = long, y = lat, group=group)) +
  coord_fixed(1.3) +
  geom_polygon(aes(fill = n)) +
  scale_fill_distiller(palette = "RdBu", direction = -1, name = "n") +
  theme_void() + 
  theme(legend.position = c(0.1, 0.35), legend.title = element_text(hjust = 0.15))


Education level

survey_education <- employees_df %>%
  group_by(education) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  mutate(
    education = str_replace(education, "Professional.*", "Advanced"), 
    education = str_replace(education, "Master.*", "Master's"),     
    education = str_replace(education, "Bachelor.*", "Bachelor"),    
    education = str_replace(education, "Associate.*", "Associate"),    
    education = str_replace(education, "Some college.*", "Some college"),
    education = str_replace(education, "Secondary.*", "High school"),
    education = str_replace(education, "Primary.*", "Primary")    
  )

Most data scientists had a master’s degree.

ggplot(survey_education, aes(x = reorder(education, n), y = n)) +
  geom_col() + 
  coord_flip() +
  xlab("Highest degree") + theme(axis.title = element_text(face = "bold")) +
  scale_y_continuous(breaks = seq(0, 700, by = 100)) 


Industry type

survey_industry <- employees_df %>%
  drop_na(industry) %>%
  group_by(industry) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  mutate(
    industry = str_replace(industry, "Information.*", "IT"),
    industry = str_replace(industry, "Financial.*", "Finance"),
    industry = str_replace(industry, "Manufacturing.*", "Manufacturing"),
    industry = str_replace(industry, "Retail.*", "Retail"),
    industry = str_replace(industry, "Advertising.*", "Advertising"),
    industry = str_replace(industry, "Higher.*", "Education"),
    industry = str_replace(industry, "Legal.*", "Legal")
  )

Data scientists were mostly employed in IT. The next most common fields (other than “Other”) were finance, manufacturing, and healthcare.

ggplot(survey_industry, aes(x = reorder(industry, n), y = n)) +
  geom_bar(stat = 'identity') + 
  coord_flip()  +
  xlab("Industry") + theme(axis.title = element_text(face = "bold")) +
  scale_y_continuous(breaks = seq(0, 300, by = 50)) 


Annual compensation

The median annual compensation was 79,581 USD.

survey_compensation <- employees_df %>%
  select(compensation) %>%
  drop_na()

survey_compensation %>%
  summarise(
    min = min(compensation),
    max = max(compensation),
    mean = mean(compensation),
    SD = sd(compensation),
    median = median(compensation),
    IQR = IQR(compensation)
  ) %>%
  kbl() %>% 
  kable_material()
min max mean SD median IQR
20 9e+06 117833 329814.7 79581 84840.5

The distribution is unimodal and skewed to the right by some very large values. To improve the visualization, values >1,000,000 USD were aggregated into an outlier bin in the histogram.

survey_compensation <- survey_compensation %>%
  mutate(
    compensation_adjusted = if_else(compensation > 1000000, 1000000, compensation)
  )

ggplot(survey_compensation, aes(x = compensation_adjusted)) +
  geom_histogram(binwidth = 5000) +
  xlab("Compensation (USD)") + theme(axis.title = element_text(face = "bold")) +  
  scale_x_continuous(breaks = c(0, 250000, 500000, 750000, 1000000), 
                     labels = c("0", "250000", "500000", "750000", ">1000000"))


Annual compensation vs years of experience

survey_comp_experience <- employees_df %>%
  select(compensation, years) %>%
  drop_na()

Due to the large number of overlapping data points, a hexagonal heatmap was used to show the relationshiop between years of experience and annual compensation. In the heatmap, data points are binned into hexagons, which are filled with a color gradient corresponding to the number of data points in each hexagon. The red hexagons shows that most data scientists who completed the survey had <5 years of experience and earned <100,000 USD.

Surprisingly, greater experience was not associated with higher compensation. Some data scientists with <15 years of experience earned large salaries, and those with >20 years of experience earned similar amounts as those who just entered the field. This suggests that other factors besides years of experience influence the compensation of data scientists.

ggplot(survey_comp_experience, aes(x = years, y = compensation)) +
  stat_binhex(bins = 50) +
  scale_fill_gradient(low = "lightblue", high = "red") +
  ylim(0, 1000000) + 
  xlab("Years of experience") + ylab("Compensation (USD)") + 
  theme(axis.title = element_text(face = "bold"))


Industry vs annual compensation

Omit rows with missing information, shorten industry names, then group the data by industry.

industry_comp <- employees_df %>%
  select(industry, compensation) %>%
  drop_na() %>%
  mutate(
    industry = str_replace(industry, "Information.*", "IT"),
    industry = str_replace(industry, "Financial.*", "Finance"),
    industry = str_replace(industry, "Manufacturing.*", "Manufacturing"),
    industry = str_replace(industry, "Retail.*", "Retail"),
    industry = str_replace(industry, "Advertising.*", "Advertising"),
    industry = str_replace(industry, "Higher.*", "Education"),
    industry = str_replace(industry, "Legal.*", "Legal")
  ) %>%  
  group_by(industry)  

The fields with the highest median compensation for data scientists were insurance, finance, and legal, which were all more than 100,000 USD. Surprisingly, IT, which has the largest proportion of data scientists in the survey, is associated with the third lowest median compensation (~64,000 USD).

industry_comp %>%
  summarise(
    median_compensation = median(compensation)
  ) %>%
  arrange(desc(median_compensation)) %>%
  kbl() %>%
  kable_material() %>%
  scroll_box(width = "75%", height = "500px")
industry median_compensation
Insurance 107090.0
Finance 106912.0
Legal 103800.0
Manufacturing 88348.5
Education 85935.0
Advertising 84028.0
Healthcare 83530.0
Other 80317.0
Retail 77399.0
IT 64254.0
Wholesale 48238.5
Oil & Gas 42836.0


The distribution of median annual compensation by industry is shown below. Note that outliers are not plotted. Some industries, such as oil & gas and healthcare had a relatively large spread (IQR) in compensation, while others, such as insurance and legal, had smaller spreads.

industry_comp %>%
  group_by(industry) %>%
  ggplot(aes(x = compensation, group = industry, color = industry)) +
  geom_boxplot(outlier.shape = NA) +
    labs(x = "Annual compensation (USD)") +
    theme(axis.title = element_text(face = "bold"), 
          axis.text.y = element_blank(),
          axis.ticks.y = element_blank(),
          strip.text.x = element_text(size = 8)) +  
    scale_x_continuous(breaks = c(0, 50000, 100000), 
                       limits = c(0, 100000),
                       labels = scales::scientific) +
  facet_wrap(~ industry) +
    theme(legend.position = "none") +
    theme(panel.spacing.x = unit(1, "lines"))


Language vs annual compensation

Convert the language variable to a long format by separating the semicolon-delimited string into individual languages. Then group the data by language.

language_comp <- employees_df %>%
  select(language, compensation) %>%
  drop_na() %>%
  separate_longer_delim(language, delim = ";") %>%
  group_by(language)  

COBOL was associated with the highest median compensation, which was surprising because it is a very old language. It is possible that knowledge and proficiency with it is rare nowadays and therefore of high value to companies with legacy systems.

The median annual compensation associated with the languages commonly used in data science (eg, Python, R, SQL) was lower than expected—around 80,000 USD.

language_comp %>%
  summarise(
    median_compensation = median(compensation)
  ) %>%
  arrange(desc(median_compensation)) %>%
  kbl() %>%
  kable_material() %>%
  scroll_box(width = "75%", height = "500px")
language median_compensation
Cobol 753545.0
Erlang 253507.5
Zig 195344.0
F# 192939.5
Elixir 190000.0
Apex 146484.5
Delphi 130649.0
Lisp 130649.0
Perl 130000.0
Go 110000.0
Rust 107090.0
Clojure 103800.0
Ruby 103190.5
Nim 99244.5
Scala 98045.5
Crystal 92939.0
TypeScript 92679.5
Raku 90000.0
Julia 89955.0
Fortran 85935.0
Bash/Shell (all shells) 85672.0
R 85000.0
Lua 81787.0
SQL 80000.0
Python 79714.0
SAS 77958.5
HTML/CSS 77603.0
JavaScript 77275.0
C++ 77104.0
C# 75312.5
C 73862.0
PowerShell 72761.0
Swift 72468.0
GDScript 72202.5
Groovy 69608.0
Kotlin 66036.0
Java 64254.0
Dart 61041.0
VBA 59892.0
MATLAB 59481.0
Objective-C 58546.5
OCaml 57973.5
Assembly 56705.5
PHP 54080.5
Visual Basic (.Net) 54080.5
Ada 53545.0
Haskell 52138.0
Prolog 39623.0
Solidity 21418.0


The distribution of annual compensation by language is shown below. Note that outliers are not plotted.

language_comp %>%
  group_by(language) %>%
  ggplot(aes(x = compensation, group = language, color = language)) +
  geom_boxplot(outlier.shape = NA) +
    labs(x = "Annual compensation (USD)") +
    theme(axis.title = element_text(face = "bold"), 
          axis.text.y = element_blank(),
          axis.ticks.y = element_blank(),
          strip.text.x = element_text(size = 8)) +  
    scale_x_continuous(breaks = c(0, 100000, 200000), limits = c(0, 250000)) +
  facet_wrap(~ language, ncol = 4) +
    theme(legend.position = "none") +
    theme(panel.spacing.x = unit(1, "lines"))  


Soft skills analysis

This analysis was separate from the hard skills analysis and was not included in the relational database because data were scarce and this analysis was initially exploratory.

Data

Soft skills data were collected from two different sources.

  1. Informational websites that ranked the most important soft skills for data scientists. A total of 65 websites were identified from the following Google searches (URLs for all sites are in the data file):

    • “soft skills” “data science”

    • companies that hire data scientists “soft skills”

    • data scientist recruit “soft skills”

  2. Kaggle dataset of 12,217 data science-related job postings on LinkedIn during 2024. These job postings included soft skills that employers considered important for the job.

All original data were CSV files and saved to our team GitHub repository.

soft_skills_raw <- read_csv('https://media.githubusercontent.com/media/evelynbartley/Project-3-Data-607/main/data/data_science_soft_skills.csv', show_col_types = FALSE)
job_postings_raw <- read_csv('https://media.githubusercontent.com/media/evelynbartley/Project-3-Data-607/main/data/job_postings_all.csv', show_col_types = FALSE)


Data transformations

Data from informational websites

soft_skills <- soft_skills_raw %>%
  select(Company, Author_name_last, Author_name_first, Field, Audience, Skill, Rank, Date_publish)

No other transformations were necessary with this dataframe because long data formatting and tidy data principles were applied at the time of data collection.

The dataframe looks like this:

glimpse(soft_skills)
## Rows: 345
## Columns: 8
## $ Company           <chr> "KD Nuggets", "KD Nuggets", "KD Nuggets", "KD Nugget…
## $ Author_name_last  <chr> NA, NA, NA, NA, "Berge", "Berge", "Berge", "Berge", …
## $ Author_name_first <chr> NA, NA, NA, NA, "Eirik", "Eirik", "Eirik", "Eirik", …
## $ Field             <chr> "Blog", "Blog", "Blog", "Blog", "Blog", "Blog", "Blo…
## $ Audience          <chr> "General", "General", "General", "General", "General…
## $ Skill             <chr> "Communication", "Adaptability", "Collaboration", "C…
## $ Rank              <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4…
## $ Date_publish      <date> 2023-11-22, 2023-11-22, 2023-11-22, 2023-11-22, 202…

Data from job postings

Select relevant columns and convert case

The data were limited to jobs with “Data Scientist” in the job title. A total of 809 job postings met this criterion.

job_postings <- job_postings %>%
  filter(grepl("DATA SCIENTIST", job_title))

nrow(job_postings)
## [1] 809

The dataframe looks like this:

glimpse(job_postings)
## Rows: 809
## Columns: 3
## $ job_title  <chr> "DATA SCIENTIST", "DATA SCIENTIST", "SR. DATA SCIENTIST", "…
## $ job_level  <chr> "MID SENIOR", "MID SENIOR", "MID SENIOR", "MID SENIOR", "MI…
## $ job_skills <chr> "DATA SCIENCE, MACHINE LEARNING, ARTIFICIAL INTELLIGENCE, D…


Analyses

Informational websites

Finding #1: Communication is the most common (valued) soft skill in websites about important soft skills for data scientists

top_skills_overall <- soft_skills %>%
  group_by(Skill) %>%
  tally() %>%
  arrange(desc(n))


We show this with 2 different visualizations. The first is a word cloud, which scales the font size of words to their frequency in a dataset. Of note, word clouds work best with single words, so skills described by more than one word were reworded with a synonym or hyphenated.

Communication is the biggest word in the cloud, which means that it is the most common skill in the dataset. On the other hand, skills with small font sizes, such as intuition and empathy, were not very common.

set.seed(1234)
wordcloud(words = top_skills_overall$Skill, 
          freq = top_skills_overall$n, min.freq = 3,
          max.words = 50, random.order = FALSE, rot.per = 0,
          colors = brewer.pal(5, "Dark2"))

The word cloud is visually appealing but isn’t quantitative, so we also show the frequency of the soft skills with a barplot. Only skills that occurred more than once in the dataset are shown.

ggplot(filter(top_skills_overall, n > 2), aes(x = reorder(Skill, n), y = n)) +
  geom_bar(stat = 'identity') + 
  coord_flip() +  
  xlab("Soft skill") + ylab("Count") + theme(axis.title = element_text(face = "bold")) +
  scale_y_continuous(breaks = seq(0, 60, by = 10))


Finding #2: Communication has consistently been the most common (valued) soft skill over time

Group data by year, drop rows without year info, and filter out skills that only occur once

top_skills_year <- soft_skills %>%
  group_by(year = lubridate::year(Date_publish), Skill) %>%
  tally() %>%
  drop_na(year) %>%
  filter(n > 2) %>%
  arrange(desc(n))

The stacked barplot shows that communication has been the most frequent soft skill since 2021. In addition, the number of skills of interest has increased each year since 2021, which may indicate that the data science field is becoming increasingly competitive.

ggplot(top_skills_year, aes(x = factor(year), y = n, fill = Skill)) +
  geom_bar(stat = 'identity', position = 'fill') +
  xlab("Year") + ylab("Proportion") + theme(axis.title = element_text(face = "bold")) +
  labs(caption = "Data as of March 2024")


Finding #3: The most common (valued) soft skills vary by target audience

Group data by target audience and skill, then calculate frequencies. To prevent the barplot from being too crowded, only skills that occurred more than 3 times were included in this analysis.

top_skills_audience <- soft_skills %>%
  group_by(Audience, Skill) %>%
  summarise(n = n()) %>%
  filter(n > 3) %>%  
  mutate(freq = n / sum(n)) %>%
  arrange(desc(freq))

The stacked barplot shows that websites targeted to students name the most soft skills whereas websites targeted to job seekers name the fewest. Websites targeted to employees and companies (generally human resource departments) are in between. These results suggest that students are being encouraged to cultivate many skills, whereas graduates (ie, job seekers or employees) are being encouraged to hone a few specific skills.

The difference in skills between employees and companies makes intuitive sense. Employees, who are most likely working on problems and interacting with colleagues, need to have problem-solving and analytical skills. On the other hand, companies that are seeking growth and profit would benefit from job candidates who are curious and have entrepreneurial skills.

But the most common skill across all audiences is communication. This also makes sense because it is a general skill and not much can be achieved without it.

order <- c("General", "Students", "Job seekers", "Employees", "Companies")
ggplot(top_skills_audience, aes(x = Audience, y = freq, fill = Skill)) +
  geom_col(position = "fill") +
  scale_x_discrete(limits = order) +
  xlab("Target audience") + ylab("Proportion") + 
  theme(axis.title = element_text(face = "bold"))


Job postings

We first need to define the soft skills of interest. This was done by using the soft skills identified by the web search above and adding some common synonyms.

soft_skills_terms <- c("ACCOUNTABILITY", "ADAPTABILITY|FLEXIBILITY|RESOURCEFUL", 
                       "ANALYTICAL|CRITICAL|LOGIC", 
                       "ATTENTION TO DETAIL", 
                       "ACUMEN|BUSINESS INTELLIGENCE|ENTREPRENEUR",
                       "COLLABORATION|TEAM", "COMMUNICATION|PRESENTATION|STORY", 
                       "CREATIVITY|INNOVATION", 
                       "CURIOSITY",
                       "EMOTIONAL INTELLIGENCE|EMPATHY", 
                       "INDEPENDENCE|INITIATIVE|MOTIVAT", "INTEGRITY|CREDIBLE|TRUSTWORTHY", 
                       "INTERPERSONAL", "INTUITION",
                       "JUDGMENT",
                       "LEADERSHIP", "LISTEN",
                       "ORGANIZATION|PLANNING", "OWNERSHIP",
                       "PASSION", "PERSUASION", "PERSEVERANCE|PERSISTANCE",
                       "PROBLEM", "PROFESSIONALISM", 
                       "RELIABILITY", 
                       "MULTITASKING|TIME MANAGEMENT|PRIORITIZATION")

Now we can count the number of postings with these skills.

soft_skills_freq <- as_tibble(soft_skills_terms)
soft_skills_freq <- soft_skills_freq %>%
  rename(Skill = value) %>%
  rowwise() %>%
  mutate(
    n = sum(grepl(Skill, job_postings$job_skills))
  ) %>%
  mutate(
    Skill = str_replace(Skill, "([A-Z]*)[ \\|].*", "\\1"),
    Skill = str_to_title(Skill),
    Skill = str_replace(Skill, "Acumen", "Entrepreneurial"),
    Skill = str_replace(Skill, "Emotional Intelligence", "EQ"),
    Skill = str_replace(Skill, "Problem", "Problem-solving")    
  ) %>%  
  arrange(desc(n))


Communication is the most common (valued) soft skill in data scientist job postings

As before, we show this with a word cloud and a barplot.

set.seed(1234)
wordcloud(words = soft_skills_freq$Skill, 
          freq = soft_skills_freq$n, min.freq = 3,
          max.words = 50, random.order = FALSE, rot.per = 0,
          colors = brewer.pal(5, "Dark2"))

ggplot(filter(soft_skills_freq, n > 2), aes(x = reorder(Skill, n), y = n)) +
  geom_bar(stat = 'identity') + 
  coord_flip() +  
  xlab("Soft skill") + ylab("Count") + theme(axis.title = element_text(face = "bold")) +
  scale_y_continuous(breaks = seq(0, 400, by = 50))

Overall, these results look similar to the website analysis. In both datasets, communication and collaboration were the #1 and #2 most common soft skills. Is there a correlation between the skill rankings in the two datasets? Let’s find out!


Correlation between skill rankings from informational websites vs job postings

There is a strong correlation between the rankings of the most common (valued) soft skills for data scientists in informational websites and job postings

To evaluate the correlation, we first combined the rankings from the two dataframes.

# Rankings from job postings
soft_skills_freq <- soft_skills_freq %>%
  mutate(
    rank_job_postings = which(soft_skills_freq$Skill == Skill)
  )
# Rankings from soft skills websites
top_skills_overall <- top_skills_overall %>%
  mutate(
    rank_websites = which(top_skills_overall$Skill == Skill)
  )
# Combine rankings
rankings <- full_join(soft_skills_freq, top_skills_overall, by = "Skill")
rankings <- rankings %>%
  select(Skill, rank_job_postings, rank_websites) %>%
  drop_na()

Then we plotted the rankings with a scatterplot. The correlation coefficient of the line of best fit for the data was 0.79, which indicates a relatively strong association between the two rankings.

coeff <- round(cor(rankings$rank_job_postings, rankings$rank_websites, 
                   method = c("pearson")), 2)
ggplot(rankings, aes(x = rank_job_postings, y = rank_websites)) +
  geom_point() +
  geom_smooth(method=lm, se=FALSE) +
  xlab("Ranking in job postings") + ylab("Ranking on informational websites") +
    theme(axis.title = element_text(face = "bold")) +
  geom_text(x = 20, y = 32,
            label = paste0('r = ', coeff),
            color = 'blue')  

The top 5 soft skills are:

head(rankings, 5)
## # A tibble: 5 × 3
##   Skill           rank_job_postings rank_websites
##   <chr>                       <int>         <int>
## 1 Communication                   1             1
## 2 Collaboration                   2             2
## 3 Problem-solving                 3             6
## 4 Analytical                      4             3
## 5 Entrepreneurial                 5             4

In summary, we found similar results from two different data sources, informational websites and job postings. This dual-source approach strengthens the conclusion that the most valued soft skills for data scientists are communication and collaboration, followed by problem-solving, analytical thinking, and an entrepreneurial mindset.


Overall conclusions

We successfully identified data sources to address the project question and designed and implemented a relational database in Google Cloud to store and access the data. Our main findings from our exploratory data analyses are summarized below.

1. Technical (hard) skills analysis

2. Soft skills analysis