In this project, we are interested in finding the most valuable data science skills for data science jobs. In order to figure out these skills, we’ll take data from online sources like Kaggle to create a data set of the various skills mentioned in job postings. Then, we’ll parse the data into data frames and plot the data in order to provide analysis on the data. In the end, we should have a better idea of what skills are relevant for the data science job market and will be able to prioritize gaining the desired skills in our own skillset to match the market demands.
First Data Source
For the first data source, we grabbed a dataset from kaggle on data scientist skills.
We created a data frame from first data source by reading in the .csv file.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Next, we added a column named Job Title and assigned a value of Data Scientist to each of the rows because the data source is about data scientist jobs.
# A tibble: 6 × 4
`Job Title` `Company Name` Location Skills
<chr> <chr> <chr> <chr>
1 Data Scientist "Tecolote Research\n3.8" Albuque… Python
2 Data Scientist "Tecolote Research\n3.8" Albuque… Excel
3 Healthcare Data Scientist "University of Maryland Medical Sys… Linthic… Python
4 Data Scientist "KnowBe4\n4.8" Clearwa… Python
5 Data Scientist "KnowBe4\n4.8" Clearwa… Spark
6 Data Scientist "KnowBe4\n4.8" Clearwa… Excel
Next, we filtered jobs based on if they had the word data in the Job Title or not, in order to get relevant job listings. We did this using the stringr library and the str_detect function.
combined_df %>%count(Skills, sort =TRUE) %>%head(10) %>%ggplot(aes(x =reorder(Skills, n), y = n, fill = Skills)) +geom_col(show.legend =FALSE) +geom_text(aes(label = n), hjust =-0.05,size =4,color ="black") +coord_flip() +labs(title ="Top 10 Most Demanded Skills", x ="Skill", y ="Frequency")+theme_minimal()
Based on the plot above for the combined data set, we’re able to see the top 10 skills that are in demand. Python is the clear leader with Excel and Spark as the second and third most in demand, respectively.
Data Science Skills Clustering by Category
library(ggplot2)library(dplyr)# Let's Define the 2026 Data Science Skills Frameworkskill_framework <-data.frame(Skills =c("AI", "ML", "DL", "NLP", "Pytorch", "Tensorflow", "Keras", "AWS", "Azure", "Docker", "Spark", "Hadoop", "Python", "R", "SQL", "Statistics", "Analytics", "Visualization", "Matlab", "SAS", "Excel", "Optimization"),Cluster =c(rep("Agentic & Advanced AI", 7),rep("Cloud & Infrastructure", 5),rep("Core Data Foundations", 10)))# 1. Aggregate frequencies from the raw combined data# This ensures we have both the "Skills" and "Frequency" columnsskill_counts <- combined_df %>%group_by(Skills) %>%summarise(Frequency =n(), .groups ="drop")# 2. Join with the 2026 Skill Framework to add the Cluster variableplot_ready_data <- skill_counts %>%left_join(skill_framework, by ="Skills") %>%# Remove any skills not categorized in our 2026 clustersfilter(!is.na(Cluster))# 3. Run the Visualizationggplot(plot_ready_data, aes(x =reorder(Skills, Frequency), y = Frequency, fill = Cluster)) +geom_bar(stat ="identity", alpha =0.85) +geom_text(aes(label = Frequency), hjust =-0.2, size =3.5, color ="gray30") +coord_flip() +facet_grid(Cluster ~ ., scales ="free_y", space ="free_y") +scale_fill_manual(values =c("Agentic & Advanced AI"="#E63946", "Cloud & Infrastructure"="#1D3557", "Core Data Foundations"="#A8DADC" )) +# Use expansion to ensure the numeric labels fit on the screenscale_y_continuous(expand =expansion(mult =c(0, 0.15))) +labs(title ="Valuable Skills Demand By Category",subtitle ="Strategic clustering of data science requirements ",x ="Specific Skill",y ="Frequency in Dataset" ) +theme_minimal(base_size =13) +theme(strip.text.y =element_text(angle =0, face ="bold"),legend.position ="none")
Based on this chart result, while the core foundational skills like Python and Excel remain the most frequent requirements in job postings, the market is strategically pivoting toward specialized clusters in Agentic AI and Cloud Infrastructure.
Skills Valued in 2020 vs 2024
#Add year columns to final data frames for both sourcesdf1_final <- ds_final %>%mutate(Year ="2020")df2_final <- glassdoor_final %>%mutate(Year ="2024")#Find skills shared between both data sourcesshared_skills <-intersect(df1_final$Skills, df2_final$Skills)#Create new data frame with years & only keeping the shared skillsdfForPlot <-bind_rows(df1_final, df2_final) %>%filter(Skills %in% shared_skills)#Create another new data frame with proportionsdfForPlotProp <- dfForPlot %>%count(Year, Skills) %>%group_by(Year) %>%mutate(prop = n /sum(n))#Plotggplot(dfForPlotProp, aes(x =reorder(Skills, prop), y = prop, fill = Year)) +geom_col(position ="dodge") +coord_flip() +labs(title ="Proportion of Jobs Requiring Skills (2020 vs 2024)",x ="Skill",y ="Proportion of Job Listings") +scale_fill_manual(values =c("2020"="orchid","2024"="lightpink")) +geom_text(aes(label =round(prop, 2)),position =position_dodge(width =0.9),hjust =-0.1,size =3)
The plot shows the commonly shared skills found within both of the data sets, and the proportion of jobs that required them. The first data source that we used was from 2020, while the second one was from 2024. In both years, about half of the jobs required Python. The proportion of jobs requiring Spark increased from 2020, when about 9% of jobs required it, to 2024, when about 24% of jobs required it. Similarly, the proportion of jobs requiring AWS increased from 5% to 23%. On the other hand, the proportion of jobs requiring R decreased from 33% in 2020, to 0% in 2024 based on this selection of jobs.
A Comparative Analysis of Skills Needs Across Different Countries
library(tidytext)
Warning: package 'tidytext' was built under R version 4.5.3
# A tibble: 156 × 1
Location
<chr>
1 Agoura Hills, CA
2 Albuquerque, NM
3 Alexandria, VA
4 Aliso Viejo, CA
5 Allentown, PA
6 Ann Arbor, MI
7 Annapolis Junction, MD
8 Arlington, VA
9 Armonk, NY
10 Arvada, CO
# ℹ 146 more rows
# A tibble: 35 × 3
Country Skills n
<chr> <chr> <int>
1 Austria Python 21
2 Austria ML 15
3 Austria R 15
4 Austria SQL 13
5 Austria Analytics 11
6 Austria Statistics 8
7 Austria Hadoop 6
8 Austria Spark 6
9 Austria Tensorflow 6
10 Austria AI 4
# ℹ 25 more rows
library(tidytext)ggplot( top10_skills_by_country,aes(x = n, y =reorder_within(Skills, n, Country), fill = Country) ) +geom_col(show.legend =FALSE) +geom_text(aes(label = n), hjust =-0.1, size =3) +facet_wrap(~ Country, scales ="free_y") +scale_y_reordered() +labs(title ="Top 10 Skill Requirements by Country",x ="Count",y ="Skill" ) +theme_minimal() +expand_limits(x =max(top10_skills_by_country$n) *1.05)
With the cleaning data, one dataset includes Germany, Switzerland, and Austria , the other set of data is from the states in the United States. After get the top 10 skills for the three countries, then compiled the top 10 skills for each U.S. state to represent the U.S. data.
By comparing the skills across the four countries, although there are differences in skill requirements, it is clear that Python is the most common.
In the U.S., only two states reported a need for R, whereas in the other three countries, R ranked relatively high—second in both Austria and Germany; in the U.S., Excel ranked second.
Conclusion
The results of our exploratory analysis show that Python seems to be the most valued data science skill. It is the most valued skill, not only overall, but also in the four countries we looked at, in both 2020 and 2024, and in the Core Data Foundations category, which is the most in demand skill category. Excel is also highly valued, with over half of the jobs in our data requiring it. By category, core data foundations including Python, Excel, R and SQL were highest in demand. However, there were also jobs requiring agentic and advanced AI skills, such as ML. Since these jobs were from either 2020 or 2024, more insights into the value of this skill category might be gained from job postings from 2025 or 2026 due to the fast paced evolution of these technologies and shifting employer expectations.
With these results, we are better equipped to focus on gaining the desired skills for the job market. As the market evolves with time, further analysis can be done to update the dataset and analysis.