Approach -
This project is meant to explore the question, “Which data science skills are most valued?” in a collaborative process with a group of our choosing. The goal is to work effectively as a team to perform an exploratory analysis to answer this question. Because we’ve had experience working together in a past assignment and noticed that we collaborate pretty well, we (Desiree Thomas, Denise Atherley and Kiera Griffiths) decided to work together for this Project as well. We plan to fulfill the various project requirements by breaking down each key deliverable.
Collaboration:
We will be using GitHub and Slack for our primary communication. GitHub will be the host for our shared repository and we will be using GitHub Projects to track issues, manage our workflow, assign tasks to specific members and track their completion. We will use Slack to coordinate meetings and Teams for video calls. GitHub issues may be used for handling bugs and errors. We have created a README.md for documentation as well.
Data Acquisition:
To acquire a relevant data source, we used the assistance of LLM Google Gemini to suggest some data sources that could highlight which data science skills are most valued. Various options were suggested but we ultimately chose a data source from Kaggle titled, “Data Science Job Postings with Salaries 2025”, which appeared to be the most compelling. The data is a processed version of scraped data collected in 2025 with transformations made to respect company privacy and avoid redistribution of raw proprietary content.
Logical model for normalized database:
To design a normalized relational schema, we will breakdown the Data Science Job Postings table into four interconnected tables.
companies Table: Stores unique information about each company.
company_id (Primary Key)
company_name (from the company column)
headquarter, industry, ownership, company_size, revenue
jobs Table: Stores the specific job postings and links to the company.
job_id (Primary Key)
company_id (Foreign Key ->companies.company_id)
job_title, seniority_level, status, location, post_date, salary
skills Table: Stores a unique list of all possible skills.
skill_id (Primary Key)
skill_name
job_skills Table: A mapping (bridge) table to handle the many-to-many relationship between jobs and skills (since a job requires multiple skills, and a skill is required by multiple jobs).
An Entity-Relationship (ER) diagram will be produced to document this design, illustrating how entities relate to one another. We can use resources such as draw.io, Excalidraw or Lucidchart.
How we will load the data:
We will load the .csv file into a database, most likely SQLite, and use SQL code to generate our structured tables with the appropriate primary keys so that we can link them.
Once the relational database is established, we can use the DBI, RSQLite and tidyverse libraries to perform all data tidying, transformations and exploratory data analysis in R. Hopefully, we can successfully store our database in the cloud which will allow our team to connect to it. We will pull only the data that we need for our analysis and address any data quality issues like missing data and outliers that deviate from the standard format.
Analysis strategy:
In order to answer the question of which data science skills are most valued, we will likely analyze the data through two main lenses:
Conclusion -
This project consisted of identifying a data source that highlighted jobs posted in 2025 that included “data scientist” or “machine learning” in the position title and included key words on the skills necessary to fulfill the job. Because of how robust the data set was, the team decided to gather the information tidy it and normalize it in a relational database. Once the distinct tables were created and the relationships identified, we were able to analyze further. The analysis successfully categorizes data science skills into distinct tiers based on their frequency in job postings (demand) and their financial return (value). Our findings show that based on demand, python coding is the most frequently requested data science skill with over 600 distinct job postings requiring it as a skill set. Python’s popularity stems from a unique blend of simple, readable syntax and immense versatility across numerous domains.
In terms of the data science skill with the most financial return, scala coding is the highest paying skill set. Scala, short for “Scalable Language,” is a high-level, general-purpose programming language designed to integrate object-oriented and functional programming paradigms. In data science, Scala is primarily used for big data engineering, large-scale data processing, and distributed computing, largely due to its native support for Apache Spark.
These findings are incredibly interesting because it showcases how the market is moving to employ individuals skilled in flexible coding language with a large financial investment in scalable language that can support big data. It represents a desire for the market to grow its data and generate meaningful insights from it.
The assistance of and LLM was used to enhance code:
Google DeepMind. (2025). Gemini 3 Flash [Large language model]. https://gemini.google.com. Accessed March 16-22, 2026.