Project 3 - Week 8
Approach
Introduction
For this project, our goal is to answer the question: Which data science skills are most valued? Since this is an exploring project, there is no single correct answer. Instead, we will investigate patterns in available data to understand what the data suggest about demand in the data science job market.
We selected a dataset of data-related job postings from Kaggle. The dataset contains information about job titles, categories, salaries, experience levels, company size, work setting and company location.
We noticed that the dataset does not include a direct column listing specific technical skills such as Python or SQL. Because of this, we will analyze experience level requirements in job postings as an indicator of the level of expertise.
Data Source
The dataset source is from Kaggle:
https://www.kaggle.com/datasets/hummaamqaasim/jobs-in-data
This dataset contains information about jobs in the data field, including roles such as Data Scientist, Data Analyst, Machine Learning Engineer, and other related positions.
The dataset includes also the following items: • work_year • job_title • job_category • salary_currency • salary • salary_in_usd • employee_residence • experience_level • employment_type • work_setting • company_location • company_size
It provides structured information about data related jobs,also including the level of experience required and salary information.I believe these are useful for analysis.
Defining Our Measure of Demand
The project question is which data science skills are most valued. However, because our dataset does not list technical skills, we will need to define an operational proxy for skill demand.
In this project, we use experience level required by employers as a proxy for skill demand.
For example:
- Entry level jobs require basic knowledge of data tools and programming.
- Mid level jobs often require stronger technical skills and practical experience.
- Senior level jobs require advanced technical expertise and leadership abilities.
Data Acquisition and Storage
The dataset was from Kaggle and downloaded as a CSV file. The data will be loaded into R. The dataset will be stored in a database using normalized tables. The data will be separated into logical components so that redundancy is reduced.
The database structure will include tables like:
- A jobs table containing job related information such as job title, category and work year.
- A company table containing company information such as location and company size.
- Additional attributes such as salary and experience level will be stored in the jobs table.
Data Preparation and Cleaning
All data preparation and transformations will be analyzed using R, and packages from the tidyverse.
The data preparation process will include:
- Loading the dataset into R
- Inspecting the dataset structure and variable types
- Checking for missing values
- Converting variables into appropriate formats
Exploratory Data Analysis
After cleaning the data, we will perform data analysis to understand patterns in the dataset.
The analysis will include:
- Counting the number of jobs for each experience level
- Examining the distribution of job categories
- Calculating average salary by experience level
- Visualizing these patterns using charts
For example, we will create bar charts showing the number of job postings by experience level to identify which levels appear most frequently.
Visualization Plan
For the visualization for level of seniority in the job market will be shown by
- Bar charts showing the frequency of job postings by experience level
- Charts comparing salaries across experience levels
- Visual summaries of job categories within the dataset:
- Pie chart dictating the distribution of job demand by required experience level between Entry-Level, Mid-Level , Senior and Executive
- Distribution of Data Science fields based on Experience Level: As an example, Machine Learning & AI posting often lean toward senior level roles.
- “Lollipop” chart in order to see the demand for each career skill in each point of a data scientist professional career.
Exploratory Data Analysis:
Data mentions job requests from all over the world with the majority in America. For the purpose of our analysis on data science skills, we will focus on the jobs offered in the United States. In addition to proxy data, we will use and collect the most effective skills in each subcategory provided in our Kaggle dataset. The most common skills requested in said field will be to use a supplement for analysis. For example BI Development requires statistical analysis, data mining and creation of dashboards using programs such as Tableau.
Collaboration Tools
We plan to use:
- GitHub for storing the project code and dataset
- Shared documents for writing the project approach and notes
- RStudio for performing the data analysis and visualization
- Google slides to present graphs for visual analysis
Expected Findings
Through this analysis, we expect to identify patterns in the demand for different levels of experience in data-related jobs.
For example, we may find that:
- Mid-level and senior-level roles appear more frequently in job postings
- Entry-level roles appear less often
- Higher experience levels are associated with higher salaries
These findings can provide insight into how experience and expertise are valued in the data science job market.