Project 3 - Week 8

Author

Sinem K Moschos

Approach

Introduction

For this project, our goal is to answer the question: Which data science skills are most valued? Since this is an exploring project, there is no single correct answer. Instead, we will investigate patterns in available data to understand what the data suggest about demand in the data science job market.

We selected a dataset of data-related job postings from Kaggle. The dataset contains information about job titles, categories, salaries, experience levels, company size, work setting and company location.

We noticed that the dataset does not include a direct column listing specific technical skills such as Python or SQL. Because of this, we will analyze experience level requirements in job postings as an indicator of the level of expertise.

Data Source

The dataset source is from Kaggle:

https://www.kaggle.com/datasets/hummaamqaasim/jobs-in-data

This dataset contains information about jobs in the data field, including roles such as Data Scientist, Data Analyst, Machine Learning Engineer, and other related positions.

The dataset includes also the following items: • work_year • job_title • job_category • salary_currency • salary • salary_in_usd • employee_residence • experience_level • employment_type • work_setting • company_location • company_size

It provides structured information about data related jobs,also including the level of experience required and salary information.I believe these are useful for analysis.

Defining Our Measure of Demand

The project question is which data science skills are most valued. However, because our dataset does not list technical skills, we will need to define an operational proxy for skill demand.

In this project, we use experience level required by employers as a proxy for skill demand.

For example:

Entry level jobs require basic knowledge of data tools and programming.
Mid level jobs often require stronger technical skills and practical experience.
Senior level jobs require advanced technical expertise and leadership abilities.

Data Acquisition and Storage

The dataset was from Kaggle and downloaded as a CSV file. The data will be loaded into R. The dataset will be stored in a database using normalized tables. The data will be separated into logical components so that redundancy is reduced.

The database structure will include tables like:

A jobs table containing job related information such as job title, category and work year.
A company table containing company information such as location and company size.
Additional attributes such as salary and experience level will be stored in the jobs table.

Data Preparation and Cleaning

All data preparation and transformations will be analyzed using R, and packages from the tidyverse.

The data preparation process will include:

Loading the dataset into R
Inspecting the dataset structure and variable types
Checking for missing values
Converting variables into appropriate formats

Exploratory Data Analysis

After cleaning the data, we will perform data analysis to understand patterns in the dataset.

The analysis will include:

Counting the number of jobs for each experience level
Examining the distribution of job categories
Calculating average salary by experience level
Visualizing these patterns using charts

For example, we will create bar charts showing the number of job postings by experience level to identify which levels appear most frequently.

Visualization Plan

For the visualization for level of seniority in the job market will be shown by

Bar charts showing the frequency of job postings by experience level
Charts comparing salaries across experience levels
Visual summaries of job categories within the dataset:
Pie chart dictating the distribution of job demand by required experience level between Entry-Level, Mid-Level , Senior and Executive
Distribution of Data Science fields based on Experience Level: As an example, Machine Learning & AI posting often lean toward senior level roles.
“Lollipop” chart in order to see the demand for each career skill in each point of a data scientist professional career.

Exploratory Data Analysis:

Data mentions job requests from all over the world with the majority in America. For the purpose of our analysis on data science skills, we will focus on the jobs offered in the United States. In addition to proxy data, we will use and collect the most effective skills in each subcategory provided in our Kaggle dataset. The most common skills requested in said field will be to use a supplement for analysis. For example BI Development requires statistical analysis, data mining and creation of dashboards using programs such as Tableau.

Collaboration Tools

We plan to use:

GitHub for storing the project code and dataset
Shared documents for writing the project approach and notes
RStudio for performing the data analysis and visualization
Google slides to present graphs for visual analysis

Expected Findings

Through this analysis, we expect to identify patterns in the demand for different levels of experience in data-related jobs.

For example, we may find that:

Mid-level and senior-level roles appear more frequently in job postings
Entry-level roles appear less often
Higher experience levels are associated with higher salaries

These findings can provide insight into how experience and expertise are valued in the data science job market.