Project # 3

Author

Michael Mayne & Sinem Moschos

Approach

Introduction

For this project, our goal is to answer the question: Which data science skills are most valued? Since this is an exploring project, there is no single correct answer. Instead, we will investigate patterns in available data to understand what the data suggest about demand in the data science job market.

We selected a dataset of data-related job postings from Kaggle. The dataset contains information about job titles, categories, salaries, experience levels, company size, work setting and company location.

We noticed that the dataset does not include a direct column listing specific technical skills such as Python or SQL. Because of this, we will analyze experience level requirements in job postings as an indicator of the level of expertise.

Data Source

The dataset source is from Kaggle:

https://www.kaggle.com/datasets/hummaamqaasim/jobs-in-data

This dataset contains information about jobs in the data field, including roles such as Data Scientist, Data Analyst, Machine Learning Engineer, and other related positions.

The dataset includes  also the following items:

• work_year

• job_title

• job_category

• salary_currency

• salary

• salary_in_usd

• employee_residence

• experience_level

• employment_type

• work_setting

• company_location

• company_size

It provides structured information about data related jobs,also including the level of experience required and salary information.I believe these are useful for analysis.

Defining Our Measure of Demand

The project question is which data science skills are most valued. However, because our dataset does not list technical skills, we will need to define an operational proxy for skill demand.

In this project, we use experience level required by employers as a proxy for skill demand.

For example:

  • Entry level jobs require basic knowledge of data tools and programming.

  • Mid level jobs often require stronger technical skills and practical experience.

  • Senior level jobs require advanced technical expertise and leadership abilities.

Data Acquisition and Storage

The dataset was from Kaggle and downloaded as a CSV file. The data will be loaded into R. The dataset will be stored in a database using normalized tables. The data will be separated into logical components so that redundancy is reduced.

The database structure will include tables like:

  • A jobs table containing job related information such as job title, category and work year.

  • A company table containing company information such as location and company size.

  • Additional attributes such as salary and experience level will be stored in the jobs table.

Data Preparation and Cleaning

All data preparation and transformations will be analyzed using R, and packages from the tidyverse.

The data preparation process will include:

  • Loading the dataset into R

  • Inspecting the dataset structure and variable types

  • Checking for missing values

  • Converting variables into appropriate formats

Exploratory Data Analysis

After cleaning the data, we will perform data analysis to understand patterns in the dataset.

The analysis will include:

  • Counting the number of jobs for each experience level

  • Examining the distribution of job categories

  • Calculating average salary by experience level

  • Visualizing these patterns using charts

For example, we will create bar charts showing the number of job postings by experience level to identify which levels appear most frequently.

Visualization Plan

For the visualization for level of seniority in the job market will be shown by

Bar charts showing the frequency of job postings by experience level

Charts comparing salaries across experience levels

Visual summaries of job categories within the dataset: 

Pie chart dictating the distribution of job demand by required experience level between Entry-Level, Mid-Level , Senior and Executive

Distribution of Data Science fields based on Experience Level: As an example, Machine Learning & AI posting often lean toward senior level roles. 

“Lollipop” chart in order to see the demand for each career skill in each point of a data scientist professional career. 

Exploratory Data Analysis: 

#Data mentions job requests from all over the world with the majority in America. For the purpose of our analysis on data science skills, we will focus on the jobs offered in the United States. 

In addition to proxy data,  we will use and collect the most effective skills in each subcategory provided in our Kaggle dataset. The most common skills requested in said field will be to use a supplement for analysis. For example BI Development requires statistical analysis, data mining and creation of dashboards using programs such as Tableau. 

Collaboration Tools

We plan to use:

  • GitHub for storing the project code and dataset

  • Shared documents for writing the project approach and notes

  • RStudio for performing the data analysis and visualization

  • Google slides to present graphs for visual analysis 

Expected Findings

Through this analysis, we expect to identify patterns in the demand for different levels of experience in data-related jobs.

For example, we may find that:

  • Mid-level and senior-level roles appear more frequently in job postings

  • Entry-level roles appear less often

  • Higher experience levels are associated with higher salaries

These findings can provide insight into how experience and expertise are valued in the data science job market.

Project #3 Code Base Analysis

Code Base

Load Libraries

We load the libraries. I use tidyverse for cleaning and analyzing the data, DBI and RSQLite for creating a relational database, and ggplot2 for visualizations

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI)
library(RSQLite)

Load the Dataset

We load the CSV file directly from GitHub using the raw link.

data_url <- "https://raw.githubusercontent.com/sinemkilicdere/Data607/refs/heads/main/Week8-Project3/jobs_in_data.csv"

jobs_raw <- read_csv(data_url)
Rows: 9355 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): job_title, job_category, salary_currency, employee_residence, exper...
dbl (3): work_year, salary, salary_in_usd

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A first look at the dataset

head(jobs_raw)
# A tibble: 6 × 12
  work_year job_title          job_category salary_currency salary salary_in_usd
      <dbl> <chr>              <chr>        <chr>            <dbl>         <dbl>
1      2023 Data DevOps Engin… Data Engine… EUR              88000         95012
2      2023 Data Architect     Data Archit… USD             186000        186000
3      2023 Data Architect     Data Archit… USD              81800         81800
4      2023 Data Scientist     Data Scienc… USD             212000        212000
5      2023 Data Scientist     Data Scienc… USD              93300         93300
6      2023 Data Scientist     Data Scienc… USD             130000        130000
# ℹ 6 more variables: employee_residence <chr>, experience_level <chr>,
#   employment_type <chr>, work_setting <chr>, company_location <chr>,
#   company_size <chr>

Inspect the Structure of the Data

We check the structure and variable types to understand what kind of data we are working with.

glimpse(jobs_raw)
Rows: 9,355
Columns: 12
$ work_year          <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
$ job_title          <chr> "Data DevOps Engineer", "Data Architect", "Data Arc…
$ job_category       <chr> "Data Engineering", "Data Architecture and Modeling…
$ salary_currency    <chr> "EUR", "USD", "USD", "USD", "USD", "USD", "USD", "U…
$ salary             <dbl> 88000, 186000, 81800, 212000, 93300, 130000, 100000…
$ salary_in_usd      <dbl> 95012, 186000, 81800, 212000, 93300, 130000, 100000…
$ employee_residence <chr> "Germany", "United States", "United States", "Unite…
$ experience_level   <chr> "Mid-level", "Senior", "Senior", "Senior", "Senior"…
$ employment_type    <chr> "Full-time", "Full-time", "Full-time", "Full-time",…
$ work_setting       <chr> "Hybrid", "In-person", "In-person", "In-person", "I…
$ company_location   <chr> "Germany", "United States", "United States", "Unite…
$ company_size       <chr> "L", "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
colnames(jobs_raw) #checking column names
 [1] "work_year"          "job_title"          "job_category"      
 [4] "salary_currency"    "salary"             "salary_in_usd"     
 [7] "employee_residence" "experience_level"   "employment_type"   
[10] "work_setting"       "company_location"   "company_size"      

Check for Missing Values

Before cleaning, we check if there are missing values in each column.

colSums(is.na(jobs_raw))
         work_year          job_title       job_category    salary_currency 
                 0                  0                  0                  0 
            salary      salary_in_usd employee_residence   experience_level 
                 0                  0                  0                  0 
   employment_type       work_setting   company_location       company_size 
                 0                  0                  0                  0 

Understand the Experience Level Codes

The dataset uses coded values for experience level, so we first check the unique values.

unique(jobs_raw$experience_level)
[1] "Mid-level"   "Senior"      "Executive"   "Entry-level"

These are: - EN = Entry-level - MI = Mid-level - SE = Senior-level - EX = Executive-level

Clean and Prepare the Data

Now we clean the dataset and these steps next: - rename experience level values - focus on jobs in the United States - make some text values easier to read

jobs_clean <- jobs_raw %>%
  filter(company_location == "United States") %>%
  mutate(
    experience_level = recode(
      experience_level,
      "EN" = "Entry-Level",
      "MI" = "Mid-Level",
      "SE" = "Senior-Level",
      "EX" = "Executive-Level"
    ),
    company_size = recode(
      company_size,
      "S" = "Small",
      "M" = "Medium",
      "L" = "Large"
    )
  )

checking clean data

head(jobs_clean)
# A tibble: 6 × 12
  work_year job_title      job_category     salary_currency salary salary_in_usd
      <dbl> <chr>          <chr>            <chr>            <dbl>         <dbl>
1      2023 Data Architect Data Architectu… USD             186000        186000
2      2023 Data Architect Data Architectu… USD              81800         81800
3      2023 Data Scientist Data Science an… USD             212000        212000
4      2023 Data Scientist Data Science an… USD              93300         93300
5      2023 Data Scientist Data Science an… USD             130000        130000
6      2023 Data Scientist Data Science an… USD             100000        100000
# ℹ 6 more variables: employee_residence <chr>, experience_level <chr>,
#   employment_type <chr>, work_setting <chr>, company_location <chr>,
#   company_size <chr>

Check the Number of U.S. Job Postings

We count how many rows remain after filtering.

nrow(jobs_clean)
[1] 8132

Check the Distribution of Experience Levels

We count the number of jobs in each experience level.

jobs_clean %>%
  count(experience_level, sort = TRUE)
# A tibble: 4 × 2
  experience_level     n
  <chr>            <int>
1 Senior            6154
2 Mid-level         1389
3 Entry-level        343
4 Executive          246

Creating a Relational Database

We store the data in a relational database using normalized tables.

We will create: • a company table • a jobs table

The jobs table will connect to the company table through a company_id.

Create the Company Table

Since the dataset does not include company names, we define companies based on the combination of: • company location • company size • work setting

company_table <- jobs_clean %>%
  distinct(company_location, company_size, work_setting) %>%
  mutate(company_id = row_number()) %>%
  select(company_id, everything())

Create the Jobs Table

We join the company table back to the jobs data so each job gets a company_id.

jobs_table <- jobs_clean %>%
  left_join(company_table, by = c("company_location", "company_size", "work_setting")) %>%
  mutate(job_id = row_number()) %>%
  select(
    job_id,
    work_year,
    job_title,
    job_category,
    salary_currency,
    salary,
    salary_in_usd,
    employee_residence,
    experience_level,
    employment_type,
    company_id
  )

preview both tables:

head(company_table)
# A tibble: 6 × 4
  company_id company_location company_size work_setting
       <int> <chr>            <chr>        <chr>       
1          1 United States    Medium       In-person   
2          2 United States    Medium       Remote      
3          3 United States    Large        In-person   
4          4 United States    Small        In-person   
5          5 United States    Large        Remote      
6          6 United States    Large        Hybrid      
head(jobs_table)
# A tibble: 6 × 11
  job_id work_year job_title   job_category salary_currency salary salary_in_usd
   <int>     <dbl> <chr>       <chr>        <chr>            <dbl>         <dbl>
1      1      2023 Data Archi… Data Archit… USD             186000        186000
2      2      2023 Data Archi… Data Archit… USD              81800         81800
3      3      2023 Data Scien… Data Scienc… USD             212000        212000
4      4      2023 Data Scien… Data Scienc… USD              93300         93300
5      5      2023 Data Scien… Data Scienc… USD             130000        130000
6      6      2023 Data Scien… Data Scienc… USD             100000        100000
# ℹ 4 more variables: employee_residence <chr>, experience_level <chr>,
#   employment_type <chr>, company_id <int>

Create the SQLite Database

We create a SQLite database file and write both normalized tables into it.

con <- dbConnect(SQLite(), "jobs_in_data.sqlite")

dbWriteTable(con, "company", company_table, overwrite = TRUE)
dbWriteTable(con, "jobs", jobs_table, overwrite = TRUE)
dbListTables(con)
[1] "company" "jobs"   
dbReadTable(con, "company") %>% head()
  company_id company_location company_size work_setting
1          1    United States       Medium    In-person
2          2    United States       Medium       Remote
3          3    United States        Large    In-person
4          4    United States        Small    In-person
5          5    United States        Large       Remote
6          6    United States        Large       Hybrid
dbReadTable(con, "jobs") %>% head()
  job_id work_year      job_title                   job_category
1      1      2023 Data Architect Data Architecture and Modeling
2      2      2023 Data Architect Data Architecture and Modeling
3      3      2023 Data Scientist      Data Science and Research
4      4      2023 Data Scientist      Data Science and Research
5      5      2023 Data Scientist      Data Science and Research
6      6      2023 Data Scientist      Data Science and Research
  salary_currency salary salary_in_usd employee_residence experience_level
1             USD 186000        186000      United States           Senior
2             USD  81800         81800      United States           Senior
3             USD 212000        212000      United States           Senior
4             USD  93300         93300      United States           Senior
5             USD 130000        130000      United States           Senior
6             USD 100000        100000      United States           Senior
  employment_type company_id
1       Full-time          1
2       Full-time          1
3       Full-time          1
4       Full-time          1
5       Full-time          2
6       Full-time          2
dbDisconnect(con)

Exploratory Data Analysis

After cleaning and storing the data, we begin the exploratory analysis.

1.Count the Number of Jobs by Experience Level

We first count how many job postings belong to each experience level.

experience_counts <- jobs_clean %>%
  count(experience_level, sort = TRUE)

experience_counts
# A tibble: 4 × 2
  experience_level     n
  <chr>            <int>
1 Senior            6154
2 Mid-level         1389
3 Entry-level        343
4 Executive          246

Bar Chart: Frequency of Job Postings by Experience Level

This bar chart shows which experience level appears most often.

ggplot(experience_counts, aes(x = reorder(experience_level, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Frequency of U.S. Job Postings by Experience Level",
    x = "Experience Level",
    y = "Number of Job Postings"
  )

2. Distribution of Job Categories

Next, we examine which job categories appear the most.

job_category_counts <- jobs_clean %>%
  count(job_category, sort = TRUE)

job_category_counts
# A tibble: 10 × 2
   job_category                       n
   <chr>                          <int>
 1 Data Science and Research       2635
 2 Data Engineering                1977
 3 Data Analysis                   1252
 4 Machine Learning and AI         1190
 5 Leadership and Management        442
 6 BI and Visualization             288
 7 Data Architecture and Modeling   237
 8 Data Management and Strategy      56
 9 Data Quality and Operations       50
10 Cloud and Database                 5

Bar Chart: Top Job Categories

job_category_counts %>%
  slice_max(order_by = n, n = 10) %>%
  ggplot(aes(x = reorder(job_category, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 10 Job Categories in U.S. Data Jobs",
    x = "Job Category",
    y = "Number of Job Postings"
  )

3. Career Titles based on Experience & Demand

Another way to express the difference in skills demanded is by showing the difference in Career titles. Multiple people can work under the same field and these but require unique skills sets, thus categories do not express the whole picture. We can organize the most common job title requested by recruiters. Then we can outline the type of experienced needed in each field by using secondary sources to figure out what unique skills and experiences are commonly needed in each category.

#Data below shows the top ten job title expressed from the data
jobs_clean %>%
  count(job_title, sort = TRUE)
# A tibble: 105 × 2
   job_title                          n
   <chr>                          <int>
 1 Data Engineer                   1937
 2 Data Scientist                  1720
 3 Data Analyst                    1219
 4 Machine Learning Engineer        869
 5 Applied Scientist                271
 6 Research Scientist               244
 7 Analytics Engineer               224
 8 Data Architect                   199
 9 Business Intelligence Engineer   140
10 Research Engineer                131
# ℹ 95 more rows
job_title_count <- jobs_clean %>%
  count(job_title, sort = TRUE)%>%
  head(10)

#Data on the least poular roles- Contains Overlap with prior set ups. 

jobs_clean %>%
  count(job_title, sort = TRUE) %>%
  tail(20)
# A tibble: 20 × 2
   job_title                                n
   <chr>                                <int>
 1 Staff Data Scientist                     2
 2 AI Research Engineer                     1
 3 AWS Data Architect                       1
 4 BI Data Engineer                         1
 5 Big Data Engineer                        1
 6 Business Intelligence Data Analyst       1
 7 Cloud Data Architect                     1
 8 Compliance Data Analyst                  1
 9 Data Analytics Lead                      1
10 Data Science Tech Lead                   1
11 Data Scientist Lead                      1
12 Finance Data Analyst                     1
13 Machine Learning Developer               1
14 Machine Learning Operations Engineer     1
15 Manager Data Management                  1
16 Managing Director Data Science           1
17 Principal Data Analyst                   1
18 Principal Machine Learning Engineer      1
19 Software Data Engineer                   1
20 Staff Machine Learning Engineer          1

Although they are similar looking closely at the jobs we can see that difference experiences is needed. The top common fields in data science have different skills in which they focus on and different priorities.

Data Analyst/ Scientist :

  • Entry: Basic (SQL/ R/ Python) & Data Wrangling & Visualizations
  • Mid/High : Predictive Analytics and Models, Leading Research Initiatives

Data Engineer

  • Entry: ETL tools / Advanced Data Based knowledge and storage
  • Mid/High: Big Data Technologies, Cloud Servers,

Machine Learning Engineer :

  • Entry: Basic ML algorithms, Basic (Python/R),
  • Mid/High: “Deep Learning” Systems, Tensor Flow, Keras, Neural Networks, ML system development

BI/Business Intelligence Engineer

  • Entry: Data Analysis, Dashboard Creations(Tableu, SPSS) , Report Creation
  • Mid/High: Data Mining, Business Intelligence Strategy, Business Soft Skills (Presentations, Management, Proposals)

Bar Chart Expressing Frequency of Career Titles

ggplot(job_title_count, aes(x = reorder(job_title, n), y = n)) +
  geom_col(fill = "darkblue") +
  coord_flip()+
  labs(title = "Popular Career Titles by Frequency", x = "Category", y = "# of Jobs")

A best way to establish experience is to see the difference goal each individual will have based on their roles

Career Titles base on Salary

Job Titles Hiring Per Level

First we organize or data by Job title similar as before but instead we will go from the most popular title per experience level.

title_by_experience <-jobs_clean %>%
   group_by(experience_level,job_title) %>%
  count(job_title, sort= TRUE)

toptitle_experience <- title_by_experience %>% 
  group_by(experience_level) %>%
  slice_max(n, n = 10) %>%
  ungroup()


print(toptitle_experience)
# A tibble: 42 × 3
   experience_level job_title                       n
   <chr>            <chr>                       <int>
 1 Entry-level      Data Analyst                   96
 2 Entry-level      Data Engineer                  71
 3 Entry-level      Data Scientist                 60
 4 Entry-level      Research Scientist             20
 5 Entry-level      Machine Learning Engineer      12
 6 Entry-level      Research Engineer              12
 7 Entry-level      Applied Scientist               8
 8 Entry-level      Research Analyst                8
 9 Entry-level      Data Integration Specialist     6
10 Entry-level      Analytics Engineer              5
# ℹ 32 more rows

Multi- Bar Chart expressing Career Titles by Experience Level

ggplot(toptitle_experience, aes(x= reorder(job_title, n), y = (n), fill= experience_level)) +
  geom_col(position = "dodge", show.legend = FALSE)+
  facet_wrap(~experience_level, scales = "free_y") +
  scale_y_log10()+
  coord_flip() +
  scale_fill_manual(values = c(
    "Entry-level" = "lightblue",
    "Mid-level" = "steelblue", 
    "Senior" = "purple", 
    "Executive"= "maroon")) +
  labs( title = "Top Career Titles by Experience Level",
        x = "Job Titles",
        y= "# of Total Jobs in Experience Level")

We realized that the data for the top job tittles are fairly similar for most roles. In term of entry and mid level individuals there is a need for support roles usually in which Data science is used to assist in another individuals role( Research Scientist/Engineers). There is significant increase of Business Intelligence Engineers for individuals with in the Senior level which suggests a need for skill that focus on Dashboards & using data to implement Data Strategy. There is also jobs requesting the role of Data Architect as companies may ask senior employees to work on database systems for an entire company, which requires extreme proficiency in SQL & nonSQL data structures.

Executive roles are obvious,as they focus on leadership and running a department of individuals for a company/group this includes Head of Data,Data Science Manager, Data Lead. Although there more general roles like Data Engineer/Scientist and Analyst than we initially expected. This could be a fault of the raw data or this could also imply a high value role that do not expressly converted to traditional director role.

4. Average Salary by Experience Level

Now we calculate the average salary in U.S. dollars for each experience level.

salary_by_experience <- jobs_clean %>%
  group_by(experience_level) %>%
  summarise(
    average_salary_usd = mean(salary_in_usd, na.rm = TRUE)
  ) %>%
  arrange(desc(average_salary_usd))

salary_by_experience
# A tibble: 4 × 2
  experience_level average_salary_usd
  <chr>                         <dbl>
1 Executive                   195627.
2 Senior                      166163.
3 Mid-level                   129678.
4 Entry-level                 103009.

Bar Chart: Average Salary by Experience Level

ggplot(salary_by_experience, aes(x = reorder(experience_level, average_salary_usd), y = average_salary_usd, fill = experience_level)) +
  geom_col() +
  geom_text(aes(label = paste0("$",(round(average_salary_usd)))), 
            hjust = 1.1,
            size = 4,
            color = "darkgray") +
  coord_flip() +
   scale_fill_manual(values = c(
    "Entry-level" = "lightblue",
    "Mid-level" = "steelblue", 
    "Senior" = "darkblue", 
    "Executive"= "maroon")) +
  labs(
    title = "Average Salary by Experience Level",
    x = "Experience Level",
    y = "Average Salary in USD"
  )

Checking for Outliers

Salary data often contains very high or very low values. These can affect the average, so we check for outliers.

Boxplot of Salary by Experience Level

ggplot(jobs_clean, aes(x = experience_level, y = salary_in_usd)) +
  geom_boxplot() +
  labs(
    title = "Salary Distribution by Experience Level",
    x = "Experience Level",
    y = "Salary in USD"
  )

Create a Salary Dataset Without Extreme Outliers

To reduce the effect of extreme salaries, we use the IQR method.

q1 <- quantile(jobs_clean$salary_in_usd, 0.25, na.rm = TRUE)
q3 <- quantile(jobs_clean$salary_in_usd, 0.75, na.rm = TRUE)
iqr_value <- q3 - q1

lower_bound <- q1 - 1.5 * iqr_value
upper_bound <- q3 + 1.5 * iqr_value

#we keep only salaries within that range.
jobs_no_outliers <- jobs_clean %>%
  filter(salary_in_usd >= lower_bound, salary_in_usd <= upper_bound)

Recalculate Average Salary Without Outliers

salary_no_outliers <- jobs_no_outliers %>%
  group_by(experience_level) %>%
  summarise(
    average_salary_usd = mean(salary_in_usd, na.rm = TRUE)
  ) %>%
  arrange(desc(average_salary_usd))

salary_no_outliers
# A tibble: 4 × 2
  experience_level average_salary_usd
  <chr>                         <dbl>
1 Executive                   182913.
2 Senior                      162279.
3 Mid-level                   128885.
4 Entry-level                 103009.

Bar Chart: Average Salary by Experience Level Without Outliers

ggplot(salary_no_outliers, aes(x = reorder(experience_level, average_salary_usd), y = average_salary_usd, fill= experience_level)) +
  geom_col() +
  geom_text(aes(label = paste0("$",(round(average_salary_usd)))), 
            hjust = 1.1,
            size = 4,
            color = "darkgray") +
  coord_flip() +
     scale_fill_manual(values = c(
    "Entry-level" = "lightblue",
    "Mid-level" = "steelblue", 
    "Senior" = "darkblue", 
    "Executive"= "maroon")) +
  labs(
    title = "Average Salary by Experience Level (Without Extreme Outliers)",
    x = "Experience Level",
    y = "Average Salary in USD"
  )

Pie Chart of Experience Level Distribution

We also create a pie chart to show how job demand is distributed across experience levels.

ggplot(experience_counts, aes(x = "", y = n, fill = experience_level)) +
  geom_col(width = 1) +
  coord_polar("y", start = 0) +
  labs(
    title = "Distribution of Job Demand by Experience Level",
    fill = "Experience Level"
  ) +
   scale_fill_manual(values = c(
    "Entry-level" = "lightblue",
    "Mid-level" = "steelblue", 
    "Senior" = "darkblue", 
    "Executive"= "red")) +
  theme(axis.title = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank())

Distribution of Job Categories by Experience Level

Next, we compare job categories across experience levels.

category_experience <- jobs_clean %>%
  count(job_category, experience_level, sort = TRUE)

head(category_experience)
# A tibble: 6 × 3
  job_category              experience_level     n
  <chr>                     <chr>            <int>
1 Data Science and Research Senior            2099
2 Data Engineering          Senior            1466
3 Machine Learning and AI   Senior             986
4 Data Analysis             Senior             821
5 Data Science and Research Mid-level          365
6 Data Engineering          Mid-level          337

Stacked Bar Chart: Job Categories by Experience Level

jobs_clean %>%
  count(job_category, experience_level) %>%
  group_by(job_category) %>%
  summarise(total = sum(n)) %>%
  slice_max(order_by = total, n = 8) %>%
  left_join(
    jobs_clean %>% count(job_category, experience_level),
    by = "job_category"
  ) %>%
  ggplot(aes(x = reorder(job_category, total), y = n, fill = experience_level)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Job Categories by Experience Level",
    x = "Job Category",
    y = "Number of Job Postings",
    fill = "Experience Level"
  )

Salary By Category

We can organize the salary by category to see which fields are more in demand, and thus recruiters are willing to pay individuals higher.

ggplot(experience_counts, aes(x = reorder(experience_level, n), y = n)) +
  geom_segment(aes(xend = reorder(experience_level, n), y = 0, yend = n)) +
  geom_point(size = 3) +
  coord_flip() +
  labs(
    title = "Lollipop Chart of Experience Level Demand",
    x = "Experience Level",
    y = "Number of Job Postings"
  )

salary_by_category <- jobs_clean %>%
  group_by(job_category) %>%
  summarise(
    average_salary_usd = mean(salary_in_usd, na.rm = TRUE)
  ) %>%
  arrange(desc(average_salary_usd))

salary_by_category
# A tibble: 10 × 2
   job_category                   average_salary_usd
   <chr>                                       <dbl>
 1 Machine Learning and AI                   191054.
 2 Data Science and Research                 173070.
 3 Data Architecture and Modeling            160576.
 4 Cloud and Database                        155000 
 5 Data Engineering                          154231.
 6 Leadership and Management                 148341.
 7 BI and Visualization                      139048.
 8 Data Analysis                             113702.
 9 Data Quality and Operations               104350.
10 Data Management and Strategy              103986.
ggplot(salary_by_category,aes(x = reorder(job_category, average_salary_usd), y = average_salary_usd)) +
  geom_segment(aes(xend = reorder(job_category, average_salary_usd), y = 0, yend = average_salary_usd)) +
  geom_point(size = 3) +
  geom_text(aes(label = paste0("$",(round(average_salary_usd)))), 
            hjust = 1.1,
            size = 4,
            color = "darkred") +
  coord_flip() +
  labs(
    title = "Average Salary by Job Field",
    x = "Job Field",
    y = "Average Salary in USD($)"
  )+
  theme_light()

As of the data shown, Machine Leaning/AI & Data Science Research tends to have the highest average salary. This is a reasonable conclusion due to the boom in Artificial Intelligence Research , so it is unclear if the amount of salary will increase. We have also seen that data quality is significantly lower which might be because it is usually considered and early career role. We need to note that since most of the openings are for senior positions the data is very skewed.

Findings and Analysis

After cleaning the data and creating the visualizations, we looked at the patterns in the dataset to better understand the experience level that companies are looking for in data related jobs.

First, we counted how many job postings exist for each experience level. From the charts we created, we can see that mid-level and senior-level roles appear the most often in the dataset. Entry-level roles appear less frequently and executive level roles appear the least. This may suggest that many companies prefer hiring people who already have some experience in the field.

We also looked at the salary differences between experience levels. The results show that salary tends to increase as the experience level increases. Entry level roles usually have lower salaries, while senior and executive roles have higher salaries. This is expected because more experienced workers usually have stronger technical knowledge and more responsibility.

Another thing we looked at was the distribution of job categories. Some data related fields appear more often than others in the dataset. When we compare job categories with experience levels, we can see that some areas such as machine learning and AI related jobs appear more often in senior level positions. This may be because these roles usually require more advanced knowledge.

We also looked at the distribution of Career titles in this data, by breaking down and clearer specific job titles, we can see the roles that employers are looking for in the industry. We found that most employers looking for early career applicants are focusing on roles for Data Engineer/Scientist but also a large amount of support roles for science and research. For Experienced candidates there is a search for business orientated roles such as Business Intelligence Specialist and Data Architect.

Overall, the dataset suggests that experience level is an important factor in the data job market. Many employers seem to look for candidates who already have some experience working with data tools and technologies. There is a small amount of employers who prefer a specific type of experience including (AWS, Cloud, etc.)

Since the dataset does not include specific technical skills like Python or SQL, we used experience level and career roles and titles as a proxy for skill demand. Even though this does not directly measure skills, it still helps us understand the overall type of skills and backgrounds employers desire in data related jobs.