For this project, our goal is to answer the question: Which data science skills are most valued? Since this is an exploring project, there is no single correct answer. Instead, we will investigate patterns in available data to understand what the data suggest about demand in the data science job market.
We selected a dataset of data-related job postings from Kaggle. The dataset contains information about job titles, categories, salaries, experience levels, company size, work setting and company location.
We noticed that the dataset does not include a direct column listing specific technical skills such as Python or SQL. Because of this, we will analyze experience level requirements in job postings as an indicator of the level of expertise.
This dataset contains information about jobs in the data field, including roles such as Data Scientist, Data Analyst, Machine Learning Engineer, and other related positions.
The dataset includes also the following items:
• work_year
• job_title
• job_category
• salary_currency
• salary
• salary_in_usd
• employee_residence
• experience_level
• employment_type
• work_setting
• company_location
• company_size
It provides structured information about data related jobs,also including the level of experience required and salary information.I believe these are useful for analysis.
Defining Our Measure of Demand
The project question is which data science skills are most valued. However, because our dataset does not list technical skills, we will need to define an operational proxy for skill demand.
In this project, we use experience level required by employers as a proxy for skill demand.
For example:
Entry level jobs require basic knowledge of data tools and programming.
Mid level jobs often require stronger technical skills and practical experience.
Senior level jobs require advanced technical expertise and leadership abilities.
Data Acquisition and Storage
The dataset was from Kaggle and downloaded as a CSV file. The data will be loaded into R. The dataset will be stored in a database using normalized tables. The data will be separated into logical components so that redundancy is reduced.
The database structure will include tables like:
A jobs table containing job related information such as job title, category and work year.
A company table containing company information such as location and company size.
Additional attributes such as salary and experience level will be stored in the jobs table.
Data Preparation and Cleaning
All data preparation and transformations will be analyzed using R, and packages from the tidyverse.
The data preparation process will include:
Loading the dataset into R
Inspecting the dataset structure and variable types
Checking for missing values
Converting variables into appropriate formats
Exploratory Data Analysis
After cleaning the data, we will perform data analysis to understand patterns in the dataset.
The analysis will include:
Counting the number of jobs for each experience level
Examining the distribution of job categories
Calculating average salary by experience level
Visualizing these patterns using charts
For example, we will create bar charts showing the number of job postings by experience level to identify which levels appear most frequently.
Visualization Plan
For the visualization for level of seniority in the job market will be shown by
Bar charts showing the frequency of job postings by experience level
Charts comparing salaries across experience levels
Visual summaries of job categories within the dataset:
Pie chart dictating the distribution of job demand by required experience level between Entry-Level, Mid-Level , Senior and Executive
Distribution of Data Science fields based on Experience Level: As an example, Machine Learning & AI posting often lean toward senior level roles.
“Lollipop” chart in order to see the demand for each career skill in each point of a data scientist professional career.
Exploratory Data Analysis:
#Data mentions job requests from all over the world with the majority in America. For the purpose of our analysis on data science skills, we will focus on the jobs offered in the United States.
In addition to proxy data, we will use and collect the most effective skills in each subcategory provided in our Kaggle dataset. The most common skills requested in said field will be to use a supplement for analysis. For example BI Development requires statistical analysis, data mining and creation of dashboards using programs such as Tableau.
Collaboration Tools
We plan to use:
GitHub for storing the project code and dataset
Shared documents for writing the project approach and notes
RStudio for performing the data analysis and visualization
Google slides to present graphs for visual analysis
Expected Findings
Through this analysis, we expect to identify patterns in the demand for different levels of experience in data-related jobs.
For example, we may find that:
Mid-level and senior-level roles appear more frequently in job postings
Entry-level roles appear less often
Higher experience levels are associated with higher salaries
These findings can provide insight into how experience and expertise are valued in the data science job market.
Project #3 Code Base Analysis
Code Base
Load Libraries
We load the libraries. I use tidyverse for cleaning and analyzing the data, DBI and RSQLite for creating a relational database, and ggplot2 for visualizations
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DBI)library(RSQLite)
Load the Dataset
We load the CSV file directly from GitHub using the raw link.
Rows: 9355 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): job_title, job_category, salary_currency, employee_residence, exper...
dbl (3): work_year, salary, salary_in_usd
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A first look at the dataset
head(jobs_raw)
# A tibble: 6 × 12
work_year job_title job_category salary_currency salary salary_in_usd
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2023 Data DevOps Engin… Data Engine… EUR 88000 95012
2 2023 Data Architect Data Archit… USD 186000 186000
3 2023 Data Architect Data Archit… USD 81800 81800
4 2023 Data Scientist Data Scienc… USD 212000 212000
5 2023 Data Scientist Data Scienc… USD 93300 93300
6 2023 Data Scientist Data Scienc… USD 130000 130000
# ℹ 6 more variables: employee_residence <chr>, experience_level <chr>,
# employment_type <chr>, work_setting <chr>, company_location <chr>,
# company_size <chr>
Inspect the Structure of the Data
We check the structure and variable types to understand what kind of data we are working with.
These are: - EN = Entry-level - MI = Mid-level - SE = Senior-level - EX = Executive-level
Clean and Prepare the Data
Now we clean the dataset and these steps next: - rename experience level values - focus on jobs in the United States - make some text values easier to read
# A tibble: 6 × 4
company_id company_location company_size work_setting
<int> <chr> <chr> <chr>
1 1 United States Medium In-person
2 2 United States Medium Remote
3 3 United States Large In-person
4 4 United States Small In-person
5 5 United States Large Remote
6 6 United States Large Hybrid
head(jobs_table)
# A tibble: 6 × 11
job_id work_year job_title job_category salary_currency salary salary_in_usd
<int> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1 2023 Data Archi… Data Archit… USD 186000 186000
2 2 2023 Data Archi… Data Archit… USD 81800 81800
3 3 2023 Data Scien… Data Scienc… USD 212000 212000
4 4 2023 Data Scien… Data Scienc… USD 93300 93300
5 5 2023 Data Scien… Data Scienc… USD 130000 130000
6 6 2023 Data Scien… Data Scienc… USD 100000 100000
# ℹ 4 more variables: employee_residence <chr>, experience_level <chr>,
# employment_type <chr>, company_id <int>
Create the SQLite Database
We create a SQLite database file and write both normalized tables into it.
company_id company_location company_size work_setting
1 1 United States Medium In-person
2 2 United States Medium Remote
3 3 United States Large In-person
4 4 United States Small In-person
5 5 United States Large Remote
6 6 United States Large Hybrid
dbReadTable(con, "jobs") %>%head()
job_id work_year job_title job_category
1 1 2023 Data Architect Data Architecture and Modeling
2 2 2023 Data Architect Data Architecture and Modeling
3 3 2023 Data Scientist Data Science and Research
4 4 2023 Data Scientist Data Science and Research
5 5 2023 Data Scientist Data Science and Research
6 6 2023 Data Scientist Data Science and Research
salary_currency salary salary_in_usd employee_residence experience_level
1 USD 186000 186000 United States Senior
2 USD 81800 81800 United States Senior
3 USD 212000 212000 United States Senior
4 USD 93300 93300 United States Senior
5 USD 130000 130000 United States Senior
6 USD 100000 100000 United States Senior
employment_type company_id
1 Full-time 1
2 Full-time 1
3 Full-time 1
4 Full-time 1
5 Full-time 2
6 Full-time 2
dbDisconnect(con)
Exploratory Data Analysis
After cleaning and storing the data, we begin the exploratory analysis.
1.Count the Number of Jobs by Experience Level
We first count how many job postings belong to each experience level.
Bar Chart: Frequency of Job Postings by Experience Level
This bar chart shows which experience level appears most often.
ggplot(experience_counts, aes(x =reorder(experience_level, n), y = n)) +geom_col() +coord_flip() +labs(title ="Frequency of U.S. Job Postings by Experience Level",x ="Experience Level",y ="Number of Job Postings" )
2. Distribution of Job Categories
Next, we examine which job categories appear the most.
# A tibble: 10 × 2
job_category n
<chr> <int>
1 Data Science and Research 2635
2 Data Engineering 1977
3 Data Analysis 1252
4 Machine Learning and AI 1190
5 Leadership and Management 442
6 BI and Visualization 288
7 Data Architecture and Modeling 237
8 Data Management and Strategy 56
9 Data Quality and Operations 50
10 Cloud and Database 5
Bar Chart: Top Job Categories
job_category_counts %>%slice_max(order_by = n, n =10) %>%ggplot(aes(x =reorder(job_category, n), y = n)) +geom_col() +coord_flip() +labs(title ="Top 10 Job Categories in U.S. Data Jobs",x ="Job Category",y ="Number of Job Postings" )
3. Career Titles based on Experience & Demand
Another way to express the difference in skills demanded is by showing the difference in Career titles. Multiple people can work under the same field and these but require unique skills sets, thus categories do not express the whole picture. We can organize the most common job title requested by recruiters. Then we can outline the type of experienced needed in each field by using secondary sources to figure out what unique skills and experiences are commonly needed in each category.
#Data below shows the top ten job title expressed from the datajobs_clean %>%count(job_title, sort =TRUE)
# A tibble: 105 × 2
job_title n
<chr> <int>
1 Data Engineer 1937
2 Data Scientist 1720
3 Data Analyst 1219
4 Machine Learning Engineer 869
5 Applied Scientist 271
6 Research Scientist 244
7 Analytics Engineer 224
8 Data Architect 199
9 Business Intelligence Engineer 140
10 Research Engineer 131
# ℹ 95 more rows
job_title_count <- jobs_clean %>%count(job_title, sort =TRUE)%>%head(10)#Data on the least poular roles- Contains Overlap with prior set ups. jobs_clean %>%count(job_title, sort =TRUE) %>%tail(20)
# A tibble: 20 × 2
job_title n
<chr> <int>
1 Staff Data Scientist 2
2 AI Research Engineer 1
3 AWS Data Architect 1
4 BI Data Engineer 1
5 Big Data Engineer 1
6 Business Intelligence Data Analyst 1
7 Cloud Data Architect 1
8 Compliance Data Analyst 1
9 Data Analytics Lead 1
10 Data Science Tech Lead 1
11 Data Scientist Lead 1
12 Finance Data Analyst 1
13 Machine Learning Developer 1
14 Machine Learning Operations Engineer 1
15 Manager Data Management 1
16 Managing Director Data Science 1
17 Principal Data Analyst 1
18 Principal Machine Learning Engineer 1
19 Software Data Engineer 1
20 Staff Machine Learning Engineer 1
Although they are similar looking closely at the jobs we can see that difference experiences is needed. The top common fields in data science have different skills in which they focus on and different priorities.
Data Analyst/ Scientist :
Entry: Basic (SQL/ R/ Python) & Data Wrangling & Visualizations
Mid/High : Predictive Analytics and Models, Leading Research Initiatives
Data Engineer
Entry: ETL tools / Advanced Data Based knowledge and storage
Mid/High: Big Data Technologies, Cloud Servers,
Machine Learning Engineer :
Entry: Basic ML algorithms, Basic (Python/R),
Mid/High: “Deep Learning” Systems, Tensor Flow, Keras, Neural Networks, ML system development
BI/Business Intelligence Engineer
Entry: Data Analysis, Dashboard Creations(Tableu, SPSS) , Report Creation
Mid/High: Data Mining, Business Intelligence Strategy, Business Soft Skills (Presentations, Management, Proposals)
Bar Chart Expressing Frequency of Career Titles
ggplot(job_title_count, aes(x =reorder(job_title, n), y = n)) +geom_col(fill ="darkblue") +coord_flip()+labs(title ="Popular Career Titles by Frequency", x ="Category", y ="# of Jobs")
A best way to establish experience is to see the difference goal each individual will have based on their roles
Career Titles base on Salary
Job Titles Hiring Per Level
First we organize or data by Job title similar as before but instead we will go from the most popular title per experience level.
# A tibble: 42 × 3
experience_level job_title n
<chr> <chr> <int>
1 Entry-level Data Analyst 96
2 Entry-level Data Engineer 71
3 Entry-level Data Scientist 60
4 Entry-level Research Scientist 20
5 Entry-level Machine Learning Engineer 12
6 Entry-level Research Engineer 12
7 Entry-level Applied Scientist 8
8 Entry-level Research Analyst 8
9 Entry-level Data Integration Specialist 6
10 Entry-level Analytics Engineer 5
# ℹ 32 more rows
Multi- Bar Chart expressing Career Titles by Experience Level
ggplot(toptitle_experience, aes(x=reorder(job_title, n), y = (n), fill= experience_level)) +geom_col(position ="dodge", show.legend =FALSE)+facet_wrap(~experience_level, scales ="free_y") +scale_y_log10()+coord_flip() +scale_fill_manual(values =c("Entry-level"="lightblue","Mid-level"="steelblue", "Senior"="purple", "Executive"="maroon")) +labs( title ="Top Career Titles by Experience Level",x ="Job Titles",y="# of Total Jobs in Experience Level")
We realized that the data for the top job tittles are fairly similar for most roles. In term of entry and mid level individuals there is a need for support roles usually in which Data science is used to assist in another individuals role( Research Scientist/Engineers). There is significant increase of Business Intelligence Engineers for individuals with in the Senior level which suggests a need for skill that focus on Dashboards & using data to implement Data Strategy. There is also jobs requesting the role of Data Architect as companies may ask senior employees to work on database systems for an entire company, which requires extreme proficiency in SQL & nonSQL data structures.
Executive roles are obvious,as they focus on leadership and running a department of individuals for a company/group this includes Head of Data,Data Science Manager, Data Lead. Although there more general roles like Data Engineer/Scientist and Analyst than we initially expected. This could be a fault of the raw data or this could also imply a high value role that do not expressly converted to traditional director role.
4. Average Salary by Experience Level
Now we calculate the average salary in U.S. dollars for each experience level.
ggplot(salary_by_experience, aes(x =reorder(experience_level, average_salary_usd), y = average_salary_usd, fill = experience_level)) +geom_col() +geom_text(aes(label =paste0("$",(round(average_salary_usd)))), hjust =1.1,size =4,color ="darkgray") +coord_flip() +scale_fill_manual(values =c("Entry-level"="lightblue","Mid-level"="steelblue", "Senior"="darkblue", "Executive"="maroon")) +labs(title ="Average Salary by Experience Level",x ="Experience Level",y ="Average Salary in USD" )
Checking for Outliers
Salary data often contains very high or very low values. These can affect the average, so we check for outliers.
Boxplot of Salary by Experience Level
ggplot(jobs_clean, aes(x = experience_level, y = salary_in_usd)) +geom_boxplot() +labs(title ="Salary Distribution by Experience Level",x ="Experience Level",y ="Salary in USD" )
Create a Salary Dataset Without Extreme Outliers
To reduce the effect of extreme salaries, we use the IQR method.
# A tibble: 6 × 3
job_category experience_level n
<chr> <chr> <int>
1 Data Science and Research Senior 2099
2 Data Engineering Senior 1466
3 Machine Learning and AI Senior 986
4 Data Analysis Senior 821
5 Data Science and Research Mid-level 365
6 Data Engineering Mid-level 337
Stacked Bar Chart: Job Categories by Experience Level
jobs_clean %>%count(job_category, experience_level) %>%group_by(job_category) %>%summarise(total =sum(n)) %>%slice_max(order_by = total, n =8) %>%left_join( jobs_clean %>%count(job_category, experience_level),by ="job_category" ) %>%ggplot(aes(x =reorder(job_category, total), y = n, fill = experience_level)) +geom_col() +coord_flip() +labs(title ="Job Categories by Experience Level",x ="Job Category",y ="Number of Job Postings",fill ="Experience Level" )
Salary By Category
We can organize the salary by category to see which fields are more in demand, and thus recruiters are willing to pay individuals higher.
ggplot(experience_counts, aes(x =reorder(experience_level, n), y = n)) +geom_segment(aes(xend =reorder(experience_level, n), y =0, yend = n)) +geom_point(size =3) +coord_flip() +labs(title ="Lollipop Chart of Experience Level Demand",x ="Experience Level",y ="Number of Job Postings" )
# A tibble: 10 × 2
job_category average_salary_usd
<chr> <dbl>
1 Machine Learning and AI 191054.
2 Data Science and Research 173070.
3 Data Architecture and Modeling 160576.
4 Cloud and Database 155000
5 Data Engineering 154231.
6 Leadership and Management 148341.
7 BI and Visualization 139048.
8 Data Analysis 113702.
9 Data Quality and Operations 104350.
10 Data Management and Strategy 103986.
ggplot(salary_by_category,aes(x =reorder(job_category, average_salary_usd), y = average_salary_usd)) +geom_segment(aes(xend =reorder(job_category, average_salary_usd), y =0, yend = average_salary_usd)) +geom_point(size =3) +geom_text(aes(label =paste0("$",(round(average_salary_usd)))), hjust =1.1,size =4,color ="darkred") +coord_flip() +labs(title ="Average Salary by Job Field",x ="Job Field",y ="Average Salary in USD($)" )+theme_light()
As of the data shown, Machine Leaning/AI & Data Science Research tends to have the highest average salary. This is a reasonable conclusion due to the boom in Artificial Intelligence Research , so it is unclear if the amount of salary will increase. We have also seen that data quality is significantly lower which might be because it is usually considered and early career role. We need to note that since most of the openings are for senior positions the data is very skewed.
Findings and Analysis
After cleaning the data and creating the visualizations, we looked at the patterns in the dataset to better understand the experience level that companies are looking for in data related jobs.
First, we counted how many job postings exist for each experience level. From the charts we created, we can see that mid-level and senior-level roles appear the most often in the dataset. Entry-level roles appear less frequently and executive level roles appear the least. This may suggest that many companies prefer hiring people who already have some experience in the field.
We also looked at the salary differences between experience levels. The results show that salary tends to increase as the experience level increases. Entry level roles usually have lower salaries, while senior and executive roles have higher salaries. This is expected because more experienced workers usually have stronger technical knowledge and more responsibility.
Another thing we looked at was the distribution of job categories. Some data related fields appear more often than others in the dataset. When we compare job categories with experience levels, we can see that some areas such as machine learning and AI related jobs appear more often in senior level positions. This may be because these roles usually require more advanced knowledge.
We also looked at the distribution of Career titles in this data, by breaking down and clearer specific job titles, we can see the roles that employers are looking for in the industry. We found that most employers looking for early career applicants are focusing on roles for Data Engineer/Scientist but also a large amount of support roles for science and research. For Experienced candidates there is a search for business orientated roles such as Business Intelligence Specialist and Data Architect.
Overall, the dataset suggests that experience level is an important factor in the data job market. Many employers seem to look for candidates who already have some experience working with data tools and technologies. There is a small amount of employers who prefer a specific type of experience including (AWS, Cloud, etc.)
Since the dataset does not include specific technical skills like Python or SQL, we used experience level and career roles and titles as a proxy for skill demand. Even though this does not directly measure skills, it still helps us understand the overall type of skills and backgrounds employers desire in data related jobs.