I have chose the AI Job Market dataset to exlore becuase AI is on the uprising but there are a lot of different ideas on what is going to happen to AI and if it is going to take over everyones jobs. With AI being on the rise it has allowed to a high demand in jobs that require programming skills, machine learning and data engineering. The data shows over 2,000 AI related job listings and details about each job containing the companies name, industry, company size, job information, experience level, employment type(Full/ Part Time), location, requirements, salary information and posting date.
I want to show relationships between jobs that are here in arizona and the salaries so I have all of the data cleaning below. This is just a head of the AI Job Postings.
# getting the min and max salary
df_num <- df %>%
separate(salary_range_usd, into = c("min_salary", "max_salary"),
sep = "-", remove = FALSE) %>%
mutate(
min_salary = as.numeric(min_salary),
max_salary = as.numeric(max_salary)
)
#get the valid states
valid_states <- state.abb
# extracting the states
df_states <- df %>%
mutate(state = str_extract(location, "(?<=, )[:upper:]{2}$")) %>%
filter(!is.na(state), state %in% valid_states)
#Filter top states for bar plot
state_counts <- df_states %>%
count(state, sort = TRUE) %>%
slice_max(n, n = 10) %>%
as.data.frame()
head(df)
## job_id company_name industry job_title
## 1 1 Foster and Sons Healthcare Data Analyst
## 2 2 Boyd, Myers and Ramirez Tech Computer Vision Engineer
## 3 3 King Inc Tech Quant Researcher
## 4 4 Cooper, Archer and Lynch Tech AI Product Manager
## 5 5 Hall LLC Finance Data Scientist
## 6 6 Ellis PLC E-commerce AI Product Manager
## skills_required
## 1 NumPy, Reinforcement Learning, PyTorch, Scikit-learn, GCP, FastAPI
## 2 Scikit-learn, CUDA, SQL, Pandas
## 3 MLflow, FastAPI, Azure, PyTorch, SQL, GCP
## 4 Scikit-learn, C++, Pandas, LangChain, AWS, R
## 5 Excel, Keras, SQL, Hugging Face
## 6 GCP, Excel, Scikit-learn, MLflow
## experience_level employment_type location salary_range_usd
## 1 Mid Full-time Tracybury, AR 92860-109598
## 2 Senior Full-time Lake Scott, CU 78523-144875
## 3 Entry Full-time East Paige, CM 124496-217204
## 4 Mid Full-time Perezview, FI 50908-123743
## 5 Senior Contract North Desireeland, NE 98694-135413
## 6 Senior Remote South Kevin, TZ 92632-180718
## posted_date company_size tools_preferred
## 1 2025-08-20 Large KDB+, LangChain
## 2 2024-03-22 Large FastAPI, KDB+, TensorFlow
## 3 2025-09-18 Large BigQuery, PyTorch, Scikit-learn
## 4 2024-05-08 Large TensorFlow, BigQuery, MLflow
## 5 2025-02-24 Large PyTorch, LangChain
## 6 2025-08-07 Large PyTorch, TensorFlow, FastAPI
az <- df[grepl(", AZ", df$location), ]
az
## job_id company_name industry job_title
## 81 81 Johnson Inc Education AI Researcher
## 561 561 Dixon-Sanchez Tech AI Product Manager
## 597 597 Erickson-Hill Finance ML Engineer
## 1035 1035 Peterson Ltd E-commerce Quant Researcher
## 1206 1206 Richards-Adams Finance Quant Researcher
## 1272 1272 Johnson-Peters Tech NLP Engineer
## 1401 1401 Grant, Rosario and Williams Retail AI Researcher
## 1420 1420 Cook-Francis Automotive Quant Researcher
## 1569 1569 Walls, Young and Cook E-commerce Data Scientist
## 1682 1682 Schmitt, James and Campbell E-commerce Computer Vision Engineer
## skills_required
## 81 Reinforcement Learning, Python, Pandas, TensorFlow
## 561 Excel, Azure, Pandas, PyTorch, SQL
## 597 LangChain, Reinforcement Learning, CUDA, R, SQL, Excel
## 1035 Reinforcement Learning, Python, Pandas, Excel, LangChain, Azure
## 1206 FastAPI, C++, NumPy, Flask
## 1272 LangChain, GCP, NumPy, Python, R
## 1401 Keras, AWS, SQL, CUDA, Python, GCP
## 1420 SQL, GCP, Excel, CUDA, R
## 1569 Reinforcement Learning, Keras, MLflow, Flask, Pandas, AWS
## 1682 FastAPI, SQL, Pandas, TensorFlow, MLflow
## experience_level employment_type location salary_range_usd
## 81 Mid Internship Lake Kristen, AZ 111184-166588
## 561 Mid Remote Adamsshire, AZ 95822-165764
## 597 Senior Internship Aaronview, AZ 48914-60214
## 1035 Mid Contract Staffordstad, AZ 147838-240312
## 1206 Entry Contract Lake Kathleenville, AZ 67422-166675
## 1272 Entry Contract North Daniel, AZ 74842-147593
## 1401 Mid Remote Lake Ryanville, AZ 127351-220767
## 1420 Entry Remote Bridgesberg, AZ 94512-159116
## 1569 Senior Contract New Nicole, AZ 55477-100381
## 1682 Mid Contract Port Joanne, AZ 74357-107271
## posted_date company_size tools_preferred
## 81 2025-08-03 Large Scikit-learn, MLflow, PyTorch
## 561 2025-05-21 Startup PyTorch, TensorFlow, KDB+
## 597 2024-10-13 Mid KDB+
## 1035 2025-04-30 Large PyTorch, LangChain, TensorFlow
## 1206 2024-08-17 Large MLflow, LangChain
## 1272 2025-04-18 Startup LangChain
## 1401 2024-12-28 Large TensorFlow, MLflow, PyTorch
## 1420 2024-10-06 Startup BigQuery, KDB+
## 1569 2024-02-12 Startup MLflow
## 1682 2024-09-15 Large KDB+, TensorFlow, LangChain
The histogram shows the distribution of maximum salaries from all the AI job postings. The blue density curve shows the shape of the distribution where we can see how the data is visualized.
## List of 8
## $ x : num [1:512] 29532 30007 30482 30957 31432 ...
## $ y : num [1:512] 8.40e-10 1.02e-09 1.23e-09 1.49e-09 1.79e-09 ...
## $ bw : num 8210
## $ n : int 2000
## $ old.coords: logi FALSE
## $ call : language density.default(x = df_num$max_salary, kernel = "gaussian")
## $ data.name : chr "df_num$max_salary"
## $ has.na : logi FALSE
## - attr(*, "class")= chr "density"
## Top 10 Skills Required I wanted to go ahead and show what the top 10
skills are that are needed for these AI jobs. In order to get this I had
to seperate the skills_required column becuause each job contains its
own skills that are required and then I had to count the frequences for
each skill.
skills_list <- df %>%
separate_rows(skills_required, sep = ",\\s*")
top_skills <- skills_list %>%
count(skills_required, sort = TRUE) %>%
slice_head(n = 10)
top_skills
## # A tibble: 10 × 2
## skills_required n
## <chr> <int>
## 1 TensorFlow 452
## 2 Excel 432
## 3 Pandas 427
## 4 FastAPI 419
## 5 NumPy 416
## 6 Reinforcement Learning 414
## 7 Azure 413
## 8 Hugging Face 408
## 9 SQL 408
## 10 Keras 406
The scatter plot shows a postiive relationship between min and max salaries from all the AI Job listings BASED on their industry which shows us what industires offer either higher or lower salaries.
df_num %>%
ggplot(aes(x = min_salary, y = max_salary, color = industry)) +
geom_point(alpha = 0.7, size = 2.5) +
geom_smooth(method = "lm", se = FALSE, color = "black", lwd = 1.2) +
labs(
title = "Relationship Between Minimun and Maximum Salary (By Indistry)",
x = "Minimum Salary (USD)",
y = "Maximum Salary (USD)",
color = "Industry"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
I wanted to show the top US States with the highest number of AI related jobs. What I did was first extract all of the states by abbreviation from the location column and then filtered out states that were not abbreviations. After that I just counted all of the jobs for that specific state and then was able to use a bar plot. We can see that the top state with the most AI jobs is South Carolina and the state with the least amount of Job postings is South Dakota. We can also see AZ on the list with only about 9 postings.
state_counts <- df_states %>%
count(state, sort = TRUE)
ggplot(state_counts, aes(x = reorder(state, n), y = n)) +
geom_col(fill = "pink", color = "black", width= 0.7) +
coord_flip() +
labs(
title = "Top U.S. States for AI Job Postings",
x = "State",
y = "Number of Job Postings"
) +
theme_minimal()
Using our data manipulation we were able to see what states had the more AI jobs and which skills are the most required. We were also able to see the relationship between the salary of AI Jobs and what type of industry they were in.