Code
library(tidyverse)
library(lubridate)
library(scales)
library(knitr)Source: Statista - AI Market Growth, AI Tool Users, and Software Market Size datasets
Web Link: https://www.statista.com/
Who Collected the Data: Statista is a leading provider of market and consumer data, compiling statistics from over 22,500 sources including market research reports, industry associations, official stats, etc. The AI market data is taken from major technology market research firms including Gartner, IDC, and Statista’s own research.
Why the Source is Reliable: Statista is widely used by Fortune 500 companies, academic institutions, as well as govt agencies for market intelligence. They employ pretty rigorous data verification processes & cite all original sources. The company has been operating since 2007 and has partnerships with major research organizations worldwide.
Demonstration of Real Data: These datasets represent actual market revenue figures (in billions $USD) and user adoption metrics (in millions of users) collected from the following: - Enterprise software licensing data - Market research surveys of businesses - Industry financial reports - Technology adoption studies
The data includes historical figures (2020-2024) that can be verified against published industry reports and forward-looking projections (2025-2030) based on industry growth models.
This data set represents the financial scale and user adoption of AI technologies compared to the software industry. It contains three metrics:
Relevance: This data is important for understanding AI’s economic impact & provides the context for further analyzing of job market trends. By comparing AI’s market growth to the total software industry, we can assess whether AI represents a sustainable growth sector or a temporary trend. This context is also important for interpreting job posting data in Part II.
Features to Join On: The primary join key will be Year (for the Statista datasets) and Date (converted to Year for the Indeed datasets in Part II).
Data Transformations Needed: - The Statista datasets are in “wide” format (years as columns) and need to be converted to “long” format - The Indeed datasets use daily dates that will need to be aggregated or converted to yearly data for joining - Year formats need to be standardized
Potential Joining Challenges: - Statista data covers 2020-2030 (including projections) - Indeed data includes historical daily data - Need to filter to overlapping years for meaningful comparisons
library(tidyverse)
library(lubridate)
library(scales)
library(knitr)# AI Market Growth Data
ai_market <- read_csv("~/Downloads/Statista_AIMarketGrowth - Sheet1.csv")
# AI Tool Users Data
ai_users <- read_csv("~/Downloads/Statista_AIToolsUsers - Sheet1.csv")
# Software Market Size Data
software_market <- read_csv("~/Downloads/Statista_SoftwareMarketSize - Sheet1-2.csv")
# Display structure of each dataset
glimpse(ai_market)Rows: 1
Columns: 13
$ Year <chr> "Total (Billions USD)"
$ `2020` <dbl> 16.87
$ `2021` <dbl> 36.09
$ `2022` <dbl> 23.61
$ `2023` <dbl> 25.6
$ `2024` <dbl> 34.9
$ `2025` <dbl> 46.99
$ `2026` <dbl> 62.62
$ `2027` <dbl> 84.25
$ `2028` <dbl> 114.16
$ `2029` <dbl> 159.28
$ `2030` <dbl> 223.52
$ `2031` <dbl> 307.56
glimpse(ai_users)Rows: 1
Columns: 13
$ Year <chr> "AI Tool Users (millions)"
$ `2020` <dbl> 48.13
$ `2021` <dbl> 59.72
$ `2022` <dbl> 75.07
$ `2023` <dbl> 84.1
$ `2024` <dbl> 104.84
$ `2025` <dbl> 129.08
$ `2026` <dbl> 158.15
$ `2027` <dbl> 193.36
$ `2028` <dbl> 236.41
$ `2029` <dbl> 289.41
$ `2030` <dbl> 355.12
$ `2031` <dbl> 437.05
glimpse(software_market)Rows: 5
Columns: 16
$ Year <chr> "Total (billions USD)", "Application Development Software", "En…
$ `2016` <dbl> 211.72, 48.91, 82.62, 53.65, 26.53
$ `2017` <dbl> 226.41, 53.63, 88.80, 55.72, 28.25
$ `2018` <dbl> 245.14, 59.18, 96.75, 58.91, 30.30
$ `2019` <dbl> 263.37, 64.24, 104.98, 62.08, 32.07
$ `2020` <dbl> 270.86, 65.38, 108.23, 64.02, 33.23
$ `2021` <dbl> 286.85, 70.60, 114.89, 66.80, 34.56
$ `2022` <dbl> 313.56, 78.27, 126.96, 71.29, 37.04
$ `2023` <dbl> 338.22, 85.66, 139.22, 74.21, 39.13
$ `2024` <dbl> 363.39, 91.95, 150.50, 80.08, 40.87
$ `2025` <dbl> 379.29, 97.64, 159.39, 80.63, 41.62
$ `2026` <dbl> 395.00, 103.28, 168.00, 81.26, 42.46
$ `2027` <dbl> 410.14, 108.95, 176.73, 81.28, 43.18
$ `2028` <dbl> 427.24, 114.83, 186.16, 82.26, 43.98
$ `2029` <dbl> 445.40, 121.37, 195.50, 83.73, 44.79
$ `2030` <dbl> 462.04, 127.32, 204.41, 84.83, 45.48
Column Names: - ai_market: Contains a “Year” column (actually a category label) & columns for years 2020-2030 with rev values - ai_users: Contains a “Year” column (category label) & columns for years 2020-2030 with user count values - software_market: Contains a “Year” column with category labels including “Total (billions USD)” and various software subcategories, with year columns 2020-2030
Number of Rows: Each dataset appears to have 1-2 rows (categories) with 11 columns (one label + 10 year columns from 2020-2030)
Missing Values:
# Check for missing values in each dataset
cat("AI Market missing values:\n")AI Market missing values:
sum(is.na(ai_market))[1] 0
cat("\nAI Users missing values:\n")
AI Users missing values:
sum(is.na(ai_users))[1] 0
cat("\nSoftware Market missing values:\n")
Software Market missing values:
sum(is.na(software_market))[1] 0
Potential Issues: 1. Projection Data: Years 2025-2030 are projections, not actual observations. This introduces uncertainty. 2. Data Format: The “wide” format requires transformation for analysis 3. Aggregation Level: Data is at annual level only, limiting granularity 4. Source Attribution: While Statista aggregates from multiple sources, the specific methodology for projections is not fully transparent
Impact on Reliability: Despite the projection uncertainties, the historical data is based on actual market figures and can be considered reliable. The projections are useful for trend analysis but should be interpreted as what they are: estimates. For our analysis, we’ll focus primarily on historical data & clearly label any projections.
Source: Indeed Hiring Lab - Job Postings Data
Web Link: https://www.hiringlab.org/
Who Collected the Data: Indeed is one of the world’s largest job search engines, processing millions of job postings daily. The Indeed Hiring Lab is their economic research section, which collects job posting data from their platform to produce labor market insights.
Why the Source is Reliable: Indeed hosts over 250 million unique visitors per month and aggregates job postings from thousands of companies. Their Hiring Lab data is frequently cited by major media outlets such as the Wall Street Journal, New York Times, CNBC, etc. and government agencies (Federal Reserve, Bureau of Labor Statistics). The data represents actual employer behavior in real-time.
Demonstration of Real Data: This dataset contains: - Daily snapshots of job posting volumes indexed to a baseline date - Sector-specific metrics (Software Development, for ours) - Share of job postings mentioning “AI”/related terms - Data spans 2019-2024 with daily granularity
The data is “real” because it’s derived from actual job postings that employers paid to list on Indeed’s platform, representing genuine hiring demand.
This dataset represents labor market demand in the software development sector & the adoption of AI-related skills in job requirements.
Two Important Files: 1. job-postings-sector-index: Tracks the volume of software development job postings over time (indexed to February 1, 2020 = 100) 2. ai-headline-share: Measures the percentage of all job postings that mention “AI” or related terms in job titles or descriptions
Relevance to First Dataset: While Statista shows the economic growth of AI (revenue and users), Indeed data shows the labor market response to that growth. If AI markets are growing, we’d expect to see: - Increased demand for software developers (higher job posting index) - Higher percentage of jobs requiring AI skills (higher AI share)
This provides validation that the market growth in Dataset 1 is translating into real employment opportunities.
Features to Join On: - Primary join: Date (will be converted to Year/Month for aggregation) - Filter criteria: countryName (“United States”) and sectorName (“Software Development”)
Between the Two Datasets (Statista + Indeed): We’ll need to aggregate Indeed’s daily data to annual averages to join with Statista’s annual data on Year.
Data Transformations Needed: - Fix column names: dateString → date, countryName, sectorName, value - Convert date strings to Date objects - Filter to US and Software Development sector - Aggregate daily data to annual averages for joining with Statista - Ensure year formats match (numeric 2020, 2021, etc.)
Potential Challenges: - Indeed data is daily; Statista is annual (need aggregation strategy) - Indeed data likely doesn’t extend to 2030 (only through ~2024) - The “AI share” metric is a percentage, while Statista tracks absolute revenue
# Software sector job postings index
software_data <- read_csv("~/Downloads/job-postings-sector-index-2.csv")
# AI headline share data
ai_data <- read_csv("~/Downloads/ai-headline-share.csv")
# Display structure
glimpse(software_data)Rows: 2,198
Columns: 8
$ `__typename` <chr> "HiringLabSectoralPosting", "HiringLabSectoralPosting", "…
$ dateString <date> 2020-02-01, 2020-02-02, 2020-02-03, 2020-02-04, 2020-02-…
$ countryCode <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "US…
$ countryName <chr> "United States", "United States", "United States", "Unite…
$ sectorCode <chr> "techsoftware", "techsoftware", "techsoftware", "techsoft…
$ sectorName <chr> "Software Development", "Software Development", "Software…
$ postingType <chr> "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TO…
$ value <dbl> 100.00, 99.84, 99.73, 99.55, 99.47, 99.46, 99.45, 99.52, …
glimpse(ai_data)Rows: 2,526
Columns: 6
$ `__typename` <chr> "HiringLabNationalAI", "HiringLabNationalAI", "HiringLabN…
$ dateString <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-01-…
$ countryCode <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "US…
$ countryName <chr> "United States", "United States", "United States", "Unite…
$ aiType <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ value <dbl> 1.71, 1.71, 1.71, 1.71, 1.71, 1.71, 1.71, 1.72, 1.72, 1.7…
Column Names: - software_data: __typename, dateString, countryCode, countryName, sectorCode, sectorName, postingType, value - ai_data: __typename, dateString, countryCode, countryName, aiType, value
Number of Rows: - software_data: 2,198 rows (approximately 5.5 years of daily data) - ai_data: 2,526 rows (approximately 6.5 years of daily data)
Example Data: The software index starts at 100.00 in February 2020 and shows the dramatic impact of COVID-19 and subsequent recovery. AI share values start around 1.71% in 2019 and grow over time.
# Check for missing values
cat("Software data missing values:\n")Software data missing values:
colSums(is.na(software_data)) __typename dateString countryCode countryName sectorCode sectorName
0 0 0 0 0 0
postingType value
0 0
cat("\nAI data missing values:\n")
AI data missing values:
colSums(is.na(ai_data)) __typename dateString countryCode countryName aiType value
0 0 0 0 2526 0
Potential Issues: 1. Missing Values: The aiType column in ai_data is entirely NA (may be reserved for future use) 2. Date Gaps: Need to verify if there are any missing dates in the time series 3. Index Baseline: Software index baseline (Feb 1, 2020) was chosen just before COVID-19, which may create unusual patterns 4. Sample Bias: Indeed data only reflects jobs posted on their platform, not the entire labor market
# Check for date gaps in software data (US, Software Development only)
software_us <- software_data %>%
filter(countryName == "United States", sectorName == "Software Development") %>%
arrange(dateString)
date_range <- seq(min(software_us$dateString), max(software_us$dateString), by = "day")
missing_dates <- setdiff(date_range, software_us$dateString)
cat("Number of missing dates:", length(missing_dates), "\n")Number of missing dates: 0
cat("Date range:", format(min(software_us$dateString), "%Y-%m-%d"), "to",
format(max(software_us$dateString), "%Y-%m-%d"))Date range: 2020-02-01 to 2026-02-06
Impact on Reliability: The data quality is generally high. Missing dates (if any) can be interpolated. The COVID-19 baseline is a known factor that we’ll account for in interpretation. Indeed’s market coverage is substantial enough that their data is widely accepted as representative of broader labor market trends.
To create a comprehensive dataset, we’ll perform multiple joins:
# 1. AI Market Growth - wide to long
ai_market_long <- ai_market %>%
rename(Category = Year) %>%
pivot_longer(
cols = -Category,
names_to = "Year",
values_to = "AI_Market_Billions"
) %>%
mutate(Year = as.numeric(Year)) %>%
select(Year, AI_Market_Billions)
# 2. AI Tool Users - wide to long
ai_users_long <- ai_users %>%
rename(Category = Year) %>%
pivot_longer(
cols = -Category,
names_to = "Year",
values_to = "AI_Users_Millions"
) %>%
mutate(Year = as.numeric(Year)) %>%
select(Year, AI_Users_Millions)
# 3. Software Market Size - extract TOTAL only, then wide to long
software_total <- software_market %>%
filter(Year == "Total (billions USD)") %>%
rename(Category = Year) %>%
pivot_longer(
cols = -Category,
names_to = "Year",
values_to = "Software_Total_Billions"
) %>%
mutate(Year = as.numeric(Year)) %>%
select(Year, Software_Total_Billions)
# Join all three Statista datasets
statista_combined <- ai_market_long %>%
inner_join(ai_users_long, by = "Year") %>%
inner_join(software_total, by = "Year") %>%
arrange(Year)
glimpse(statista_combined)Rows: 11
Columns: 4
$ Year <dbl> 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027…
$ AI_Market_Billions <dbl> 16.87, 36.09, 23.61, 25.60, 34.90, 46.99, 62.6…
$ AI_Users_Millions <dbl> 48.13, 59.72, 75.07, 84.10, 104.84, 129.08, 15…
$ Software_Total_Billions <dbl> 270.86, 286.85, 313.56, 338.22, 363.39, 379.29…
Changes Made: - Renamed “Year” column to “Category” to avoid confusion during pivot - Used pivot_longer() to convert from wide to long format - Converted Year to numeric type for proper joining - Filtered software_market to only “Total (billions USD)” row to get aggregate software market size - Used inner_join() to combine all three datasets on Year
# Clean Software Development data
software_clean <- software_data %>%
filter(
countryName == "United States",
sectorName == "Software Development"
) %>%
mutate(date = as.Date(dateString)) %>%
select(date, software_index = value)
# Clean AI share data
ai_clean <- ai_data %>%
filter(countryName == "United States") %>%
mutate(date = as.Date(dateString)) %>%
select(date, ai_share = value)
# Join Indeed datasets on date
indeed_combined <- inner_join(ai_clean, software_clean, by = "date") %>%
arrange(date)
# Create normalized AI index (base = first value = 100)
indeed_combined <- indeed_combined %>%
mutate(
ai_index = (ai_share / first(ai_share)) * 100
)
glimpse(indeed_combined)Rows: 2,130
Columns: 4
$ date <date> 2020-02-01, 2020-02-02, 2020-02-03, 2020-02-04, 2020-0…
$ ai_share <dbl> 1.93, 1.93, 1.93, 1.93, 1.93, 1.92, 1.92, 1.93, 1.93, 1…
$ software_index <dbl> 100.00, 99.84, 99.73, 99.55, 99.47, 99.46, 99.45, 99.52…
$ ai_index <dbl> 100.00000, 100.00000, 100.00000, 100.00000, 100.00000, …
Changes Made: - Filtered both datasets to United States only - Filtered software data to Software Development sector only - Converted dateString to proper Date objects - Renamed value column to meaningful names (software_index, ai_share) - Used inner_join() to keep only overlapping dates - Created ai_index to normalize AI share to base 100 for easier comparison with software index
# Aggregate Indeed data to annual averages
indeed_annual <- indeed_combined %>%
mutate(Year = year(date)) %>%
group_by(Year) %>%
summarize(
avg_software_index = mean(software_index, na.rm = TRUE),
avg_ai_share = mean(ai_share, na.rm = TRUE),
avg_ai_index = mean(ai_index, na.rm = TRUE),
observations = n(),
.groups = "drop"
)
glimpse(indeed_annual)Rows: 6
Columns: 5
$ Year <dbl> 2020, 2021, 2022, 2023, 2024, 2025
$ avg_software_index <dbl> 78.92090, 148.42016, 195.70079, 90.46466, 69.58863,…
$ avg_ai_share <dbl> 1.722537, 2.363562, 2.829425, 1.838712, 2.207923, 3…
$ avg_ai_index <dbl> 89.25064, 122.46433, 146.60231, 95.27007, 114.40018…
$ observations <int> 335, 365, 365, 365, 366, 334
Changes Made: - Extracted year from date using year() function - Used group_by() and summarize() to calculate annual averages - Included observation count to show how many daily data points contributed to each year - Removed missing values from calculations using na.rm = TRUE
# Join Statista and Indeed data on Year
final_combined <- statista_combined %>%
left_join(indeed_annual, by = "Year") %>%
arrange(Year)
# Display the final combined dataset
glimpse(final_combined)Rows: 11
Columns: 8
$ Year <dbl> 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027…
$ AI_Market_Billions <dbl> 16.87, 36.09, 23.61, 25.60, 34.90, 46.99, 62.6…
$ AI_Users_Millions <dbl> 48.13, 59.72, 75.07, 84.10, 104.84, 129.08, 15…
$ Software_Total_Billions <dbl> 270.86, 286.85, 313.56, 338.22, 363.39, 379.29…
$ avg_software_index <dbl> 78.92090, 148.42016, 195.70079, 90.46466, 69.5…
$ avg_ai_share <dbl> 1.722537, 2.363562, 2.829425, 1.838712, 2.2079…
$ avg_ai_index <dbl> 89.25064, 122.46433, 146.60231, 95.27007, 114.…
$ observations <int> 335, 365, 365, 365, 366, 334, NA, NA, NA, NA, …
# Show a sample of the data
kable(final_combined,
caption = "Combined AI Market and Job Postings Data (2020-2030)",
digits = 2)| Year | AI_Market_Billions | AI_Users_Millions | Software_Total_Billions | avg_software_index | avg_ai_share | avg_ai_index | observations |
|---|---|---|---|---|---|---|---|
| 2020 | 16.87 | 48.13 | 270.86 | 78.92 | 1.72 | 89.25 | 335 |
| 2021 | 36.09 | 59.72 | 286.85 | 148.42 | 2.36 | 122.46 | 365 |
| 2022 | 23.61 | 75.07 | 313.56 | 195.70 | 2.83 | 146.60 | 365 |
| 2023 | 25.60 | 84.10 | 338.22 | 90.46 | 1.84 | 95.27 | 365 |
| 2024 | 34.90 | 104.84 | 363.39 | 69.59 | 2.21 | 114.40 | 366 |
| 2025 | 46.99 | 129.08 | 379.29 | 64.91 | 3.09 | 160.16 | 334 |
| 2026 | 62.62 | 158.15 | 395.00 | NA | NA | NA | NA |
| 2027 | 84.25 | 193.36 | 410.14 | NA | NA | NA | NA |
| 2028 | 114.16 | 236.41 | 427.24 | NA | NA | NA | NA |
| 2029 | 159.28 | 289.41 | 445.40 | NA | NA | NA | NA |
| 2030 | 223.52 | 355.12 | 462.04 | NA | NA | NA | NA |
Final Dataset Summary: - Rows: 11 (covering years 2020-2030) - Columns: 8 (Year, AI market metrics, software market metrics, job posting metrics) - Join Type: Left join to preserve all Statista years, including future projections (2025-2030) - Missing Data: Years 2025-2030 will have NA values for Indeed data since those years haven’t occurred yet
Primary Variables:
| Variable | Description | Unit | Source |
|---|---|---|---|
Year |
Calendar year | Numeric (2020-2030) | Both |
AI_Market_Billions |
Total AI market revenue | Billions USD | Statista |
AI_Users_Millions |
Number of AI tool users | Millions of users | Statista |
Software_Total_Billions |
Total software industry revenue | Billions USD | Statista |
avg_software_index |
Average software job posting index | Index (Feb 2020 = 100) | Indeed |
avg_ai_share |
Average % of jobs mentioning AI | Percentage (0-100) | Indeed |
avg_ai_index |
AI share normalized to base 100 | Index (Jan 2019 = 100) | Indeed (calculated) |
observations |
Daily data points per year | Count | Indeed (calculated) |
Most Relevant Variables for Analysis:
AI_Market_Billions & AI_Users_Millions: These show the economic scale and adoption velocity of AI technology. The relationship between revenue and users can reveal whether growth is driven by more users or higher spending per user.
avg_software_index: This indicates overall demand for software developers. A rising index suggests strong hiring demand; a falling index suggests contraction.
avg_ai_share: This is the most direct measure of AI’s impact on the job market - what percentage of software jobs now require AI skills?
Comparison of growth rates: By indexing all variables to base 100, we can compare the rate of change in market size vs. job demand.
Codebook Links: - Statista: Full methodology available at https://www.statista.com/markets/methodology/ - Indeed: Hiring Lab methodology at https://www.hiringlab.org/about/ - Note: Detailed variable-level codebooks are not publicly available for these commercial datasets, but source documentation explains data collection methods
Potential Biases and Limitations:
Our combined dataset reflects several important biases that must be considered:
Geographic Bias: The Indeed data is filtered to United States only, while Statista data represents global markets. This creates a mismatch - we’re comparing U.S. job market trends to global economic trends. This may mask important regional differences in AI adoption.
Platform Bias: Indeed data only captures jobs posted on their platform, which skews toward certain industries and company sizes. Smaller companies and certain sectors may be underrepresented. Additionally, companies with internal hiring processes or those using specialized tech recruiting platforms (like LinkedIn, Hired, or Triplebyte) are not fully represented.
Industry Bias: We filtered to “Software Development” jobs specifically, excluding other tech roles (data science, ML engineering, AI research) that might be even more AI-focused. This narrow focus may underestimate AI’s total impact on the job market.
Temporal Bias: The software job index is baselined to February 2020, immediately before COVID-19. This baseline choice amplifies the appearance of recovery and growth while masking pre-pandemic trends.
Projection Uncertainty: Years 2025-2030 in the Statista data are projections, not observations. These projections reflect current assumptions about AI growth that may not materialize.
Missing Features: Our dataset lacks: - Salary information (which would show if AI skills command premium pay) - Educational requirements (which would show if AI is creating or eliminating entry-level opportunities) - Company size data (which would show if AI adoption differs by organization size) - Demographic information about who is being hired
Alignment with Human Rights Principles:
Critical Concern: Our dataset cannot answer whether AI’s economic growth is creating broadly accessible opportunities or concentrating benefits among already-privileged groups. The lack of demographic and socioeconomic data prevents us from assessing whether this technological transition is exacerbating or reducing inequality.
In this section, we perform data cleaning, transformation, and exploratory analysis on our combined dataset.
Purpose: To enable direct comparison of growth rates across variables with different scales. By indexing all variables to their 2020 values = 100, we can see which metrics are growing fastest in percentage terms, not just absolute terms.
# Calculate growth indices (2020 = 100 for all variables)
growth_indexed <- final_combined %>%
mutate(
AI_Market_Index = (AI_Market_Billions / first(AI_Market_Billions)) * 100,
AI_Users_Index = (AI_Users_Millions / first(AI_Users_Millions)) * 100,
Software_Market_Index = (Software_Total_Billions / first(Software_Total_Billions)) * 100,
Software_Jobs_Index = (avg_software_index / first(avg_software_index[!is.na(avg_software_index)])) * 100,
AI_JobShare_Index = (avg_ai_share / first(avg_ai_share[!is.na(avg_ai_share)])) * 100
)
# Display the indexed data
kable(growth_indexed %>%
select(Year, AI_Market_Index, AI_Users_Index, Software_Market_Index,
Software_Jobs_Index, AI_JobShare_Index) %>%
filter(!is.na(Software_Jobs_Index)), # Only show years with Indeed data
caption = "Growth Indices (2020 = 100) for All Metrics",
digits = 1)| Year | AI_Market_Index | AI_Users_Index | Software_Market_Index | Software_Jobs_Index | AI_JobShare_Index |
|---|---|---|---|---|---|
| 2020 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 |
| 2021 | 213.9 | 124.1 | 105.9 | 188.1 | 137.2 |
| 2022 | 140.0 | 156.0 | 115.8 | 248.0 | 164.3 |
| 2023 | 151.7 | 174.7 | 124.9 | 114.6 | 106.7 |
| 2024 | 206.9 | 217.8 | 134.2 | 88.2 | 128.2 |
| 2025 | 278.5 | 268.2 | 140.0 | 82.2 | 179.5 |
Reasoning: This transformation allows us to answer questions like “Is the AI job share growing faster than the AI market size?” or “Are software job postings keeping pace with overall software market growth?” Without indexing, the different scales (billions vs percentages vs indices) make visual comparison difficult.
Purpose: To classify years into distinct phases of AI development, allowing us to analyze how different growth phases correspond to job market changes.
# Create categorical variable for AI market growth phases
growth_phases <- growth_indexed %>%
mutate(
Growth_Phase = case_when(
Year <= 2021 ~ "Early Stage (2020-2021)",
Year >= 2022 & Year <= 2024 ~ "Rapid Expansion (2022-2024)",
Year >= 2025 ~ "Projected Maturity (2025+)",
TRUE ~ "Other"
),
Growth_Phase = factor(Growth_Phase,
levels = c("Early Stage (2020-2021)",
"Rapid Expansion (2022-2024)",
"Projected Maturity (2025+)"))
)
# Show breakdown by phase
kable(growth_phases %>%
group_by(Growth_Phase) %>%
summarize(
Years = paste(Year, collapse = ", "),
Avg_AI_Market_Growth = mean(AI_Market_Index, na.rm = TRUE),
Avg_Jobs_Growth = mean(Software_Jobs_Index, na.rm = TRUE),
.groups = "drop"
),
caption = "AI Growth Phases and Corresponding Job Market Trends",
digits = 1)| Growth_Phase | Years | Avg_AI_Market_Growth | Avg_Jobs_Growth |
|---|---|---|---|
| Early Stage (2020-2021) | 2020, 2021 | 157.0 | 144.0 |
| Rapid Expansion (2022-2024) | 2022, 2023, 2024 | 166.2 | 150.3 |
| Projected Maturity (2025+) | 2025, 2026, 2027, 2028, 2029, 2030 | 682.5 | 82.2 |
Reasoning: The AI industry has gone through distinct phases: - 2020-2021: Foundation models emerge, but enterprise adoption is limited - 2022-2024: ChatGPT launches (Nov 2022), triggering rapid enterprise adoption - 2025+: Projected maturation and mainstream integration
By categorizing years into phases, we can test whether job market responses lag behind or align with market growth phases. For example, did the “Rapid Expansion” phase (2022-2024) correspond to a spike in AI-related job postings?
Purpose: To measure the acceleration of growth, not just the cumulative growth. This helps identify inflection points where AI adoption accelerated or decelerated.
# Calculate year-over-year percentage change
yoy_growth <- growth_phases %>%
arrange(Year) %>%
mutate(
AI_Market_YoY = (AI_Market_Billions / lag(AI_Market_Billions) - 1) * 100,
AI_Users_YoY = (AI_Users_Millions / lag(AI_Users_Millions) - 1) * 100,
Software_Market_YoY = (Software_Total_Billions / lag(Software_Total_Billions) - 1) * 100,
Software_Jobs_YoY = (avg_software_index / lag(avg_software_index) - 1) * 100,
AI_Share_YoY = (avg_ai_share / lag(avg_ai_share) - 1) * 100
)
# Display YoY growth rates
kable(yoy_growth %>%
select(Year, Growth_Phase, AI_Market_YoY, AI_Users_YoY,
Software_Jobs_YoY, AI_Share_YoY) %>%
filter(!is.na(AI_Market_YoY)),
caption = "Year-over-Year Growth Rates (%)",
digits = 1)| Year | Growth_Phase | AI_Market_YoY | AI_Users_YoY | Software_Jobs_YoY | AI_Share_YoY |
|---|---|---|---|---|---|
| 2021 | Early Stage (2020-2021) | 113.9 | 24.1 | 88.1 | 37.2 |
| 2022 | Rapid Expansion (2022-2024) | -34.6 | 25.7 | 31.9 | 19.7 |
| 2023 | Rapid Expansion (2022-2024) | 8.4 | 12.0 | -53.8 | -35.0 |
| 2024 | Rapid Expansion (2022-2024) | 36.3 | 24.7 | -23.1 | 20.1 |
| 2025 | Projected Maturity (2025+) | 34.6 | 23.1 | -6.7 | 40.0 |
| 2026 | Projected Maturity (2025+) | 33.3 | 22.5 | NA | NA |
| 2027 | Projected Maturity (2025+) | 34.5 | 22.3 | NA | NA |
| 2028 | Projected Maturity (2025+) | 35.5 | 22.3 | NA | NA |
| 2029 | Projected Maturity (2025+) | 39.5 | 22.4 | NA | NA |
| 2030 | Projected Maturity (2025+) | 40.3 | 22.7 | NA | NA |
Reasoning: This calculation reveals the velocity of change. For example: - If AI market grows 50% in 2023 and 100% in 2024, the YoY growth shows acceleration - If job postings grow 20% annually but AI share in jobs grows 50% annually, it indicates AI is capturing a larger portion of new jobs
This helps us identify whether we’re in an acceleration phase (growth rate increasing) or deceleration phase (growth rate slowing).
Purpose: To provide aggregate statistics for each growth phase, making it easier to compare phases and write conclusions.
# Create comprehensive summary by growth phase
phase_summary <- yoy_growth %>%
filter(!is.na(Software_Jobs_Index)) %>% # Only use years with actual job data
group_by(Growth_Phase) %>%
summarize(
Years_in_Phase = n(),
Avg_AI_Market_Billions = mean(AI_Market_Billions, na.rm = TRUE),
Total_AI_Market_Growth = max(AI_Market_Index, na.rm = TRUE) - min(AI_Market_Index, na.rm = TRUE),
Avg_Software_Jobs_Index = mean(avg_software_index, na.rm = TRUE),
Avg_AI_Job_Share = mean(avg_ai_share, na.rm = TRUE),
Max_YoY_AI_Market_Growth = max(AI_Market_YoY, na.rm = TRUE),
Max_YoY_Jobs_Growth = max(Software_Jobs_YoY, na.rm = TRUE),
.groups = "drop"
)
kable(phase_summary,
caption = "Summary Statistics by AI Growth Phase",
digits = 2)| Growth_Phase | Years_in_Phase | Avg_AI_Market_Billions | Total_AI_Market_Growth | Avg_Software_Jobs_Index | Avg_AI_Job_Share | Max_YoY_AI_Market_Growth | Max_YoY_Jobs_Growth |
|---|---|---|---|---|---|---|---|
| Early Stage (2020-2021) | 2 | 26.48 | 113.93 | 113.67 | 2.04 | 113.93 | 88.06 |
| Rapid Expansion (2022-2024) | 3 | 28.04 | 66.92 | 118.58 | 2.29 | 36.33 | 31.86 |
| Projected Maturity (2025+) | 1 | 46.99 | 0.00 | 64.91 | 3.09 | 34.64 | -6.73 |
Reasoning: This aggregation allows us to make statements like: - “During the Rapid Expansion phase, the AI market grew by an average of X% annually” - “Job market response lagged by Y percentage points compared to market growth” - “AI’s share of job postings increased most dramatically during the [phase] phase”
Purpose: To create a visual representation of how different metrics have grown relative to each other over time.
# Prepare data for visualization
growth_long <- growth_indexed %>%
filter(!is.na(Software_Jobs_Index)) %>% # Only years with job data
select(Year, AI_Market_Index, AI_Users_Index, Software_Market_Index,
Software_Jobs_Index, AI_JobShare_Index) %>%
pivot_longer(
cols = -Year,
names_to = "Metric",
values_to = "Index_Value"
) %>%
mutate(
Metric = case_when(
Metric == "AI_Market_Index" ~ "AI Market Revenue",
Metric == "AI_Users_Index" ~ "AI Tool Users",
Metric == "Software_Market_Index" ~ "Total Software Market",
Metric == "Software_Jobs_Index" ~ "Software Job Postings",
Metric == "AI_JobShare_Index" ~ "AI Share of Job Postings",
TRUE ~ Metric
),
Category = case_when(
Metric %in% c("AI Market Revenue", "AI Tool Users", "AI Share of Job Postings") ~ "AI Metrics",
Metric %in% c("Total Software Market", "Software Job Postings") ~ "Software Metrics",
TRUE ~ "Other"
)
)
# Create the visualization
ggplot(growth_long, aes(x = Year, y = Index_Value, color = Metric, linetype = Category)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2.5) +
labs(
title = "Comparative Growth: AI vs Software Industry (2020-2024)",
subtitle = "All metrics indexed to 2020 = 100",
x = "Year",
y = "Growth Index (2020 = 100)",
color = "Metric",
linetype = "Category"
) +
scale_color_manual(values = c(
"AI Market Revenue" = "#2E86AB",
"AI Tool Users" = "#A23B72",
"AI Share of Job Postings" = "#E63946",
"Total Software Market" = "#F18F01",
"Software Job Postings" = "#06A77D"
)) +
scale_y_continuous(labels = comma, breaks = seq(0, 500, 50)) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom",
legend.box = "vertical"
)Reasoning: This visualization reveals:
AI metrics (blue/purple/red) show steeper growth than software metrics (orange/green); AI is a high-growth segment within a mature industry
AI User Growth tracks closely with Market Revenue Growth, suggesting growth is driven by adoption, not just price increases
Software Job Postings growth is moderate compared to AI market growth, suggesting:
AI Share of Job Postings is growing, but not as much as market size, indicating AI is spreading across existing roles rather than creating entirely new role categories
Purpose: To estimate the absolute volume of AI-related software jobs by combining job posting volume with AI share %.
# Create AI-adjusted hiring metric
ai_adjusted <- yoy_growth %>%
filter(!is.na(avg_software_index)) %>%
mutate(
AI_Adjusted_Hiring = avg_software_index * (avg_ai_share / 100)
)
# Calculate growth in AI-adjusted hiring
ai_hiring_growth <- ai_adjusted %>%
select(Year, Growth_Phase, avg_software_index, avg_ai_share, AI_Adjusted_Hiring) %>%
mutate(
AI_Hiring_YoY = (AI_Adjusted_Hiring / lag(AI_Adjusted_Hiring) - 1) * 100
)
kable(ai_hiring_growth,
caption = "AI-Adjusted Software Hiring Index",
digits = 2,
col.names = c("Year", "Growth Phase", "Software Jobs Index",
"AI Share (%)", "AI-Adjusted Hiring", "YoY Growth (%)"))| Year | Growth Phase | Software Jobs Index | AI Share (%) | AI-Adjusted Hiring | YoY Growth (%) |
|---|---|---|---|---|---|
| 2020 | Early Stage (2020-2021) | 78.92 | 1.72 | 1.36 | NA |
| 2021 | Early Stage (2020-2021) | 148.42 | 2.36 | 3.51 | 158.05 |
| 2022 | Rapid Expansion (2022-2024) | 195.70 | 2.83 | 5.54 | 57.85 |
| 2023 | Rapid Expansion (2022-2024) | 90.46 | 1.84 | 1.66 | -69.96 |
| 2024 | Rapid Expansion (2022-2024) | 69.59 | 2.21 | 1.54 | -7.63 |
| 2025 | Projected Maturity (2025+) | 64.91 | 3.09 | 2.01 | 30.59 |
# Visualize AI-adjusted hiring
ggplot(ai_adjusted, aes(x = Year, y = AI_Adjusted_Hiring)) +
geom_line(linewidth = 1.2, color = "#E63946") +
geom_point(size = 3, color = "#E63946") +
geom_area(alpha = 0.3, fill = "#E63946") +
labs(
title = "Estimated AI-Related Software Hiring Volume",
subtitle = "Software Job Index × AI Share (%)",
x = "Year",
y = "AI-Adjusted Hiring Index",
caption = "Higher values indicate more AI-related software jobs posted"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 14))Reasoning: This metric attempts to estimate the absolute volume of AI-related software jobs by multiplying: - Software job postings index (how many software jobs exist) - AI share % (what portion mention AI)
Roughly, gives: “How many AI-related software job postings are there?”
Key insights from this metric: - If it’s growing faster than either component alone, it indicates both more jobs AND higher AI penetration - If it plateaus despite AI market growth, it suggests supply constraints (not enough AI talent) - Comparing its growth rate to AI market revenue growth reveals if job creation is keeping pace with economic growth
Topic: Analyzing the relationship between AI market growth & software development job market trends
Group Section: BG-5
Teaching Assistant: Lexeigh Kolakowski
Our group is investigating how the growth of AI technologies is reshaping the software development job market. Specifically, we’re examining whether AI’s economic expansion (measured by market revenue and user adoption) corresponds to changes in hiring demand & skill requirements in the software industry.
Lauren Hughes (Applied Math, lhughes@uw.edu) is interested in how actual market trends compare with what is portrayed in the news, and is particularly drawn to the mathematical modeling behind labor market indices & growth curves. Lauren is passionate about applying various quantitative methods to real-world economic questions (+ seeing how others work with data) & is pursuing a career in either Applied Mathematics/tech or Medicine, both of which put an emphasis on the importance of accurate, readily conveyable information.
Jonah Calague (Informatics, jcalague@uw.edu) is interested in seeing how the growth of AI is transforming the software industry and influencing the job market.
Oliver Boctor N/A
This project primarily utilized concepts and tools covered in INFO 201 course materials, though other stuff happened, too:
dplyr verbs (filter, select, mutate, group_by, summarize, arrange)tidyr functions (pivot_longer, pivot_wider)inner_join, left_join concepts from relational data lecturesggplot2 with geom_line, geom_point, geom_arealubridate package for date conversion and extraction (as.Date, year)Advanced ggplot2 Customization: For creating more polished visualizations, we consulted: - Resource: https://ggplot2-book.org/themes.html - Used for: Custom color palettes with scale_color_manual(), adjusting legend positioning, and aesthetics :)
Index Calculation Methodology: To properly calculate growth indices, we referenced: - Resource: Indeed Hiring Lab methodology documentation (https://www.hiringlab.org/about/) - Used for: Understanding how to properly baseline indices and calculate year-over-year growth rates
Many YouTube videos - Along with some 2007 coding forums…
For completeness, we include visualizations of the daily-level Indeed data before aggregation to annual averages.
# Reload and clean daily Indeed data for visualization
software_daily <- software_data %>%
filter(countryName == "United States", sectorName == "Software Development") %>%
mutate(date = as.Date(dateString))
ai_daily <- ai_data %>%
filter(countryName == "United States") %>%
mutate(date = as.Date(dateString))
# Join daily data
indeed_daily <- inner_join(
ai_daily %>% select(date, ai_share = value),
software_daily %>% select(date, software_index = value),
by = "date"
) %>%
mutate(ai_index = (ai_share / first(ai_share)) * 100)
# Create dual-axis visualization
p1 <- ggplot(indeed_daily, aes(x = date)) +
geom_line(aes(y = ai_index, color = "AI Mentions (Indexed)"), linewidth = 0.8) +
geom_line(aes(y = software_index, color = "Software Job Postings"), linewidth = 0.8) +
labs(
title = "Daily Trends: AI Job Mentions vs Software Job Postings (U.S.)",
subtitle = "Both series indexed to first observation = 100",
x = "Date",
y = "Index (First Date = 100)",
color = ""
) +
scale_color_manual(values = c("AI Mentions (Indexed)" = "#E63946",
"Software Job Postings" = "#06A77D")) +
scale_y_continuous(labels = comma) +
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold", size = 12)
)
print(p1)# Calculate and plot AI-adjusted daily hiring
indeed_daily <- indeed_daily %>%
mutate(ai_adjusted_hiring = software_index * (ai_share / 100))
p2 <- ggplot(indeed_daily, aes(x = date, y = ai_adjusted_hiring)) +
geom_line(color = "#A23B72", linewidth = 0.8) +
geom_smooth(method = "loess", se = TRUE, color = "#2E86AB", fill = "#2E86AB", alpha = 0.2) +
labs(
title = "Estimated Daily AI-Related Software Hiring Volume",
subtitle = "Software Job Index × AI Share (%) with smoothed trend",
x = "Date",
y = "AI-Adjusted Hiring Index"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 12))
print(p2)Observations from Daily Data:
High volatility: Daily data shows significant day-to-day fluctuations, which is why we aggregated to annual averages for the main analysis
Seasonal patterns: There appear to be weekly and potentially seasonal patterns in job posting volume (fewer postings around holidays)
COVID-19 impact: The software job posting index shows a dramatic drop in March-April 2020, followed by a recovery
Accelerating AI mentions: The AI share trend line shows relatively steady growth from 2019-2022, then accelerates sharply in 2023-2024
Smoothed trend: The LOESS smoothing reveals the underlying trend while filtering out noise, making the acceleration post-2022 even more apparent