AI Market Growth & Software Job Postings Analysis

Author

Lauren Hughes, Jonah Calague, Oliver Boctor

Published

February 16, 2026

Section I: First Data Set will include Statista AI & Software Market Data

Data Source

Source: Statista - AI Market Growth, AI Tool Users, and Software Market Size datasets

Web Link: https://www.statista.com/

Who Collected the Data: Statista is a leading provider of market and consumer data, compiling statistics from over 22,500 sources including market research reports, industry associations, official stats, etc. The AI market data is taken from major technology market research firms including Gartner, IDC, and Statista’s own research.

Why the Source is Reliable: Statista is widely used by Fortune 500 companies, academic institutions, as well as govt agencies for market intelligence. They employ pretty rigorous data verification processes & cite all original sources. The company has been operating since 2007 and has partnerships with major research organizations worldwide.

Demonstration of Real Data: These datasets represent actual market revenue figures (in billions $USD) and user adoption metrics (in millions of users) collected from the following: - Enterprise software licensing data - Market research surveys of businesses - Industry financial reports - Technology adoption studies

The data includes historical figures (2020-2024) that can be verified against published industry reports and forward-looking projections (2025-2030) based on industry growth models.

Data Meaning

This data set represents the financial scale and user adoption of AI technologies compared to the software industry. It contains three metrics:

  1. AI Market Revenue: Total global revenue generated by AI products and services (in billions $USD)
  2. AI Tool Users: Number of individuals and organizations actively using AI tools (in millions)
  3. Total Software Market Size: The entire software industry revenue globally (in billions $USD)

Relevance: This data is important for understanding AI’s economic impact & provides the context for further analyzing of job market trends. By comparing AI’s market growth to the total software industry, we can assess whether AI represents a sustainable growth sector or a temporary trend. This context is also important for interpreting job posting data in Part II.

Data Join Strategy

Features to Join On: The primary join key will be Year (for the Statista datasets) and Date (converted to Year for the Indeed datasets in Part II).

Data Transformations Needed: - The Statista datasets are in “wide” format (years as columns) and need to be converted to “long” format - The Indeed datasets use daily dates that will need to be aggregated or converted to yearly data for joining - Year formats need to be standardized

Potential Joining Challenges: - Statista data covers 2020-2030 (including projections) - Indeed data includes historical daily data - Need to filter to overlapping years for meaningful comparisons

Data Observations

Code
library(tidyverse)
library(lubridate)
library(scales)
library(knitr)
Code
# AI Market Growth Data
ai_market <- read_csv("~/Downloads/Statista_AIMarketGrowth - Sheet1.csv")

# AI Tool Users Data
ai_users <- read_csv("~/Downloads/Statista_AIToolsUsers - Sheet1.csv")

# Software Market Size Data
software_market <- read_csv("~/Downloads/Statista_SoftwareMarketSize - Sheet1-2.csv")

# Display structure of each dataset
glimpse(ai_market)
Rows: 1
Columns: 13
$ Year   <chr> "Total (Billions USD)"
$ `2020` <dbl> 16.87
$ `2021` <dbl> 36.09
$ `2022` <dbl> 23.61
$ `2023` <dbl> 25.6
$ `2024` <dbl> 34.9
$ `2025` <dbl> 46.99
$ `2026` <dbl> 62.62
$ `2027` <dbl> 84.25
$ `2028` <dbl> 114.16
$ `2029` <dbl> 159.28
$ `2030` <dbl> 223.52
$ `2031` <dbl> 307.56
Code
glimpse(ai_users)
Rows: 1
Columns: 13
$ Year   <chr> "AI Tool Users (millions)"
$ `2020` <dbl> 48.13
$ `2021` <dbl> 59.72
$ `2022` <dbl> 75.07
$ `2023` <dbl> 84.1
$ `2024` <dbl> 104.84
$ `2025` <dbl> 129.08
$ `2026` <dbl> 158.15
$ `2027` <dbl> 193.36
$ `2028` <dbl> 236.41
$ `2029` <dbl> 289.41
$ `2030` <dbl> 355.12
$ `2031` <dbl> 437.05
Code
glimpse(software_market)
Rows: 5
Columns: 16
$ Year   <chr> "Total (billions USD)", "Application Development Software", "En…
$ `2016` <dbl> 211.72, 48.91, 82.62, 53.65, 26.53
$ `2017` <dbl> 226.41, 53.63, 88.80, 55.72, 28.25
$ `2018` <dbl> 245.14, 59.18, 96.75, 58.91, 30.30
$ `2019` <dbl> 263.37, 64.24, 104.98, 62.08, 32.07
$ `2020` <dbl> 270.86, 65.38, 108.23, 64.02, 33.23
$ `2021` <dbl> 286.85, 70.60, 114.89, 66.80, 34.56
$ `2022` <dbl> 313.56, 78.27, 126.96, 71.29, 37.04
$ `2023` <dbl> 338.22, 85.66, 139.22, 74.21, 39.13
$ `2024` <dbl> 363.39, 91.95, 150.50, 80.08, 40.87
$ `2025` <dbl> 379.29, 97.64, 159.39, 80.63, 41.62
$ `2026` <dbl> 395.00, 103.28, 168.00, 81.26, 42.46
$ `2027` <dbl> 410.14, 108.95, 176.73, 81.28, 43.18
$ `2028` <dbl> 427.24, 114.83, 186.16, 82.26, 43.98
$ `2029` <dbl> 445.40, 121.37, 195.50, 83.73, 44.79
$ `2030` <dbl> 462.04, 127.32, 204.41, 84.83, 45.48

Column Names: - ai_market: Contains a “Year” column (actually a category label) & columns for years 2020-2030 with rev values - ai_users: Contains a “Year” column (category label) & columns for years 2020-2030 with user count values - software_market: Contains a “Year” column with category labels including “Total (billions USD)” and various software subcategories, with year columns 2020-2030

Number of Rows: Each dataset appears to have 1-2 rows (categories) with 11 columns (one label + 10 year columns from 2020-2030)

Data Issues

Missing Values:

Code
# Check for missing values in each dataset
cat("AI Market missing values:\n")
AI Market missing values:
Code
sum(is.na(ai_market))
[1] 0
Code
cat("\nAI Users missing values:\n")

AI Users missing values:
Code
sum(is.na(ai_users))
[1] 0
Code
cat("\nSoftware Market missing values:\n")

Software Market missing values:
Code
sum(is.na(software_market))
[1] 0

Potential Issues: 1. Projection Data: Years 2025-2030 are projections, not actual observations. This introduces uncertainty. 2. Data Format: The “wide” format requires transformation for analysis 3. Aggregation Level: Data is at annual level only, limiting granularity 4. Source Attribution: While Statista aggregates from multiple sources, the specific methodology for projections is not fully transparent

Impact on Reliability: Despite the projection uncertainties, the historical data is based on actual market figures and can be considered reliable. The projections are useful for trend analysis but should be interpreted as what they are: estimates. For our analysis, we’ll focus primarily on historical data & clearly label any projections.


Section II: Second Data Set - Indeed Job Postings Data

Data Source

Source: Indeed Hiring Lab - Job Postings Data

Web Link: https://www.hiringlab.org/

Who Collected the Data: Indeed is one of the world’s largest job search engines, processing millions of job postings daily. The Indeed Hiring Lab is their economic research section, which collects job posting data from their platform to produce labor market insights.

Why the Source is Reliable: Indeed hosts over 250 million unique visitors per month and aggregates job postings from thousands of companies. Their Hiring Lab data is frequently cited by major media outlets such as the Wall Street Journal, New York Times, CNBC, etc. and government agencies (Federal Reserve, Bureau of Labor Statistics). The data represents actual employer behavior in real-time.

Demonstration of Real Data: This dataset contains: - Daily snapshots of job posting volumes indexed to a baseline date - Sector-specific metrics (Software Development, for ours) - Share of job postings mentioning “AI”/related terms - Data spans 2019-2024 with daily granularity

The data is “real” because it’s derived from actual job postings that employers paid to list on Indeed’s platform, representing genuine hiring demand.

Data Meaning

This dataset represents labor market demand in the software development sector & the adoption of AI-related skills in job requirements.

Two Important Files: 1. job-postings-sector-index: Tracks the volume of software development job postings over time (indexed to February 1, 2020 = 100) 2. ai-headline-share: Measures the percentage of all job postings that mention “AI” or related terms in job titles or descriptions

Relevance to First Dataset: While Statista shows the economic growth of AI (revenue and users), Indeed data shows the labor market response to that growth. If AI markets are growing, we’d expect to see: - Increased demand for software developers (higher job posting index) - Higher percentage of jobs requiring AI skills (higher AI share)

This provides validation that the market growth in Dataset 1 is translating into real employment opportunities.

Data Join Strategy

Features to Join On: - Primary join: Date (will be converted to Year/Month for aggregation) - Filter criteria: countryName (“United States”) and sectorName (“Software Development”)

Between the Two Datasets (Statista + Indeed): We’ll need to aggregate Indeed’s daily data to annual averages to join with Statista’s annual data on Year.

Data Transformations Needed: - Fix column names: dateStringdate, countryName, sectorName, value - Convert date strings to Date objects - Filter to US and Software Development sector - Aggregate daily data to annual averages for joining with Statista - Ensure year formats match (numeric 2020, 2021, etc.)

Potential Challenges: - Indeed data is daily; Statista is annual (need aggregation strategy) - Indeed data likely doesn’t extend to 2030 (only through ~2024) - The “AI share” metric is a percentage, while Statista tracks absolute revenue

Data Observations

Code
# Software sector job postings index
software_data <- read_csv("~/Downloads/job-postings-sector-index-2.csv")

# AI headline share data
ai_data <- read_csv("~/Downloads/ai-headline-share.csv")

# Display structure
glimpse(software_data)
Rows: 2,198
Columns: 8
$ `__typename` <chr> "HiringLabSectoralPosting", "HiringLabSectoralPosting", "…
$ dateString   <date> 2020-02-01, 2020-02-02, 2020-02-03, 2020-02-04, 2020-02-…
$ countryCode  <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "US…
$ countryName  <chr> "United States", "United States", "United States", "Unite…
$ sectorCode   <chr> "techsoftware", "techsoftware", "techsoftware", "techsoft…
$ sectorName   <chr> "Software Development", "Software Development", "Software…
$ postingType  <chr> "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TOTAL", "TO…
$ value        <dbl> 100.00, 99.84, 99.73, 99.55, 99.47, 99.46, 99.45, 99.52, …
Code
glimpse(ai_data)
Rows: 2,526
Columns: 6
$ `__typename` <chr> "HiringLabNationalAI", "HiringLabNationalAI", "HiringLabN…
$ dateString   <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-01-…
$ countryCode  <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "US…
$ countryName  <chr> "United States", "United States", "United States", "Unite…
$ aiType       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ value        <dbl> 1.71, 1.71, 1.71, 1.71, 1.71, 1.71, 1.71, 1.72, 1.72, 1.7…

Column Names: - software_data: __typename, dateString, countryCode, countryName, sectorCode, sectorName, postingType, value - ai_data: __typename, dateString, countryCode, countryName, aiType, value

Number of Rows: - software_data: 2,198 rows (approximately 5.5 years of daily data) - ai_data: 2,526 rows (approximately 6.5 years of daily data)

Example Data: The software index starts at 100.00 in February 2020 and shows the dramatic impact of COVID-19 and subsequent recovery. AI share values start around 1.71% in 2019 and grow over time.

Data Issues

Code
# Check for missing values
cat("Software data missing values:\n")
Software data missing values:
Code
colSums(is.na(software_data))
 __typename  dateString countryCode countryName  sectorCode  sectorName 
          0           0           0           0           0           0 
postingType       value 
          0           0 
Code
cat("\nAI data missing values:\n")

AI data missing values:
Code
colSums(is.na(ai_data))
 __typename  dateString countryCode countryName      aiType       value 
          0           0           0           0        2526           0 

Potential Issues: 1. Missing Values: The aiType column in ai_data is entirely NA (may be reserved for future use) 2. Date Gaps: Need to verify if there are any missing dates in the time series 3. Index Baseline: Software index baseline (Feb 1, 2020) was chosen just before COVID-19, which may create unusual patterns 4. Sample Bias: Indeed data only reflects jobs posted on their platform, not the entire labor market

Code
# Check for date gaps in software data (US, Software Development only)
software_us <- software_data %>%
  filter(countryName == "United States", sectorName == "Software Development") %>%
  arrange(dateString)

date_range <- seq(min(software_us$dateString), max(software_us$dateString), by = "day")
missing_dates <- setdiff(date_range, software_us$dateString)

cat("Number of missing dates:", length(missing_dates), "\n")
Number of missing dates: 0 
Code
cat("Date range:", format(min(software_us$dateString), "%Y-%m-%d"), "to", 
    format(max(software_us$dateString), "%Y-%m-%d"))
Date range: 2020-02-01 to 2026-02-06

Impact on Reliability: The data quality is generally high. Missing dates (if any) can be interpolated. The COVID-19 baseline is a known factor that we’ll account for in interpretation. Indeed’s market coverage is substantial enough that their data is widely accepted as representative of broader labor market trends.


Section III: Joining the Data

Data Join Process

To create a comprehensive dataset, we’ll perform multiple joins:

  1. Within Statista data: Join AI market, AI users, and software market data on Year
  2. Within Indeed data: Join software postings and AI share data on Date
  3. Between sources: Aggregate Indeed data to annual level and join with Statista on Year

Step 1: Transform Statista Data from Wide to Long Format

Code
# 1. AI Market Growth - wide to long
ai_market_long <- ai_market %>%
  rename(Category = Year) %>%
  pivot_longer(
    cols = -Category,
    names_to = "Year",
    values_to = "AI_Market_Billions"
  ) %>%
  mutate(Year = as.numeric(Year)) %>%
  select(Year, AI_Market_Billions)

# 2. AI Tool Users - wide to long
ai_users_long <- ai_users %>%
  rename(Category = Year) %>%
  pivot_longer(
    cols = -Category,
    names_to = "Year",
    values_to = "AI_Users_Millions"
  ) %>%
  mutate(Year = as.numeric(Year)) %>%
  select(Year, AI_Users_Millions)

# 3. Software Market Size - extract TOTAL only, then wide to long
software_total <- software_market %>%
  filter(Year == "Total (billions USD)") %>%
  rename(Category = Year) %>%
  pivot_longer(
    cols = -Category,
    names_to = "Year",
    values_to = "Software_Total_Billions"
  ) %>%
  mutate(Year = as.numeric(Year)) %>%
  select(Year, Software_Total_Billions)

# Join all three Statista datasets
statista_combined <- ai_market_long %>%
  inner_join(ai_users_long, by = "Year") %>%
  inner_join(software_total, by = "Year") %>%
  arrange(Year)

glimpse(statista_combined)
Rows: 11
Columns: 4
$ Year                    <dbl> 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027…
$ AI_Market_Billions      <dbl> 16.87, 36.09, 23.61, 25.60, 34.90, 46.99, 62.6…
$ AI_Users_Millions       <dbl> 48.13, 59.72, 75.07, 84.10, 104.84, 129.08, 15…
$ Software_Total_Billions <dbl> 270.86, 286.85, 313.56, 338.22, 363.39, 379.29…

Changes Made: - Renamed “Year” column to “Category” to avoid confusion during pivot - Used pivot_longer() to convert from wide to long format - Converted Year to numeric type for proper joining - Filtered software_market to only “Total (billions USD)” row to get aggregate software market size - Used inner_join() to combine all three datasets on Year

Step 2: Transform and Join Indeed Data

Code
# Clean Software Development data
software_clean <- software_data %>%
  filter(
    countryName == "United States",
    sectorName == "Software Development"
  ) %>%
  mutate(date = as.Date(dateString)) %>%
  select(date, software_index = value)

# Clean AI share data
ai_clean <- ai_data %>%
  filter(countryName == "United States") %>%
  mutate(date = as.Date(dateString)) %>%
  select(date, ai_share = value)

# Join Indeed datasets on date
indeed_combined <- inner_join(ai_clean, software_clean, by = "date") %>%
  arrange(date)

# Create normalized AI index (base = first value = 100)
indeed_combined <- indeed_combined %>%
  mutate(
    ai_index = (ai_share / first(ai_share)) * 100
  )

glimpse(indeed_combined)
Rows: 2,130
Columns: 4
$ date           <date> 2020-02-01, 2020-02-02, 2020-02-03, 2020-02-04, 2020-0…
$ ai_share       <dbl> 1.93, 1.93, 1.93, 1.93, 1.93, 1.92, 1.92, 1.93, 1.93, 1…
$ software_index <dbl> 100.00, 99.84, 99.73, 99.55, 99.47, 99.46, 99.45, 99.52…
$ ai_index       <dbl> 100.00000, 100.00000, 100.00000, 100.00000, 100.00000, …

Changes Made: - Filtered both datasets to United States only - Filtered software data to Software Development sector only - Converted dateString to proper Date objects - Renamed value column to meaningful names (software_index, ai_share) - Used inner_join() to keep only overlapping dates - Created ai_index to normalize AI share to base 100 for easier comparison with software index

Step 3: Aggregate Indeed Data to Annual Level

Code
# Aggregate Indeed data to annual averages
indeed_annual <- indeed_combined %>%
  mutate(Year = year(date)) %>%
  group_by(Year) %>%
  summarize(
    avg_software_index = mean(software_index, na.rm = TRUE),
    avg_ai_share = mean(ai_share, na.rm = TRUE),
    avg_ai_index = mean(ai_index, na.rm = TRUE),
    observations = n(),
    .groups = "drop"
  )

glimpse(indeed_annual)
Rows: 6
Columns: 5
$ Year               <dbl> 2020, 2021, 2022, 2023, 2024, 2025
$ avg_software_index <dbl> 78.92090, 148.42016, 195.70079, 90.46466, 69.58863,…
$ avg_ai_share       <dbl> 1.722537, 2.363562, 2.829425, 1.838712, 2.207923, 3…
$ avg_ai_index       <dbl> 89.25064, 122.46433, 146.60231, 95.27007, 114.40018…
$ observations       <int> 335, 365, 365, 365, 366, 334

Changes Made: - Extracted year from date using year() function - Used group_by() and summarize() to calculate annual averages - Included observation count to show how many daily data points contributed to each year - Removed missing values from calculations using na.rm = TRUE

Step 4: Create Final Combined Dataset

Code
# Join Statista and Indeed data on Year
final_combined <- statista_combined %>%
  left_join(indeed_annual, by = "Year") %>%
  arrange(Year)

# Display the final combined dataset
glimpse(final_combined)
Rows: 11
Columns: 8
$ Year                    <dbl> 2020, 2021, 2022, 2023, 2024, 2025, 2026, 2027…
$ AI_Market_Billions      <dbl> 16.87, 36.09, 23.61, 25.60, 34.90, 46.99, 62.6…
$ AI_Users_Millions       <dbl> 48.13, 59.72, 75.07, 84.10, 104.84, 129.08, 15…
$ Software_Total_Billions <dbl> 270.86, 286.85, 313.56, 338.22, 363.39, 379.29…
$ avg_software_index      <dbl> 78.92090, 148.42016, 195.70079, 90.46466, 69.5…
$ avg_ai_share            <dbl> 1.722537, 2.363562, 2.829425, 1.838712, 2.2079…
$ avg_ai_index            <dbl> 89.25064, 122.46433, 146.60231, 95.27007, 114.…
$ observations            <int> 335, 365, 365, 365, 366, 334, NA, NA, NA, NA, …
Code
# Show a sample of the data
kable(final_combined, 
      caption = "Combined AI Market and Job Postings Data (2020-2030)",
      digits = 2)
Combined AI Market and Job Postings Data (2020-2030)
Year AI_Market_Billions AI_Users_Millions Software_Total_Billions avg_software_index avg_ai_share avg_ai_index observations
2020 16.87 48.13 270.86 78.92 1.72 89.25 335
2021 36.09 59.72 286.85 148.42 2.36 122.46 365
2022 23.61 75.07 313.56 195.70 2.83 146.60 365
2023 25.60 84.10 338.22 90.46 1.84 95.27 365
2024 34.90 104.84 363.39 69.59 2.21 114.40 366
2025 46.99 129.08 379.29 64.91 3.09 160.16 334
2026 62.62 158.15 395.00 NA NA NA NA
2027 84.25 193.36 410.14 NA NA NA NA
2028 114.16 236.41 427.24 NA NA NA NA
2029 159.28 289.41 445.40 NA NA NA NA
2030 223.52 355.12 462.04 NA NA NA NA

Final Dataset Summary: - Rows: 11 (covering years 2020-2030) - Columns: 8 (Year, AI market metrics, software market metrics, job posting metrics) - Join Type: Left join to preserve all Statista years, including future projections (2025-2030) - Missing Data: Years 2025-2030 will have NA values for Indeed data since those years haven’t occurred yet

Data Documentation

Codebook and Variable Descriptions

Primary Variables:

Variable Description Unit Source
Year Calendar year Numeric (2020-2030) Both
AI_Market_Billions Total AI market revenue Billions USD Statista
AI_Users_Millions Number of AI tool users Millions of users Statista
Software_Total_Billions Total software industry revenue Billions USD Statista
avg_software_index Average software job posting index Index (Feb 2020 = 100) Indeed
avg_ai_share Average % of jobs mentioning AI Percentage (0-100) Indeed
avg_ai_index AI share normalized to base 100 Index (Jan 2019 = 100) Indeed (calculated)
observations Daily data points per year Count Indeed (calculated)

Most Relevant Variables for Analysis:

  1. AI_Market_Billions & AI_Users_Millions: These show the economic scale and adoption velocity of AI technology. The relationship between revenue and users can reveal whether growth is driven by more users or higher spending per user.

  2. avg_software_index: This indicates overall demand for software developers. A rising index suggests strong hiring demand; a falling index suggests contraction.

  3. avg_ai_share: This is the most direct measure of AI’s impact on the job market - what percentage of software jobs now require AI skills?

  4. Comparison of growth rates: By indexing all variables to base 100, we can compare the rate of change in market size vs. job demand.

Codebook Links: - Statista: Full methodology available at https://www.statista.com/markets/methodology/ - Indeed: Hiring Lab methodology at https://www.hiringlab.org/about/ - Note: Detailed variable-level codebooks are not publicly available for these commercial datasets, but source documentation explains data collection methods

Bias Statement

Potential Biases and Limitations:

Our combined dataset reflects several important biases that must be considered:

Geographic Bias: The Indeed data is filtered to United States only, while Statista data represents global markets. This creates a mismatch - we’re comparing U.S. job market trends to global economic trends. This may mask important regional differences in AI adoption.

Platform Bias: Indeed data only captures jobs posted on their platform, which skews toward certain industries and company sizes. Smaller companies and certain sectors may be underrepresented. Additionally, companies with internal hiring processes or those using specialized tech recruiting platforms (like LinkedIn, Hired, or Triplebyte) are not fully represented.

Industry Bias: We filtered to “Software Development” jobs specifically, excluding other tech roles (data science, ML engineering, AI research) that might be even more AI-focused. This narrow focus may underestimate AI’s total impact on the job market.

Temporal Bias: The software job index is baselined to February 2020, immediately before COVID-19. This baseline choice amplifies the appearance of recovery and growth while masking pre-pandemic trends.

Projection Uncertainty: Years 2025-2030 in the Statista data are projections, not observations. These projections reflect current assumptions about AI growth that may not materialize.

Missing Features: Our dataset lacks: - Salary information (which would show if AI skills command premium pay) - Educational requirements (which would show if AI is creating or eliminating entry-level opportunities) - Company size data (which would show if AI adoption differs by organization size) - Demographic information about who is being hired

Alignment with Human Rights Principles:

  1. Privacy: Both datasets use aggregated data without individual identifiers
  2. Accountability:️ Limited - commercial datasets don’t fully disclose source attribution
  3. Safety and Security: No security concerns with market-level data
  4. Transparency and Explainability:️ Partial - methodologies are documented but proprietary algorithms are not disclosed
  5. Fairness and Non-Discrimination:️ Cannot assess - dataset lacks demographic breakdowns to evaluate equitable access to AI opportunities
  6. Human Control of Technology: N/A for market data
  7. Professional Responsibility:️ Data providers have commercial incentives that may influence presentation
  8. Promotion of Human Values:️ Dataset doesn’t capture whether AI growth is improving work quality, job satisfaction, or economic equity

Critical Concern: Our dataset cannot answer whether AI’s economic growth is creating broadly accessible opportunities or concentrating benefits among already-privileged groups. The lack of demographic and socioeconomic data prevents us from assessing whether this technological transition is exacerbating or reducing inequality.


Section IV: Data Wrangling

In this section, we perform data cleaning, transformation, and exploratory analysis on our combined dataset.

Wrangling Step 1: Create Growth Index for All Variables

Purpose: To enable direct comparison of growth rates across variables with different scales. By indexing all variables to their 2020 values = 100, we can see which metrics are growing fastest in percentage terms, not just absolute terms.

Code
# Calculate growth indices (2020 = 100 for all variables)
growth_indexed <- final_combined %>%
  mutate(
    AI_Market_Index = (AI_Market_Billions / first(AI_Market_Billions)) * 100,
    AI_Users_Index = (AI_Users_Millions / first(AI_Users_Millions)) * 100,
    Software_Market_Index = (Software_Total_Billions / first(Software_Total_Billions)) * 100,
    Software_Jobs_Index = (avg_software_index / first(avg_software_index[!is.na(avg_software_index)])) * 100,
    AI_JobShare_Index = (avg_ai_share / first(avg_ai_share[!is.na(avg_ai_share)])) * 100
  )

# Display the indexed data
kable(growth_indexed %>% 
        select(Year, AI_Market_Index, AI_Users_Index, Software_Market_Index, 
               Software_Jobs_Index, AI_JobShare_Index) %>%
        filter(!is.na(Software_Jobs_Index)),  # Only show years with Indeed data
      caption = "Growth Indices (2020 = 100) for All Metrics",
      digits = 1)
Growth Indices (2020 = 100) for All Metrics
Year AI_Market_Index AI_Users_Index Software_Market_Index Software_Jobs_Index AI_JobShare_Index
2020 100.0 100.0 100.0 100.0 100.0
2021 213.9 124.1 105.9 188.1 137.2
2022 140.0 156.0 115.8 248.0 164.3
2023 151.7 174.7 124.9 114.6 106.7
2024 206.9 217.8 134.2 88.2 128.2
2025 278.5 268.2 140.0 82.2 179.5

Reasoning: This transformation allows us to answer questions like “Is the AI job share growing faster than the AI market size?” or “Are software job postings keeping pace with overall software market growth?” Without indexing, the different scales (billions vs percentages vs indices) make visual comparison difficult.

Wrangling Step 2: Create Categorical Variable for Growth Phase

Purpose: To classify years into distinct phases of AI development, allowing us to analyze how different growth phases correspond to job market changes.

Code
# Create categorical variable for AI market growth phases
growth_phases <- growth_indexed %>%
  mutate(
    Growth_Phase = case_when(
      Year <= 2021 ~ "Early Stage (2020-2021)",
      Year >= 2022 & Year <= 2024 ~ "Rapid Expansion (2022-2024)",
      Year >= 2025 ~ "Projected Maturity (2025+)",
      TRUE ~ "Other"
    ),
    Growth_Phase = factor(Growth_Phase, 
                          levels = c("Early Stage (2020-2021)", 
                                   "Rapid Expansion (2022-2024)", 
                                   "Projected Maturity (2025+)"))
  )

# Show breakdown by phase
kable(growth_phases %>%
        group_by(Growth_Phase) %>%
        summarize(
          Years = paste(Year, collapse = ", "),
          Avg_AI_Market_Growth = mean(AI_Market_Index, na.rm = TRUE),
          Avg_Jobs_Growth = mean(Software_Jobs_Index, na.rm = TRUE),
          .groups = "drop"
        ),
      caption = "AI Growth Phases and Corresponding Job Market Trends",
      digits = 1)
AI Growth Phases and Corresponding Job Market Trends
Growth_Phase Years Avg_AI_Market_Growth Avg_Jobs_Growth
Early Stage (2020-2021) 2020, 2021 157.0 144.0
Rapid Expansion (2022-2024) 2022, 2023, 2024 166.2 150.3
Projected Maturity (2025+) 2025, 2026, 2027, 2028, 2029, 2030 682.5 82.2

Reasoning: The AI industry has gone through distinct phases: - 2020-2021: Foundation models emerge, but enterprise adoption is limited - 2022-2024: ChatGPT launches (Nov 2022), triggering rapid enterprise adoption - 2025+: Projected maturation and mainstream integration

By categorizing years into phases, we can test whether job market responses lag behind or align with market growth phases. For example, did the “Rapid Expansion” phase (2022-2024) correspond to a spike in AI-related job postings?

Wrangling Step 3: Calculate Year-over-Year Growth Rates

Purpose: To measure the acceleration of growth, not just the cumulative growth. This helps identify inflection points where AI adoption accelerated or decelerated.

Code
# Calculate year-over-year percentage change
yoy_growth <- growth_phases %>%
  arrange(Year) %>%
  mutate(
    AI_Market_YoY = (AI_Market_Billions / lag(AI_Market_Billions) - 1) * 100,
    AI_Users_YoY = (AI_Users_Millions / lag(AI_Users_Millions) - 1) * 100,
    Software_Market_YoY = (Software_Total_Billions / lag(Software_Total_Billions) - 1) * 100,
    Software_Jobs_YoY = (avg_software_index / lag(avg_software_index) - 1) * 100,
    AI_Share_YoY = (avg_ai_share / lag(avg_ai_share) - 1) * 100
  )

# Display YoY growth rates
kable(yoy_growth %>%
        select(Year, Growth_Phase, AI_Market_YoY, AI_Users_YoY, 
               Software_Jobs_YoY, AI_Share_YoY) %>%
        filter(!is.na(AI_Market_YoY)),
      caption = "Year-over-Year Growth Rates (%)",
      digits = 1)
Year-over-Year Growth Rates (%)
Year Growth_Phase AI_Market_YoY AI_Users_YoY Software_Jobs_YoY AI_Share_YoY
2021 Early Stage (2020-2021) 113.9 24.1 88.1 37.2
2022 Rapid Expansion (2022-2024) -34.6 25.7 31.9 19.7
2023 Rapid Expansion (2022-2024) 8.4 12.0 -53.8 -35.0
2024 Rapid Expansion (2022-2024) 36.3 24.7 -23.1 20.1
2025 Projected Maturity (2025+) 34.6 23.1 -6.7 40.0
2026 Projected Maturity (2025+) 33.3 22.5 NA NA
2027 Projected Maturity (2025+) 34.5 22.3 NA NA
2028 Projected Maturity (2025+) 35.5 22.3 NA NA
2029 Projected Maturity (2025+) 39.5 22.4 NA NA
2030 Projected Maturity (2025+) 40.3 22.7 NA NA

Reasoning: This calculation reveals the velocity of change. For example: - If AI market grows 50% in 2023 and 100% in 2024, the YoY growth shows acceleration - If job postings grow 20% annually but AI share in jobs grows 50% annually, it indicates AI is capturing a larger portion of new jobs

This helps us identify whether we’re in an acceleration phase (growth rate increasing) or deceleration phase (growth rate slowing).

Wrangling Step 4: Create Summary Statistics by Growth Phase

Purpose: To provide aggregate statistics for each growth phase, making it easier to compare phases and write conclusions.

Code
# Create comprehensive summary by growth phase
phase_summary <- yoy_growth %>%
  filter(!is.na(Software_Jobs_Index)) %>%  # Only use years with actual job data
  group_by(Growth_Phase) %>%
  summarize(
    Years_in_Phase = n(),
    Avg_AI_Market_Billions = mean(AI_Market_Billions, na.rm = TRUE),
    Total_AI_Market_Growth = max(AI_Market_Index, na.rm = TRUE) - min(AI_Market_Index, na.rm = TRUE),
    Avg_Software_Jobs_Index = mean(avg_software_index, na.rm = TRUE),
    Avg_AI_Job_Share = mean(avg_ai_share, na.rm = TRUE),
    Max_YoY_AI_Market_Growth = max(AI_Market_YoY, na.rm = TRUE),
    Max_YoY_Jobs_Growth = max(Software_Jobs_YoY, na.rm = TRUE),
    .groups = "drop"
  )

kable(phase_summary,
      caption = "Summary Statistics by AI Growth Phase",
      digits = 2)
Summary Statistics by AI Growth Phase
Growth_Phase Years_in_Phase Avg_AI_Market_Billions Total_AI_Market_Growth Avg_Software_Jobs_Index Avg_AI_Job_Share Max_YoY_AI_Market_Growth Max_YoY_Jobs_Growth
Early Stage (2020-2021) 2 26.48 113.93 113.67 2.04 113.93 88.06
Rapid Expansion (2022-2024) 3 28.04 66.92 118.58 2.29 36.33 31.86
Projected Maturity (2025+) 1 46.99 0.00 64.91 3.09 34.64 -6.73

Reasoning: This aggregation allows us to make statements like: - “During the Rapid Expansion phase, the AI market grew by an average of X% annually” - “Job market response lagged by Y percentage points compared to market growth” - “AI’s share of job postings increased most dramatically during the [phase] phase”

Wrangling Step 5: Visualize Growth Index Comparison

Purpose: To create a visual representation of how different metrics have grown relative to each other over time.

Code
# Prepare data for visualization
growth_long <- growth_indexed %>%
  filter(!is.na(Software_Jobs_Index)) %>%  # Only years with job data
  select(Year, AI_Market_Index, AI_Users_Index, Software_Market_Index, 
         Software_Jobs_Index, AI_JobShare_Index) %>%
  pivot_longer(
    cols = -Year,
    names_to = "Metric",
    values_to = "Index_Value"
  ) %>%
  mutate(
    Metric = case_when(
      Metric == "AI_Market_Index" ~ "AI Market Revenue",
      Metric == "AI_Users_Index" ~ "AI Tool Users",
      Metric == "Software_Market_Index" ~ "Total Software Market",
      Metric == "Software_Jobs_Index" ~ "Software Job Postings",
      Metric == "AI_JobShare_Index" ~ "AI Share of Job Postings",
      TRUE ~ Metric
    ),
    Category = case_when(
      Metric %in% c("AI Market Revenue", "AI Tool Users", "AI Share of Job Postings") ~ "AI Metrics",
      Metric %in% c("Total Software Market", "Software Job Postings") ~ "Software Metrics",
      TRUE ~ "Other"
    )
  )

# Create the visualization
ggplot(growth_long, aes(x = Year, y = Index_Value, color = Metric, linetype = Category)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2.5) +
  labs(
    title = "Comparative Growth: AI vs Software Industry (2020-2024)",
    subtitle = "All metrics indexed to 2020 = 100",
    x = "Year",
    y = "Growth Index (2020 = 100)",
    color = "Metric",
    linetype = "Category"
  ) +
  scale_color_manual(values = c(
    "AI Market Revenue" = "#2E86AB",
    "AI Tool Users" = "#A23B72",
    "AI Share of Job Postings" = "#E63946",
    "Total Software Market" = "#F18F01",
    "Software Job Postings" = "#06A77D"
  )) +
  scale_y_continuous(labels = comma, breaks = seq(0, 500, 50)) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    legend.position = "bottom",
    legend.box = "vertical"
  )

Reasoning: This visualization reveals:

  1. AI metrics (blue/purple/red) show steeper growth than software metrics (orange/green); AI is a high-growth segment within a mature industry

  2. AI User Growth tracks closely with Market Revenue Growth, suggesting growth is driven by adoption, not just price increases

  3. Software Job Postings growth is moderate compared to AI market growth, suggesting:

    • AI growth may be productivity-enhancing (doing more with same workforce)
    • Or there’s a lag between market growth and hiring response
    • Or AI jobs are being captured in other job categories
  4. AI Share of Job Postings is growing, but not as much as market size, indicating AI is spreading across existing roles rather than creating entirely new role categories

Wrangling Step 6: Analyze AI-Adjusted Hiring Metric

Purpose: To estimate the absolute volume of AI-related software jobs by combining job posting volume with AI share %.

Code
# Create AI-adjusted hiring metric
ai_adjusted <- yoy_growth %>%
  filter(!is.na(avg_software_index)) %>%
  mutate(
    AI_Adjusted_Hiring = avg_software_index * (avg_ai_share / 100)
  )

# Calculate growth in AI-adjusted hiring
ai_hiring_growth <- ai_adjusted %>%
  select(Year, Growth_Phase, avg_software_index, avg_ai_share, AI_Adjusted_Hiring) %>%
  mutate(
    AI_Hiring_YoY = (AI_Adjusted_Hiring / lag(AI_Adjusted_Hiring) - 1) * 100
  )

kable(ai_hiring_growth,
      caption = "AI-Adjusted Software Hiring Index",
      digits = 2,
      col.names = c("Year", "Growth Phase", "Software Jobs Index", 
                    "AI Share (%)", "AI-Adjusted Hiring", "YoY Growth (%)"))
AI-Adjusted Software Hiring Index
Year Growth Phase Software Jobs Index AI Share (%) AI-Adjusted Hiring YoY Growth (%)
2020 Early Stage (2020-2021) 78.92 1.72 1.36 NA
2021 Early Stage (2020-2021) 148.42 2.36 3.51 158.05
2022 Rapid Expansion (2022-2024) 195.70 2.83 5.54 57.85
2023 Rapid Expansion (2022-2024) 90.46 1.84 1.66 -69.96
2024 Rapid Expansion (2022-2024) 69.59 2.21 1.54 -7.63
2025 Projected Maturity (2025+) 64.91 3.09 2.01 30.59
Code
# Visualize AI-adjusted hiring
ggplot(ai_adjusted, aes(x = Year, y = AI_Adjusted_Hiring)) +
  geom_line(linewidth = 1.2, color = "#E63946") +
  geom_point(size = 3, color = "#E63946") +
  geom_area(alpha = 0.3, fill = "#E63946") +
  labs(
    title = "Estimated AI-Related Software Hiring Volume",
    subtitle = "Software Job Index × AI Share (%)",
    x = "Year",
    y = "AI-Adjusted Hiring Index",
    caption = "Higher values indicate more AI-related software jobs posted"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 14))

Reasoning: This metric attempts to estimate the absolute volume of AI-related software jobs by multiplying: - Software job postings index (how many software jobs exist) - AI share % (what portion mention AI)

Roughly, gives: “How many AI-related software job postings are there?”

Key insights from this metric: - If it’s growing faster than either component alone, it indicates both more jobs AND higher AI penetration - If it plateaus despite AI market growth, it suggests supply constraints (not enough AI talent) - Comparing its growth rate to AI market revenue growth reveals if job creation is keeping pace with economic growth


Section V: Group Information

Project Overview

Topic: Analyzing the relationship between AI market growth & software development job market trends

Group Section: BG-5

Teaching Assistant: Lexeigh Kolakowski

Our group is investigating how the growth of AI technologies is reshaping the software development job market. Specifically, we’re examining whether AI’s economic expansion (measured by market revenue and user adoption) corresponds to changes in hiring demand & skill requirements in the software industry.

Team Members

Lauren Hughes (Applied Math, lhughes@uw.edu) is interested in how actual market trends compare with what is portrayed in the news, and is particularly drawn to the mathematical modeling behind labor market indices & growth curves. Lauren is passionate about applying various quantitative methods to real-world economic questions (+ seeing how others work with data) & is pursuing a career in either Applied Mathematics/tech or Medicine, both of which put an emphasis on the importance of accurate, readily conveyable information.

Jonah Calague (Informatics, jcalague@uw.edu) is interested in seeing how the growth of AI is transforming the software industry and influencing the job market.

Oliver Boctor N/A

Coding Notes

Tools Used from Class

This project primarily utilized concepts and tools covered in INFO 201 course materials, though other stuff happened, too:

  • Data manipulation: dplyr verbs (filter, select, mutate, group_by, summarize, arrange)
  • Data reshaping: tidyr functions (pivot_longer, pivot_wider)
  • Data joining: inner_join, left_join concepts from relational data lectures
  • Visualization: ggplot2 with geom_line, geom_point, geom_area
  • Date handling: lubridate package for date conversion and extraction (as.Date, year)
  • R Markdown: Document formatting with YAML headers, code chunks, and narrative text

Additional Resources

Advanced ggplot2 Customization: For creating more polished visualizations, we consulted: - Resource: https://ggplot2-book.org/themes.html - Used for: Custom color palettes with scale_color_manual(), adjusting legend positioning, and aesthetics :)

Index Calculation Methodology: To properly calculate growth indices, we referenced: - Resource: Indeed Hiring Lab methodology documentation (https://www.hiringlab.org/about/) - Used for: Understanding how to properly baseline indices and calculate year-over-year growth rates

Many YouTube videos - Along with some 2007 coding forums…


Appendix: Additional Visualizations