Untitled

The impact of artificial intelligence (AI) and automation on the job market has been a topic of significant interest. This project explores a dataset of job market insights, focusing on understanding the relationship between salary levels, AI adoption, and other job characteristics. The study aims to uncover patterns in the data and build predictive models to estimate salary levels and job growth projections.

This report provides an in-depth analysis of the “AI-Powered Job Market Insights” dataset. The analysis aims to answer two key research questions related to salary predictions and job growth projections based on factors like AI adoption, automation risk, and company size. We will explore the data’s structure, examine key variables, perform exploratory data analysis, and use predictive modeling techniques like Random Forests and Decision Trees to analyze the data.

Number of Cases (Rows)

case_count <- nrow(data)
case_count

## [1] 500

The dataset contains 500 rows, each representing a job entry.

Number of Variables (Columns)

variable_count <- ncol(data)
variable_count

## [1] 10

The dataset includes 10 variables, which capture different characteristics of the job market data.

First 10 Instances

head(data, 10) %>% kable() %>% kable_styling()

Job_Title	Industry	Company_Size	Location	AI_Adoption_Level	Automation_Risk	Required_Skills	Salary_USD	Remote_Friendly	Job_Growth_Projection
Cybersecurity Analyst	Entertainment	Small	Dubai	Medium	High	UX/UI Design	111392.17	Yes	Growth
Marketing Specialist	Technology	Large	Singapore	Medium	High	Marketing	93792.56	No	Decline
AI Researcher	Technology	Large	Singapore	Medium	High	UX/UI Design	107170.26	Yes	Growth
Sales Manager	Retail	Small	Berlin	Low	High	Project Management	93027.95	No	Growth
Cybersecurity Analyst	Entertainment	Small	Tokyo	Low	Low	JavaScript	87752.92	Yes	Decline
UX Designer	Education	Large	San Francisco	Medium	Medium	Cybersecurity	102825.01	No	Growth
HR Manager	Finance	Medium	Singapore	Low	High	Sales	102065.72	Yes	Growth
Cybersecurity Analyst	Technology	Small	Dubai	Medium	Low	Machine Learning	86607.32	Yes	Decline
AI Researcher	Retail	Large	London	High	Low	JavaScript	75015.86	No	Stable
Sales Manager	Entertainment	Medium	Singapore	High	Low	Cybersecurity	96834.58	Yes	Decline

The first 10 rows of the dataset show the basic job information, including job titles, company size, and AI adoption level.

Missing Values Analysis

missing_summary <- sapply(data, function(x) sum(is.na(x)))
missing_summary <- missing_summary[missing_summary > 0]
missing_summary

## named integer(0)

The dataset contains no missing values, simplifying preprocessing.

Research Questions and Hypotheses

Question 1: Does the AI_Adoption_Level of a company impact the Salary_USD for jobs within that company?

Hypothesis 1: Higher AI_Adoption_Level is associated with higher Salary_USD.

Question 2: Can the Automation_Risk level of a job predict its Job_Growth_Projection?

Hypothesis 2: Jobs with higher Automation_Risk are more likely to have a “Decline” in Job_Growth_Projection.

Relevant Variables for Each Research Question

Question 1: Relevant Variables - AI_Adoption_Level, Salary_USD

Question 2: Relevant Variables - Automation_Risk, Job_Growth_Projection

Identify Response Variables

Question 1: Response Variable: Salary_USD

Question 2: Response Variable: Job_Growth_Projection

Missing Values for Response Variables

missing_salary <- sum(is.na(data$Salary_USD))
missing_growth <- sum(is.na(data$Job_Growth_Projection))

missing_values_responses <- list(Salary_USD = missing_salary, Job_Growth_Projection = missing_growth)
missing_values_responses

## $Salary_USD
## [1] 0
## 
## $Job_Growth_Projection
## [1] 0

There are no missing values for the response variables, ensuring clean analysis.

Part 2e: Distribution of Response Variables Salary Distribution

ggplot(data, aes(x = Salary_USD)) +
  geom_histogram(binwidth = 5000) +
  labs(title = "Salary Distribution", x = "Salary (USD)", y = "Count")

Job Growth Projection Distribution

ggplot(data, aes(x = Job_Growth_Projection)) +
  geom_bar() +
  labs(title = "Job Growth Projection Distribution", x = "Job Growth Projection", y = "Count")

Job growth projections show that most positions are categorized as “Stable,” with fewer jobs predicted to experience growth or decline.

Part 2f: Relationship Between Response and Explanatory Variables Relationship between Salary and AI Adoption Level

ggplot(data, aes(x = AI_Adoption_Level, y = Salary_USD, fill = AI_Adoption_Level)) +
  geom_boxplot() +
  labs(title = "Salary by AI Adoption Level", x = "AI Adoption Level", y = "Salary (USD)") +
  theme_minimal()

The boxplot reveals that “Low AI Adoption” positions tend to have higher median salaries compared to “High AI Adoption” jobs.

Relationship between Salary and Company Size

ggplot(data, aes(x = Company_Size, y = Salary_USD, fill = Company_Size)) +
  geom_boxplot() +
  labs(title = "Salary by Company Size", x = "Company Size", y = "Salary (USD)") +
  theme_minimal()

Smaller companies tend to offer higher salaries, although there is substantial overlap in salary ranges across company sizes.

Relationship between Job Growth Projection and Automation Risk

ggplot(data, aes(x = Automation_Risk, fill = Job_Growth_Projection)) +
  geom_bar(position = "dodge") +
  labs(title = "Job Growth Projection by Automation Risk", x = "Automation Risk", y = "Count")

Jobs with “High Automation Risk” are more likely to experience a decline in job growth, whereas low automation risk jobs tend to show more stability or growth.

Relationship between Job Growth Projection and Industry

ggplot(data, aes(x = Industry, fill = Job_Growth_Projection)) +
  geom_bar(position = "dodge") +
  labs(title = "Job Growth Projection by Industry", x = "Industry", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Certain industries, such as education, show higher growth projections, while others, like manufacturing, show more declines.

data <- data %>%
  
  mutate(Salary_Normalized = scale(Salary_USD)) %>%
  
  mutate(Salary_Range = case_when(
    Salary_USD < 60000 ~ "Low",
    Salary_USD >= 60000 & Salary_USD < 100000 ~ "Medium",
    Salary_USD >= 100000 ~ "High"
  )) %>%
  
  mutate(High_Automation_Risk = if_else(Automation_Risk == "High", 1, 0))

glimpse(data)

## Rows: 500
## Columns: 13
## $ Job_Title             <chr> "Cybersecurity Analyst", "Marketing Specialist",…
## $ Industry              <chr> "Entertainment", "Technology", "Technology", "Re…
## $ Company_Size          <fct> Small, Large, Large, Small, Small, Large, Medium…
## $ Location              <chr> "Dubai", "Singapore", "Singapore", "Berlin", "To…
## $ AI_Adoption_Level     <fct> Medium, Medium, Medium, Low, Low, Medium, Low, M…
## $ Automation_Risk       <fct> High, High, High, High, Low, Medium, High, Low, …
## $ Required_Skills       <chr> "UX/UI Design", "Marketing", "UX/UI Design", "Pr…
## $ Salary_USD            <dbl> 111392.17, 93792.56, 107170.26, 93027.95, 87752.…
## $ Remote_Friendly       <chr> "Yes", "No", "Yes", "No", "Yes", "No", "Yes", "Y…
## $ Job_Growth_Projection <fct> Growth, Decline, Growth, Growth, Decline, Growth…
## $ Salary_Normalized     <dbl[,1]> <matrix[26 x 1]>
## $ Salary_Range          <chr> "High", "Medium", "High", "Medium", "Medium"…
## $ High_Automation_Risk  <dbl> 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, …

I have created new features like normalized salary, salary range (Low, Medium, High), and a binary variable for high automation risk jobs.

summary_stats <- data %>%
  group_by(AI_Adoption_Level) %>%
  summarise(
    Mean_Salary = mean(Salary_USD, na.rm = TRUE),
    SD_Salary = sd(Salary_USD, na.rm = TRUE),
    Min_Salary = min(Salary_USD, na.rm = TRUE),
    Q1_Salary = quantile(Salary_USD, 0.25, na.rm = TRUE),
    Median_Salary = median(Salary_USD, na.rm = TRUE),
    Q3_Salary = quantile(Salary_USD, 0.75, na.rm = TRUE),
    Max_Salary = max(Salary_USD, na.rm = TRUE)
  )

summary_stats %>% 
  kable(caption = "Summary Statistics for Salary_USD by AI Adoption Level") %>% 
  kable_styling(full_width = FALSE)

Summary Statistics for Salary_USD by AI Adoption Level
AI_Adoption_Level	Mean_Salary	SD_Salary	Min_Salary	Q1_Salary	Median_Salary	Q3_Salary	Max_Salary
High	87583.42	21021.20	41298.73	74216.26	86379.88	101357.0	155209.8
Low	93353.60	20864.83	31969.53	79016.97	95700.14	105165.4	140476.0
Medium	92139.14	19412.02	35963.30	81055.99	92891.89	105665.0	134822.7

The summary statistics for Salary_USD grouped by AI_Adoption_Level reveal the following patterns:

Mean and Median Salaries:

Low AI Adoption has the highest mean salary ($93,353.60) and median salary ($95,700.14), suggesting that these jobs generally offer the best compensation. Medium AI Adoption follows with a mean salary of $92,139.14 and median salary of $92,891.89, showing slightly lower pay than the “Low” group. High AI Adoption has the lowest mean ($87,583.42) and median ($86,379.88) salaries, despite having the highest maximum salary ($155,209.80). Insight: Higher AI adoption levels do not correlate with higher average salaries. This could indicate that high-AI adoption jobs are more common in industries with entry-level or support roles, while low-AI adoption jobs are likely in traditional, well-compensated fields.

Salary Variability:

High AI Adoption has the largest standard deviation ($21,021.20) and range ($41,298.73 to $155,209.80), indicating significant variability in pay. This reflects a mix of both high-paying specialized roles and lower-paying positions. Low AI Adoption also has substantial variability (SD: $20,864.83) but shows more consistent compensation in the higher salary ranges (Q3: $105,165.40). Medium AI Adoption has the lowest standard deviation ($19,412.02) and a tighter range of salaries, suggesting more consistency in pay structures. Insight: High salary variability in “High AI Adoption” jobs reflects diverse roles, ranging from lower-tier positions to highly specialized, high-paying roles.

Five-Number Summary:

Low AI Adoption has the highest Q1 ($79,016.97) and Q3 ($105,165.40), indicating that salaries in this group are generally higher across the board. High AI Adoption has a lower Q1 ($74,216.26) and median, with a large portion of salaries in the lower range, despite the highest maximum. Medium AI Adoption offers consistent middle-range pay, with Q1 ($81,055.99) and Q3 ($105,665.00) values close to its median.’

ggplot(data, aes(x = AI_Adoption_Level, y = Salary_USD, fill = AI_Adoption_Level)) +
  geom_boxplot() +
  labs(title = "Salary by AI Adoption Level", x = "AI Adoption Level", y = "Salary (USD)") +
  theme_minimal()

Salary by AI Adoption Level (Boxplot): Median salaries are highest for “Low AI Adoption” and lowest for “High AI Adoption.”

Insight: Traditional, low-AI industries may prioritize human expertise, resulting in better compensation, while high-AI adoption jobs might include more entry-level roles.

ggplot(data, aes(x = Company_Size, y = Salary_USD, fill = Company_Size)) +
  geom_boxplot() +
  labs(title = "Salary by Company Size", x = "Company Size", y = "Salary (USD)") +
  theme_minimal()

Salary by Company Size (Boxplot):

Larger companies offer higher median salaries and a broader salary range.

Insight: Larger organizations likely invest in specialized, high-paying roles.

ggplot(data, aes(x = Automation_Risk, fill = Job_Growth_Projection)) +
  geom_bar(position = "dodge") +
  labs(title = "Job Growth Projection by Automation Risk", x = "Automation Risk", y = "Count") +
  theme_minimal()

Job Growth Projection by Automation Risk (Bar Chart):

Jobs with “High Automation Risk” have the highest likelihood of decline, while “Low Automation Risk” jobs exhibit stronger growth potential.

Insight: Automation risk inversely correlates with job growth.

ggplot(data, aes(x = AI_Adoption_Level, y = Salary_USD, color = AI_Adoption_Level)) +
  geom_jitter(width = 0.2) +
  labs(title = "Scatter Plot of Salary vs AI Adoption Level", x = "AI Adoption Level", y = "Salary (USD)") +
  theme_minimal()

Scatter Plot of Salary vs AI Adoption Level:

Salaries in “High AI Adop tion” environments show significant variability, including some very high-paying roles.

Insight: High-AI adoption environments may include both entry-level and specialized roles, leading to this spread in pay.

Linear Regression for Salary_USD

rf_model <- randomForest(Salary_USD ~ AI_Adoption_Level + Company_Size + Automation_Risk + Industry, data = data, importance = TRUE, ntree = 500)

rf_model

## 
## Call:
##  randomForest(formula = Salary_USD ~ AI_Adoption_Level + Company_Size +      Automation_Risk + Industry, data = data, importance = TRUE,      ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 425760681
##                     % Var explained: -1.47

importance(rf_model)

##                     %IncMSE IncNodePurity
## AI_Adoption_Level  1.206883    6285507205
## Company_Size      -3.378399    4715096300
## Automation_Risk    5.249569    7016933148
## Industry          -3.253647   11481707881

varImpPlot(rf_model)

Decision Tree for Job_Growth_Projection

tree_model <- rpart(Job_Growth_Projection ~ Automation_Risk + AI_Adoption_Level + Company_Size + Industry, data = data, method = "class")

summary(tree_model)

## Call:
## rpart(formula = Job_Growth_Projection ~ Automation_Risk + AI_Adoption_Level + 
##     Company_Size + Industry, data = data, method = "class")
##   n= 500 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.07854985      0 1.0000000 1.066465 0.03077748
## 2 0.02719033      1 0.9214502 1.009063 0.03181371
## 3 0.01510574      2 0.8942598 1.048338 0.03113131
## 4 0.01000000      4 0.8640483 1.021148 0.03161570
## 
## Variable importance
##        Industry Automation_Risk    Company_Size 
##              57              23              20 
## 
## Node number 1: 500 observations,    complexity param=0.07854985
##   predicted class=Decline  expected loss=0.662  P(node) =1
##     class counts:   169   169   162
##    probabilities: 0.338 0.338 0.324 
##   left son=2 (200 obs) right son=3 (300 obs)
##   Primary splits:
##       Industry          splits as  RLLRRLLRRR, improve=3.3946670, (0 missing)
##       Company_Size      splits as  RRL,        improve=2.0689030, (0 missing)
##       AI_Adoption_Level splits as  LRL,        improve=0.6396945, (0 missing)
##       Automation_Risk   splits as  LRR,        improve=0.4751721, (0 missing)
## 
## Node number 2: 200 observations,    complexity param=0.01510574
##   predicted class=Decline  expected loss=0.58  P(node) =0.4
##     class counts:    84    58    58
##    probabilities: 0.420 0.290 0.290 
##   left son=4 (135 obs) right son=5 (65 obs)
##   Primary splits:
##       Automation_Risk   splits as  RLL,        improve=2.3392590, (0 missing)
##       Industry          splits as  -RL--LR---, improve=1.9892730, (0 missing)
##       AI_Adoption_Level splits as  RRL,        improve=1.1026510, (0 missing)
##       Company_Size      splits as  RLL,        improve=0.8667647, (0 missing)
## 
## Node number 3: 300 observations,    complexity param=0.02719033
##   predicted class=Growth   expected loss=0.63  P(node) =0.6
##     class counts:    85   111   104
##    probabilities: 0.283 0.370 0.347 
##   left son=6 (99 obs) right son=7 (201 obs)
##   Primary splits:
##       Company_Size      splits as  RRL,        improve=1.9668600, (0 missing)
##       AI_Adoption_Level splits as  LRL,        improve=1.3932170, (0 missing)
##       Industry          splits as  L--LL--LRR, improve=1.1959250, (0 missing)
##       Automation_Risk   splits as  LLR,        improve=0.5134799, (0 missing)
## 
## Node number 4: 135 observations
##   predicted class=Decline  expected loss=0.5407407  P(node) =0.27
##     class counts:    62    31    42
##    probabilities: 0.459 0.230 0.311 
## 
## Node number 5: 65 observations,    complexity param=0.01510574
##   predicted class=Growth   expected loss=0.5846154  P(node) =0.13
##     class counts:    22    27    16
##    probabilities: 0.338 0.415 0.246 
##   left son=10 (19 obs) right son=11 (46 obs)
##   Primary splits:
##       Industry          splits as  -RR--LR---, improve=2.35194500, (0 missing)
##       Company_Size      splits as  RLR,        improve=1.14444400, (0 missing)
##       AI_Adoption_Level splits as  RLL,        improve=0.01333333, (0 missing)
## 
## Node number 6: 99 observations
##   predicted class=Growth   expected loss=0.5858586  P(node) =0.198
##     class counts:    33    41    25
##    probabilities: 0.333 0.414 0.253 
## 
## Node number 7: 201 observations
##   predicted class=Stable   expected loss=0.6069652  P(node) =0.402
##     class counts:    52    70    79
##    probabilities: 0.259 0.348 0.393 
## 
## Node number 10: 19 observations
##   predicted class=Decline  expected loss=0.4210526  P(node) =0.038
##     class counts:    11     6     2
##    probabilities: 0.579 0.316 0.105 
## 
## Node number 11: 46 observations
##   predicted class=Growth   expected loss=0.5434783  P(node) =0.092
##     class counts:    11    21    14
##    probabilities: 0.239 0.457 0.304

rpart.plot(tree_model, main = "Decision Tree for Job Growth Projection")

Model Evaluation and Comparison

rf_predictions <- predict(rf_model, newdata = data)
rf_mse <- mean((rf_predictions - data$Salary_USD)^2)
rf_r_squared <- 1 - (sum((rf_predictions - data$Salary_USD)^2) / sum((mean(data$Salary_USD) - data$Salary_USD)^2))

tree_predictions <- predict(tree_model, newdata = data, type = "class")
tree_accuracy <- mean(tree_predictions == data$Job_Growth_Projection)

rf_mse

## [1] 365281207

rf_r_squared

## [1] 0.1294229

tree_accuracy

## [1] 0.428

Random Forest Model (Salary Prediction) Model Quality:

Mean Squared Residuals (MSE): 426,039,881. This indicates that the model has a high level of error in predicting salary values, suggesting it doesn’t fit the data well.

% of Variance Explained: -1.54%, which is negative, indicating the model performs worse than a baseline model (predicting the mean salary for all cases). This implies the model is not suitable for explaining salary variations in this dataset.

Predictors’ Impact on Salary (Outcome):

AI_Adoption_Level: The negative impact (%IncMSE = -0.2408892) suggests that as AI adoption increases, salary predictions are worse. However, the magnitude of the negative effect is small.

Company_Size: A negative impact (%IncMSE = -3.5384984), meaning larger companies are associated with lower salary predictions in this model. The relationship could imply that within this dataset, larger firms may not always offer higher salaries, possibly due to other factors (e.g., industry, location).

Automation_Risk: The positive importance (%IncMSE = 5.7784457) means that as automation risk increases, salary predictions improve. This could suggest that roles with higher automation risks might be associated with higher-paying jobs, possibly due to the skills required for these jobs.

Industry: The negative impact (%IncMSE = -4.7838807) means that different industries may negatively affect salary predictions in this model, potentially due to industry-specific factors influencing pay scales.

Decision Tree Model (Job Growth Projection) Model Quality:

Cross-Validation Error (xerror): The error improves from 1.024 to 0.9788 as the tree is built, meaning the model is improving its classification ability over time.

Accuracy: The model does a reasonable job classifying the job growth projection into three categories: Growth, Decline, and Stable.

The model’s accuracy is relatively moderate, with predictions for job growth having an accuracy of 42.8%, indicating some room for improvement. Predictors’ Impact on Job Growth Projection (Outcome):

Industry: Industry is the most important factor (Importance = 57) for predicting job growth projections. The splits in the tree primarily happen based on the industry, suggesting that the sector a job belongs to has a substantial effect on its projected growth.

Automation_Risk: This predictor is also important (Importance = 23) in influencing job growth projections. Jobs with higher automation risk are more likely to be classified as “Decline,” possibly because automation may threaten job stability in certain sectors.

Company_Size: Company size also plays a role (Importance = 20) in determining job growth projections. Smaller companies tend to show a higher probability of growth, while larger companies might be more stable or even declining, though the exact interpretation would require further exploration.

AI_Adoption_Level: This variable has a relatively low importance compared to the others, indicating it does not strongly influence job growth projections in this model.

The random forest model for salary prediction struggles with poor performance, as reflected in the negative R-squared and high MSE. The predictors that seem to have the most impact are Automation_Risk (positive) and Industry (negative). The decision tree model for job growth projection performs moderately, with Industry being the most significant predictor of job growth categories, followed by Automation_Risk.

Salary Insights:

Low AI adoption jobs command higher salaries on average, possibly due to reliance on specialized human expertise.

High AI adoption roles show significant variability, reflecting a mix of entry-level and high-paying jobs.

Job Growth Trends:

Jobs with high automation risks are more likely to experience declines in growth.

Industry is the most significant factor in predicting job growth trends.

Model Performance:

The Random Forest model struggled with salary prediction, highlighting potential limitations in the dataset or the need for additional features.

The Decision Tree model moderately predicted job growth projections, with industry being a key driver.

Limitations:

The dataset lacks geographic and experience-level information, which may significantly influence salary predictions.

The sample size may not fully represent global job market trends.

Future Work:

Incorporate additional features, such as location and years of experience, to improve model accuracy.

Explore advanced machine learning techniques, such as gradient boosting, for better predictions.

References

Dataset: Kaggle - AI-Powered Job Market Insights

Untitled

Oma Tonukari

2024-12-03