The impact of artificial intelligence (AI) and automation on the job market has been a topic of significant interest. This project explores a dataset of job market insights, focusing on understanding the relationship between salary levels, AI adoption, and other job characteristics. The study aims to uncover patterns in the data and build predictive models to estimate salary levels and job growth projections.
This report provides an in-depth analysis of the “AI-Powered Job Market Insights” dataset. The analysis aims to answer two key research questions related to salary predictions and job growth projections based on factors like AI adoption, automation risk, and company size. We will explore the data’s structure, examine key variables, perform exploratory data analysis, and use predictive modeling techniques like Random Forests and Decision Trees to analyze the data.
Number of Cases (Rows)
case_count <- nrow(data)
case_count
## [1] 500
The dataset contains 500 rows, each representing a job entry.
Number of Variables (Columns)
variable_count <- ncol(data)
variable_count
## [1] 10
The dataset includes 10 variables, which capture different characteristics of the job market data.
First 10 Instances
head(data, 10) %>% kable() %>% kable_styling()
| Job_Title | Industry | Company_Size | Location | AI_Adoption_Level | Automation_Risk | Required_Skills | Salary_USD | Remote_Friendly | Job_Growth_Projection |
|---|---|---|---|---|---|---|---|---|---|
| Cybersecurity Analyst | Entertainment | Small | Dubai | Medium | High | UX/UI Design | 111392.17 | Yes | Growth |
| Marketing Specialist | Technology | Large | Singapore | Medium | High | Marketing | 93792.56 | No | Decline |
| AI Researcher | Technology | Large | Singapore | Medium | High | UX/UI Design | 107170.26 | Yes | Growth |
| Sales Manager | Retail | Small | Berlin | Low | High | Project Management | 93027.95 | No | Growth |
| Cybersecurity Analyst | Entertainment | Small | Tokyo | Low | Low | JavaScript | 87752.92 | Yes | Decline |
| UX Designer | Education | Large | San Francisco | Medium | Medium | Cybersecurity | 102825.01 | No | Growth |
| HR Manager | Finance | Medium | Singapore | Low | High | Sales | 102065.72 | Yes | Growth |
| Cybersecurity Analyst | Technology | Small | Dubai | Medium | Low | Machine Learning | 86607.32 | Yes | Decline |
| AI Researcher | Retail | Large | London | High | Low | JavaScript | 75015.86 | No | Stable |
| Sales Manager | Entertainment | Medium | Singapore | High | Low | Cybersecurity | 96834.58 | Yes | Decline |
The first 10 rows of the dataset show the basic job information, including job titles, company size, and AI adoption level.
Missing Values Analysis
missing_summary <- sapply(data, function(x) sum(is.na(x)))
missing_summary <- missing_summary[missing_summary > 0]
missing_summary
## named integer(0)
The dataset contains no missing values, simplifying preprocessing.
Research Questions and Hypotheses
Question 1: Does the AI_Adoption_Level of a company impact the Salary_USD for jobs within that company?
Hypothesis 1: Higher AI_Adoption_Level is associated with higher Salary_USD.
Question 2: Can the Automation_Risk level of a job predict its Job_Growth_Projection?
Hypothesis 2: Jobs with higher Automation_Risk are more likely to have a “Decline” in Job_Growth_Projection.
Relevant Variables for Each Research Question
Question 1: Relevant Variables - AI_Adoption_Level, Salary_USD
Question 2: Relevant Variables - Automation_Risk, Job_Growth_Projection
Identify Response Variables
Question 1: Response Variable: Salary_USD
Question 2: Response Variable: Job_Growth_Projection
Missing Values for Response Variables
missing_salary <- sum(is.na(data$Salary_USD))
missing_growth <- sum(is.na(data$Job_Growth_Projection))
missing_values_responses <- list(Salary_USD = missing_salary, Job_Growth_Projection = missing_growth)
missing_values_responses
## $Salary_USD
## [1] 0
##
## $Job_Growth_Projection
## [1] 0
There are no missing values for the response variables, ensuring clean analysis.
Part 2e: Distribution of Response Variables Salary Distribution
ggplot(data, aes(x = Salary_USD)) +
geom_histogram(binwidth = 5000) +
labs(title = "Salary Distribution", x = "Salary (USD)", y = "Count")
Job Growth Projection Distribution
ggplot(data, aes(x = Job_Growth_Projection)) +
geom_bar() +
labs(title = "Job Growth Projection Distribution", x = "Job Growth Projection", y = "Count")
Job growth projections show that most positions are categorized as “Stable,” with fewer jobs predicted to experience growth or decline.
Part 2f: Relationship Between Response and Explanatory Variables Relationship between Salary and AI Adoption Level
ggplot(data, aes(x = AI_Adoption_Level, y = Salary_USD, fill = AI_Adoption_Level)) +
geom_boxplot() +
labs(title = "Salary by AI Adoption Level", x = "AI Adoption Level", y = "Salary (USD)") +
theme_minimal()
The boxplot reveals that “Low AI Adoption” positions tend to have higher median salaries compared to “High AI Adoption” jobs.
Relationship between Salary and Company Size
ggplot(data, aes(x = Company_Size, y = Salary_USD, fill = Company_Size)) +
geom_boxplot() +
labs(title = "Salary by Company Size", x = "Company Size", y = "Salary (USD)") +
theme_minimal()
Smaller companies tend to offer higher salaries, although there is substantial overlap in salary ranges across company sizes.
Relationship between Job Growth Projection and Automation Risk
ggplot(data, aes(x = Automation_Risk, fill = Job_Growth_Projection)) +
geom_bar(position = "dodge") +
labs(title = "Job Growth Projection by Automation Risk", x = "Automation Risk", y = "Count")
Jobs with “High Automation Risk” are more likely to experience a decline in job growth, whereas low automation risk jobs tend to show more stability or growth.
Relationship between Job Growth Projection and Industry
ggplot(data, aes(x = Industry, fill = Job_Growth_Projection)) +
geom_bar(position = "dodge") +
labs(title = "Job Growth Projection by Industry", x = "Industry", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Certain industries, such as education, show higher growth projections, while others, like manufacturing, show more declines.
data <- data %>%
mutate(Salary_Normalized = scale(Salary_USD)) %>%
mutate(Salary_Range = case_when(
Salary_USD < 60000 ~ "Low",
Salary_USD >= 60000 & Salary_USD < 100000 ~ "Medium",
Salary_USD >= 100000 ~ "High"
)) %>%
mutate(High_Automation_Risk = if_else(Automation_Risk == "High", 1, 0))
glimpse(data)
## Rows: 500
## Columns: 13
## $ Job_Title <chr> "Cybersecurity Analyst", "Marketing Specialist",…
## $ Industry <chr> "Entertainment", "Technology", "Technology", "Re…
## $ Company_Size <fct> Small, Large, Large, Small, Small, Large, Medium…
## $ Location <chr> "Dubai", "Singapore", "Singapore", "Berlin", "To…
## $ AI_Adoption_Level <fct> Medium, Medium, Medium, Low, Low, Medium, Low, M…
## $ Automation_Risk <fct> High, High, High, High, Low, Medium, High, Low, …
## $ Required_Skills <chr> "UX/UI Design", "Marketing", "UX/UI Design", "Pr…
## $ Salary_USD <dbl> 111392.17, 93792.56, 107170.26, 93027.95, 87752.…
## $ Remote_Friendly <chr> "Yes", "No", "Yes", "No", "Yes", "No", "Yes", "Y…
## $ Job_Growth_Projection <fct> Growth, Decline, Growth, Growth, Decline, Growth…
## $ Salary_Normalized <dbl[,1]> <matrix[26 x 1]>
## $ Salary_Range <chr> "High", "Medium", "High", "Medium", "Medium"…
## $ High_Automation_Risk <dbl> 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, …
I have created new features like normalized salary, salary range (Low, Medium, High), and a binary variable for high automation risk jobs.
summary_stats <- data %>%
group_by(AI_Adoption_Level) %>%
summarise(
Mean_Salary = mean(Salary_USD, na.rm = TRUE),
SD_Salary = sd(Salary_USD, na.rm = TRUE),
Min_Salary = min(Salary_USD, na.rm = TRUE),
Q1_Salary = quantile(Salary_USD, 0.25, na.rm = TRUE),
Median_Salary = median(Salary_USD, na.rm = TRUE),
Q3_Salary = quantile(Salary_USD, 0.75, na.rm = TRUE),
Max_Salary = max(Salary_USD, na.rm = TRUE)
)
summary_stats %>%
kable(caption = "Summary Statistics for Salary_USD by AI Adoption Level") %>%
kable_styling(full_width = FALSE)
| AI_Adoption_Level | Mean_Salary | SD_Salary | Min_Salary | Q1_Salary | Median_Salary | Q3_Salary | Max_Salary |
|---|---|---|---|---|---|---|---|
| High | 87583.42 | 21021.20 | 41298.73 | 74216.26 | 86379.88 | 101357.0 | 155209.8 |
| Low | 93353.60 | 20864.83 | 31969.53 | 79016.97 | 95700.14 | 105165.4 | 140476.0 |
| Medium | 92139.14 | 19412.02 | 35963.30 | 81055.99 | 92891.89 | 105665.0 | 134822.7 |
The summary statistics for Salary_USD grouped by AI_Adoption_Level reveal the following patterns:
Mean and Median Salaries:
Low AI Adoption has the highest mean salary ($93,353.60) and median salary ($95,700.14), suggesting that these jobs generally offer the best compensation. Medium AI Adoption follows with a mean salary of $92,139.14 and median salary of $92,891.89, showing slightly lower pay than the “Low” group. High AI Adoption has the lowest mean ($87,583.42) and median ($86,379.88) salaries, despite having the highest maximum salary ($155,209.80). Insight: Higher AI adoption levels do not correlate with higher average salaries. This could indicate that high-AI adoption jobs are more common in industries with entry-level or support roles, while low-AI adoption jobs are likely in traditional, well-compensated fields.
Salary Variability:
High AI Adoption has the largest standard deviation ($21,021.20) and range ($41,298.73 to $155,209.80), indicating significant variability in pay. This reflects a mix of both high-paying specialized roles and lower-paying positions. Low AI Adoption also has substantial variability (SD: $20,864.83) but shows more consistent compensation in the higher salary ranges (Q3: $105,165.40). Medium AI Adoption has the lowest standard deviation ($19,412.02) and a tighter range of salaries, suggesting more consistency in pay structures. Insight: High salary variability in “High AI Adoption” jobs reflects diverse roles, ranging from lower-tier positions to highly specialized, high-paying roles.
Five-Number Summary:
Low AI Adoption has the highest Q1 ($79,016.97) and Q3 ($105,165.40), indicating that salaries in this group are generally higher across the board. High AI Adoption has a lower Q1 ($74,216.26) and median, with a large portion of salaries in the lower range, despite the highest maximum. Medium AI Adoption offers consistent middle-range pay, with Q1 ($81,055.99) and Q3 ($105,665.00) values close to its median.’
ggplot(data, aes(x = AI_Adoption_Level, y = Salary_USD, fill = AI_Adoption_Level)) +
geom_boxplot() +
labs(title = "Salary by AI Adoption Level", x = "AI Adoption Level", y = "Salary (USD)") +
theme_minimal()
Salary by AI Adoption Level (Boxplot): Median salaries are highest for “Low AI Adoption” and lowest for “High AI Adoption.”
Insight: Traditional, low-AI industries may prioritize human expertise, resulting in better compensation, while high-AI adoption jobs might include more entry-level roles.
ggplot(data, aes(x = Company_Size, y = Salary_USD, fill = Company_Size)) +
geom_boxplot() +
labs(title = "Salary by Company Size", x = "Company Size", y = "Salary (USD)") +
theme_minimal()
Salary by Company Size (Boxplot):
Larger companies offer higher median salaries and a broader salary range.
Insight: Larger organizations likely invest in specialized, high-paying roles.
ggplot(data, aes(x = Automation_Risk, fill = Job_Growth_Projection)) +
geom_bar(position = "dodge") +
labs(title = "Job Growth Projection by Automation Risk", x = "Automation Risk", y = "Count") +
theme_minimal()
Job Growth Projection by Automation Risk (Bar Chart):
Jobs with “High Automation Risk” have the highest likelihood of decline, while “Low Automation Risk” jobs exhibit stronger growth potential.
Insight: Automation risk inversely correlates with job growth.
ggplot(data, aes(x = AI_Adoption_Level, y = Salary_USD, color = AI_Adoption_Level)) +
geom_jitter(width = 0.2) +
labs(title = "Scatter Plot of Salary vs AI Adoption Level", x = "AI Adoption Level", y = "Salary (USD)") +
theme_minimal()
Scatter Plot of Salary vs AI Adoption Level:
Salaries in “High AI Adop tion” environments show significant variability, including some very high-paying roles.
Insight: High-AI adoption environments may include both entry-level and specialized roles, leading to this spread in pay.
Linear Regression for Salary_USD
rf_model <- randomForest(Salary_USD ~ AI_Adoption_Level + Company_Size + Automation_Risk + Industry, data = data, importance = TRUE, ntree = 500)
rf_model
##
## Call:
## randomForest(formula = Salary_USD ~ AI_Adoption_Level + Company_Size + Automation_Risk + Industry, data = data, importance = TRUE, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 425760681
## % Var explained: -1.47
importance(rf_model)
## %IncMSE IncNodePurity
## AI_Adoption_Level 1.206883 6285507205
## Company_Size -3.378399 4715096300
## Automation_Risk 5.249569 7016933148
## Industry -3.253647 11481707881
varImpPlot(rf_model)
Decision Tree for Job_Growth_Projection
tree_model <- rpart(Job_Growth_Projection ~ Automation_Risk + AI_Adoption_Level + Company_Size + Industry, data = data, method = "class")
summary(tree_model)
## Call:
## rpart(formula = Job_Growth_Projection ~ Automation_Risk + AI_Adoption_Level +
## Company_Size + Industry, data = data, method = "class")
## n= 500
##
## CP nsplit rel error xerror xstd
## 1 0.07854985 0 1.0000000 1.066465 0.03077748
## 2 0.02719033 1 0.9214502 1.009063 0.03181371
## 3 0.01510574 2 0.8942598 1.048338 0.03113131
## 4 0.01000000 4 0.8640483 1.021148 0.03161570
##
## Variable importance
## Industry Automation_Risk Company_Size
## 57 23 20
##
## Node number 1: 500 observations, complexity param=0.07854985
## predicted class=Decline expected loss=0.662 P(node) =1
## class counts: 169 169 162
## probabilities: 0.338 0.338 0.324
## left son=2 (200 obs) right son=3 (300 obs)
## Primary splits:
## Industry splits as RLLRRLLRRR, improve=3.3946670, (0 missing)
## Company_Size splits as RRL, improve=2.0689030, (0 missing)
## AI_Adoption_Level splits as LRL, improve=0.6396945, (0 missing)
## Automation_Risk splits as LRR, improve=0.4751721, (0 missing)
##
## Node number 2: 200 observations, complexity param=0.01510574
## predicted class=Decline expected loss=0.58 P(node) =0.4
## class counts: 84 58 58
## probabilities: 0.420 0.290 0.290
## left son=4 (135 obs) right son=5 (65 obs)
## Primary splits:
## Automation_Risk splits as RLL, improve=2.3392590, (0 missing)
## Industry splits as -RL--LR---, improve=1.9892730, (0 missing)
## AI_Adoption_Level splits as RRL, improve=1.1026510, (0 missing)
## Company_Size splits as RLL, improve=0.8667647, (0 missing)
##
## Node number 3: 300 observations, complexity param=0.02719033
## predicted class=Growth expected loss=0.63 P(node) =0.6
## class counts: 85 111 104
## probabilities: 0.283 0.370 0.347
## left son=6 (99 obs) right son=7 (201 obs)
## Primary splits:
## Company_Size splits as RRL, improve=1.9668600, (0 missing)
## AI_Adoption_Level splits as LRL, improve=1.3932170, (0 missing)
## Industry splits as L--LL--LRR, improve=1.1959250, (0 missing)
## Automation_Risk splits as LLR, improve=0.5134799, (0 missing)
##
## Node number 4: 135 observations
## predicted class=Decline expected loss=0.5407407 P(node) =0.27
## class counts: 62 31 42
## probabilities: 0.459 0.230 0.311
##
## Node number 5: 65 observations, complexity param=0.01510574
## predicted class=Growth expected loss=0.5846154 P(node) =0.13
## class counts: 22 27 16
## probabilities: 0.338 0.415 0.246
## left son=10 (19 obs) right son=11 (46 obs)
## Primary splits:
## Industry splits as -RR--LR---, improve=2.35194500, (0 missing)
## Company_Size splits as RLR, improve=1.14444400, (0 missing)
## AI_Adoption_Level splits as RLL, improve=0.01333333, (0 missing)
##
## Node number 6: 99 observations
## predicted class=Growth expected loss=0.5858586 P(node) =0.198
## class counts: 33 41 25
## probabilities: 0.333 0.414 0.253
##
## Node number 7: 201 observations
## predicted class=Stable expected loss=0.6069652 P(node) =0.402
## class counts: 52 70 79
## probabilities: 0.259 0.348 0.393
##
## Node number 10: 19 observations
## predicted class=Decline expected loss=0.4210526 P(node) =0.038
## class counts: 11 6 2
## probabilities: 0.579 0.316 0.105
##
## Node number 11: 46 observations
## predicted class=Growth expected loss=0.5434783 P(node) =0.092
## class counts: 11 21 14
## probabilities: 0.239 0.457 0.304
rpart.plot(tree_model, main = "Decision Tree for Job Growth Projection")
Model Evaluation and Comparison
rf_predictions <- predict(rf_model, newdata = data)
rf_mse <- mean((rf_predictions - data$Salary_USD)^2)
rf_r_squared <- 1 - (sum((rf_predictions - data$Salary_USD)^2) / sum((mean(data$Salary_USD) - data$Salary_USD)^2))
tree_predictions <- predict(tree_model, newdata = data, type = "class")
tree_accuracy <- mean(tree_predictions == data$Job_Growth_Projection)
rf_mse
## [1] 365281207
rf_r_squared
## [1] 0.1294229
tree_accuracy
## [1] 0.428
Random Forest Model (Salary Prediction) Model Quality:
Mean Squared Residuals (MSE): 426,039,881. This indicates that the model has a high level of error in predicting salary values, suggesting it doesn’t fit the data well.
% of Variance Explained: -1.54%, which is negative, indicating the model performs worse than a baseline model (predicting the mean salary for all cases). This implies the model is not suitable for explaining salary variations in this dataset.
Predictors’ Impact on Salary (Outcome):
AI_Adoption_Level: The negative impact (%IncMSE = -0.2408892) suggests that as AI adoption increases, salary predictions are worse. However, the magnitude of the negative effect is small.
Company_Size: A negative impact (%IncMSE = -3.5384984), meaning larger companies are associated with lower salary predictions in this model. The relationship could imply that within this dataset, larger firms may not always offer higher salaries, possibly due to other factors (e.g., industry, location).
Automation_Risk: The positive importance (%IncMSE = 5.7784457) means that as automation risk increases, salary predictions improve. This could suggest that roles with higher automation risks might be associated with higher-paying jobs, possibly due to the skills required for these jobs.
Industry: The negative impact (%IncMSE = -4.7838807) means that different industries may negatively affect salary predictions in this model, potentially due to industry-specific factors influencing pay scales.
Decision Tree Model (Job Growth Projection) Model Quality:
Cross-Validation Error (xerror): The error improves from 1.024 to 0.9788 as the tree is built, meaning the model is improving its classification ability over time.
Accuracy: The model does a reasonable job classifying the job growth projection into three categories: Growth, Decline, and Stable.
The model’s accuracy is relatively moderate, with predictions for job growth having an accuracy of 42.8%, indicating some room for improvement. Predictors’ Impact on Job Growth Projection (Outcome):
Industry: Industry is the most important factor (Importance = 57) for predicting job growth projections. The splits in the tree primarily happen based on the industry, suggesting that the sector a job belongs to has a substantial effect on its projected growth.
Automation_Risk: This predictor is also important (Importance = 23) in influencing job growth projections. Jobs with higher automation risk are more likely to be classified as “Decline,” possibly because automation may threaten job stability in certain sectors.
Company_Size: Company size also plays a role (Importance = 20) in determining job growth projections. Smaller companies tend to show a higher probability of growth, while larger companies might be more stable or even declining, though the exact interpretation would require further exploration.
AI_Adoption_Level: This variable has a relatively low importance compared to the others, indicating it does not strongly influence job growth projections in this model.
The random forest model for salary prediction struggles with poor performance, as reflected in the negative R-squared and high MSE. The predictors that seem to have the most impact are Automation_Risk (positive) and Industry (negative). The decision tree model for job growth projection performs moderately, with Industry being the most significant predictor of job growth categories, followed by Automation_Risk.
Salary Insights:
Low AI adoption jobs command higher salaries on average, possibly due to reliance on specialized human expertise.
High AI adoption roles show significant variability, reflecting a mix of entry-level and high-paying jobs.
Job Growth Trends:
Jobs with high automation risks are more likely to experience declines in growth.
Industry is the most significant factor in predicting job growth trends.
Model Performance:
The Random Forest model struggled with salary prediction, highlighting potential limitations in the dataset or the need for additional features.
The Decision Tree model moderately predicted job growth projections, with industry being a key driver.
Limitations:
The dataset lacks geographic and experience-level information, which may significantly influence salary predictions.
The sample size may not fully represent global job market trends.
Future Work:
Incorporate additional features, such as location and years of experience, to improve model accuracy.
Explore advanced machine learning techniques, such as gradient boosting, for better predictions.
References
Dataset: Kaggle - AI-Powered Job Market Insights