Linear Regression for Salary_USD
# Train a random forest model for Salary prediction
rf_model <- randomForest(Salary_USD ~ AI_Adoption_Level + Company_Size + Automation_Risk + Industry, data = data, importance = TRUE, ntree = 500)
# Print the model summary
rf_model
##
## Call:
## randomForest(formula = Salary_USD ~ AI_Adoption_Level + Company_Size + Automation_Risk + Industry, data = data, importance = TRUE, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 425077527
## % Var explained: -1.31
# Variable importance
importance(rf_model)
## %IncMSE IncNodePurity
## AI_Adoption_Level 1.787137 6123371856
## Company_Size -2.591380 4484123132
## Automation_Risk 6.257217 6951805109
## Industry -2.959176 11331896055
# Plotting the importance
varImpPlot(rf_model)
Decision Tree for Job_Growth_Projection
# Train a decision tree model for Job Growth Projection
tree_model <- rpart(Job_Growth_Projection ~ Automation_Risk + AI_Adoption_Level + Company_Size + Industry, data = data, method = "class")
# Print the summary of the tree model
summary(tree_model)
## Call:
## rpart(formula = Job_Growth_Projection ~ Automation_Risk + AI_Adoption_Level +
## Company_Size + Industry, data = data, method = "class")
## n= 500
##
## CP nsplit rel error xerror xstd
## 1 0.07854985 0 1.0000000 1.0785498 0.03052735
## 2 0.02719033 1 0.9214502 0.9969789 0.03200136
## 3 0.01510574 2 0.8942598 1.0513595 0.03107409
## 4 0.01000000 4 0.8640483 1.0120846 0.03176518
##
## Variable importance
## Industry Automation_Risk Company_Size
## 57 23 20
##
## Node number 1: 500 observations, complexity param=0.07854985
## predicted class=Decline expected loss=0.662 P(node) =1
## class counts: 169 169 162
## probabilities: 0.338 0.338 0.324
## left son=2 (200 obs) right son=3 (300 obs)
## Primary splits:
## Industry splits as RLLRRLLRRR, improve=3.3946670, (0 missing)
## Company_Size splits as RRL, improve=2.0689030, (0 missing)
## AI_Adoption_Level splits as LRL, improve=0.6396945, (0 missing)
## Automation_Risk splits as LRR, improve=0.4751721, (0 missing)
##
## Node number 2: 200 observations, complexity param=0.01510574
## predicted class=Decline expected loss=0.58 P(node) =0.4
## class counts: 84 58 58
## probabilities: 0.420 0.290 0.290
## left son=4 (135 obs) right son=5 (65 obs)
## Primary splits:
## Automation_Risk splits as RLL, improve=2.3392590, (0 missing)
## Industry splits as -RL--LR---, improve=1.9892730, (0 missing)
## AI_Adoption_Level splits as RRL, improve=1.1026510, (0 missing)
## Company_Size splits as RLL, improve=0.8667647, (0 missing)
##
## Node number 3: 300 observations, complexity param=0.02719033
## predicted class=Growth expected loss=0.63 P(node) =0.6
## class counts: 85 111 104
## probabilities: 0.283 0.370 0.347
## left son=6 (99 obs) right son=7 (201 obs)
## Primary splits:
## Company_Size splits as RRL, improve=1.9668600, (0 missing)
## AI_Adoption_Level splits as LRL, improve=1.3932170, (0 missing)
## Industry splits as L--LL--LRR, improve=1.1959250, (0 missing)
## Automation_Risk splits as LLR, improve=0.5134799, (0 missing)
##
## Node number 4: 135 observations
## predicted class=Decline expected loss=0.5407407 P(node) =0.27
## class counts: 62 31 42
## probabilities: 0.459 0.230 0.311
##
## Node number 5: 65 observations, complexity param=0.01510574
## predicted class=Growth expected loss=0.5846154 P(node) =0.13
## class counts: 22 27 16
## probabilities: 0.338 0.415 0.246
## left son=10 (19 obs) right son=11 (46 obs)
## Primary splits:
## Industry splits as -RR--LR---, improve=2.35194500, (0 missing)
## Company_Size splits as RLR, improve=1.14444400, (0 missing)
## AI_Adoption_Level splits as RLL, improve=0.01333333, (0 missing)
##
## Node number 6: 99 observations
## predicted class=Growth expected loss=0.5858586 P(node) =0.198
## class counts: 33 41 25
## probabilities: 0.333 0.414 0.253
##
## Node number 7: 201 observations
## predicted class=Stable expected loss=0.6069652 P(node) =0.402
## class counts: 52 70 79
## probabilities: 0.259 0.348 0.393
##
## Node number 10: 19 observations
## predicted class=Decline expected loss=0.4210526 P(node) =0.038
## class counts: 11 6 2
## probabilities: 0.579 0.316 0.105
##
## Node number 11: 46 observations
## predicted class=Growth expected loss=0.5434783 P(node) =0.092
## class counts: 11 21 14
## probabilities: 0.239 0.457 0.304
# Plot the decision tree
rpart.plot(tree_model, main = "Decision Tree for Job Growth Projection")
Model Evaluation and Comparison
rf_predictions <- predict(rf_model, newdata = data)
rf_mse <- mean((rf_predictions - data$Salary_USD)^2)
rf_r_squared <- 1 - (sum((rf_predictions - data$Salary_USD)^2) / sum((mean(data$Salary_USD) - data$Salary_USD)^2))
tree_predictions <- predict(tree_model, newdata = data, type = "class")
tree_accuracy <- mean(tree_predictions == data$Job_Growth_Projection)
rf_mse
## [1] 366712214
rf_r_squared
## [1] 0.1260123
tree_accuracy
## [1] 0.428
Random Forest Model (Salary Prediction) Model Quality:
Mean Squared Residuals (MSE): 426,039,881. This indicates that the model has a high level of error in predicting salary values, suggesting it doesn’t fit the data well.
% of Variance Explained: -1.54%, which is negative, indicating the model performs worse than a baseline model (predicting the mean salary for all cases). This implies the model is not suitable for explaining salary variations in this dataset.
Predictors’ Impact on Salary (Outcome):
AI_Adoption_Level: The negative impact (%IncMSE = -0.2408892) suggests that as AI adoption increases, salary predictions are worse. However, the magnitude of the negative effect is small.
Company_Size: A negative impact (%IncMSE = -3.5384984), meaning larger companies are associated with lower salary predictions in this model. The relationship could imply that within this dataset, larger firms may not always offer higher salaries, possibly due to other factors (e.g., industry, location).
Automation_Risk: The positive importance (%IncMSE = 5.7784457) means that as automation risk increases, salary predictions improve. This could suggest that roles with higher automation risks might be associated with higher-paying jobs, possibly due to the skills required for these jobs.
Industry: The negative impact (%IncMSE = -4.7838807) means that different industries may negatively affect salary predictions in this model, potentially due to industry-specific factors influencing pay scales.
Decision Tree Model (Job Growth Projection) Model Quality:
Cross-Validation Error (xerror): The error improves from 1.024 to 0.9788 as the tree is built, meaning the model is improving its classification ability over time.
Accuracy: The model does a reasonable job classifying the job growth projection into three categories: Growth, Decline, and Stable.
The model’s accuracy is relatively moderate, with predictions for job growth having an accuracy of 42.8%, indicating some room for improvement. Predictors’ Impact on Job Growth Projection (Outcome):
Industry: Industry is the most important factor (Importance = 57) for predicting job growth projections. The splits in the tree primarily happen based on the industry, suggesting that the sector a job belongs to has a substantial effect on its projected growth.
Automation_Risk: This predictor is also important (Importance = 23) in influencing job growth projections. Jobs with higher automation risk are more likely to be classified as “Decline,” possibly because automation may threaten job stability in certain sectors.
Company_Size: Company size also plays a role (Importance = 20) in determining job growth projections. Smaller companies tend to show a higher probability of growth, while larger companies might be more stable or even declining, though the exact interpretation would require further exploration.
AI_Adoption_Level: This variable has a relatively low importance compared to the others, indicating it does not strongly influence job growth projections in this model.
The random forest model for salary prediction struggles with poor performance, as reflected in the negative R-squared and high MSE. The predictors that seem to have the most impact are Automation_Risk (positive) and Industry (negative). The decision tree model for job growth projection performs moderately, with Industry being the most significant predictor of job growth categories, followed by Automation_Risk.