proj part 4

Linear Regression for Salary_USD

# Train a random forest model for Salary prediction
rf_model <- randomForest(Salary_USD ~ AI_Adoption_Level + Company_Size + Automation_Risk + Industry, data = data, importance = TRUE, ntree = 500)

# Print the model summary
rf_model

## 
## Call:
##  randomForest(formula = Salary_USD ~ AI_Adoption_Level + Company_Size +      Automation_Risk + Industry, data = data, importance = TRUE,      ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 425077527
##                     % Var explained: -1.31

# Variable importance
importance(rf_model)

##                     %IncMSE IncNodePurity
## AI_Adoption_Level  1.787137    6123371856
## Company_Size      -2.591380    4484123132
## Automation_Risk    6.257217    6951805109
## Industry          -2.959176   11331896055

# Plotting the importance
varImpPlot(rf_model)

Decision Tree for Job_Growth_Projection

# Train a decision tree model for Job Growth Projection
tree_model <- rpart(Job_Growth_Projection ~ Automation_Risk + AI_Adoption_Level + Company_Size + Industry, data = data, method = "class")

# Print the summary of the tree model
summary(tree_model)

## Call:
## rpart(formula = Job_Growth_Projection ~ Automation_Risk + AI_Adoption_Level + 
##     Company_Size + Industry, data = data, method = "class")
##   n= 500 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.07854985      0 1.0000000 1.0785498 0.03052735
## 2 0.02719033      1 0.9214502 0.9969789 0.03200136
## 3 0.01510574      2 0.8942598 1.0513595 0.03107409
## 4 0.01000000      4 0.8640483 1.0120846 0.03176518
## 
## Variable importance
##        Industry Automation_Risk    Company_Size 
##              57              23              20 
## 
## Node number 1: 500 observations,    complexity param=0.07854985
##   predicted class=Decline  expected loss=0.662  P(node) =1
##     class counts:   169   169   162
##    probabilities: 0.338 0.338 0.324 
##   left son=2 (200 obs) right son=3 (300 obs)
##   Primary splits:
##       Industry          splits as  RLLRRLLRRR, improve=3.3946670, (0 missing)
##       Company_Size      splits as  RRL,        improve=2.0689030, (0 missing)
##       AI_Adoption_Level splits as  LRL,        improve=0.6396945, (0 missing)
##       Automation_Risk   splits as  LRR,        improve=0.4751721, (0 missing)
## 
## Node number 2: 200 observations,    complexity param=0.01510574
##   predicted class=Decline  expected loss=0.58  P(node) =0.4
##     class counts:    84    58    58
##    probabilities: 0.420 0.290 0.290 
##   left son=4 (135 obs) right son=5 (65 obs)
##   Primary splits:
##       Automation_Risk   splits as  RLL,        improve=2.3392590, (0 missing)
##       Industry          splits as  -RL--LR---, improve=1.9892730, (0 missing)
##       AI_Adoption_Level splits as  RRL,        improve=1.1026510, (0 missing)
##       Company_Size      splits as  RLL,        improve=0.8667647, (0 missing)
## 
## Node number 3: 300 observations,    complexity param=0.02719033
##   predicted class=Growth   expected loss=0.63  P(node) =0.6
##     class counts:    85   111   104
##    probabilities: 0.283 0.370 0.347 
##   left son=6 (99 obs) right son=7 (201 obs)
##   Primary splits:
##       Company_Size      splits as  RRL,        improve=1.9668600, (0 missing)
##       AI_Adoption_Level splits as  LRL,        improve=1.3932170, (0 missing)
##       Industry          splits as  L--LL--LRR, improve=1.1959250, (0 missing)
##       Automation_Risk   splits as  LLR,        improve=0.5134799, (0 missing)
## 
## Node number 4: 135 observations
##   predicted class=Decline  expected loss=0.5407407  P(node) =0.27
##     class counts:    62    31    42
##    probabilities: 0.459 0.230 0.311 
## 
## Node number 5: 65 observations,    complexity param=0.01510574
##   predicted class=Growth   expected loss=0.5846154  P(node) =0.13
##     class counts:    22    27    16
##    probabilities: 0.338 0.415 0.246 
##   left son=10 (19 obs) right son=11 (46 obs)
##   Primary splits:
##       Industry          splits as  -RR--LR---, improve=2.35194500, (0 missing)
##       Company_Size      splits as  RLR,        improve=1.14444400, (0 missing)
##       AI_Adoption_Level splits as  RLL,        improve=0.01333333, (0 missing)
## 
## Node number 6: 99 observations
##   predicted class=Growth   expected loss=0.5858586  P(node) =0.198
##     class counts:    33    41    25
##    probabilities: 0.333 0.414 0.253 
## 
## Node number 7: 201 observations
##   predicted class=Stable   expected loss=0.6069652  P(node) =0.402
##     class counts:    52    70    79
##    probabilities: 0.259 0.348 0.393 
## 
## Node number 10: 19 observations
##   predicted class=Decline  expected loss=0.4210526  P(node) =0.038
##     class counts:    11     6     2
##    probabilities: 0.579 0.316 0.105 
## 
## Node number 11: 46 observations
##   predicted class=Growth   expected loss=0.5434783  P(node) =0.092
##     class counts:    11    21    14
##    probabilities: 0.239 0.457 0.304

# Plot the decision tree
rpart.plot(tree_model, main = "Decision Tree for Job Growth Projection")

Model Evaluation and Comparison

rf_predictions <- predict(rf_model, newdata = data)
rf_mse <- mean((rf_predictions - data$Salary_USD)^2)
rf_r_squared <- 1 - (sum((rf_predictions - data$Salary_USD)^2) / sum((mean(data$Salary_USD) - data$Salary_USD)^2))

tree_predictions <- predict(tree_model, newdata = data, type = "class")
tree_accuracy <- mean(tree_predictions == data$Job_Growth_Projection)

rf_mse

## [1] 366712214

rf_r_squared

## [1] 0.1260123

tree_accuracy

## [1] 0.428

Random Forest Model (Salary Prediction) Model Quality:

Mean Squared Residuals (MSE): 426,039,881. This indicates that the model has a high level of error in predicting salary values, suggesting it doesn’t fit the data well.

% of Variance Explained: -1.54%, which is negative, indicating the model performs worse than a baseline model (predicting the mean salary for all cases). This implies the model is not suitable for explaining salary variations in this dataset.

Predictors’ Impact on Salary (Outcome):

AI_Adoption_Level: The negative impact (%IncMSE = -0.2408892) suggests that as AI adoption increases, salary predictions are worse. However, the magnitude of the negative effect is small.

Company_Size: A negative impact (%IncMSE = -3.5384984), meaning larger companies are associated with lower salary predictions in this model. The relationship could imply that within this dataset, larger firms may not always offer higher salaries, possibly due to other factors (e.g., industry, location).

Automation_Risk: The positive importance (%IncMSE = 5.7784457) means that as automation risk increases, salary predictions improve. This could suggest that roles with higher automation risks might be associated with higher-paying jobs, possibly due to the skills required for these jobs.

Industry: The negative impact (%IncMSE = -4.7838807) means that different industries may negatively affect salary predictions in this model, potentially due to industry-specific factors influencing pay scales.

Decision Tree Model (Job Growth Projection) Model Quality:

Cross-Validation Error (xerror): The error improves from 1.024 to 0.9788 as the tree is built, meaning the model is improving its classification ability over time.

Accuracy: The model does a reasonable job classifying the job growth projection into three categories: Growth, Decline, and Stable.

The model’s accuracy is relatively moderate, with predictions for job growth having an accuracy of 42.8%, indicating some room for improvement. Predictors’ Impact on Job Growth Projection (Outcome):

Industry: Industry is the most important factor (Importance = 57) for predicting job growth projections. The splits in the tree primarily happen based on the industry, suggesting that the sector a job belongs to has a substantial effect on its projected growth.

Automation_Risk: This predictor is also important (Importance = 23) in influencing job growth projections. Jobs with higher automation risk are more likely to be classified as “Decline,” possibly because automation may threaten job stability in certain sectors.

Company_Size: Company size also plays a role (Importance = 20) in determining job growth projections. Smaller companies tend to show a higher probability of growth, while larger companies might be more stable or even declining, though the exact interpretation would require further exploration.

AI_Adoption_Level: This variable has a relatively low importance compared to the others, indicating it does not strongly influence job growth projections in this model.

The random forest model for salary prediction struggles with poor performance, as reflected in the negative R-squared and high MSE. The predictors that seem to have the most impact are Automation_Risk (positive) and Industry (negative). The decision tree model for job growth projection performs moderately, with Industry being the most significant predictor of job growth categories, followed by Automation_Risk.

proj part 4

Oma Tonukari

2024-12-02