Εισαγωγή

Το data set δείχνει πως το ai έχει επηρεάσει την αγορά εργασίας και τον τρόπο με τον οποίο επηρεάζει τους μισθούς οι παράγοντες όπως το επίπεδο εμπειρίας, ο τύπος εταιρείας, οι τεχνικές δεξιότητες και η δυνατότητα απομακρυσμένης εργασίας

Στόχος της παρούσας εργασίας είναι η ανάπτυξη και σύγκριση δύο μοντέλων μηχανικής μάθησης: της Λογιστικής Παλινδρόμησης και των Δέντρων Απόφασης (CART), για την ταξινόμηση θέσεων εργασίας σε υψηλού και χαμηλού μισθού.


1. Load Dataset

jobs <- read.csv2("ai_job_market_dataset.csv")

2. Target Variable

median_salary <- median(jobs$Salary_USD)

jobs$High_Salary <- ifelse(jobs$Salary_USD > median_salary, 1, 0)

jobs$High_Salary <- as.factor(jobs$High_Salary)

table(jobs$High_Salary)
## 
##    0    1 
## 1000 1000

3. Data Preprocessing

jobs$Experience_Level <- as.factor(jobs$Experience_Level)
jobs$Company_Type <- as.factor(jobs$Company_Type)
jobs$Remote <- as.factor(jobs$Remote)
jobs$Top_Skill <- as.factor(jobs$Top_Skill)
jobs$Country <- as.factor(jobs$Country)

str(jobs)
## 'data.frame':    2000 obs. of  9 variables:
##  $ Year            : int  2022 2025 2023 2025 2024 2026 2024 2024 2023 2023 ...
##  $ Job_Title       : chr  "Data Scientist" "Data Scientist" "NLP Engineer" "AI Engineer" ...
##  $ Country         : Factor w/ 5 levels "Canada","Germany",..: 2 5 3 2 1 1 4 4 3 2 ...
##  $ Company_Type    : Factor w/ 3 levels "Big Tech","Freelance",..: 3 3 3 3 1 1 3 2 1 1 ...
##  $ Experience_Level: Factor w/ 3 levels "Entry","Mid",..: 1 1 2 1 1 3 2 2 2 2 ...
##  $ Salary_USD      : int  32058 45821 31292 46264 57624 143241 98685 109912 29756 81208 ...
##  $ Remote          : Factor w/ 2 levels "No","Yes": 2 2 2 1 1 2 1 2 1 1 ...
##  $ Top_Skill       : Factor w/ 5 levels "NLP","Python",..: 1 1 4 2 2 4 4 1 2 4 ...
##  $ High_Salary     : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 2 1 2 ...

4. Train / Test Split

library(caTools)
## Warning: package 'caTools' was built under R version 4.5.3
set.seed(3000)

spl <- sample.split(jobs$High_Salary, SplitRatio = 0.7)

Train <- subset(jobs, spl == TRUE)
Test <- subset(jobs, spl == FALSE)

5. Logistic Regression Model

LogModel <- glm(
  High_Salary ~ Experience_Level +
  Company_Type +
  Remote +
  Top_Skill +
  Country,
  data = Train,
  family = "binomial"
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(LogModel)
## 
## Call:
## glm(formula = High_Salary ~ Experience_Level + Company_Type + 
##     Remote + Top_Skill + Country, family = "binomial", data = Train)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             -37.0898  2206.0054  -0.017  0.98659   
## Experience_LevelMid      40.5795  2206.0054   0.018  0.98532   
## Experience_LevelSenior   60.5464  2708.4782   0.022  0.98217   
## Company_TypeFreelance    -0.9001     0.4700  -1.915  0.05548 . 
## Company_TypeStartup      -0.2295     0.4971  -0.462  0.64427   
## RemoteYes                 0.1370     0.3696   0.371  0.71088   
## Top_SkillPython          -1.8328     0.7040  -2.603  0.00923 **
## Top_SkillPyTorch         -1.2011     0.7112  -1.689  0.09123 . 
## Top_SkillSQL             -1.3164     0.7055  -1.866  0.06204 . 
## Top_SkillTensorFlow      -0.8464     0.7469  -1.133  0.25712   
## CountryGermany           -0.6764     0.4600  -1.470  0.14146   
## CountryIndia            -42.9999  2494.3228  -0.017  0.98625   
## CountryUK                -0.2110     0.4641  -0.455  0.64934   
## CountryUSA               18.0676  1497.1503   0.012  0.99037   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1940.81  on 1399  degrees of freedom
## Residual deviance:  197.99  on 1386  degrees of freedom
## AIC: 225.99
## 
## Number of Fisher Scoring iterations: 21

6. Logistic Regression Evaluation

LogPred <- predict(LogModel, newdata = Test, type = "response")

LogPredClass <- ifelse(LogPred > 0.5, 1, 0)

table(Test$High_Salary, LogPredClass)
##    LogPredClass
##       0   1
##   0 286  14
##   1   0 300
log_acc <- mean(LogPredClass == Test$High_Salary)
log_acc
## [1] 0.9766667

7. CART Model

library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.3
TreeModel <- rpart(
  High_Salary ~ Experience_Level +
  Company_Type +
  Remote +
  Top_Skill +
  Country,
  data = Train,
  method = "class",
  minbucket = 20
)

prp(
  TreeModel,
  type = 2,
  extra = 104,
  fallen.leaves = TRUE,
  shadow.col = "gray",
  branch.lty = 3,
  faclen = 0,
  tweak = 1.2
)


8. CART Evaluation

PredictCART <- predict(TreeModel, newdata = Test, type = "class")

table(Test$High_Salary, PredictCART)
##    PredictCART
##       0   1
##   0 286  14
##   1   0 300
cart_acc <- mean(PredictCART == Test$High_Salary)
cart_acc
## [1] 0.9766667

9. ROC Curve & AUC (CART)

library(ROCR)
## Warning: package 'ROCR' was built under R version 4.5.3
PredictROC <- predict(TreeModel, newdata = Test)

pred <- prediction(PredictROC[,2], Test$High_Salary)

perf <- performance(pred, "tpr", "fpr")

plot(perf)

auc <- as.numeric(performance(pred, "auc")@y.values)
auc
## [1] 0.9766667

10. Pruned CART (Cross Validation)

library(caret)
## Warning: package 'caret' was built under R version 4.5.3
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(3000)

train_control <- trainControl(method = "cv", number = 5)

cp_grid <- expand.grid(.cp = seq(0.001, 0.05, 0.002))

cart_cv <- train(
  High_Salary ~ Experience_Level +
  Company_Type +
  Remote +
  Top_Skill +
  Country,
  data = Train,
  method = "rpart",
  trControl = train_control,
  tuneGrid = cp_grid
)

cart_cv
## CART 
## 
## 1400 samples
##    5 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1120, 1120, 1120, 1120, 1120 
## Resampling results across tuning parameters:
## 
##   cp     Accuracy   Kappa    
##   0.001  0.9728571  0.9457143
##   0.003  0.9728571  0.9457143
##   0.005  0.9728571  0.9457143
##   0.007  0.9728571  0.9457143
##   0.009  0.9728571  0.9457143
##   0.011  0.9728571  0.9457143
##   0.013  0.9728571  0.9457143
##   0.015  0.9728571  0.9457143
##   0.017  0.9728571  0.9457143
##   0.019  0.9728571  0.9457143
##   0.021  0.9728571  0.9457143
##   0.023  0.9728571  0.9457143
##   0.025  0.9728571  0.9457143
##   0.027  0.9728571  0.9457143
##   0.029  0.9728571  0.9457143
##   0.031  0.9728571  0.9457143
##   0.033  0.9728571  0.9457143
##   0.035  0.9728571  0.9457143
##   0.037  0.9728571  0.9457143
##   0.039  0.9728571  0.9457143
##   0.041  0.9728571  0.9457143
##   0.043  0.9728571  0.9457143
##   0.045  0.9728571  0.9457143
##   0.047  0.9728571  0.9457143
##   0.049  0.9728571  0.9457143
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.049.

11. Best CP Model

best_cp <- cart_cv$bestTune$cp
best_cp
## [1] 0.049
TreeModel_cv <- rpart(
  High_Salary ~ Experience_Level +
  Company_Type +
  Remote +
  Top_Skill +
  Country,
  data = Train,
  method = "class",
  cp = best_cp
)

prp(TreeModel_cv)


12. Pruned CART Evaluation

PredictCART_cv <- predict(TreeModel_cv, newdata = Test, type = "class")

cart_cv_acc <- mean(PredictCART_cv == Test$High_Salary)
cart_cv_acc
## [1] 0.9766667

13. Model Comparison

results <- data.frame(
  Model = c("Logistic Regression", "CART", "Pruned CART"),
  Accuracy = c(log_acc, cart_acc, cart_cv_acc)
)

results
##                 Model  Accuracy
## 1 Logistic Regression 0.9766667
## 2                CART 0.9766667
## 3         Pruned CART 0.9766667

Συμπεράσματα

Η ανάλυση αναδεικνύει ότι τα μοντέλα μηχανικής μάθησης μπορούν να αποδώσουν εξαιρετικά υψηλή ακρίβεια στην πρόβλεψη μισθών στον τομέα της Τεχνητής Νοημοσύνης.

Παρόλο που η Λογιστική Παλινδρόμηση προσφέρει στατιστική ερμηνευσιμότητα, τα Δέντρα Απόφασης (CART) παρέχουν μεγαλύτερη διαφάνεια στη διαδικασία λήψης αποφάσεων, γεγονός που τα καθιστά πιο κατάλληλα για επιχειρησιακή χρήση και ερμηνεία από μη τεχνικούς χρήστες.