Το data set δείχνει πως το ai έχει επηρεάσει την αγορά εργασίας και τον τρόπο με τον οποίο επηρεάζει τους μισθούς οι παράγοντες όπως το επίπεδο εμπειρίας, ο τύπος εταιρείας, οι τεχνικές δεξιότητες και η δυνατότητα απομακρυσμένης εργασίας
Στόχος της παρούσας εργασίας είναι η ανάπτυξη και σύγκριση δύο μοντέλων μηχανικής μάθησης: της Λογιστικής Παλινδρόμησης και των Δέντρων Απόφασης (CART), για την ταξινόμηση θέσεων εργασίας σε υψηλού και χαμηλού μισθού.
jobs <- read.csv2("ai_job_market_dataset.csv")
median_salary <- median(jobs$Salary_USD)
jobs$High_Salary <- ifelse(jobs$Salary_USD > median_salary, 1, 0)
jobs$High_Salary <- as.factor(jobs$High_Salary)
table(jobs$High_Salary)
##
## 0 1
## 1000 1000
jobs$Experience_Level <- as.factor(jobs$Experience_Level)
jobs$Company_Type <- as.factor(jobs$Company_Type)
jobs$Remote <- as.factor(jobs$Remote)
jobs$Top_Skill <- as.factor(jobs$Top_Skill)
jobs$Country <- as.factor(jobs$Country)
str(jobs)
## 'data.frame': 2000 obs. of 9 variables:
## $ Year : int 2022 2025 2023 2025 2024 2026 2024 2024 2023 2023 ...
## $ Job_Title : chr "Data Scientist" "Data Scientist" "NLP Engineer" "AI Engineer" ...
## $ Country : Factor w/ 5 levels "Canada","Germany",..: 2 5 3 2 1 1 4 4 3 2 ...
## $ Company_Type : Factor w/ 3 levels "Big Tech","Freelance",..: 3 3 3 3 1 1 3 2 1 1 ...
## $ Experience_Level: Factor w/ 3 levels "Entry","Mid",..: 1 1 2 1 1 3 2 2 2 2 ...
## $ Salary_USD : int 32058 45821 31292 46264 57624 143241 98685 109912 29756 81208 ...
## $ Remote : Factor w/ 2 levels "No","Yes": 2 2 2 1 1 2 1 2 1 1 ...
## $ Top_Skill : Factor w/ 5 levels "NLP","Python",..: 1 1 4 2 2 4 4 1 2 4 ...
## $ High_Salary : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 2 2 1 2 ...
library(caTools)
## Warning: package 'caTools' was built under R version 4.5.3
set.seed(3000)
spl <- sample.split(jobs$High_Salary, SplitRatio = 0.7)
Train <- subset(jobs, spl == TRUE)
Test <- subset(jobs, spl == FALSE)
LogModel <- glm(
High_Salary ~ Experience_Level +
Company_Type +
Remote +
Top_Skill +
Country,
data = Train,
family = "binomial"
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(LogModel)
##
## Call:
## glm(formula = High_Salary ~ Experience_Level + Company_Type +
## Remote + Top_Skill + Country, family = "binomial", data = Train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.0898 2206.0054 -0.017 0.98659
## Experience_LevelMid 40.5795 2206.0054 0.018 0.98532
## Experience_LevelSenior 60.5464 2708.4782 0.022 0.98217
## Company_TypeFreelance -0.9001 0.4700 -1.915 0.05548 .
## Company_TypeStartup -0.2295 0.4971 -0.462 0.64427
## RemoteYes 0.1370 0.3696 0.371 0.71088
## Top_SkillPython -1.8328 0.7040 -2.603 0.00923 **
## Top_SkillPyTorch -1.2011 0.7112 -1.689 0.09123 .
## Top_SkillSQL -1.3164 0.7055 -1.866 0.06204 .
## Top_SkillTensorFlow -0.8464 0.7469 -1.133 0.25712
## CountryGermany -0.6764 0.4600 -1.470 0.14146
## CountryIndia -42.9999 2494.3228 -0.017 0.98625
## CountryUK -0.2110 0.4641 -0.455 0.64934
## CountryUSA 18.0676 1497.1503 0.012 0.99037
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1940.81 on 1399 degrees of freedom
## Residual deviance: 197.99 on 1386 degrees of freedom
## AIC: 225.99
##
## Number of Fisher Scoring iterations: 21
LogPred <- predict(LogModel, newdata = Test, type = "response")
LogPredClass <- ifelse(LogPred > 0.5, 1, 0)
table(Test$High_Salary, LogPredClass)
## LogPredClass
## 0 1
## 0 286 14
## 1 0 300
log_acc <- mean(LogPredClass == Test$High_Salary)
log_acc
## [1] 0.9766667
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.3
TreeModel <- rpart(
High_Salary ~ Experience_Level +
Company_Type +
Remote +
Top_Skill +
Country,
data = Train,
method = "class",
minbucket = 20
)
prp(
TreeModel,
type = 2,
extra = 104,
fallen.leaves = TRUE,
shadow.col = "gray",
branch.lty = 3,
faclen = 0,
tweak = 1.2
)
PredictCART <- predict(TreeModel, newdata = Test, type = "class")
table(Test$High_Salary, PredictCART)
## PredictCART
## 0 1
## 0 286 14
## 1 0 300
cart_acc <- mean(PredictCART == Test$High_Salary)
cart_acc
## [1] 0.9766667
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.5.3
PredictROC <- predict(TreeModel, newdata = Test)
pred <- prediction(PredictROC[,2], Test$High_Salary)
perf <- performance(pred, "tpr", "fpr")
plot(perf)
auc <- as.numeric(performance(pred, "auc")@y.values)
auc
## [1] 0.9766667
library(caret)
## Warning: package 'caret' was built under R version 4.5.3
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(3000)
train_control <- trainControl(method = "cv", number = 5)
cp_grid <- expand.grid(.cp = seq(0.001, 0.05, 0.002))
cart_cv <- train(
High_Salary ~ Experience_Level +
Company_Type +
Remote +
Top_Skill +
Country,
data = Train,
method = "rpart",
trControl = train_control,
tuneGrid = cp_grid
)
cart_cv
## CART
##
## 1400 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1120, 1120, 1120, 1120, 1120
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.001 0.9728571 0.9457143
## 0.003 0.9728571 0.9457143
## 0.005 0.9728571 0.9457143
## 0.007 0.9728571 0.9457143
## 0.009 0.9728571 0.9457143
## 0.011 0.9728571 0.9457143
## 0.013 0.9728571 0.9457143
## 0.015 0.9728571 0.9457143
## 0.017 0.9728571 0.9457143
## 0.019 0.9728571 0.9457143
## 0.021 0.9728571 0.9457143
## 0.023 0.9728571 0.9457143
## 0.025 0.9728571 0.9457143
## 0.027 0.9728571 0.9457143
## 0.029 0.9728571 0.9457143
## 0.031 0.9728571 0.9457143
## 0.033 0.9728571 0.9457143
## 0.035 0.9728571 0.9457143
## 0.037 0.9728571 0.9457143
## 0.039 0.9728571 0.9457143
## 0.041 0.9728571 0.9457143
## 0.043 0.9728571 0.9457143
## 0.045 0.9728571 0.9457143
## 0.047 0.9728571 0.9457143
## 0.049 0.9728571 0.9457143
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.049.
best_cp <- cart_cv$bestTune$cp
best_cp
## [1] 0.049
TreeModel_cv <- rpart(
High_Salary ~ Experience_Level +
Company_Type +
Remote +
Top_Skill +
Country,
data = Train,
method = "class",
cp = best_cp
)
prp(TreeModel_cv)
PredictCART_cv <- predict(TreeModel_cv, newdata = Test, type = "class")
cart_cv_acc <- mean(PredictCART_cv == Test$High_Salary)
cart_cv_acc
## [1] 0.9766667
results <- data.frame(
Model = c("Logistic Regression", "CART", "Pruned CART"),
Accuracy = c(log_acc, cart_acc, cart_cv_acc)
)
results
## Model Accuracy
## 1 Logistic Regression 0.9766667
## 2 CART 0.9766667
## 3 Pruned CART 0.9766667
Η ανάλυση αναδεικνύει ότι τα μοντέλα μηχανικής μάθησης μπορούν να αποδώσουν εξαιρετικά υψηλή ακρίβεια στην πρόβλεψη μισθών στον τομέα της Τεχνητής Νοημοσύνης.
Παρόλο που η Λογιστική Παλινδρόμηση προσφέρει στατιστική ερμηνευσιμότητα, τα Δέντρα Απόφασης (CART) παρέχουν μεγαλύτερη διαφάνεια στη διαδικασία λήψης αποφάσεων, γεγονός που τα καθιστά πιο κατάλληλα για επιχειρησιακή χρήση και ερμηνεία από μη τεχνικούς χρήστες.