Cieľom tejto úlohy je vykonať jednoduchú a viacnásobnú lineárnu
regresiu nad datasetom AI Job Dataset, ktorý obsahuje
informácie o platoch v odvetví umelej inteligencie.
Inšpirované prístupom z Cvičenie 5.
# Načítanie dát
df <- read.csv("C:/Users/Lenovo/Documents/School/Mgr/R - Ekonometria/Cvicenia/My dataset/ai_job_dataset.csv")
# Prehľad dát
glimpse(df)
## Rows: 15,000
## Columns: 19
## $ job_id <chr> "AI00001", "AI00002", "AI00003", "AI00004", "AI…
## $ job_title <chr> "AI Research Scientist", "AI Software Engineer"…
## $ salary_usd <int> 90376, 61895, 152626, 80215, 54624, 123574, 796…
## $ salary_currency <chr> "USD", "USD", "USD", "USD", "EUR", "EUR", "GBP"…
## $ experience_level <chr> "SE", "EN", "MI", "SE", "EN", "SE", "MI", "EN",…
## $ employment_type <chr> "CT", "CT", "FL", "FL", "PT", "CT", "FL", "FL",…
## $ company_location <chr> "China", "Canada", "Switzerland", "India", "Fra…
## $ company_size <chr> "M", "M", "L", "M", "S", "M", "S", "L", "L", "M…
## $ employee_residence <chr> "China", "Ireland", "South Korea", "India", "Si…
## $ remote_ratio <int> 50, 100, 0, 50, 100, 50, 0, 0, 0, 0, 100, 0, 10…
## $ required_skills <chr> "Tableau, PyTorch, Kubernetes, Linux, NLP", "De…
## $ education_required <chr> "Bachelor", "Master", "Associate", "PhD", "Mast…
## $ years_experience <int> 9, 1, 2, 7, 0, 7, 3, 0, 7, 5, 8, 15, 5, 0, 6, 0…
## $ industry <chr> "Automotive", "Media", "Education", "Consulting…
## $ posting_date <chr> "2024-10-18", "2024-11-20", "2025-03-18", "2024…
## $ application_deadline <chr> "2024-11-07", "2025-01-11", "2025-04-07", "2025…
## $ job_description_length <int> 1076, 1268, 1974, 1345, 1989, 819, 1936, 1286, …
## $ benefits_score <dbl> 5.9, 5.2, 9.4, 8.6, 6.6, 5.9, 6.3, 7.6, 9.3, 5.…
## $ company_name <chr> "Smart Analytics", "TechCorp Inc", "Autonomous …
# Vyčistenie názvov stĺpcov
names(df) <- make.names(names(df))
# Základné štatistiky
summary(df)
## job_id job_title salary_usd salary_currency
## Length:15000 Length:15000 Min. : 32519 Length:15000
## Class :character Class :character 1st Qu.: 70180 Class :character
## Mode :character Mode :character Median : 99705 Mode :character
## Mean :115349
## 3rd Qu.:146409
## Max. :399095
## experience_level employment_type company_location company_size
## Length:15000 Length:15000 Length:15000 Length:15000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## employee_residence remote_ratio required_skills education_required
## Length:15000 Min. : 0.00 Length:15000 Length:15000
## Class :character 1st Qu.: 0.00 Class :character Class :character
## Mode :character Median : 50.00 Mode :character Mode :character
## Mean : 49.48
## 3rd Qu.:100.00
## Max. :100.00
## years_experience industry posting_date application_deadline
## Min. : 0.000 Length:15000 Length:15000 Length:15000
## 1st Qu.: 2.000 Class :character Class :character Class :character
## Median : 5.000 Mode :character Mode :character Mode :character
## Mean : 6.253
## 3rd Qu.:10.000
## Max. :19.000
## job_description_length benefits_score company_name
## Min. : 500 Min. : 5.000 Length:15000
## 1st Qu.:1004 1st Qu.: 6.200 Class :character
## Median :1512 Median : 7.500 Mode :character
## Mean :1503 Mean : 7.504
## 3rd Qu.:2000 3rd Qu.: 8.800
## Max. :2499 Max. :10.000
Cieľom je pozrieť sa na vzťahy medzi numerickými premennými v datasete.
num_cols <- df %>% select(where(is.numeric))
cor_matrix <- cor(num_cols, use = "complete.obs")
# Heatmapa korelácií
ggcorrplot(cor_matrix, lab = TRUE, title = "Korelačná matica numerických premenných")
Interpretácia:
Vyššie korelácie znamenajú silnejší vzťah medzi premennými. Napríklad
silný vzťah medzi salary_usd a years_experience môže
naznačovať, že skúsenosti sú významným faktorom ovplyvňujúcim výšku
mzdy.
Budeme skúmať, ako skúsenosti ovplyvňujú plat v USD.
model1 <- lm(salary_usd ~ years_experience, data = df)
summary(model1)
##
## Call:
## lm(formula = salary_usd ~ years_experience, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -129178 -24881 -5950 18845 253718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65233.53 500.78 130.3 <2e-16 ***
## years_experience 8014.37 59.92 133.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40690 on 14998 degrees of freedom
## Multiple R-squared: 0.544, Adjusted R-squared: 0.544
## F-statistic: 1.789e+04 on 1 and 14998 DF, p-value: < 2.2e-16
years_experience
ukazuje, o koľko sa priemerný plat zmení pri zvýšení skúseností o 1
rok.ggplot(df, aes(x = years_experience, y = salary_usd)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_smooth(method = "lm", color = "red") +
labs(
title = "Závislosť platu od dĺžky skúseností",
x = "Roky skúseností",
y = "Plat (USD)"
) +
theme_minimal()
Do modelu pridáme ďalšie premenné – napríklad remote_ratio a benefits_score.
model2 <- lm(salary_usd ~ years_experience + remote_ratio + benefits_score, data = df)
summary(model2)
##
## Call:
## lm(formula = salary_usd ~ years_experience + remote_ratio + benefits_score,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128846 -24823 -5999 18829 253315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63079.996 1835.080 34.375 <2e-16 ***
## years_experience 8014.467 59.926 133.739 <2e-16 ***
## remote_ratio 3.554 8.143 0.436 0.663
## benefits_score 263.459 229.034 1.150 0.250
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40700 on 14996 degrees of freedom
## Multiple R-squared: 0.544, Adjusted R-squared: 0.5439
## F-statistic: 5964 on 3 and 14996 DF, p-value: < 2.2e-16
remote_ratio negatívny koeficient,
môžeme interpretovať, že väčší podiel práce na diaľku je spojený s
mierne nižším platom (v priemere).glance(model1)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.544 0.544 40695. 17892. 0 1 -180491. 360988. 3.61e5
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
glance(model2)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.544 0.544 40695. 5964. 0 3 -180490. 360990. 3.61e5
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Ak má model2 vyššie Adjusted R², znamená to, že pridanie ďalších premenných zlepšilo vysvetľovaciu silu modelu.
df$predicted_salary <- predict(model2)
ggplot(df, aes(x = salary_usd, y = predicted_salary)) +
geom_point(alpha = 0.6, color = "darkgreen") +
geom_abline(intercept = 0, slope = 1, color = "red") +
labs(
title = "Skutočné vs. Predikované platy",
x = "Skutočný plat (USD)",
y = "Predikovaný plat (USD)"
) +
theme_minimal()