V tejto časti si precvičíme prácu s maticami na niekoľkých praktických úlohách.
Vytvoríme dve matice 3x3 a vypočítame ich súčet, rozdiel a súčin.
# Vytvorenie dvoch matíc 3x3
M1 <- matrix(c(2, 5, 7, 4, 8, 6, 1, 3, 9), nrow = 3, byrow = TRUE)
M2 <- matrix(c(9, 2, 4, 7, 5, 1, 6, 8, 3), nrow = 3, byrow = TRUE)
M1
## [,1] [,2] [,3]
## [1,] 2 5 7
## [2,] 4 8 6
## [3,] 1 3 9
M2
## [,1] [,2] [,3]
## [1,] 9 2 4
## [2,] 7 5 1
## [3,] 6 8 3
# Súčet matíc
sum_mat <- M1 + M2
sum_mat
## [,1] [,2] [,3]
## [1,] 11 7 11
## [2,] 11 13 7
## [3,] 7 11 12
# Rozdiel matíc
diff_mat <- M1 - M2
diff_mat
## [,1] [,2] [,3]
## [1,] -7 3 3
## [2,] -3 3 5
## [3,] -5 -5 6
# Matematické násobenie matíc
prod_mat <- M1 %*% M2
prod_mat
## [,1] [,2] [,3]
## [1,] 95 85 34
## [2,] 128 96 42
## [3,] 84 89 34
➡️ Komentár: Vidíme, že výsledné matice majú rovnaké
rozmery. Násobenie %*% vykonáva klasický maticový
súčin.
Zistíme determinant matice M1 a pokúsime sa vypočítať
jej inverznú maticu.
# Výpočet determinantov
det_M1 <- det(M1)
det_M1
## [1] -14
# Ak determinant nie je nulový, môžeme nájsť inverznú maticu
if (det_M1 != 0) {
inv_M1 <- solve(M1)
inv_M1
} else {
print("Matica nie je invertovateľná (determinant = 0).")
}
## [,1] [,2] [,3]
## [1,] -3.8571429 1.71428571 1.8571429
## [2,] 2.1428571 -0.78571429 -1.1428571
## [3,] -0.2857143 0.07142857 0.2857143
➡️ Komentár: Ak je determinant rôzny od nuly, matica je regulárna a môžeme ju invertovať. V opačnom prípade nie.
Spojíme dve matice po stĺpcoch a vypíšeme rozmery novej matice.
# Spojenie matíc po stĺpcoch
combined <- cbind(M1, M2)
combined
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 2 5 7 9 2 4
## [2,] 4 8 6 7 5 1
## [3,] 1 3 9 6 8 3
# Zistenie rozmerov
dim(combined)
## [1] 3 6
➡️ Komentár: Nová matica má rovnaký počet riadkov ako pôvodné matice, ale dvojnásobný počet stĺpcov.
Vytvoríme maticu s náhodnými číslami a zistíme, kde sa nachádza jej najväčšia hodnota.
# Generovanie náhodnej matice 4x4 s hodnotami od 0 do 20
set.seed(123) # pre reprodukovateľnosť
R <- matrix(runif(16, 0, 20), nrow = 4)
R
## [,1] [,2] [,3] [,4]
## [1,] 5.751550 18.80935 11.028700 13.551413
## [2,] 15.766103 0.91113 9.132295 11.452668
## [3,] 8.179538 10.56211 19.136667 2.058494
## [4,] 17.660348 17.84838 9.066683 17.996499
# Nájdeme najväčšiu hodnotu
max_val <- max(R)
max_val
## [1] 19.13667
# Zistíme pozíciu tejto hodnoty
max_index <- which(R == max_val, arr.ind = TRUE)
max_index
## row col
## [1,] 3 3
➡️ Komentár: Funkcia which(..., arr.ind = TRUE)
vráti pozíciu (riadok, stĺpec), kde sa daná hodnota nachádza.
V tejto časti sme si ukázali: - ako vytvárať a upravovať matice, - základné operácie a štatistiky, - prácu s transpozíciou, inverziou a determinantom.
Matice tvoria základ pre lineárnu algebru a sú veľmi dôležité pri dátovej analýze a modelovaní.
# Priradenie konštanty do premennej
a <- 20
b <- 5.5
# Arithmetic
sum_ab <- a + b # sucet
diff_ab <- a - b # rozdiel
prod_ab <- a * b # násobenie
quot_ab <- a / b # delenie
power_ab <- a ^ b # umocňovanie
mod_ab <- a %% 3 # zbytok po delení tromi (tzv modulo)
# Rounding
round_b <- round(b) # zaokruhlovanie smerom k najblizsiemu celemu cislu
ceil_b <- ceiling(b) # najblizsie vyssie cele cislo
floor_b <- floor(b) # najblizsie nizsie cele cislo
a; b
## [1] 20
## [1] 5.5
sum_ab; diff_ab; prod_ab; quot_ab; power_ab; mod_ab;
## [1] 25.5
## [1] 14.5
## [1] 110
## [1] 3.636364
## [1] 14310835
## [1] 2
round_b; ceil_b; floor_b
## [1] 6
## [1] 6
## [1] 5
Poznámky
^ operátor umocňovania.%% je modulo, teda zbytok po delení,round(x, digits = 0) zaokrúhľovanie na určitý počet
desatinných miest (digits=). ak digits = 0, potom ide o celočíselné
zaokrúhľovanieknitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(dplyr)
# Import CSV súboru
nhl <- read_csv("nhlplayoffs.csv")
# Zobrazenie prvých riadkov a názvov stĺpcov
head(nhl)
## # A tibble: 6 × 13
## rank team year games wins losses ties shootout_wins shootout_losses
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Colorado A… 2022 20 16 4 0 5 1
## 2 2 Tampa Bay … 2022 23 14 9 0 1 2
## 3 3 New York R… 2022 20 10 10 0 1 2
## 4 4 Edmonton O… 2022 16 8 8 0 1 2
## 5 5 Carolina H… 2022 14 7 7 0 1 0
## 6 6 St. Louis … 2022 12 6 6 0 1 1
## # ℹ 4 more variables: win_loss_percentage <dbl>, goals_scored <dbl>,
## # goals_against <dbl>, goal_differential <dbl>
colnames(nhl)
## [1] "rank" "team" "year"
## [4] "games" "wins" "losses"
## [7] "ties" "shootout_wins" "shootout_losses"
## [10] "win_loss_percentage" "goals_scored" "goals_against"
## [13] "goal_differential"
nhl <- nhl %>%
rename(
Year = year,
Team = team,
Wins = wins,
Losses = losses,
GoalsAgainst = goals_against,
) %>%
mutate(
) %>%
filter(!is.na(Year))
summary(nhl)
## rank Team Year games
## Min. : 1.000 Length:1009 Min. :1918 Min. : 2.000
## 1st Qu.: 3.000 Class :character 1st Qu.:1972 1st Qu.: 5.000
## Median : 6.000 Mode :character Median :1990 Median : 7.000
## Mean : 7.067 Mean :1986 Mean : 9.364
## 3rd Qu.:11.000 3rd Qu.:2007 3rd Qu.:12.000
## Max. :24.000 Max. :2022 Max. :27.000
## Wins Losses ties shootout_wins
## Min. : 0.000 Min. : 0.000 Min. :0.00000 Min. : 0.0000
## 1st Qu.: 1.000 1st Qu.: 4.000 1st Qu.:0.00000 1st Qu.: 0.0000
## Median : 3.000 Median : 4.000 Median :0.00000 Median : 1.0000
## Mean : 4.657 Mean : 4.657 Mean :0.04955 Mean : 0.9326
## 3rd Qu.: 7.000 3rd Qu.: 6.000 3rd Qu.:0.00000 3rd Qu.: 1.0000
## Max. :18.000 Max. :12.000 Max. :4.00000 Max. :10.0000
## shootout_losses win_loss_percentage goals_scored GoalsAgainst
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:0.3330 1st Qu.:11.00 1st Qu.:16.00
## Median :1.0000 Median :0.4290 Median :20.00 Median :22.00
## Mean :0.9326 Mean :0.4112 Mean :26.63 Mean :26.63
## 3rd Qu.:1.0000 3rd Qu.:0.5450 3rd Qu.:37.00 3rd Qu.:35.00
## Max. :4.0000 Max. :1.0000 Max. :98.00 Max. :91.00
## goal_differential
## Min. :-27
## 1st Qu.: -6
## Median : -2
## Mean : 0
## 3rd Qu.: 3
## Max. : 49
library(ggplot2)
ggplot(nhl, aes(x = goals_scored, y = win_loss_percentage)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_smooth(method = "lm", color = "red", se = FALSE) +
theme_minimal() +
labs(
title = "Vzťah medzi počtom gólov a výherným percentom",
x = "Počet strelených gólov",
y = "Percento výhier (%)"
)
ggplot(nhl, aes(x = factor(Year), y = Wins)) +
geom_boxplot(fill = "lightblue", color = "darkblue") +
theme_minimal() +
labs(
title = "Rozdelenie počtu výhier v play-off podľa rokov",
x = "Rok",
y = "Počet výhier"
)
nhl.trend <- nhl %>%
group_by(Year) %>%
summarise(AvgWinPercent = mean(win_loss_percentage, na.rm = TRUE))
ggplot(nhl.trend, aes(x = Year, y = AvgWinPercent)) +
geom_line(color = "darkred", size = 1) +
geom_point(color = "black") +
theme_minimal() +
labs(
title = "Priemerné výherné percento tímov NHL (2006–2022)",
x = "Rok",
y = "Priemerné % výhier"
)
# Nainštaluj balíčky, ak ešte nie sú nainštalované
needed <- c("tidyverse", "broom", "lmtest", "car", "tseries", "sandwich", "ggplot2")
new <- needed[!(needed %in% installed.packages()[, "Package"]) ]
if(length(new)) install.packages(new)
library(tidyverse)
library(broom)
library(lmtest)
library(car)
library(tseries)
library(sandwich)
library(ggplot2)
# Uprav cestu podľa potreby. Tu sa predpokladá "/mnt/data/nhlplayoffs.csv".
nhl <- read_csv("nhlplayoffs.csv")
# Zobrazenie základných informácií
glimpse(nhl)
## Rows: 1,009
## Columns: 13
## $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ team <chr> "Colorado Avalanche", "Tampa Bay Lightning", "New …
## $ year <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 20…
## $ games <dbl> 20, 23, 20, 16, 14, 12, 12, 10, 7, 7, 7, 7, 7, 6, …
## $ wins <dbl> 16, 14, 10, 8, 7, 6, 5, 4, 3, 3, 3, 3, 3, 2, 2, 0,…
## $ losses <dbl> 4, 9, 10, 8, 7, 6, 7, 6, 4, 4, 4, 4, 4, 4, 4, 4, 7…
## $ ties <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ shootout_wins <dbl> 5, 1, 1, 1, 1, 1, 1, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0,…
## $ shootout_losses <dbl> 1, 2, 2, 2, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 2, 1, 4,…
## $ win_loss_percentage <dbl> 0.800, 0.609, 0.500, 0.500, 0.500, 0.500, 0.417, 0…
## $ goals_scored <dbl> 85, 67, 62, 65, 37, 40, 35, 23, 20, 17, 24, 14, 29…
## $ goals_against <dbl> 55, 61, 58, 59, 40, 38, 39, 32, 24, 27, 23, 15, 28…
## $ goal_differential <dbl> 30, 6, 4, 6, -3, 2, -4, -9, -4, -10, 1, -1, 1, -6,…
summary(nhl)
## rank team year games
## Min. : 1.000 Length:1009 Min. :1918 Min. : 2.000
## 1st Qu.: 3.000 Class :character 1st Qu.:1972 1st Qu.: 5.000
## Median : 6.000 Mode :character Median :1990 Median : 7.000
## Mean : 7.067 Mean :1986 Mean : 9.364
## 3rd Qu.:11.000 3rd Qu.:2007 3rd Qu.:12.000
## Max. :24.000 Max. :2022 Max. :27.000
## wins losses ties shootout_wins
## Min. : 0.000 Min. : 0.000 Min. :0.00000 Min. : 0.0000
## 1st Qu.: 1.000 1st Qu.: 4.000 1st Qu.:0.00000 1st Qu.: 0.0000
## Median : 3.000 Median : 4.000 Median :0.00000 Median : 1.0000
## Mean : 4.657 Mean : 4.657 Mean :0.04955 Mean : 0.9326
## 3rd Qu.: 7.000 3rd Qu.: 6.000 3rd Qu.:0.00000 3rd Qu.: 1.0000
## Max. :18.000 Max. :12.000 Max. :4.00000 Max. :10.0000
## shootout_losses win_loss_percentage goals_scored goals_against
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:0.3330 1st Qu.:11.00 1st Qu.:16.00
## Median :1.0000 Median :0.4290 Median :20.00 Median :22.00
## Mean :0.9326 Mean :0.4112 Mean :26.63 Mean :26.63
## 3rd Qu.:1.0000 3rd Qu.:0.5450 3rd Qu.:37.00 3rd Qu.:35.00
## Max. :4.0000 Max. :1.0000 Max. :98.00 Max. :91.00
## goal_differential
## Min. :-27
## 1st Qu.: -6
## Median : -2
## Mean : 0
## 3rd Qu.: 3
## Max. : 49
# Prvých 10 riadkov
head(nhl, 10)
## # A tibble: 10 × 13
## rank team year games wins losses ties shootout_wins shootout_losses
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Colorado … 2022 20 16 4 0 5 1
## 2 2 Tampa Bay… 2022 23 14 9 0 1 2
## 3 3 New York … 2022 20 10 10 0 1 2
## 4 4 Edmonton … 2022 16 8 8 0 1 2
## 5 5 Carolina … 2022 14 7 7 0 1 0
## 6 6 St. Louis… 2022 12 6 6 0 1 1
## 7 7 Calgary F… 2022 12 5 7 0 1 1
## 8 8 Florida P… 2022 10 4 6 0 2 0
## 9 9 Boston Br… 2022 7 3 4 0 0 0
## 10 10 Los Angel… 2022 7 3 4 0 1 0
## # ℹ 4 more variables: win_loss_percentage <dbl>, goals_scored <dbl>,
## # goals_against <dbl>, goal_differential <dbl>
# Vyberieme len premenné relevantné pre model
nhl_model <- nhl %>%
select(win_loss_percentage, goals_scored, goals_against, games, team, year) %>%
mutate(across(c(win_loss_percentage, goals_scored, goals_against, games), as.numeric))
# Kontrola NA
nhl_model %>% summarise(across(everything(), ~ sum(is.na(.))))
## # A tibble: 1 × 6
## win_loss_percentage goals_scored goals_against games team year
## <int> <int> <int> <int> <int> <int>
## 1 0 0 0 0 0 0
# Ak by boli NA, jednoduchá imputácia priemerom (tu len ako postup)
if(any(is.na(nhl_model))) {
nhl_model <- nhl_model %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
}
# Základné štatistiky
nhl_model %>% select(win_loss_percentage, goals_scored, goals_against, games) %>% summary()
## win_loss_percentage goals_scored goals_against games
## Min. :0.0000 Min. : 0.00 Min. : 0.00 Min. : 2.000
## 1st Qu.:0.3330 1st Qu.:11.00 1st Qu.:16.00 1st Qu.: 5.000
## Median :0.4290 Median :20.00 Median :22.00 Median : 7.000
## Mean :0.4112 Mean :26.63 Mean :26.63 Mean : 9.364
## 3rd Qu.:0.5450 3rd Qu.:37.00 3rd Qu.:35.00 3rd Qu.:12.000
## Max. :1.0000 Max. :98.00 Max. :91.00 Max. :27.000
# Korelačná matica
cor_mat <- nhl_model %>% select(win_loss_percentage, goals_scored, goals_against, games) %>% cor()
cor_mat
## win_loss_percentage goals_scored goals_against games
## win_loss_percentage 1.0000000 0.7009174 0.5133309 0.6829670
## goals_scored 0.7009174 1.0000000 0.9088340 0.9401903
## goals_against 0.5133309 0.9088340 1.0000000 0.8948069
## games 0.6829670 0.9401903 0.8948069 1.0000000
# Heatmap korelácií
library(ggplot2)
cor_df <- as.data.frame(as.table(cor_mat))
ggplot(cor_df, aes(Var1, Var2, fill = Freq)) +
geom_tile() +
geom_text(aes(label = round(Freq, 2))) +
labs(title = "Korelačná matica") +
theme_minimal()
model <- lm(win_loss_percentage ~ goals_scored + goals_against + games, data = nhl_model)
summary(model)
##
## Call:
## lm(formula = win_loss_percentage ~ goals_scored + goals_against +
## games, data = nhl_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30101 -0.06681 0.01477 0.07457 0.65960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2767744 0.0095021 29.13 < 2e-16 ***
## goals_scored 0.0100937 0.0006642 15.20 < 2e-16 ***
## goals_against -0.0113519 0.0006818 -16.65 < 2e-16 ***
## games 0.0179319 0.0022057 8.13 1.26e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1325 on 1005 degrees of freedom
## Multiple R-squared: 0.6052, Adjusted R-squared: 0.604
## F-statistic: 513.5 on 3 and 1005 DF, p-value: < 2.2e-16
# Harmonizovaný tidy výstup
tidy(model)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.277 0.00950 29.1 1.00e-135
## 2 goals_scored 0.0101 0.000664 15.2 4.15e- 47
## 3 goals_against -0.0114 0.000682 -16.7 3.65e- 55
## 4 games 0.0179 0.00221 8.13 1.26e- 15
glance(model)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.605 0.604 0.132 513. 3.12e-202 3 610. -1210. -1185.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Rezidua a fitted hodnoty
nhl_model <- nhl_model %>%
mutate(fitted = fitted(model), residuals = residuals(model), std_resid = rstandard(model))
# Graf: rezidua vs fitted
ggplot(nhl_model, aes(x = fitted, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Rezidua vs Fitted hodnoty", x = "Fitted", y = "Rezidua") +
theme_minimal()
# Histogram rezíduí + normálna krivka
ggplot(nhl_model, aes(x = residuals)) +
geom_histogram(aes(y = ..density..), bins = 30) +
stat_function(fun = dnorm, args = list(mean = mean(nhl_model$residuals), sd = sd(nhl_model$residuals))) +
labs(title = "Histogram rezíduí s normálnou krivkou") +
theme_minimal()
# QQ-plot
qqPlot(model, main = "QQ-plot rezíduí")
## [1] 996 998
# Jarque-Bera test (normálnosť)
jarque.test <- tseries::jarque.bera.test(nhl_model$residuals)
jarque.test
##
## Jarque Bera Test
##
## data: nhl_model$residuals
## X-squared = 162.83, df = 2, p-value < 2.2e-16
# Breusch-Pagan test
bptest_res <- bptest(model)
bptest_res
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 72.531, df = 3, p-value = 1.225e-15
# White test (approx pomocou bptest na kvadratické termíny)
# Pridáme kvadratické a interakčné termíny pre aproximáciu White testu
model_white <- lm(residuals(model)^2 ~ fitted(model) + I(fitted(model)^2), data = nhl_model)
summary(model_white)
##
## Call:
## lm(formula = residuals(model)^2 ~ fitted(model) + I(fitted(model)^2),
## data = nhl_model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.03452 -0.01482 -0.00935 0.00232 0.41837
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.064579 0.007487 8.625 < 2e-16 ***
## fitted(model) -0.206015 0.032159 -6.406 2.29e-10 ***
## I(fitted(model)^2) 0.192016 0.030854 6.223 7.13e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03406 on 1006 degrees of freedom
## Multiple R-squared: 0.03925, Adjusted R-squared: 0.03734
## F-statistic: 20.55 on 2 and 1006 DF, p-value: 1.795e-09
# Durbin-Watson test
dw_res <- lmtest::dwtest(model)
dw_res
##
## Durbin-Watson test
##
## data: model
## DW = 1.4977, p-value = 5.168e-16
## alternative hypothesis: true autocorrelation is greater than 0
vifs <- car::vif(model)
vifs
## goals_scored goals_against games
## 10.735066 6.249835 9.372463
nhl <- read.csv("nhlplayoffs.csv")
cat("Rozmery datasetu:", dim(nhl), "\n")
## Rozmery datasetu: 1009 13
str(nhl)
## 'data.frame': 1009 obs. of 13 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ team : chr "Colorado Avalanche" "Tampa Bay Lightning" "New York Rangers" "Edmonton Oilers" ...
## $ year : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
## $ games : int 20 23 20 16 14 12 12 10 7 7 ...
## $ wins : int 16 14 10 8 7 6 5 4 3 3 ...
## $ losses : int 4 9 10 8 7 6 7 6 4 4 ...
## $ ties : int 0 0 0 0 0 0 0 0 0 0 ...
## $ shootout_wins : int 5 1 1 1 1 1 1 2 0 1 ...
## $ shootout_losses : int 1 2 2 2 0 1 1 0 0 0 ...
## $ win_loss_percentage: num 0.8 0.609 0.5 0.5 0.5 0.5 0.417 0.4 0.429 0.429 ...
## $ goals_scored : int 85 67 62 65 37 40 35 23 20 17 ...
## $ goals_against : int 55 61 58 59 40 38 39 32 24 27 ...
## $ goal_differential : int 30 6 4 6 -3 2 -4 -9 -4 -10 ...
summary(nhl)
## rank team year games
## Min. : 1.000 Length:1009 Min. :1918 Min. : 2.000
## 1st Qu.: 3.000 Class :character 1st Qu.:1972 1st Qu.: 5.000
## Median : 6.000 Mode :character Median :1990 Median : 7.000
## Mean : 7.067 Mean :1986 Mean : 9.364
## 3rd Qu.:11.000 3rd Qu.:2007 3rd Qu.:12.000
## Max. :24.000 Max. :2022 Max. :27.000
## wins losses ties shootout_wins
## Min. : 0.000 Min. : 0.000 Min. :0.00000 Min. : 0.0000
## 1st Qu.: 1.000 1st Qu.: 4.000 1st Qu.:0.00000 1st Qu.: 0.0000
## Median : 3.000 Median : 4.000 Median :0.00000 Median : 1.0000
## Mean : 4.657 Mean : 4.657 Mean :0.04955 Mean : 0.9326
## 3rd Qu.: 7.000 3rd Qu.: 6.000 3rd Qu.:0.00000 3rd Qu.: 1.0000
## Max. :18.000 Max. :12.000 Max. :4.00000 Max. :10.0000
## shootout_losses win_loss_percentage goals_scored goals_against
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:0.3330 1st Qu.:11.00 1st Qu.:16.00
## Median :1.0000 Median :0.4290 Median :20.00 Median :22.00
## Mean :0.9326 Mean :0.4112 Mean :26.63 Mean :26.63
## 3rd Qu.:1.0000 3rd Qu.:0.5450 3rd Qu.:37.00 3rd Qu.:35.00
## Max. :4.0000 Max. :1.0000 Max. :98.00 Max. :91.00
## goal_differential
## Min. :-27
## 1st Qu.: -6
## Median : -2
## Mean : 0
## 3rd Qu.: 3
## Max. : 49
# Počet chýbajúcich hodnôt
sapply(nhl, function(x) sum(is.na(x)))
## rank team year games
## 0 0 0 0
## wins losses ties shootout_wins
## 0 0 0 0
## shootout_losses win_loss_percentage goals_scored goals_against
## 0 0 0 0
## goal_differential
## 0
# Základné štatistiky pre číselné premenné
numvars <- nhl[sapply(nhl, is.numeric)]
summary(numvars) # oprava: round() odstránené kvôli chybe pri knitovaní
## rank year games wins
## Min. : 1.000 Min. :1918 Min. : 2.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:1972 1st Qu.: 5.000 1st Qu.: 1.000
## Median : 6.000 Median :1990 Median : 7.000 Median : 3.000
## Mean : 7.067 Mean :1986 Mean : 9.364 Mean : 4.657
## 3rd Qu.:11.000 3rd Qu.:2007 3rd Qu.:12.000 3rd Qu.: 7.000
## Max. :24.000 Max. :2022 Max. :27.000 Max. :18.000
## losses ties shootout_wins shootout_losses
## Min. : 0.000 Min. :0.00000 Min. : 0.0000 Min. :0.0000
## 1st Qu.: 4.000 1st Qu.:0.00000 1st Qu.: 0.0000 1st Qu.:0.0000
## Median : 4.000 Median :0.00000 Median : 1.0000 Median :1.0000
## Mean : 4.657 Mean :0.04955 Mean : 0.9326 Mean :0.9326
## 3rd Qu.: 6.000 3rd Qu.:0.00000 3rd Qu.: 1.0000 3rd Qu.:1.0000
## Max. :12.000 Max. :4.00000 Max. :10.0000 Max. :4.0000
## win_loss_percentage goals_scored goals_against goal_differential
## Min. :0.0000 Min. : 0.00 Min. : 0.00 Min. :-27
## 1st Qu.:0.3330 1st Qu.:11.00 1st Qu.:16.00 1st Qu.: -6
## Median :0.4290 Median :20.00 Median :22.00 Median : -2
## Mean :0.4112 Mean :26.63 Mean :26.63 Mean : 0
## 3rd Qu.:0.5450 3rd Qu.:37.00 3rd Qu.:35.00 3rd Qu.: 3
## Max. :1.0000 Max. :98.00 Max. :91.00 Max. : 49
cor_mat <- cor(numvars, use="pairwise.complete.obs")
print(round(cor_mat,3))
## rank year games wins losses ties shootout_wins
## rank 1.000 0.458 -0.589 -0.652 -0.278 -0.158 -0.420
## year 0.458 1.000 0.318 0.226 0.477 -0.362 0.211
## games -0.589 0.318 1.000 0.967 0.822 -0.128 0.672
## wins -0.652 0.226 0.967 1.000 0.652 -0.098 0.670
## losses -0.278 0.477 0.822 0.652 1.000 -0.293 0.506
## ties -0.158 -0.362 -0.128 -0.098 -0.293 1.000 -0.065
## shootout_wins -0.420 0.211 0.672 0.670 0.506 -0.065 1.000
## shootout_losses -0.095 0.275 0.366 0.300 0.426 -0.129 0.237
## win_loss_percentage -0.697 -0.009 0.683 0.785 0.277 0.070 0.496
## goals_scored -0.572 0.268 0.940 0.944 0.700 -0.131 0.577
## goals_against -0.438 0.361 0.895 0.818 0.845 -0.204 0.569
## goal_differential -0.548 0.000 0.613 0.747 0.160 0.046 0.343
## shootout_losses win_loss_percentage goals_scored
## rank -0.095 -0.697 -0.572
## year 0.275 -0.009 0.268
## games 0.366 0.683 0.940
## wins 0.300 0.785 0.944
## losses 0.426 0.277 0.700
## ties -0.129 0.070 -0.131
## shootout_wins 0.237 0.496 0.577
## shootout_losses 1.000 0.154 0.313
## win_loss_percentage 0.154 1.000 0.701
## goals_scored 0.313 0.701 1.000
## goals_against 0.279 0.513 0.909
## goal_differential 0.236 0.712 0.723
## goals_against goal_differential
## rank -0.438 -0.548
## year 0.361 0.000
## games 0.895 0.613
## wins 0.818 0.747
## losses 0.845 0.160
## ties -0.204 0.046
## shootout_wins 0.569 0.343
## shootout_losses 0.279 0.236
## win_loss_percentage 0.513 0.712
## goals_scored 0.909 0.723
## goals_against 1.000 0.369
## goal_differential 0.369 1.000
pairs(nhl[, c("goal_differential","goals_scored","goals_against","wins","games")], gap=0.4, main="Pairs plot")
V pôvodnom súbore sa uvažoval lineárny model pre Life.expectancy. Tu ako cieľ používam goal_differential. Preukážem postupy, ktoré sú v pôvodnom cvičení: základný OLS, diagnostika, riešenie multikolinearity, robustné SE, transformácie a pod.
nhl <- read_csv("nhlplayoffs.csv")
glimpse(nhl)
## Rows: 1,009
## Columns: 13
## $ rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ team <chr> "Colorado Avalanche", "Tampa Bay Lightning", "New …
## $ year <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 20…
## $ games <dbl> 20, 23, 20, 16, 14, 12, 12, 10, 7, 7, 7, 7, 7, 6, …
## $ wins <dbl> 16, 14, 10, 8, 7, 6, 5, 4, 3, 3, 3, 3, 3, 2, 2, 0,…
## $ losses <dbl> 4, 9, 10, 8, 7, 6, 7, 6, 4, 4, 4, 4, 4, 4, 4, 4, 7…
## $ ties <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ shootout_wins <dbl> 5, 1, 1, 1, 1, 1, 1, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0,…
## $ shootout_losses <dbl> 1, 2, 2, 2, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 2, 1, 4,…
## $ win_loss_percentage <dbl> 0.800, 0.609, 0.500, 0.500, 0.500, 0.500, 0.417, 0…
## $ goals_scored <dbl> 85, 67, 62, 65, 37, 40, 35, 23, 20, 17, 24, 14, 29…
## $ goals_against <dbl> 55, 61, 58, 59, 40, 38, 39, 32, 24, 27, 23, 15, 28…
## $ goal_differential <dbl> 30, 6, 4, 6, -3, 2, -4, -9, -4, -10, 1, -1, 1, -6,…
vars <- nhl %>% select(games, wins, losses, win_loss_percentage, goals_scored, goals_against, goal_differential)
descr_tidy <- bind_rows(lapply(names(vars), function(v){
col <- vars[[v]]
tibble(variable = v,
n = sum(!is.na(col)),
mean = mean(col, na.rm=TRUE),
sd = sd(col, na.rm=TRUE),
min = min(col, na.rm=TRUE),
q1 = quantile(col, .25, na.rm=TRUE),
median = median(col, na.rm=TRUE),
q3 = quantile(col, .75, na.rm=TRUE),
max = max(col, na.rm=TRUE))
}))
Interpretácia (popisná štatistika — celkové
dáta).
Z tabuľky vidíme priemerný počet gólov na zápas (stĺpec
goals_scored) a rozptyl (sd). Medziročné a
medzi-tímové rozdiely sú vysoké. Tieto zistenia naznačujú, že pri
modelovaní by sme mali rátať s veľkou variabilitou.
# knitr options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
# balíčky
if (!requireNamespace("car", quietly = TRUE)) install.packages("car")
library(car)
if (!requireNamespace("psych", quietly = TRUE)) install.packages("psych")
library(psych)
# načítanie dát
nhl <- read.csv("nhlplayoffs.csv", dec = ".", sep = ",", header = TRUE)
# kontrola
dim(nhl)
## [1] 1009 13
str(nhl)
## 'data.frame': 1009 obs. of 13 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ team : chr "Colorado Avalanche" "Tampa Bay Lightning" "New York Rangers" "Edmonton Oilers" ...
## $ year : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
## $ games : int 20 23 20 16 14 12 12 10 7 7 ...
## $ wins : int 16 14 10 8 7 6 5 4 3 3 ...
## $ losses : int 4 9 10 8 7 6 7 6 4 4 ...
## $ ties : int 0 0 0 0 0 0 0 0 0 0 ...
## $ shootout_wins : int 5 1 1 1 1 1 1 2 0 1 ...
## $ shootout_losses : int 1 2 2 2 0 1 1 0 0 0 ...
## $ win_loss_percentage: num 0.8 0.609 0.5 0.5 0.5 0.5 0.417 0.4 0.429 0.429 ...
## $ goals_scored : int 85 67 62 65 37 40 35 23 20 17 ...
## $ goals_against : int 55 61 58 59 40 38 39 32 24 27 ...
## $ goal_differential : int 30 6 4 6 -3 2 -4 -9 -4 -10 ...
summary(nhl[c("games","wins","losses","goals_scored","goals_against","goal_differential")])
## games wins losses goals_scored
## Min. : 2.000 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 5.000 1st Qu.: 1.000 1st Qu.: 4.000 1st Qu.:11.00
## Median : 7.000 Median : 3.000 Median : 4.000 Median :20.00
## Mean : 9.364 Mean : 4.657 Mean : 4.657 Mean :26.63
## 3rd Qu.:12.000 3rd Qu.: 7.000 3rd Qu.: 6.000 3rd Qu.:37.00
## Max. :27.000 Max. :18.000 Max. :12.000 Max. :98.00
## goals_against goal_differential
## Min. : 0.00 Min. :-27
## 1st Qu.:16.00 1st Qu.: -6
## Median :22.00 Median : -2
## Mean :26.63 Mean : 0
## 3rd Qu.:35.00 3rd Qu.: 3
## Max. :91.00 Max. : 49
# vyber dát pre analýzu (odstránime identifikačné polia ako team, year ak sú nepotrebné)
nhl2 <- nhl[, c("wins","games","goals_scored","goals_against","goal_differential")]
head(nhl2)
## wins games goals_scored goals_against goal_differential
## 1 16 20 85 55 30
## 2 14 23 67 61 6
## 3 10 20 62 58 4
## 4 8 16 65 59 6
## 5 7 14 37 40 -3
## 6 6 12 40 38 2
Základný model (východiskový) odhadneme takto:
model_base <- lm(wins ~ goals_scored + goals_against + games, data = nhl2)
summary(model_base)
##
## Call:
## lm(formula = wins ~ goals_scored + goals_against + games, data = nhl2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3909 -0.4314 -0.0324 0.3591 3.1894
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.001460 0.050968 -19.65 <2e-16 ***
## goals_scored 0.113663 0.003562 31.91 <2e-16 ***
## goals_against -0.117244 0.003657 -32.06 <2e-16 ***
## games 0.614489 0.011831 51.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7105 on 1005 degrees of freedom
## Multiple R-squared: 0.9727, Adjusted R-squared: 0.9727
## F-statistic: 1.195e+04 on 3 and 1005 DF, p-value: < 2.2e-16
# uložíme si koeficienty a štandardné chyby pre neskoršie porovnanie
coef_base <- summary(model_base)$coefficients
coef_base
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0014605 0.050967749 -19.64891 5.395933e-73
## goals_scored 0.1136630 0.003562468 31.90569 7.548804e-155
## goals_against -0.1172439 0.003656798 -32.06190 6.331893e-156
## games 0.6144894 0.011830852 51.93957 7.517920e-287
Skontrolujeme korelácie medzi vysvetľujúcimi premennými — vysoké korelácie naznačujú multikolinearitu.
# korelačná matica (Pearson)
preds <- nhl2[, c("goals_scored","goals_against","games","goal_differential")]
cor_matrix <- cor(preds, use = "pairwise.complete.obs")
cor_matrix
## goals_scored goals_against games goal_differential
## goals_scored 1.0000000 0.9088340 0.9401903 0.7229132
## goals_against 0.9088340 1.0000000 0.8948069 0.3687775
## games 0.9401903 0.8948069 1.0000000 0.6128774
## goal_differential 0.7229132 0.3687775 0.6128774 1.0000000
# doplnkový prehľad pomocou psych::pairs.panels
pairs.panels(preds)