Matice

Malé cvičenie

V tejto časti si precvičíme prácu s maticami na niekoľkých praktických úlohách.

Úloha 1: Základné operácie s maticami

Vytvoríme dve matice 3x3 a vypočítame ich súčet, rozdiel a súčin.

# Vytvorenie dvoch matíc 3x3
M1 <- matrix(c(2, 5, 7, 4, 8, 6, 1, 3, 9), nrow = 3, byrow = TRUE)
M2 <- matrix(c(9, 2, 4, 7, 5, 1, 6, 8, 3), nrow = 3, byrow = TRUE)

M1
##      [,1] [,2] [,3]
## [1,]    2    5    7
## [2,]    4    8    6
## [3,]    1    3    9
M2
##      [,1] [,2] [,3]
## [1,]    9    2    4
## [2,]    7    5    1
## [3,]    6    8    3
# Súčet matíc
sum_mat <- M1 + M2
sum_mat
##      [,1] [,2] [,3]
## [1,]   11    7   11
## [2,]   11   13    7
## [3,]    7   11   12
# Rozdiel matíc
diff_mat <- M1 - M2
diff_mat
##      [,1] [,2] [,3]
## [1,]   -7    3    3
## [2,]   -3    3    5
## [3,]   -5   -5    6
# Matematické násobenie matíc
prod_mat <- M1 %*% M2
prod_mat
##      [,1] [,2] [,3]
## [1,]   95   85   34
## [2,]  128   96   42
## [3,]   84   89   34

➡️ Komentár: Vidíme, že výsledné matice majú rovnaké rozmery. Násobenie %*% vykonáva klasický maticový súčin.


Úloha 2: Determinant a inverzia matice

Zistíme determinant matice M1 a pokúsime sa vypočítať jej inverznú maticu.

# Výpočet determinantov
det_M1 <- det(M1)
det_M1
## [1] -14
# Ak determinant nie je nulový, môžeme nájsť inverznú maticu
if (det_M1 != 0) {
  inv_M1 <- solve(M1)
  inv_M1
} else {
  print("Matica nie je invertovateľná (determinant = 0).")
}
##            [,1]        [,2]       [,3]
## [1,] -3.8571429  1.71428571  1.8571429
## [2,]  2.1428571 -0.78571429 -1.1428571
## [3,] -0.2857143  0.07142857  0.2857143

➡️ Komentár: Ak je determinant rôzny od nuly, matica je regulárna a môžeme ju invertovať. V opačnom prípade nie.


Úloha 3: Spájanie matíc

Spojíme dve matice po stĺpcoch a vypíšeme rozmery novej matice.

# Spojenie matíc po stĺpcoch
combined <- cbind(M1, M2)
combined
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    2    5    7    9    2    4
## [2,]    4    8    6    7    5    1
## [3,]    1    3    9    6    8    3
# Zistenie rozmerov
dim(combined)
## [1] 3 6

➡️ Komentár: Nová matica má rovnaký počet riadkov ako pôvodné matice, ale dvojnásobný počet stĺpcov.


Úloha 4: Náhodné matice a vyhľadávanie hodnôt

Vytvoríme maticu s náhodnými číslami a zistíme, kde sa nachádza jej najväčšia hodnota.

# Generovanie náhodnej matice 4x4 s hodnotami od 0 do 20
set.seed(123)  # pre reprodukovateľnosť
R <- matrix(runif(16, 0, 20), nrow = 4)
R
##           [,1]     [,2]      [,3]      [,4]
## [1,]  5.751550 18.80935 11.028700 13.551413
## [2,] 15.766103  0.91113  9.132295 11.452668
## [3,]  8.179538 10.56211 19.136667  2.058494
## [4,] 17.660348 17.84838  9.066683 17.996499
# Nájdeme najväčšiu hodnotu
max_val <- max(R)
max_val
## [1] 19.13667
# Zistíme pozíciu tejto hodnoty
max_index <- which(R == max_val, arr.ind = TRUE)
max_index
##      row col
## [1,]   3   3

➡️ Komentár: Funkcia which(..., arr.ind = TRUE) vráti pozíciu (riadok, stĺpec), kde sa daná hodnota nachádza.


Zhrnutie

V tejto časti sme si ukázali: - ako vytvárať a upravovať matice, - základné operácie a štatistiky, - prácu s transpozíciou, inverziou a determinantom.

Matice tvoria základ pre lineárnu algebru a sú veľmi dôležité pri dátovej analýze a modelovaní.

Skaláre (jednočíselné hodnoty)

Numerické skaláre

# Priradenie konštanty do premennej
a <- 20
b <- 5.5

# Arithmetic
sum_ab      <- a + b        # sucet
diff_ab     <- a - b        # rozdiel
prod_ab     <- a * b        # násobenie
quot_ab     <- a / b        # delenie
power_ab    <- a ^ b        # umocňovanie
mod_ab      <- a %% 3       # zbytok po delení tromi (tzv modulo)


# Rounding
round_b   <- round(b)       # zaokruhlovanie smerom k najblizsiemu celemu cislu
ceil_b    <- ceiling(b)     # najblizsie vyssie cele cislo
floor_b   <- floor(b)       # najblizsie nizsie cele cislo

a; b
## [1] 20
## [1] 5.5
sum_ab; diff_ab; prod_ab; quot_ab; power_ab; mod_ab;
## [1] 25.5
## [1] 14.5
## [1] 110
## [1] 3.636364
## [1] 14310835
## [1] 2
round_b; ceil_b; floor_b
## [1] 6
## [1] 6
## [1] 5

Poznámky

  • ^ operátor umocňovania.
  • %% je modulo, teda zbytok po delení,
  • round(x, digits = 0) zaokrúhľovanie na určitý počet desatinných miest (digits=). ak digits = 0, potom ide o celočíselné zaokrúhľovanie

Analýza výkonov NHL tímov v play-off (2006–2022)

knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  warning = FALSE
)
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(dplyr)

# Import CSV súboru

nhl <- read_csv("nhlplayoffs.csv")

# Zobrazenie prvých riadkov a názvov stĺpcov

head(nhl)
## # A tibble: 6 × 13
##    rank team         year games  wins losses  ties shootout_wins shootout_losses
##   <dbl> <chr>       <dbl> <dbl> <dbl>  <dbl> <dbl>         <dbl>           <dbl>
## 1     1 Colorado A…  2022    20    16      4     0             5               1
## 2     2 Tampa Bay …  2022    23    14      9     0             1               2
## 3     3 New York R…  2022    20    10     10     0             1               2
## 4     4 Edmonton O…  2022    16     8      8     0             1               2
## 5     5 Carolina H…  2022    14     7      7     0             1               0
## 6     6 St. Louis …  2022    12     6      6     0             1               1
## # ℹ 4 more variables: win_loss_percentage <dbl>, goals_scored <dbl>,
## #   goals_against <dbl>, goal_differential <dbl>
colnames(nhl)
##  [1] "rank"                "team"                "year"               
##  [4] "games"               "wins"                "losses"             
##  [7] "ties"                "shootout_wins"       "shootout_losses"    
## [10] "win_loss_percentage" "goals_scored"        "goals_against"      
## [13] "goal_differential"
nhl <- nhl %>%
rename(
Year = year,
Team = team,
Wins = wins,
Losses = losses,
GoalsAgainst = goals_against,
) %>%
mutate(
) %>%
filter(!is.na(Year))

summary(nhl)
##       rank            Team                Year          games       
##  Min.   : 1.000   Length:1009        Min.   :1918   Min.   : 2.000  
##  1st Qu.: 3.000   Class :character   1st Qu.:1972   1st Qu.: 5.000  
##  Median : 6.000   Mode  :character   Median :1990   Median : 7.000  
##  Mean   : 7.067                      Mean   :1986   Mean   : 9.364  
##  3rd Qu.:11.000                      3rd Qu.:2007   3rd Qu.:12.000  
##  Max.   :24.000                      Max.   :2022   Max.   :27.000  
##       Wins            Losses            ties         shootout_wins    
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.00000   Min.   : 0.0000  
##  1st Qu.: 1.000   1st Qu.: 4.000   1st Qu.:0.00000   1st Qu.: 0.0000  
##  Median : 3.000   Median : 4.000   Median :0.00000   Median : 1.0000  
##  Mean   : 4.657   Mean   : 4.657   Mean   :0.04955   Mean   : 0.9326  
##  3rd Qu.: 7.000   3rd Qu.: 6.000   3rd Qu.:0.00000   3rd Qu.: 1.0000  
##  Max.   :18.000   Max.   :12.000   Max.   :4.00000   Max.   :10.0000  
##  shootout_losses  win_loss_percentage  goals_scored    GoalsAgainst  
##  Min.   :0.0000   Min.   :0.0000      Min.   : 0.00   Min.   : 0.00  
##  1st Qu.:0.0000   1st Qu.:0.3330      1st Qu.:11.00   1st Qu.:16.00  
##  Median :1.0000   Median :0.4290      Median :20.00   Median :22.00  
##  Mean   :0.9326   Mean   :0.4112      Mean   :26.63   Mean   :26.63  
##  3rd Qu.:1.0000   3rd Qu.:0.5450      3rd Qu.:37.00   3rd Qu.:35.00  
##  Max.   :4.0000   Max.   :1.0000      Max.   :98.00   Max.   :91.00  
##  goal_differential
##  Min.   :-27      
##  1st Qu.: -6      
##  Median : -2      
##  Mean   :  0      
##  3rd Qu.:  3      
##  Max.   : 49
library(ggplot2)

ggplot(nhl, aes(x = goals_scored, y = win_loss_percentage)) +
geom_point(alpha = 0.6, color = "steelblue") +
geom_smooth(method = "lm", color = "red", se = FALSE) +
theme_minimal() +
labs(
title = "Vzťah medzi počtom gólov a výherným percentom",
x = "Počet strelených gólov",
y = "Percento výhier (%)"
)

ggplot(nhl, aes(x = factor(Year), y = Wins)) +
geom_boxplot(fill = "lightblue", color = "darkblue") +
theme_minimal() +
labs(
title = "Rozdelenie počtu výhier v play-off podľa rokov",
x = "Rok",
y = "Počet výhier"
)

nhl.trend <- nhl %>%
group_by(Year) %>%
summarise(AvgWinPercent = mean(win_loss_percentage, na.rm = TRUE))

ggplot(nhl.trend, aes(x = Year, y = AvgWinPercent)) +
geom_line(color = "darkred", size = 1) +
geom_point(color = "black") +
theme_minimal() +
labs(
title = "Priemerné výherné percento tímov NHL (2006–2022)",
x = "Rok",
y = "Priemerné % výhier"
)

Balíčky

# Nainštaluj balíčky, ak ešte nie sú nainštalované
needed <- c("tidyverse", "broom", "lmtest", "car", "tseries", "sandwich", "ggplot2")
new <- needed[!(needed %in% installed.packages()[, "Package"]) ]
if(length(new)) install.packages(new)

library(tidyverse)
library(broom)
library(lmtest)
library(car)
library(tseries)
library(sandwich)
library(ggplot2)

Načítanie dát

# Uprav cestu podľa potreby. Tu sa predpokladá "/mnt/data/nhlplayoffs.csv".
nhl <- read_csv("nhlplayoffs.csv")

# Zobrazenie základných informácií
glimpse(nhl)
## Rows: 1,009
## Columns: 13
## $ rank                <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ team                <chr> "Colorado Avalanche", "Tampa Bay Lightning", "New …
## $ year                <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 20…
## $ games               <dbl> 20, 23, 20, 16, 14, 12, 12, 10, 7, 7, 7, 7, 7, 6, …
## $ wins                <dbl> 16, 14, 10, 8, 7, 6, 5, 4, 3, 3, 3, 3, 3, 2, 2, 0,…
## $ losses              <dbl> 4, 9, 10, 8, 7, 6, 7, 6, 4, 4, 4, 4, 4, 4, 4, 4, 7…
## $ ties                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ shootout_wins       <dbl> 5, 1, 1, 1, 1, 1, 1, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0,…
## $ shootout_losses     <dbl> 1, 2, 2, 2, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 2, 1, 4,…
## $ win_loss_percentage <dbl> 0.800, 0.609, 0.500, 0.500, 0.500, 0.500, 0.417, 0…
## $ goals_scored        <dbl> 85, 67, 62, 65, 37, 40, 35, 23, 20, 17, 24, 14, 29…
## $ goals_against       <dbl> 55, 61, 58, 59, 40, 38, 39, 32, 24, 27, 23, 15, 28…
## $ goal_differential   <dbl> 30, 6, 4, 6, -3, 2, -4, -9, -4, -10, 1, -1, 1, -6,…
summary(nhl)
##       rank            team                year          games       
##  Min.   : 1.000   Length:1009        Min.   :1918   Min.   : 2.000  
##  1st Qu.: 3.000   Class :character   1st Qu.:1972   1st Qu.: 5.000  
##  Median : 6.000   Mode  :character   Median :1990   Median : 7.000  
##  Mean   : 7.067                      Mean   :1986   Mean   : 9.364  
##  3rd Qu.:11.000                      3rd Qu.:2007   3rd Qu.:12.000  
##  Max.   :24.000                      Max.   :2022   Max.   :27.000  
##       wins            losses            ties         shootout_wins    
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.00000   Min.   : 0.0000  
##  1st Qu.: 1.000   1st Qu.: 4.000   1st Qu.:0.00000   1st Qu.: 0.0000  
##  Median : 3.000   Median : 4.000   Median :0.00000   Median : 1.0000  
##  Mean   : 4.657   Mean   : 4.657   Mean   :0.04955   Mean   : 0.9326  
##  3rd Qu.: 7.000   3rd Qu.: 6.000   3rd Qu.:0.00000   3rd Qu.: 1.0000  
##  Max.   :18.000   Max.   :12.000   Max.   :4.00000   Max.   :10.0000  
##  shootout_losses  win_loss_percentage  goals_scored   goals_against  
##  Min.   :0.0000   Min.   :0.0000      Min.   : 0.00   Min.   : 0.00  
##  1st Qu.:0.0000   1st Qu.:0.3330      1st Qu.:11.00   1st Qu.:16.00  
##  Median :1.0000   Median :0.4290      Median :20.00   Median :22.00  
##  Mean   :0.9326   Mean   :0.4112      Mean   :26.63   Mean   :26.63  
##  3rd Qu.:1.0000   3rd Qu.:0.5450      3rd Qu.:37.00   3rd Qu.:35.00  
##  Max.   :4.0000   Max.   :1.0000      Max.   :98.00   Max.   :91.00  
##  goal_differential
##  Min.   :-27      
##  1st Qu.: -6      
##  Median : -2      
##  Mean   :  0      
##  3rd Qu.:  3      
##  Max.   : 49
# Prvých 10 riadkov
head(nhl, 10)
## # A tibble: 10 × 13
##     rank team        year games  wins losses  ties shootout_wins shootout_losses
##    <dbl> <chr>      <dbl> <dbl> <dbl>  <dbl> <dbl>         <dbl>           <dbl>
##  1     1 Colorado …  2022    20    16      4     0             5               1
##  2     2 Tampa Bay…  2022    23    14      9     0             1               2
##  3     3 New York …  2022    20    10     10     0             1               2
##  4     4 Edmonton …  2022    16     8      8     0             1               2
##  5     5 Carolina …  2022    14     7      7     0             1               0
##  6     6 St. Louis…  2022    12     6      6     0             1               1
##  7     7 Calgary F…  2022    12     5      7     0             1               1
##  8     8 Florida P…  2022    10     4      6     0             2               0
##  9     9 Boston Br…  2022     7     3      4     0             0               0
## 10    10 Los Angel…  2022     7     3      4     0             1               0
## # ℹ 4 more variables: win_loss_percentage <dbl>, goals_scored <dbl>,
## #   goals_against <dbl>, goal_differential <dbl>

Predspracovanie

# Vyberieme len premenné relevantné pre model
nhl_model <- nhl %>%
  select(win_loss_percentage, goals_scored, goals_against, games, team, year) %>%
  mutate(across(c(win_loss_percentage, goals_scored, goals_against, games), as.numeric))

# Kontrola NA
nhl_model %>% summarise(across(everything(), ~ sum(is.na(.))))
## # A tibble: 1 × 6
##   win_loss_percentage goals_scored goals_against games  team  year
##                 <int>        <int>         <int> <int> <int> <int>
## 1                   0            0             0     0     0     0
# Ak by boli NA, jednoduchá imputácia priemerom (tu len ako postup)
if(any(is.na(nhl_model))) {
  nhl_model <- nhl_model %>%
    mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))
}

Deskriptívna štatistika a korelácie

# Základné štatistiky
nhl_model %>% select(win_loss_percentage, goals_scored, goals_against, games) %>% summary()
##  win_loss_percentage  goals_scored   goals_against       games       
##  Min.   :0.0000      Min.   : 0.00   Min.   : 0.00   Min.   : 2.000  
##  1st Qu.:0.3330      1st Qu.:11.00   1st Qu.:16.00   1st Qu.: 5.000  
##  Median :0.4290      Median :20.00   Median :22.00   Median : 7.000  
##  Mean   :0.4112      Mean   :26.63   Mean   :26.63   Mean   : 9.364  
##  3rd Qu.:0.5450      3rd Qu.:37.00   3rd Qu.:35.00   3rd Qu.:12.000  
##  Max.   :1.0000      Max.   :98.00   Max.   :91.00   Max.   :27.000
# Korelačná matica
cor_mat <- nhl_model %>% select(win_loss_percentage, goals_scored, goals_against, games) %>% cor()
cor_mat
##                     win_loss_percentage goals_scored goals_against     games
## win_loss_percentage           1.0000000    0.7009174     0.5133309 0.6829670
## goals_scored                  0.7009174    1.0000000     0.9088340 0.9401903
## goals_against                 0.5133309    0.9088340     1.0000000 0.8948069
## games                         0.6829670    0.9401903     0.8948069 1.0000000
# Heatmap korelácií
library(ggplot2)
cor_df <- as.data.frame(as.table(cor_mat))
ggplot(cor_df, aes(Var1, Var2, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = round(Freq, 2))) +
  labs(title = "Korelačná matica") +
  theme_minimal()

Odhad regresného modelu

model <- lm(win_loss_percentage ~ goals_scored + goals_against + games, data = nhl_model)
summary(model)
## 
## Call:
## lm(formula = win_loss_percentage ~ goals_scored + goals_against + 
##     games, data = nhl_model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30101 -0.06681  0.01477  0.07457  0.65960 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.2767744  0.0095021   29.13  < 2e-16 ***
## goals_scored   0.0100937  0.0006642   15.20  < 2e-16 ***
## goals_against -0.0113519  0.0006818  -16.65  < 2e-16 ***
## games          0.0179319  0.0022057    8.13 1.26e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1325 on 1005 degrees of freedom
## Multiple R-squared:  0.6052, Adjusted R-squared:  0.604 
## F-statistic: 513.5 on 3 and 1005 DF,  p-value: < 2.2e-16
# Harmonizovaný tidy výstup
tidy(model)
## # A tibble: 4 × 5
##   term          estimate std.error statistic   p.value
##   <chr>            <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     0.277   0.00950      29.1  1.00e-135
## 2 goals_scored    0.0101  0.000664     15.2  4.15e- 47
## 3 goals_against  -0.0114  0.000682    -16.7  3.65e- 55
## 4 games           0.0179  0.00221       8.13 1.26e- 15
glance(model)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.605         0.604 0.132      513. 3.12e-202     3   610. -1210. -1185.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Diagnostika rezíduí

# Rezidua a fitted hodnoty
nhl_model <- nhl_model %>%
  mutate(fitted = fitted(model), residuals = residuals(model), std_resid = rstandard(model))

# Graf: rezidua vs fitted
ggplot(nhl_model, aes(x = fitted, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Rezidua vs Fitted hodnoty", x = "Fitted", y = "Rezidua") +
  theme_minimal()

# Histogram rezíduí + normálna krivka
ggplot(nhl_model, aes(x = residuals)) +
  geom_histogram(aes(y = ..density..), bins = 30) +
  stat_function(fun = dnorm, args = list(mean = mean(nhl_model$residuals), sd = sd(nhl_model$residuals))) +
  labs(title = "Histogram rezíduí s normálnou krivkou") +
  theme_minimal()

# QQ-plot
qqPlot(model, main = "QQ-plot rezíduí")

## [1] 996 998
# Jarque-Bera test (normálnosť)
jarque.test <- tseries::jarque.bera.test(nhl_model$residuals)
jarque.test
## 
##  Jarque Bera Test
## 
## data:  nhl_model$residuals
## X-squared = 162.83, df = 2, p-value < 2.2e-16

Test heteroskedasticity

# Breusch-Pagan test
bptest_res <- bptest(model)
bptest_res
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 72.531, df = 3, p-value = 1.225e-15
# White test (approx pomocou bptest na kvadratické termíny)
# Pridáme kvadratické a interakčné termíny pre aproximáciu White testu
model_white <- lm(residuals(model)^2 ~ fitted(model) + I(fitted(model)^2), data = nhl_model)
summary(model_white)
## 
## Call:
## lm(formula = residuals(model)^2 ~ fitted(model) + I(fitted(model)^2), 
##     data = nhl_model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03452 -0.01482 -0.00935  0.00232  0.41837 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.064579   0.007487   8.625  < 2e-16 ***
## fitted(model)      -0.206015   0.032159  -6.406 2.29e-10 ***
## I(fitted(model)^2)  0.192016   0.030854   6.223 7.13e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03406 on 1006 degrees of freedom
## Multiple R-squared:  0.03925,    Adjusted R-squared:  0.03734 
## F-statistic: 20.55 on 2 and 1006 DF,  p-value: 1.795e-09

Autokorelácia

# Durbin-Watson test
dw_res <- lmtest::dwtest(model)
dw_res
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.4977, p-value = 5.168e-16
## alternative hypothesis: true autocorrelation is greater than 0

Multikolinearita

vifs <- car::vif(model)
vifs
##  goals_scored goals_against         games 
##     10.735066      6.249835      9.372463

1. Načítanie dát

nhl <- read.csv("nhlplayoffs.csv")
cat("Rozmery datasetu:", dim(nhl), "\n")
## Rozmery datasetu: 1009 13
str(nhl)
## 'data.frame':    1009 obs. of  13 variables:
##  $ rank               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ team               : chr  "Colorado Avalanche" "Tampa Bay Lightning" "New York Rangers" "Edmonton Oilers" ...
##  $ year               : int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
##  $ games              : int  20 23 20 16 14 12 12 10 7 7 ...
##  $ wins               : int  16 14 10 8 7 6 5 4 3 3 ...
##  $ losses             : int  4 9 10 8 7 6 7 6 4 4 ...
##  $ ties               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ shootout_wins      : int  5 1 1 1 1 1 1 2 0 1 ...
##  $ shootout_losses    : int  1 2 2 2 0 1 1 0 0 0 ...
##  $ win_loss_percentage: num  0.8 0.609 0.5 0.5 0.5 0.5 0.417 0.4 0.429 0.429 ...
##  $ goals_scored       : int  85 67 62 65 37 40 35 23 20 17 ...
##  $ goals_against      : int  55 61 58 59 40 38 39 32 24 27 ...
##  $ goal_differential  : int  30 6 4 6 -3 2 -4 -9 -4 -10 ...
summary(nhl)
##       rank            team                year          games       
##  Min.   : 1.000   Length:1009        Min.   :1918   Min.   : 2.000  
##  1st Qu.: 3.000   Class :character   1st Qu.:1972   1st Qu.: 5.000  
##  Median : 6.000   Mode  :character   Median :1990   Median : 7.000  
##  Mean   : 7.067                      Mean   :1986   Mean   : 9.364  
##  3rd Qu.:11.000                      3rd Qu.:2007   3rd Qu.:12.000  
##  Max.   :24.000                      Max.   :2022   Max.   :27.000  
##       wins            losses            ties         shootout_wins    
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.00000   Min.   : 0.0000  
##  1st Qu.: 1.000   1st Qu.: 4.000   1st Qu.:0.00000   1st Qu.: 0.0000  
##  Median : 3.000   Median : 4.000   Median :0.00000   Median : 1.0000  
##  Mean   : 4.657   Mean   : 4.657   Mean   :0.04955   Mean   : 0.9326  
##  3rd Qu.: 7.000   3rd Qu.: 6.000   3rd Qu.:0.00000   3rd Qu.: 1.0000  
##  Max.   :18.000   Max.   :12.000   Max.   :4.00000   Max.   :10.0000  
##  shootout_losses  win_loss_percentage  goals_scored   goals_against  
##  Min.   :0.0000   Min.   :0.0000      Min.   : 0.00   Min.   : 0.00  
##  1st Qu.:0.0000   1st Qu.:0.3330      1st Qu.:11.00   1st Qu.:16.00  
##  Median :1.0000   Median :0.4290      Median :20.00   Median :22.00  
##  Mean   :0.9326   Mean   :0.4112      Mean   :26.63   Mean   :26.63  
##  3rd Qu.:1.0000   3rd Qu.:0.5450      3rd Qu.:37.00   3rd Qu.:35.00  
##  Max.   :4.0000   Max.   :1.0000      Max.   :98.00   Max.   :91.00  
##  goal_differential
##  Min.   :-27      
##  1st Qu.: -6      
##  Median : -2      
##  Mean   :  0      
##  3rd Qu.:  3      
##  Max.   : 49

2. Popis dát a EDA (všetky originálne kroky aplikované na nhlplayoffs.csv)

# Počet chýbajúcich hodnôt
sapply(nhl, function(x) sum(is.na(x)))
##                rank                team                year               games 
##                   0                   0                   0                   0 
##                wins              losses                ties       shootout_wins 
##                   0                   0                   0                   0 
##     shootout_losses win_loss_percentage        goals_scored       goals_against 
##                   0                   0                   0                   0 
##   goal_differential 
##                   0
# Základné štatistiky pre číselné premenné
numvars <- nhl[sapply(nhl, is.numeric)]
summary(numvars) # oprava: round() odstránené kvôli chybe pri knitovaní
##       rank             year          games             wins       
##  Min.   : 1.000   Min.   :1918   Min.   : 2.000   Min.   : 0.000  
##  1st Qu.: 3.000   1st Qu.:1972   1st Qu.: 5.000   1st Qu.: 1.000  
##  Median : 6.000   Median :1990   Median : 7.000   Median : 3.000  
##  Mean   : 7.067   Mean   :1986   Mean   : 9.364   Mean   : 4.657  
##  3rd Qu.:11.000   3rd Qu.:2007   3rd Qu.:12.000   3rd Qu.: 7.000  
##  Max.   :24.000   Max.   :2022   Max.   :27.000   Max.   :18.000  
##      losses            ties         shootout_wins     shootout_losses 
##  Min.   : 0.000   Min.   :0.00000   Min.   : 0.0000   Min.   :0.0000  
##  1st Qu.: 4.000   1st Qu.:0.00000   1st Qu.: 0.0000   1st Qu.:0.0000  
##  Median : 4.000   Median :0.00000   Median : 1.0000   Median :1.0000  
##  Mean   : 4.657   Mean   :0.04955   Mean   : 0.9326   Mean   :0.9326  
##  3rd Qu.: 6.000   3rd Qu.:0.00000   3rd Qu.: 1.0000   3rd Qu.:1.0000  
##  Max.   :12.000   Max.   :4.00000   Max.   :10.0000   Max.   :4.0000  
##  win_loss_percentage  goals_scored   goals_against   goal_differential
##  Min.   :0.0000      Min.   : 0.00   Min.   : 0.00   Min.   :-27      
##  1st Qu.:0.3330      1st Qu.:11.00   1st Qu.:16.00   1st Qu.: -6      
##  Median :0.4290      Median :20.00   Median :22.00   Median : -2      
##  Mean   :0.4112      Mean   :26.63   Mean   :26.63   Mean   :  0      
##  3rd Qu.:0.5450      3rd Qu.:37.00   3rd Qu.:35.00   3rd Qu.:  3      
##  Max.   :1.0000      Max.   :98.00   Max.   :91.00   Max.   : 49

2.2 Korelácie a párové grafy (originálne kroky)

cor_mat <- cor(numvars, use="pairwise.complete.obs")
print(round(cor_mat,3))
##                       rank   year  games   wins losses   ties shootout_wins
## rank                 1.000  0.458 -0.589 -0.652 -0.278 -0.158        -0.420
## year                 0.458  1.000  0.318  0.226  0.477 -0.362         0.211
## games               -0.589  0.318  1.000  0.967  0.822 -0.128         0.672
## wins                -0.652  0.226  0.967  1.000  0.652 -0.098         0.670
## losses              -0.278  0.477  0.822  0.652  1.000 -0.293         0.506
## ties                -0.158 -0.362 -0.128 -0.098 -0.293  1.000        -0.065
## shootout_wins       -0.420  0.211  0.672  0.670  0.506 -0.065         1.000
## shootout_losses     -0.095  0.275  0.366  0.300  0.426 -0.129         0.237
## win_loss_percentage -0.697 -0.009  0.683  0.785  0.277  0.070         0.496
## goals_scored        -0.572  0.268  0.940  0.944  0.700 -0.131         0.577
## goals_against       -0.438  0.361  0.895  0.818  0.845 -0.204         0.569
## goal_differential   -0.548  0.000  0.613  0.747  0.160  0.046         0.343
##                     shootout_losses win_loss_percentage goals_scored
## rank                         -0.095              -0.697       -0.572
## year                          0.275              -0.009        0.268
## games                         0.366               0.683        0.940
## wins                          0.300               0.785        0.944
## losses                        0.426               0.277        0.700
## ties                         -0.129               0.070       -0.131
## shootout_wins                 0.237               0.496        0.577
## shootout_losses               1.000               0.154        0.313
## win_loss_percentage           0.154               1.000        0.701
## goals_scored                  0.313               0.701        1.000
## goals_against                 0.279               0.513        0.909
## goal_differential             0.236               0.712        0.723
##                     goals_against goal_differential
## rank                       -0.438            -0.548
## year                        0.361             0.000
## games                       0.895             0.613
## wins                        0.818             0.747
## losses                      0.845             0.160
## ties                       -0.204             0.046
## shootout_wins               0.569             0.343
## shootout_losses             0.279             0.236
## win_loss_percentage         0.513             0.712
## goals_scored                0.909             0.723
## goals_against               1.000             0.369
## goal_differential           0.369             1.000
pairs(nhl[, c("goal_differential","goals_scored","goals_against","wins","games")], gap=0.4, main="Pairs plot")

3. Modelovanie - krok za krokom

V pôvodnom súbore sa uvažoval lineárny model pre Life.expectancy. Tu ako cieľ používam goal_differential. Preukážem postupy, ktoré sú v pôvodnom cvičení: základný OLS, diagnostika, riešenie multikolinearity, robustné SE, transformácie a pod.

Načítanie dát

nhl <- read_csv("nhlplayoffs.csv")
glimpse(nhl)
## Rows: 1,009
## Columns: 13
## $ rank                <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ team                <chr> "Colorado Avalanche", "Tampa Bay Lightning", "New …
## $ year                <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 20…
## $ games               <dbl> 20, 23, 20, 16, 14, 12, 12, 10, 7, 7, 7, 7, 7, 6, …
## $ wins                <dbl> 16, 14, 10, 8, 7, 6, 5, 4, 3, 3, 3, 3, 3, 2, 2, 0,…
## $ losses              <dbl> 4, 9, 10, 8, 7, 6, 7, 6, 4, 4, 4, 4, 4, 4, 4, 4, 7…
## $ ties                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ shootout_wins       <dbl> 5, 1, 1, 1, 1, 1, 1, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0,…
## $ shootout_losses     <dbl> 1, 2, 2, 2, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 2, 1, 4,…
## $ win_loss_percentage <dbl> 0.800, 0.609, 0.500, 0.500, 0.500, 0.500, 0.417, 0…
## $ goals_scored        <dbl> 85, 67, 62, 65, 37, 40, 35, 23, 20, 17, 24, 14, 29…
## $ goals_against       <dbl> 55, 61, 58, 59, 40, 38, 39, 32, 24, 27, 23, 15, 28…
## $ goal_differential   <dbl> 30, 6, 4, 6, -3, 2, -4, -9, -4, -10, 1, -1, 1, -6,…

Popisná štatistika (celé dáta)

vars <- nhl %>% select(games, wins, losses, win_loss_percentage, goals_scored, goals_against, goal_differential)
descr_tidy <- bind_rows(lapply(names(vars), function(v){
  col <- vars[[v]]
  tibble(variable = v,
         n = sum(!is.na(col)),
         mean = mean(col, na.rm=TRUE),
         sd = sd(col, na.rm=TRUE),
         min = min(col, na.rm=TRUE),
         q1 = quantile(col, .25, na.rm=TRUE),
         median = median(col, na.rm=TRUE),
         q3 = quantile(col, .75, na.rm=TRUE),
         max = max(col, na.rm=TRUE))
}))

Interpretácia (popisná štatistika — celkové dáta).
Z tabuľky vidíme priemerný počet gólov na zápas (stĺpec goals_scored) a rozptyl (sd). Medziročné a medzi-tímové rozdiely sú vysoké. Tieto zistenia naznačujú, že pri modelovaní by sme mali rátať s veľkou variabilitou.

# knitr options
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)


# balíčky
if (!requireNamespace("car", quietly = TRUE)) install.packages("car")
library(car)


if (!requireNamespace("psych", quietly = TRUE)) install.packages("psych")
library(psych)


# načítanie dát
nhl <- read.csv("nhlplayoffs.csv", dec = ".", sep = ",", header = TRUE)


# kontrola
dim(nhl)
## [1] 1009   13
str(nhl)
## 'data.frame':    1009 obs. of  13 variables:
##  $ rank               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ team               : chr  "Colorado Avalanche" "Tampa Bay Lightning" "New York Rangers" "Edmonton Oilers" ...
##  $ year               : int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
##  $ games              : int  20 23 20 16 14 12 12 10 7 7 ...
##  $ wins               : int  16 14 10 8 7 6 5 4 3 3 ...
##  $ losses             : int  4 9 10 8 7 6 7 6 4 4 ...
##  $ ties               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ shootout_wins      : int  5 1 1 1 1 1 1 2 0 1 ...
##  $ shootout_losses    : int  1 2 2 2 0 1 1 0 0 0 ...
##  $ win_loss_percentage: num  0.8 0.609 0.5 0.5 0.5 0.5 0.417 0.4 0.429 0.429 ...
##  $ goals_scored       : int  85 67 62 65 37 40 35 23 20 17 ...
##  $ goals_against      : int  55 61 58 59 40 38 39 32 24 27 ...
##  $ goal_differential  : int  30 6 4 6 -3 2 -4 -9 -4 -10 ...
summary(nhl[c("games","wins","losses","goals_scored","goals_against","goal_differential")])
##      games             wins            losses        goals_scored  
##  Min.   : 2.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 5.000   1st Qu.: 1.000   1st Qu.: 4.000   1st Qu.:11.00  
##  Median : 7.000   Median : 3.000   Median : 4.000   Median :20.00  
##  Mean   : 9.364   Mean   : 4.657   Mean   : 4.657   Mean   :26.63  
##  3rd Qu.:12.000   3rd Qu.: 7.000   3rd Qu.: 6.000   3rd Qu.:37.00  
##  Max.   :27.000   Max.   :18.000   Max.   :12.000   Max.   :98.00  
##  goals_against   goal_differential
##  Min.   : 0.00   Min.   :-27      
##  1st Qu.:16.00   1st Qu.: -6      
##  Median :22.00   Median : -2      
##  Mean   :26.63   Mean   :  0      
##  3rd Qu.:35.00   3rd Qu.:  3      
##  Max.   :91.00   Max.   : 49
# vyber dát pre analýzu (odstránime identifikačné polia ako team, year ak sú nepotrebné)
nhl2 <- nhl[, c("wins","games","goals_scored","goals_against","goal_differential")]
head(nhl2)
##   wins games goals_scored goals_against goal_differential
## 1   16    20           85            55                30
## 2   14    23           67            61                 6
## 3   10    20           62            58                 4
## 4    8    16           65            59                 6
## 5    7    14           37            40                -3
## 6    6    12           40            38                 2

4. Odhad základného regresného modelu

Základný model (východiskový) odhadneme takto:

model_base <- lm(wins ~ goals_scored + goals_against + games, data = nhl2)
summary(model_base)
## 
## Call:
## lm(formula = wins ~ goals_scored + goals_against + games, data = nhl2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3909 -0.4314 -0.0324  0.3591  3.1894 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.001460   0.050968  -19.65   <2e-16 ***
## goals_scored   0.113663   0.003562   31.91   <2e-16 ***
## goals_against -0.117244   0.003657  -32.06   <2e-16 ***
## games          0.614489   0.011831   51.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7105 on 1005 degrees of freedom
## Multiple R-squared:  0.9727, Adjusted R-squared:  0.9727 
## F-statistic: 1.195e+04 on 3 and 1005 DF,  p-value: < 2.2e-16
# uložíme si koeficienty a štandardné chyby pre neskoršie porovnanie
coef_base <- summary(model_base)$coefficients
coef_base
##                 Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept)   -1.0014605 0.050967749 -19.64891  5.395933e-73
## goals_scored   0.1136630 0.003562468  31.90569 7.548804e-155
## goals_against -0.1172439 0.003656798 -32.06190 6.331893e-156
## games          0.6144894 0.011830852  51.93957 7.517920e-287

5. Korelačná matica

Skontrolujeme korelácie medzi vysvetľujúcimi premennými — vysoké korelácie naznačujú multikolinearitu.

# korelačná matica (Pearson)
preds <- nhl2[, c("goals_scored","goals_against","games","goal_differential")]
cor_matrix <- cor(preds, use = "pairwise.complete.obs")
cor_matrix
##                   goals_scored goals_against     games goal_differential
## goals_scored         1.0000000     0.9088340 0.9401903         0.7229132
## goals_against        0.9088340     1.0000000 0.8948069         0.3687775
## games                0.9401903     0.8948069 1.0000000         0.6128774
## goal_differential    0.7229132     0.3687775 0.6128774         1.0000000
# doplnkový prehľad pomocou psych::pairs.panels
pairs.panels(preds)