V tomto R Markdown dokumente sú úlohy z vypracovania Cvicenie7.
Všetky analýzy sú aplikované na súbor nhlplayoffs.csv.
nhl <- read.csv("nhlplayoffs.csv", stringsAsFactors = FALSE)
glimpse(nhl)
## Rows: 1,009
## Columns: 13
## $ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ team <chr> "Colorado Avalanche", "Tampa Bay Lightning", "New …
## $ year <int> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 20…
## $ games <int> 20, 23, 20, 16, 14, 12, 12, 10, 7, 7, 7, 7, 7, 6, …
## $ wins <int> 16, 14, 10, 8, 7, 6, 5, 4, 3, 3, 3, 3, 3, 2, 2, 0,…
## $ losses <int> 4, 9, 10, 8, 7, 6, 7, 6, 4, 4, 4, 4, 4, 4, 4, 4, 7…
## $ ties <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ shootout_wins <int> 5, 1, 1, 1, 1, 1, 1, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0,…
## $ shootout_losses <int> 1, 2, 2, 2, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 2, 1, 4,…
## $ win_loss_percentage <dbl> 0.800, 0.609, 0.500, 0.500, 0.500, 0.500, 0.417, 0…
## $ goals_scored <int> 85, 67, 62, 65, 37, 40, 35, 23, 20, 17, 24, 14, 29…
## $ goals_against <int> 55, 61, 58, 59, 40, 38, 39, 32, 24, 27, 23, 15, 28…
## $ goal_differential <int> 30, 6, 4, 6, -3, 2, -4, -9, -4, -10, 1, -1, 1, -6,…
team_counts <- nhl %>% count(team, sort = TRUE)
team_counts %>% slice(1:10) %>% kable() %>% kable_styling(full_width = FALSE)
| team | n |
|---|---|
| Montreal Canadiens | 85 |
| Boston Bruins | 75 |
| Toronto Maple Leafs | 67 |
| Detroit Red Wings | 62 |
| New York Rangers | 61 |
| St. Louis Blues | 45 |
| Chicago Black Hawks | 41 |
| Philadelphia Flyers | 40 |
| Pittsburgh Penguins | 37 |
| Washington Capitals | 32 |
top_team <- team_counts$team[1]
top_team
## [1] "Montreal Canadiens"
vars <- nhl %>% select(games, wins, losses, win_loss_percentage, goals_scored, goals_against, goal_differential)
descr_tidy <- bind_rows(lapply(names(vars), function(v){
col <- vars[[v]]
tibble(variable = v,
n = sum(!is.na(col)),
mean = mean(col, na.rm=TRUE),
sd = sd(col, na.rm=TRUE),
min = min(col, na.rm=TRUE),
q1 = quantile(col, .25, na.rm=TRUE),
median = median(col, na.rm=TRUE),
q3 = quantile(col, .75, na.rm=TRUE),
max = max(col, na.rm=TRUE))
}))
kable(descr_tidy, digits=3) %>% kable_styling(full_width = FALSE)
| variable | n | mean | sd | min | q1 | median | q3 | max |
|---|---|---|---|---|---|---|---|---|
| games | 1009 | 9.364 | 5.791 | 2 | 5.000 | 7.000 | 12.000 | 27 |
| wins | 1009 | 4.657 | 4.296 | 0 | 1.000 | 3.000 | 7.000 | 18 |
| losses | 1009 | 4.657 | 2.037 | 0 | 4.000 | 4.000 | 6.000 | 12 |
| win_loss_percentage | 1009 | 0.411 | 0.210 | 0 | 0.333 | 0.429 | 0.545 | 1 |
| goals_scored | 1009 | 26.631 | 20.582 | 0 | 11.000 | 20.000 | 37.000 | 98 |
| goals_against | 1009 | 26.631 | 15.299 | 0 | 16.000 | 22.000 | 35.000 | 91 |
| goal_differential | 1009 | 0.000 | 9.237 | -27 | -6.000 | -2.000 | 3.000 | 49 |
Interpretácia (popisná štatistika — celkové
dáta).
Z tabuľky vidíme priemerný počet gólov na zápas (stĺpec
goals_scored) a rozptyl (sd). Medziročné a
medzi-tímové rozdiely sú vysoké. Tieto zistenia naznačujú, že pri
modelovaní by sme mali rátať s veľkou variabilitou.
Nižšie uvedené grafy a štatistiky sú vypočítané len pre tím Montreal Canadiens a porovnanie s ostatnými tímami.
team_data <- nhl %>% filter(team == top_team)
other_data <- nhl %>% filter(team != top_team)
nrow(team_data)
## [1] 85
summary(team_data)
## rank team year games
## Min. : 1.000 Length:85 Min. :1918 Min. : 2.000
## 1st Qu.: 1.000 Class :character 1st Qu.:1945 1st Qu.: 5.000
## Median : 3.000 Mode :character Median :1967 Median :10.000
## Mean : 4.435 Mean :1968 Mean : 9.035
## 3rd Qu.: 6.000 3rd Qu.:1989 3rd Qu.:12.000
## Max. :16.000 Max. :2021 Max. :22.000
## wins losses ties shootout_wins
## Min. : 0.000 Min. : 0.000 Min. :0.00000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.:0.00000 1st Qu.: 0.000
## Median : 5.000 Median : 4.000 Median :0.00000 Median : 1.000
## Mean : 5.176 Mean : 3.776 Mean :0.08235 Mean : 1.024
## 3rd Qu.: 8.000 3rd Qu.: 5.000 3rd Qu.:0.00000 3rd Qu.: 1.000
## Max. :16.000 Max. :10.000 Max. :2.00000 Max. :10.000
## shootout_losses win_loss_percentage goals_scored goals_against
## Min. :0.0000 Min. :0.0000 Min. : 2.00 Min. : 2.00
## 1st Qu.:0.0000 1st Qu.:0.3330 1st Qu.: 9.00 1st Qu.:14.00
## Median :1.0000 Median :0.5000 Median :24.00 Median :20.00
## Mean :0.7765 Mean :0.5101 Mean :26.94 Mean :22.85
## 3rd Qu.:1.0000 3rd Qu.:0.6670 3rd Qu.:39.00 3rd Qu.:29.00
## Max. :4.0000 Max. :1.0000 Max. :75.00 Max. :63.00
## goal_differential
## Min. :-11.000
## 1st Qu.: -3.000
## Median : 0.000
## Mean : 4.094
## 3rd Qu.: 11.000
## Max. : 31.000
p1 <- ggplot(team_data, aes(x = goals_scored)) +
geom_histogram(binwidth = 2) +
labs(title = paste("Histogram goals_scored —", top_team), x = "Goals scored", y = "Count")
p2 <- ggplot(nhl %>% filter(team %in% c(top_team, team_counts$team[2])), aes(x = team, y = goals_scored)) +
geom_boxplot() +
labs(title = "Boxplot goals_scored: Top 2 teams", y = "Goals scored", x = "Team") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
cowplot::plot_grid(p1, p2, ncol = 2)
Interpretácia:
Histogram ukazuje distribúciu gólov pre tím Montreal Canadiens. Boxplot
porovnáva rozdelenie gólov medzi dvoma najfrekventovanejšími tímami.
Sledujeme, či má top tím systematicky vyšší medián alebo väčší
rozptyl.
trend_team <- team_data %>%
group_by(year) %>%
summarise(mean_goals = mean(goals_scored, na.rm=TRUE), n = n())
ggplot(trend_team, aes(x = year, y = mean_goals)) +
geom_line() + geom_point() +
labs(title = paste("Priemerné goals_scored podľa rokov —", top_team),
x = "Year", y = "Mean goals scored")
Interpretácia:
Trend graf ukazuje, či sa priemer gólov daného tímu v play-off menil v
priebehu rokov. Hľadáme stabilný rast, pokles alebo veľké výkyvy viazané
na konkrétne roky.
# Vykonáme t-test medzi top tímom a zvyškom datasetu
tt <- t.test(goals_scored ~ I(team == top_team), data = nhl)
tt
##
## Welch Two Sample t-test
##
## data: goals_scored by I(team == top_team)
## t = -0.15435, df = 102.82, p-value = 0.8776
## alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
## 95 percent confidence interval:
## -4.685978 4.009253
## sample estimates:
## mean in group FALSE mean in group TRUE
## 26.60281 26.94118
Interpretácia (t-test):
- Štatistika t: -0.154
- p-hodnota: 0.878
- Priemery: top tím = 26.94, ostatní = 26.6.
Ak je p-hodnota menšia než 0.05, môžeme konštatovať, že rozdiel v priemernom počte gólov medzi tímom Montreal Canadiens a zvyškom je štatisticky významný (na úrovni 5%). Inak rozdiel nie je štatisticky významný.
Model predikuje goals_scored pomocou games,
wins, win_loss_percentage,
goals_against, shootout_wins.
model <- lm(goals_scored ~ games + wins + win_loss_percentage + goals_against + shootout_wins, data = nhl)
summary(model)
##
## Call:
## lm(formula = goals_scored ~ games + wins + win_loss_percentage +
## goals_against + shootout_wins, data = nhl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.6545 -2.4184 0.1083 2.0582 19.7893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9593 0.5476 1.752 0.0801 .
## games -1.4302 0.1325 -10.796 <2e-16 ***
## wins 4.6182 0.1664 27.749 <2e-16 ***
## win_loss_percentage -0.7820 1.1397 -0.686 0.4928
## goals_against 0.7305 0.0210 34.791 <2e-16 ***
## shootout_wins -1.6890 0.1409 -11.990 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.151 on 1003 degrees of freedom
## Multiple R-squared: 0.9595, Adjusted R-squared: 0.9593
## F-statistic: 4757 on 5 and 1003 DF, p-value: < 2.2e-16
Interpretácia (regresia):
- R² modelu: 0.96 — udáva, akú časť variability
goals_scored model vysvetľuje.
- Dôležité koeficienty (pozrieme summary(model)): pozitívny
koeficient znamená, že pri zvýšení danej premenné očakávame nárast
priemerného počtu gólov (pri ostatných premenných fixných). Negatívny
koeficient znamená opačný efekt.
- Skontrolujeme p-hodnoty pre jednotlivé koeficienty — tie ukazujú,
ktoré premenné sú v modeli štatisticky významné.
model_team <- lm(goals_scored ~ games + wins + win_loss_percentage + goals_against + shootout_wins, data = team_data)
summary(model_team)
##
## Call:
## lm(formula = goals_scored ~ games + wins + win_loss_percentage +
## goals_against + shootout_wins, data = team_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.8400 -2.6780 -0.1857 1.6369 15.6735
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1796 2.1898 0.539 0.59161
## games -1.6351 0.5369 -3.046 0.00315 **
## wins 4.4065 0.5726 7.696 3.40e-11 ***
## win_loss_percentage 0.5838 4.0808 0.143 0.88661
## goals_against 0.8269 0.1101 7.507 7.89e-11 ***
## shootout_wins -1.4317 0.4379 -3.269 0.00160 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.492 on 79 degrees of freedom
## Multiple R-squared: 0.9486, Adjusted R-squared: 0.9453
## F-statistic: 291.5 on 5 and 79 DF, p-value: < 2.2e-16
Interpretácia (regresia pre top tím):
Porovnávame koeficienty a R² s celkovým modelom: ak sa líšia, môže to
znamenať, že vzťahy pre tento tím sú odlišné od priemeru všetkých
tímov.
par(mfrow=c(2,2))
plot(model)
par(mfrow=c(1,1))
Interpretácia diagnostiky:
Skontrolujeme normalitu reziduálov, heteroskedasticitu a prítomnosť
vplyvných bodov. Ak sú podmienky porušené, zvážime transformácie alebo
robustné metódy/regresie s náhodnými efektmi.
## Nelineárna štatistická analýza (porovnanie modelov)
# --- NELINEÁRNA REGRESIA ---
# Kvadratický (polynomiálny) model: goals_scored ~ wins + wins^2
nlm_quad <- lm(goals_scored ~ poly(wins, 2, raw = TRUE), data = nhl)
summary(nlm_quad)
##
## Call:
## lm(formula = goals_scored ~ poly(wins, 2, raw = TRUE), data = nhl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.390 -3.593 -0.338 3.323 34.241
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.95566 0.41072 12.066 <2e-16 ***
## poly(wins, 2, raw = TRUE)1 4.86864 0.15549 31.311 <2e-16 ***
## poly(wins, 2, raw = TRUE)2 -0.02487 0.01054 -2.359 0.0185 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.793 on 1006 degrees of freedom
## Multiple R-squared: 0.8913, Adjusted R-squared: 0.8911
## F-statistic: 4124 on 2 and 1006 DF, p-value: < 2.2e-16
# Logaritmický model: goals_scored ~ log(wins + 1)
nlm_log <- lm(goals_scored ~ log(wins + 1), data = nhl)
summary(nlm_log)
##
## Call:
## lm(formula = goals_scored ~ log(wins + 1), data = nhl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.668 -6.686 -2.314 6.020 41.607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.0200 0.6262 -8.016 3.01e-15 ***
## log(wins + 1) 22.1500 0.3806 58.193 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.859 on 1007 degrees of freedom
## Multiple R-squared: 0.7708, Adjusted R-squared: 0.7706
## F-statistic: 3386 on 1 and 1007 DF, p-value: < 2.2e-16
# Exponenciálny model: goals_scored = a * exp(b * wins)
nlm_exp <- nls(goals_scored ~ a * exp(b * wins),
data = nhl,
start = list(a = 1, b = 0.05))
summary(nlm_exp)
##
## Formula: goals_scored ~ a * exp(b * wins)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a 14.080914 0.278477 50.56 <2e-16 ***
## b 0.113164 0.001701 66.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.998 on 1007 degrees of freedom
##
## Number of iterations to convergence: 10
## Achieved convergence tolerance: 7.869e-06
# Porovnanie modelov podľa AIC (nižšia hodnota = lepší model)
AIC(model, nlm_quad, nlm_log, nlm_exp)
## df AIC
## model 7 5743.490
## nlm_quad 4 6734.640
## nlm_log 3 7485.316
## nlm_exp 3 7300.979
# Vizualizácia všetkých modelov pre porovnanie
# Vytvoríme rozsah pre wins a doplníme priemerné hodnoty ostatných premenných
wins_range <- data.frame(
wins = seq(min(nhl$wins, na.rm=TRUE),
max(nhl$wins, na.rm=TRUE), length.out = 100),
games = mean(nhl$games, na.rm=TRUE),
win_loss_percentage = mean(nhl$win_loss_percentage, na.rm=TRUE),
goals_against = mean(nhl$goals_against, na.rm=TRUE),
shootout_wins = mean(nhl$shootout_wins, na.rm=TRUE)
)
# Predikcie pre všetky modely
wins_range$pred_lin <- predict(model, newdata = wins_range)
wins_range$pred_quad <- predict(nlm_quad, newdata = wins_range)
wins_range$pred_log <- predict(nlm_log, newdata = wins_range)
wins_range$pred_exp <- predict(nlm_exp, newdata = wins_range)
wins_range$pred_quad <- predict(nlm_quad, newdata = wins_range)
wins_range$pred_log <- predict(nlm_log, newdata = wins_range)
wins_range$pred_exp <- predict(nlm_exp, newdata = wins_range)
ggplot(nhl, aes(x = wins, y = goals_scored)) +
geom_point(alpha = 0.5) +
geom_line(data = wins_range, aes(y = pred_lin, color = "Lineárny")) +
geom_line(data = wins_range, aes(y = pred_quad, color = "Kvadratický")) +
geom_line(data = wins_range, aes(y = pred_log, color = "Logaritmický")) +
geom_line(data = wins_range, aes(y = pred_exp, color = "Exponenciálny")) +
labs(title = "Porovnanie lineárneho a nelineárnych modelov",
x = "Počet výhier", y = "Počet gólov",
color = "Model") +
theme_minimal()
Interpretácia (nelineárne modely):
Na rozdiel od lineárneho modelu, ktorý predpokladá priamu úmernosť medzi
počtom výhier (wins) a počtom gólov
(goals_scored), nelineárne modely skúmajú, či sa tento
vzťah zakrivuje alebo mení intenzita
vplyvu:
wins^2 štatisticky významný, vzťah nie je
lineárny.b kladný a signifikantný, znamená to, že čím viac tím
vyhráva, tým prudšie sa zvyšuje počet gólov.Graf vizuálne ukazuje, ktorý model najlepšie vystihuje reálne dáta — najpresnejší model by mal ležať najbližšie k bodom a dobre kopírovať ich tvar.
```