Tento súbor vypracúva úlohy aplikované na databázu nhlplayoffs.csv. Analýza zahŕňa popis dát, regresné modelovanie, diagnostické testy (heteroskedasticita, autokorelácia, VIF) a odporúčania.
nhl <- read.csv("nhlplayoffs.csv", stringsAsFactors = FALSE)
head(nhl)
summary(nhl)
## rank team year games
## Min. : 1.000 Length:1009 Min. :1918 Min. : 2.000
## 1st Qu.: 3.000 Class :character 1st Qu.:1972 1st Qu.: 5.000
## Median : 6.000 Mode :character Median :1990 Median : 7.000
## Mean : 7.067 Mean :1986 Mean : 9.364
## 3rd Qu.:11.000 3rd Qu.:2007 3rd Qu.:12.000
## Max. :24.000 Max. :2022 Max. :27.000
## wins losses ties shootout_wins
## Min. : 0.000 Min. : 0.000 Min. :0.00000 Min. : 0.0000
## 1st Qu.: 1.000 1st Qu.: 4.000 1st Qu.:0.00000 1st Qu.: 0.0000
## Median : 3.000 Median : 4.000 Median :0.00000 Median : 1.0000
## Mean : 4.657 Mean : 4.657 Mean :0.04955 Mean : 0.9326
## 3rd Qu.: 7.000 3rd Qu.: 6.000 3rd Qu.:0.00000 3rd Qu.: 1.0000
## Max. :18.000 Max. :12.000 Max. :4.00000 Max. :10.0000
## shootout_losses win_loss_percentage goals_scored goals_against
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:0.3330 1st Qu.:11.00 1st Qu.:16.00
## Median :1.0000 Median :0.4290 Median :20.00 Median :22.00
## Mean :0.9326 Mean :0.4112 Mean :26.63 Mean :26.63
## 3rd Qu.:1.0000 3rd Qu.:0.5450 3rd Qu.:37.00 3rd Qu.:35.00
## Max. :4.0000 Max. :1.0000 Max. :98.00 Max. :91.00
## goal_differential
## Min. :-27
## 1st Qu.: -6
## Median : -2
## Mean : 0
## 3rd Qu.: 3
## Max. : 49
Cieľová premenná bude goal_differential (rozdiel
gólov). Hlavné prediktory sú premenné relevantné pre hokej:
goals_scored, goals_against,
wins, win_loss_percentage,
games.
# základné prehľady
table(is.na(nhl$goal_differential))
##
## FALSE
## 1009
sapply(nhl[c("goals_scored","goals_against","wins","win_loss_percentage","games")], function(x) round(mean(x, na.rm=TRUE),2))
## goals_scored goals_against wins win_loss_percentage
## 26.63 26.63 4.66 0.41
## games
## 9.36
Najprv odhadneme model s premennými goals_scored,
goals_against, wins,
win_loss_percentage, games:
model1 <- lm(goal_differential ~ goals_scored + goals_against + wins + win_loss_percentage + games, data=nhl)
summary(model1)
##
## Call:
## lm(formula = goal_differential ~ goals_scored + goals_against +
## wins + win_loss_percentage + games, data = nhl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.754e-14 -8.140e-16 -5.400e-17 6.470e-16 1.879e-13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.579e-15 8.503e-16 -4.209e+00 2.79e-05 ***
## goals_scored 1.000e+00 4.581e-17 2.183e+16 < 2e-16 ***
## goals_against -1.000e+00 4.714e-17 -2.121e+16 < 2e-16 ***
## wins 5.151e-16 3.297e-16 1.563e+00 0.1185
## win_loss_percentage -3.249e-15 1.768e-15 -1.837e+00 0.0665 .
## games -2.302e-16 2.171e-16 -1.060e+00 0.2894
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439e-15 on 1003 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.148e+32 on 5 and 1003 DF, p-value: < 2.2e-16
Poznámka: v tomto datasete platí identita
goal_differential = goals_scored - goals_against. Preto
nastáva perfektná multikolinearita pri súčasnom
zaradení goals_scored a goals_against do
modelu, čo vedie k R^2 ≈ 1 a nevhodným estymáciám pre ostatné
premenné.
# Breusch-Pagan pre heteroskedasticitu
library(lmtest)
bptest(model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 9.0401, df = 5, p-value = 0.1075
# Durbin-Watson
dwtest(model1)
##
## Durbin-Watson test
##
## data: model1
## DW = 1.5693, p-value = 2.658e-12
## alternative hypothesis: true autocorrelation is greater than 0
# VIF
library(car)
vif(model1)
## goals_scored goals_against wins win_loss_percentage
## 21.615570 12.646316 48.778562 3.368474
## games
## 38.435238
Z vyššie uvedeného vyplýva veľmi vysoká hodnota VIF pre niektoré premenné (multikolinearita).
Aby sme odstránili problém s multikolinearitou, odhadneme model bez
goals_scored a goals_against:
model2 <- lm(goal_differential ~ wins + win_loss_percentage + games, data=nhl)
summary(model2)
##
## Call:
## lm(formula = goal_differential ~ wins + win_loss_percentage +
## games, data = nhl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.8840 -2.4688 0.0154 2.6990 19.5153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.30438 0.62038 2.103 0.0358 *
## wins 5.07489 0.18043 28.127 <2e-16 ***
## win_loss_percentage -0.07867 1.29233 -0.061 0.9515
## games -2.65986 0.11360 -23.413 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.711 on 1005 degrees of freedom
## Multiple R-squared: 0.7407, Adjusted R-squared: 0.7399
## F-statistic: 956.7 on 3 and 1005 DF, p-value: < 2.2e-16
# robustné HC3 SE
coeftest(model2, vcov = vcovHC(model2, type = "HC3"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.304377 0.666204 1.9579 0.05052 .
## wins 5.074891 0.255608 19.8542 < 2e-16 ***
## win_loss_percentage -0.078674 1.289918 -0.0610 0.95138
## games -2.659864 0.155062 -17.1535 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# diagnostika
bptest(model2)
##
## studentized Breusch-Pagan test
##
## data: model2
## BP = 109.82, df = 3, p-value < 2.2e-16
dwtest(model2)
##
## Durbin-Watson test
##
## data: model2
## DW = 2.198, p-value = 0.9991
## alternative hypothesis: true autocorrelation is greater than 0
vif(model2)
## wins win_loss_percentage games
## 27.293099 3.360903 19.657305
Nižšie sú vložené kľúčové výsledky vypočítané pri automatizovanej analýze
Koeficienty (robustné HC3):
| Predictor | Estimate | Std. Error | t value | p-value |
|---|---|---|---|---|
| (Intercept) | 1.3044 | 0.6662 | 1.9579 | 0.0505 |
| wins | 5.0749 | 0.2556 | 19.8542 | 0.0000 |
| win_loss_percentage | -0.0787 | 1.2899 | -0.0610 | 0.9514 |
| games | -2.6599 | 0.1551 | -17.1535 | 0.0000 |
VIF (model2):
| Variable | VIF |
|---|---|
| const | 17.4973 |
| wins | 27.2931 |
| win_loss_percentage | 3.3609 |
| games | 19.6573 |
wins má pozitívny a veľmi významný vplyv na
goal_differential — každé navýšenie počtu víťazstiev súvisí
so zvýšením rozdielu gólov.games má negatívny a významný koeficient v tomto modeli
— môže to byť spôsobené kontextom (väčší počet odohraných zápasov v
súčte playoff sérií môže byť spojený s inými faktormi). Treba
interpretovať opatrne.win_loss_percentage nie je v tomto modeli významný po
kontrole ostatných premenných.wins a games sú vysoké — naznačuje
to multikolinearitu medzi počtom zápasov a víťazstvami (očakávané).
Možné riešenia: odstrániť jednu z kolineárnych premenných, použiť PCA
alebo iné regularizačné metódy.Vypracovala som cvičenie podľa obsahu pôvodného Rmd a aplikovala som
kroky na súbor nhlplayoffs.csv.