From \(\texttt{summary(full)}\), we can see that the most significant coefficient is of PGA, which represents whether a film won at the Producers Guild Awards. The extremely small p-value of 8.69e-13 indicates that we should reject the null hypothesis \(H_0 : \hat{\beta}_{\text{PGA}} = 0\) at any reasonable significance level, and that PGA should therefore be included in the model.
This coefficient has an estimated value of 3.751. Becuase we are using logistic regression, we exponentiate to get the odds, 42.56. This means that a film with a PGA win (all else equal) is approximately 43 times more likely to win Best Picture than a film without.
Recall that an approximate \(100(1-\alpha)\%\) confidence interval for \(\beta_j\) is \[\begin{equation*} \hat{\beta} _j \pm Z_{\alpha /2} \textrm{se}\left(\hat{\beta}_j\right). \end{equation*}\] Thus an approximate \(99\%\) confidence interval for the PGA coefficient is \[\begin{equation*} 3.751 \pm Z_{0.01 /2} \left(0.4994\right) = 3.751 \pm 2.33 \left(0.4994\right), \end{equation*}\] where we have used the Cambridge Statistical Tables to obtain an approximate value for \(Z_{\alpha /2}\). In interval notation, this is \[\begin{equation*} \left[2.587, 4.915\right]. \end{equation*}\]
We perform stepwise selection to find the optimal model. (“Optimal” in the sense of having the smallest AIC.)
This results in the inclusion of Dir, Edi, Dan, Gdr, Gd, PGA, DGA, Romance, SciFi, Days, PG, PG13, R, NSFC and WR, and the exclusion of all others.
This model has an AIC of 313.92, compared to 388.01 for the full model.
Using the \(\texttt{pROC}\) package, we find the AUC to be 0.9275. Being close to 1, this indicates that the optimal model is very good at discrimination.
The optimal threshold is 0.1512 (4 dp), with a sensitivity of 0.8333.
See the table in the appendix. As Anora has the highest probability, we predict that Anora will win Best Picture.
# Preamble
setwd("/Users/douglas/Documents/Oscars")
library(leaps)
library(pROC)
oscars <- read.csv("oscars.csv",header = TRUE)
oscars1 <- subset(oscars, Ch != 0)
oscars1 <- oscars1[,-c(1,2,5)]
oscars1$Win <- ifelse(oscars1 $ Ch == 1, 1, 0)
# Question 1
full <- glm(Win ~ . -Ch, data = oscars1, family="binomial")
summary(full)
Call:
glm(formula = Win ~ . - Ch, family = "binomial", data = oscars1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.459e+00 3.063e+00 -3.088 0.00202 **
Nom -1.143e-02 3.910e-01 -0.029 0.97668
Dir 1.780e+00 6.959e-01 2.557 0.01055 *
Aml -1.211e-02 4.934e-01 -0.025 0.98041
Afl 2.905e-01 5.629e-01 0.516 0.60578
Ams 6.048e-01 4.765e-01 1.269 0.20438
Afs 1.481e-01 4.990e-01 0.297 0.76660
Scr 3.845e-01 6.442e-01 0.597 0.55063
Cin -1.487e-01 5.846e-01 -0.254 0.79921
Art 2.350e-01 5.889e-01 0.399 0.68985
Cos -5.047e-01 6.502e-01 -0.776 0.43754
Sco 2.931e-01 5.493e-01 0.534 0.59357
Son -1.610e-01 8.498e-01 -0.190 0.84970
Edi 1.207e+00 5.636e-01 2.142 0.03223 *
Sou -5.270e-01 6.362e-01 -0.828 0.40744
For 4.318e-01 1.465e+00 0.295 0.76824
Anf 1.065e+00 4.817e+03 0.000 0.99982
Eff 2.175e-01 6.460e-01 0.337 0.73630
Mak 1.457e+00 9.438e-01 1.544 0.12257
Dan 2.769e+00 1.628e+00 1.701 0.08895 .
AD 1.683e+00 1.314e+00 1.281 0.20003
Gdr 1.305e+00 4.610e-01 2.830 0.00465 **
Gmc -1.290e-01 7.790e-01 -0.166 0.86850
Gd -2.950e+00 1.770e+00 -1.667 0.09556 .
Gm1 1.467e+00 1.269e+00 1.157 0.24740
Gm2 -1.193e+00 3.497e+00 -0.341 0.73293
Gf1 -1.558e+00 1.807e+00 -0.862 0.38856
Gf2 -1.435e+01 1.772e+03 -0.008 0.99354
PGA 3.571e+00 4.994e-01 7.150 8.69e-13 ***
DGA 1.905e+00 1.331e+00 1.431 0.15240
Action -6.718e-01 7.969e-01 -0.843 0.39925
Adventure -2.477e-01 7.548e-01 -0.328 0.74279
Animation -1.172e+01 3.956e+03 -0.003 0.99764
Biography -5.795e-01 6.509e-01 -0.890 0.37334
Comedy -1.364e-01 5.373e-01 -0.254 0.79953
Crime 1.021e+00 6.313e-01 1.617 0.10594
Docu -1.333e+01 3.956e+03 -0.003 0.99731
Drama -9.871e-01 6.238e-01 -1.582 0.11359
Family 1.222e+00 8.718e-01 1.402 0.16102
Fantasy -1.039e+00 1.169e+00 -0.888 0.37443
Film.noir -6.244e-01 1.425e+00 -0.438 0.66130
History 3.740e-01 7.037e-01 0.531 0.59508
Horror -8.303e-01 2.053e+00 -0.404 0.68593
Music 7.516e-01 9.649e-01 0.779 0.43604
Musical 8.186e-01 8.582e-01 0.954 0.34012
Mystery 6.285e-01 7.954e-01 0.790 0.42944
Romance 4.362e-01 4.183e-01 1.043 0.29704
SciFi -1.084e+00 1.637e+00 -0.662 0.50792
Sport 5.060e-01 1.178e+00 0.430 0.66745
Thriller -8.074e-01 7.014e-01 -1.151 0.24973
War 9.575e-01 6.376e-01 1.502 0.13317
Western -3.627e-02 9.434e-01 -0.038 0.96933
Length 1.789e-03 8.491e-03 0.211 0.83311
Days 2.381e-03 1.621e-03 1.469 0.14174
G -1.985e+00 2.029e+00 -0.978 0.32803
PG -1.135e+00 6.883e-01 -1.649 0.09922 .
PG13 -1.424e+00 8.666e-01 -1.643 0.10038
R -1.438e+00 7.088e-01 -2.030 0.04241 *
U -1.384e+01 1.989e+03 -0.007 0.99445
Ebert 1.325e-01 1.683e-01 0.787 0.43115
NYFCC 1.660e-01 4.684e-01 0.354 0.72304
LAFCA -8.886e-01 7.154e-01 -1.242 0.21423
NSFC 1.880e+00 7.502e-01 2.506 0.01221 *
NBR -1.391e-01 4.917e-01 -0.283 0.77719
WR 5.265e-01 3.997e-01 1.317 0.18771
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 528.99 on 603 degrees of freedom
Residual deviance: 258.01 on 539 degrees of freedom
AIC: 388.01
Number of Fisher Scoring iterations: 16
# Question 2
optimal = step(full, direction="both", trace = 0)
summary(optimal)
Call:
glm(formula = Win ~ Dir + Edi + Dan + Gdr + Gd + PGA + DGA +
Romance + SciFi + Days + PG + PG13 + R + NSFC + WR, family = "binomial",
data = oscars1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.428192 2.462220 -3.829 0.000129 ***
Dir 1.574330 0.494778 3.182 0.001463 **
Edi 1.169340 0.373104 3.134 0.001724 **
Dan 3.131095 1.300739 2.407 0.016077 *
Gdr 1.170446 0.393905 2.971 0.002965 **
Gd -2.476237 1.353463 -1.830 0.067316 .
PGA 3.320186 0.408168 8.134 4.14e-16 ***
DGA 1.733160 0.996429 1.739 0.081969 .
Romance 0.545207 0.363354 1.500 0.133490
SciFi -2.375553 1.384500 -1.716 0.086195 .
Days 0.002247 0.001310 1.715 0.086365 .
PG -0.865544 0.546307 -1.584 0.113113
PG13 -0.980035 0.580730 -1.688 0.091489 .
R -1.041853 0.446238 -2.335 0.019557 *
NSFC 1.348854 0.523822 2.575 0.010023 *
WR 0.590864 0.320739 1.842 0.065447 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 528.99 on 603 degrees of freedom
Residual deviance: 281.92 on 588 degrees of freedom
AIC: 313.92
Number of Fisher Scoring iterations: 6
# Question 3
prob <- predict(optimal, type="response")
ROC1 <- roc(oscars1$Win, prob)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
str(ROC1)
List of 15
$ percent : logi FALSE
$ sensitivities : num [1:605] 1 1 1 1 1 1 1 1 1 1 ...
$ specificities : num [1:605] 0 0.00197 0.00394 0.00591 0.00787 ...
$ thresholds : num [1:605] -Inf 0.000351 0.000465 0.000601 0.000713 ...
$ direction : chr "<"
$ cases : Named num [1:96] 0.573 0.579 0.136 0.875 0.28 ...
..- attr(*, "names")= chr [1:96] "16" "24" "35" "45" ...
$ controls : Named num [1:508] 0.731998 0.43912 0.000417 0.172277 0.006449 ...
..- attr(*, "names")= chr [1:508] "11" "12" "13" "14" ...
$ fun.sesp :function (thresholds, controls, cases, direction)
$ auc : 'auc' num 0.926
..- attr(*, "partial.auc")= logi FALSE
..- attr(*, "percent")= logi FALSE
..- attr(*, "roc")=List of 15
.. ..$ percent : logi FALSE
.. ..$ sensitivities : num [1:605] 1 1 1 1 1 1 1 1 1 1 ...
.. ..$ specificities : num [1:605] 0 0.00197 0.00394 0.00591 0.00787 ...
.. ..$ thresholds : num [1:605] -Inf 0.000351 0.000465 0.000601 0.000713 ...
.. ..$ direction : chr "<"
.. ..$ cases : Named num [1:96] 0.573 0.579 0.136 0.875 0.28 ...
.. .. ..- attr(*, "names")= chr [1:96] "16" "24" "35" "45" ...
.. ..$ controls : Named num [1:508] 0.731998 0.43912 0.000417 0.172277 0.006449 ...
.. .. ..- attr(*, "names")= chr [1:508] "11" "12" "13" "14" ...
.. ..$ fun.sesp :function (thresholds, controls, cases, direction)
.. ..$ auc : 'auc' num 0.926
.. .. ..- attr(*, "partial.auc")= logi FALSE
.. .. ..- attr(*, "percent")= logi FALSE
.. .. ..- attr(*, "roc")=List of 8
.. .. .. ..$ percent : logi FALSE
.. .. .. ..$ sensitivities: num [1:605] 1 1 1 1 1 1 1 1 1 1 ...
.. .. .. ..$ specificities: num [1:605] 0 0.00197 0.00394 0.00591 0.00787 ...
.. .. .. ..$ thresholds : num [1:605] -Inf 0.000351 0.000465 0.000601 0.000713 ...
.. .. .. ..$ direction : chr "<"
.. .. .. ..$ cases : Named num [1:96] 0.573 0.579 0.136 0.875 0.28 ...
.. .. .. .. ..- attr(*, "names")= chr [1:96] "16" "24" "35" "45" ...
.. .. .. ..$ controls : Named num [1:508] 0.731998 0.43912 0.000417 0.172277 0.006449 ...
.. .. .. .. ..- attr(*, "names")= chr [1:508] "11" "12" "13" "14" ...
.. .. .. ..$ fun.sesp :function (thresholds, controls, cases, direction)
.. .. .. ..- attr(*, "class")= chr "roc"
.. ..$ call : language roc.default(response = oscars1$Win, predictor = prob)
.. ..$ original.predictor: Named num [1:604] 0.731998 0.43912 0.000417 0.172277 0.006449 ...
.. .. ..- attr(*, "names")= chr [1:604] "11" "12" "13" "14" ...
.. ..$ original.response : num [1:604] 0 0 0 0 0 1 0 0 0 0 ...
.. ..$ predictor : Named num [1:604] 0.731998 0.43912 0.000417 0.172277 0.006449 ...
.. .. ..- attr(*, "names")= chr [1:604] "11" "12" "13" "14" ...
.. ..$ response : num [1:604] 0 0 0 0 0 1 0 0 0 0 ...
.. ..$ levels : chr [1:2] "0" "1"
.. ..- attr(*, "class")= chr "roc"
$ call : language roc.default(response = oscars1$Win, predictor = prob)
$ original.predictor: Named num [1:604] 0.731998 0.43912 0.000417 0.172277 0.006449 ...
..- attr(*, "names")= chr [1:604] "11" "12" "13" "14" ...
$ original.response : num [1:604] 0 0 0 0 0 1 0 0 0 0 ...
$ predictor : Named num [1:604] 0.731998 0.43912 0.000417 0.172277 0.006449 ...
..- attr(*, "names")= chr [1:604] "11" "12" "13" "14" ...
$ response : num [1:604] 0 0 0 0 0 1 0 0 0 0 ...
$ levels : chr [1:2] "0" "1"
- attr(*, "class")= chr "roc"
par(pty = "s")
plot(ROC1, main="ROC for optimal model", print.auc=TRUE)
ROC1$auc
Area under the curve: 0.9257
ind <- which.min( (ROC1$sensitivities-1)^2 + (ROC1$specificities-1)^2 )[[1]]
thres <- ROC1$thresholds[ind]
thres
[1] 0.1512266
coords(ROC1, x = thres, input="threshold", ret="sensitivity")
# Question 4
oscars2024 <- subset(oscars, Ch==0)
probs <- predict(optimal, newdata=oscars2024, type="response")
probs_norm <- probs/sum(probs)
results <- data.frame(Movie = oscars2024$Name, Probability_of_winning = probs_norm)
results <- results[order(-results$Probability_of_winning),]
results