\( \frac{n-1}{n} \). There are n possible choices for the first observation of bootstrap, {1,2,3…,n}. Then we have n-1 choices that the first observation is not the jth observation from the original data set. As known, the withdraw is randomly independent, the probability that the first bootstrap observation is not the jth observation is
\[ \frac{the~number~of~options~that~is~not~the~jth~observation}{the~totoal~possible~option~for~the~first~bootstrap~observation} \].
\( \frac{n-1}{n} \).
The probability that the jth observation is not in the first bootstrap observation is \( \frac{n-1}{n} \);
The probability that the jth observation is not in the second bootstrap observation is \( \frac{n-1}{n} \); \[ \dotsc \] The probability that the jth observation is not in the second bootstrap observation is \( \frac{n-1}{n} \);
As each of the withdraw is independent, and we need withdraw the data for n times, thus the probability that the jth observation is not in the bootstrap sample is \( (\frac{n-1}{n})^n \), that is \( (1 − 1/n)^n \).
\( 1-(1 − 1/n)^n=1-(1 − 1/5)^5=0.67232 \)
\( 1-(1 − 1/n)^n=1-(1 − 1/100)^{100}=0.6339677 \)
\( 1-(1 − 1/n)^n=1-(1 − 1/10000)^{10000}=0.632139 \)
Create a plot that displays, for each integer value of n from 1 to 100, 000, the probability that the jth observation is in the bootstrap sample. Comment on what you observe.
In terms of the rule: (1-1/n)^n converges to 1/e when n grows infinitely large. 1 - (1-1/n)^n converges to 1-1/e that is approximately 0.6321. This plot shows this result.
prob <- numeric(1e+05)
x <- numeric(1e+05)
for (i in 1:1e+05) {
prob[i] <- 1 - ((i - 1)/i)^i
x[i] <- i
}
plot(x, prob, type = "l")
The mean is 0.6348 that is very closed to what we get for (e). We need sample more if we want to get more accurate result.
set.seed(1232321)
store = rep(NA, 10000)
for (i in 1:10000) {
store[i] = sum(sample(1:100, rep = TRUE) == 4) > 0
}
mean(store)
## [1] 0.6348
These three questions are hand written on a separate sheet of paper.
I created two linearly well separated clusters as required. c was equal to 3.0. If the cluster was less than 3.0, it fell in category "0"; Otherwise, it fell in category "1".
First I made the cluster corresponding to category "0":
require(ISLR)
## Loading required package: ISLR
cluster1x <- seq(0, 2.9, 0.1)
cluster1y <- runif(100, -5, 5)
cluser1bigx <- numeric(3000)
cluser1bigy <- numeric(3000)
count <- 1
for (i in 1:30) {
for (j in 1:100) {
cluser1bigx[count] = cluster1x[i]
cluser1bigy[count] = cluster1y[j]
count <- count + 1
}
}
cluster1Res <- rep("0", 3000)
Here is the plot of the first cluster:
plot(cluser1bigx, cluser1bigy)
Second I made the cluster corresponding to category “1”:
cluster2x <- seq(3.1, 6, 0.1)
cluster2y <- runif(100, -5, 5)
cluser2bigx <- numeric(3000)
cluser2bigy <- numeric(3000)
count2 <- 1
for (i in 1:30) {
for (j in 1:100) {
cluser2bigx[count2] = cluster2x[i]
cluser2bigy[count2] = cluster2y[j]
count2 <- count2 + 1
}
}
cluster2Res <- rep("1", 3000)
plot(cluser2bigx, cluser2bigy)
Then I combined all the x,y and categorical responsed into one single data frame:
bigclusterX <- c(cluser1bigx, cluser2bigx)
bigclusterY <- c(cluser1bigy, cluser2bigy)
bigclusterRes <- c(cluster1Res, cluster2Res)
wholeData <- data.frame(bigclusterX, bigclusterY, bigclusterRes)
ThenI plot the data as a whole:
plot(bigclusterX, bigclusterY, col = ifelse(bigclusterRes == "0", "red", "blue"))
You can see that the data is well separated at c=3. Then I applied logistic regression model to fit the data. The warnings occur as described in question:
glm.fit = glm(bigclusterRes ~ bigclusterX + bigclusterY, data = wholeData, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Team with Marcello Hasegawa
Fit the logistic regression model above to the data and examine the rankings. What happened to make the teams saint-mary-saint-mary and st.-thomas-(tx)-celts look so good? Can you explain it in terms of your answer to the previous question?
First, In the logistic regression model, we give a lot less credit than the linear regression does for running up the score against its opponents. This will lower the rank for some teams that run up the score against their components.
Second, the teams saint-mary-saint-mary and st.-thomas-(tx)-celts only plays one game. The two teams also finish the game with relatively wide margin, say winning 332 and -86 respectively. In terms of question 3, we want to maximize the likelihood function. The logistic regressio will place more weight on the the teams saint-mary-saint-mary and st.-thomas-(tx)-celts that only play one game with wide margin. In order to regress on the same response variable as the linear regression does, the coefficient of the two teams should be large. This is why the rank is high in the logistic regression.
# 1. loading the data
games <- read.csv("http://www.stanford.edu/~wfithian/games.csv", as.is = TRUE)
teams <- read.csv("http://www.stanford.edu/~wfithian/teams.csv", as.is = TRUE)
head(games)
## date home away
## 1 2012-11-09 albany-great-danes duquesne-dukes
## 2 2012-11-11 ohio-state-buckeyes albany-great-danes
## 3 2012-11-13 washington-huskies albany-great-danes
## 4 2012-11-17 albany-great-danes umkc-kangaroos
## 5 2012-11-18 albany-great-danes loyola-(md)-greyhounds
## 6 2012-11-20 south-carolina-state-bulldogs albany-great-danes
## homeScore awayScore neutralLocation gameType
## 1 69 66 0 REG
## 2 82 60 0 REG
## 3 62 63 0 REG
## 4 62 59 1 REG
## 5 64 67 1 REG
## 6 55 83 0 REG
head(teams)
## team conference inTourney apRank usaTodayRank
## 1 stony-brook-seawolves america-east 0 NA NA
## 2 vermont-catamounts america-east 0 NA NA
## 3 boston-university-terriers america-east 0 NA NA
## 4 hartford-hawks america-east 0 NA NA
## 5 albany-great-danes america-east 1 NA NA
## 6 maine-black-bears america-east 0 NA NA
all.teams <- sort(unique(c(teams$team, games$home, games$away)))
games$home <- factor(games$home, levels = all.teams)
games$away <- factor(games$away, levels = all.teams)
teams$team <- factor(teams$team, levels = all.teams)
teams$conference <- factor(teams$conference)
# Rank the teams.
total.margin <- function(team) {
with(games, sum(homeScore[home == team]) + sum(awayScore[away == team]) -
sum(homeScore[away == team]) - sum(awayScore[home == team]))
}
margins <- sapply(teams$team, total.margin)
names(margins) <- teams$team
rank.table <- cbind(Margin = margins, `Margin Rank` = rank(-margins, ties = "min"),
`AP Rank` = teams$apRank, `USAT Rank` = teams$usaTodayRank)
margin.top25 <- order(margins, decreasing = TRUE)[1:25]
rank.table[margin.top25, ]
## Margin Margin Rank AP Rank USAT Rank
## florida-gators 630 1 14 12
## louisville-cardinals 627 2 2 2
## indiana-hoosiers 595 3 4 5
## gonzaga-bulldogs 554 4 1 1
## kansas-jayhawks 490 5 3 3
## syracuse-orange 469 6 16 18
## weber-state-wildcats 467 7 NA NA
## michigan-wolverines 465 8 10 11
## virginia-commonwealth-rams 447 9 NA 23
## pittsburgh-panthers 435 10 20 22
## middle-tennessee-blue-raiders 431 11 NA NA
## duke-blue-devils 428 12 6 7
## creighton-bluejays 406 13 22 21
## ohio-state-buckeyes 402 14 7 6
## saint-mary's-gaels 381 15 NA 25
## davidson-wildcats 380 16 NA NA
## ole-miss-rebels 370 17 NA NA
## saint-louis-billikens 351 18 13 13
## belmont-bruins 351 18 NA NA
## memphis-tigers 344 20 19 15
## arizona-wildcats 344 20 21 20
## wichita-state-shockers 335 22 NA NA
## stephen-f.-austin-lumberjacks 332 23 NA NA
## miami-(fl)-hurricanes 328 24 5 4
## missouri-tigers 320 25 NA NA
# Linear Regression for Ranking team
y <- with(games, homeScore - awayScore)
X0 <- as.data.frame(matrix(0, nrow(games), length(all.teams)))
names(X0) <- all.teams
for (tm in all.teams) {
X0[[tm]] <- 1 * (games$home == tm) - 1 * (games$away == tm)
}
# 4. Identifiability Problem
stanf.ix <- which(names(X0) == "stanford-cardinal")
X <- X0[, -stanf.ix]
reg.season.games <- which(games$gameType == "REG")
mod <- lm(y ~ 0 + ., data = X, subset = reg.season.games)
head(coef(summary(mod)))
## Estimate Std. Error t value Pr(>|t|)
## `air-force-falcons` -6.688 2.898 -2.3080 2.104e-02
## `akron-zips` -2.558 2.841 -0.9002 3.681e-01
## `alabama-a&m-bulldogs` -30.590 2.909 -10.5167 1.338e-25
## `alabama-crimson-tide` -2.851 2.802 -1.0173 3.090e-01
## `alabama-state-hornets` -29.958 2.849 -10.5143 1.372e-25
## `albany-great-danes` -13.335 2.818 -4.7313 2.292e-06
summary(mod)$r.squared
## [1] 0.5512
## Court Advantage
# Home Court Advantage
homeAdv <- 1 - games$neutralLocation
homeAdv.mod <- lm(y ~ 0 + homeAdv + ., data = X, subset = reg.season.games)
homeAdv.coef <- coef(homeAdv.mod)[paste("`", all.teams, "`", sep = "")]
names(homeAdv.coef) <- all.teams
head(coef(summary(homeAdv.mod)))
## Estimate Std. Error t value Pr(>|t|)
## homeAdv 3.5285 0.1569 22.4920 8.655e-107
## `air-force-falcons` -5.2991 2.7621 -1.9185 5.510e-02
## `akron-zips` -0.9668 2.7085 -0.3570 7.211e-01
## `alabama-a&m-bulldogs` -27.1437 2.7761 -9.7777 2.227e-22
## `alabama-crimson-tide` -2.6111 2.6706 -0.9777 3.283e-01
## `alabama-state-hornets` -26.3396 2.7200 -9.6836 5.519e-22
# Home Court Advantage by Logistic Regression
newY <- ifelse(y > 0, 1, 0)
homeAdv.mod.logit <- glm(newY ~ 0 + homeAdv + ., data = X, subset = reg.season.games,
family = binomial)
homeAdv.coef.logit <- coef(homeAdv.mod.logit)[paste("`", all.teams, "`", sep = "")]
names(homeAdv.coef.logit) <- all.teams
head(coef(summary(homeAdv.mod.logit)))
## Estimate Std. Error z value Pr(>|z|)
## homeAdv 0.679812 0.04032 16.860919 8.722e-64
## `air-force-falcons` 0.117271 0.70171 0.167122 8.673e-01
## `akron-zips` 0.228891 0.73603 0.310980 7.558e-01
## `alabama-a&m-bulldogs` -4.576627 0.83025 -5.512339 3.541e-08
## `alabama-crimson-tide` -0.004103 0.66055 -0.006211 9.950e-01
## `alabama-state-hornets` -4.590446 0.77545 -5.919731 3.225e-09
# Model Ranking Table
rank.table <- cbind(`Model Score` = homeAdv.coef, `Model Rank` = rank(-homeAdv.coef,
ties = "min"), margins = margins)
## Warning: number of rows of result is not a multiple of vector length (arg
## 3)
rank.table[order(homeAdv.coef, decreasing = TRUE)[1:25], ]
## Model Score Model Rank margins
## indiana-hoosiers 13.255 1 17
## florida-gators 12.639 2 277
## louisville-cardinals 12.081 3 298
## gonzaga-bulldogs 9.988 4 1
## duke-blue-devils 9.394 5 -236
## kansas-jayhawks 8.632 6 -208
## ohio-state-buckeyes 7.897 7 56
## michigan-wolverines 7.505 8 -366
## pittsburgh-panthers 7.356 9 155
## syracuse-orange 7.070 10 242
## wisconsin-badgers 6.831 11 286
## michigan-state-spartans 6.176 12 -263
## creighton-bluejays 5.284 13 -106
## miami-(fl)-hurricanes 5.192 14 -288
## virginia-commonwealth-rams 4.957 15 -53
## arizona-wildcats 4.685 16 196
## minnesota-golden-gophers 4.422 17 335
## georgetown-hoyas 4.259 18 -134
## missouri-tigers 3.962 19 -25
## oklahoma-state-cowboys 3.691 20 294
## saint-mary's-gaels 3.372 21 224
## colorado-state-rams 3.152 22 93
## new-mexico-lobos 3.004 23 5
## north-carolina-tar-heels 2.715 24 -163
## ole-miss-rebels 2.645 25 3
# Model Ranking Table for Logit
rank.table.logit <- cbind(`Model Linear Score` = homeAdv.coef, `Model Logistic Score` = homeAdv.coef.logit,
margins = margins, `Model Linear Rank` = rank(-homeAdv.coef, ties = "min"),
`Model Logistic Rank` = rank(-homeAdv.coef.logit, ties = "min"), `AP Rank` = teams$apRank,
`USAT Rank` = teams$usaTodayRank)
## Warning: number of rows of result is not a multiple of vector length (arg
## 3)
rank.table.logit[order(homeAdv.coef.logit, decreasing = TRUE)[1:25], ]
## Model Linear Score Model Logistic Score margins
## saint-mary-saint-mary -10.4184 14.127 332
## st.-thomas-(tx)-celts -5.6607 13.270 -86
## gonzaga-bulldogs 9.9875 2.771 1
## louisville-cardinals 12.0810 2.412 298
## kansas-jayhawks 8.6319 2.266 -208
## indiana-hoosiers 13.2552 2.255 17
## new-mexico-lobos 3.0044 2.239 5
## ohio-state-buckeyes 7.8965 2.137 56
## duke-blue-devils 9.3937 2.107 -236
## georgetown-hoyas 4.2590 2.032 -134
## michigan-state-spartans 6.1762 1.976 -263
## michigan-wolverines 7.5054 1.960 -366
## miami-(fl)-hurricanes 5.1916 1.761 -288
## kansas-state-wildcats 1.6525 1.749 -257
## syracuse-orange 7.0696 1.605 242
## memphis-tigers 1.0144 1.532 -23
## saint-louis-billikens 2.6373 1.522 -353
## marquette-golden-eagles 2.2985 1.508 -134
## butler-bulldogs -0.7322 1.460 207
## wisconsin-badgers 6.8313 1.451 286
## florida-gators 12.6386 1.316 277
## oklahoma-state-cowboys 3.6909 1.304 294
## unlv-rebels 2.1699 1.294 105
## arizona-wildcats 4.6849 1.292 196
## pittsburgh-panthers 7.3563 1.266 155
## Model Linear Rank Model Logistic Rank AP Rank
## saint-mary-saint-mary 162 1 NA
## st.-thomas-(tx)-celts 98 2 NA
## gonzaga-bulldogs 4 3 NA
## louisville-cardinals 3 4 NA
## kansas-jayhawks 6 5 NA
## indiana-hoosiers 1 6 NA
## new-mexico-lobos 23 7 NA
## ohio-state-buckeyes 7 8 NA
## duke-blue-devils 5 9 NA
## georgetown-hoyas 18 10 NA
## michigan-state-spartans 12 11 NA
## michigan-wolverines 8 12 NA
## miami-(fl)-hurricanes 14 13 NA
## kansas-state-wildcats 36 14 NA
## syracuse-orange 10 15 NA
## memphis-tigers 42 16 NA
## saint-louis-billikens 26 17 NA
## marquette-golden-eagles 31 18 NA
## butler-bulldogs 51 19 NA
## wisconsin-badgers 11 20 NA
## florida-gators 2 21 NA
## oklahoma-state-cowboys 20 22 NA
## unlv-rebels 32 23 NA
## arizona-wildcats 16 24 NA
## pittsburgh-panthers 9 25 NA
## USAT Rank
## saint-mary-saint-mary NA
## st.-thomas-(tx)-celts NA
## gonzaga-bulldogs NA
## louisville-cardinals NA
## kansas-jayhawks NA
## indiana-hoosiers NA
## new-mexico-lobos NA
## ohio-state-buckeyes NA
## duke-blue-devils NA
## georgetown-hoyas NA
## michigan-state-spartans NA
## michigan-wolverines NA
## miami-(fl)-hurricanes NA
## kansas-state-wildcats NA
## syracuse-orange NA
## memphis-tigers NA
## saint-louis-billikens NA
## marquette-golden-eagles NA
## butler-bulldogs NA
## wisconsin-badgers NA
## florida-gators NA
## oklahoma-state-cowboys NA
## unlv-rebels NA
## arizona-wildcats NA
## pittsburgh-panthers NA
Get rid of teams that played less than five games and refit the model. Make a rank table like the ones we made in class, where you compare the logistic regression rankings to the linear regression rankings, the AP Rankings, and the USA Today rankings. Which model seems to correspond better to the voters’ decisions, the linear regression or logistic regression?
The logistic regression corresponds better to the voters’ decisions.
## Calculate the number of games for each team.
numGames <- vector()
for (i in 1:length(all.teams)) {
numGames[i] <- length(which(games$home == all.teams[i])) + length(which(games$away ==
all.teams[i]))
}
names(numGames) <- all.teams
## Check - Teams with less than 5 games
numGames[which(numGames < 5)]
## arcadia-university-arcadia-university
## 1
## averett-cougars
## 1
## benedictine-eagles
## 2
## benedictine-ravens
## 1
## byu-hawaii-seasiders
## 1
## cal-state-san-marcos-cougars
## 2
## chadron-state-eagles
## 1
## college-of-coastal-georgia-mariner
## 1
## concordia-ill-cougars
## 1
## district-of-columbia-district-of-columbia
## 4
## east-texas-bapt-tigers
## 1
## emory-eagles
## 1
## fairmont-state-falcons
## 1
## florida-christian-college-suns
## 2
## florida-tech-panthers
## 1
## fontbonne-fontbonne
## 1
## geneva-golden-tornadoes
## 1
## georgia-southwestern-hurricanes
## 1
## hannibal-lagrange-hannibal-lagrange
## 2
## hanover-panthers
## 1
## hiram-terriers
## 1
## holy-cross-saints
## 1
## indiana-east-indiana-east
## 1
## john-carroll-blue-streaks
## 1
## lenoir-rhyne-bears
## 2
## lewis-&-clark-pioneers
## 1
## lincoln-(mo)-lincoln-(mo)
## 1
## lsu-shreveport-pilots
## 2
## mansfield-mountaineers
## 1
## marymount-va-saints
## 1
## methodist-monarchs
## 1
## minot-state-beavers
## 1
## missouri-valley-vikings
## 1
## montana-tech-orediggers
## 1
## muhlenberg-mules
## 1
## new-jersey-lions
## 1
## north-alabama-lions
## 1
## oglethorpe-stormy-petrels
## 1
## ohio-dominican-panthers
## 1
## olivet-college-comets
## 1
## pacific-oregon-boxers
## 1
## pacific-union-pacific-union
## 2
## panhandle-state-aggies
## 2
## saint-mary-saint-mary
## 1
## slippery-rock-the-rock
## 1
## southern-virginia-knights
## 2
## southwest-mustang
## 2
## spalding-pelicans
## 1
## st-gregory's-cavaliers
## 2
## st.-thomas-(tx)-celts
## 1
## tabor-college-tabor-college
## 1
## union-ky-college-bulldogs
## 3
## university-of-mary-marauders
## 1
## wabash-wabash
## 1
## wartburg-knights
## 1
## westmont-warriors
## 1
## whitman-missionaries
## 1
## whittier-poets
## 1
## willamette-bearcats
## 1
## Eliminate teams with less than 5 games
all.teams.m5 <- all.teams[-which(numGames < 5)]
## Recalculate matrix
X0.m5 <- as.data.frame(matrix(0, nrow(games), length(all.teams.m5)))
names(X0.m5) <- all.teams.m5
## Fill in the columns, one by one
for (tm in all.teams.m5) {
X0.m5[[tm]] <- 1 * (games$home == tm) - 1 * (games$away == tm)
}
## Stanford is the baseline.
X.m5 <- X0.m5[, names(X0.m5) != "stanford-cardinal"]
y.l <- with(games, homeScore - awayScore)
y.lr <- ifelse(y > 0, 1, 0)
## Model including home advantage
homeAdv.mod.l.m5 <- lm(y.l ~ 0 + homeAdv + ., data = X.m5, subset = reg.season.games)
homeAdv.mod.lr.m5 <- glm(y.lr ~ 0 + homeAdv + ., family = binomial, data = X.m5,
subset = reg.season.games)
## Check. summary(homeAdv.mod.l.m5)[1:25,] summary(homeAdv.mod.lr.m5)[1:25,]
## Calculate Rank.
homeAdv.coef.l.m5 <- coef(homeAdv.mod.l.m5)[paste("`", teams$team, "`", sep = "")]
names(homeAdv.coef.l.m5) <- teams$team
homeAdv.coef.lr.m5 <- coef(homeAdv.mod.lr.m5)[paste("`", teams$team, "`", sep = "")]
names(homeAdv.coef.lr.m5) <- teams$team
## Check.
head(coef(summary(homeAdv.mod.l.m5)))
## Estimate Std. Error t value Pr(>|t|)
## homeAdv 3.8398 0.1612 23.82672 5.279e-119
## `air-force-falcons` 20.7865 2.3456 8.86199 1.072e-18
## `akron-zips` 26.1639 2.2026 11.87837 4.092e-32
## `alabama-a&m-bulldogs` -0.7100 2.3592 -0.30093 7.635e-01
## `alabama-crimson-tide` 23.7965 2.2268 10.68663 2.246e-26
## `alabama-state-hornets` 0.1463 2.2826 0.06411 9.489e-01
head(coef(summary(homeAdv.mod.lr.m5)))
## Estimate Std. Error z value Pr(>|z|)
## homeAdv 0.7082 0.04013 17.647 1.071e-69
## `air-force-falcons` 3.2084 0.61464 5.220 1.790e-07
## `akron-zips` 3.4950 0.63577 5.497 3.858e-08
## `alabama-a&m-bulldogs` -1.3650 0.74467 -1.833 6.679e-02
## `alabama-crimson-tide` 3.1509 0.56691 5.558 2.729e-08
## `alabama-state-hornets` -1.3552 0.68459 -1.980 4.774e-02
## Build rank table.
rank.table.m5 <- cbind(`Model linear Score` = homeAdv.coef.l.m5, `Model log. reg. Score` = homeAdv.coef.lr.m5,
`Model linear Rank` = rank(-homeAdv.coef.l.m5, ties = "min"), `Model log. reg. Rank` = rank(-homeAdv.coef.lr.m5,
ties = "min"), `AP Rank` = teams$apRank, `USAT Rank` = teams$usaTodayRank)
rank.table.m5[order(homeAdv.coef.lr.m5, decreasing = TRUE)[1:25], ]
## Model linear Score Model log. reg. Score
## gonzaga-bulldogs 36.18 5.943
## louisville-cardinals 38.67 5.591
## kansas-jayhawks 34.70 5.403
## indiana-hoosiers 39.52 5.374
## new-mexico-lobos 29.08 5.354
## duke-blue-devils 35.75 5.273
## ohio-state-buckeyes 34.03 5.246
## georgetown-hoyas 30.71 5.185
## michigan-state-spartans 32.34 5.092
## michigan-wolverines 33.72 5.080
## miami-(fl)-hurricanes 31.56 4.917
## kansas-state-wildcats 28.08 4.903
## syracuse-orange 33.52 4.778
## memphis-tigers 27.75 4.721
## saint-louis-billikens 29.16 4.690
## marquette-golden-eagles 28.73 4.673
## butler-bulldogs 25.64 4.640
## wisconsin-badgers 32.98 4.555
## oklahoma-state-cowboys 30.00 4.460
## florida-gators 38.95 4.453
## pittsburgh-panthers 33.96 4.445
## notre-dame-fighting-irish 28.43 4.426
## unlv-rebels 28.04 4.363
## colorado-state-rams 29.41 4.305
## north-carolina-tar-heels 29.10 4.225
## Model linear Rank Model log. reg. Rank AP Rank
## gonzaga-bulldogs 4 1 1
## louisville-cardinals 3 2 2
## kansas-jayhawks 6 3 3
## indiana-hoosiers 1 4 4
## new-mexico-lobos 24 5 11
## duke-blue-devils 5 6 6
## ohio-state-buckeyes 7 7 7
## georgetown-hoyas 16 8 8
## michigan-state-spartans 12 9 9
## michigan-wolverines 9 10 10
## miami-(fl)-hurricanes 15 11 5
## kansas-state-wildcats 35 12 12
## syracuse-orange 10 13 16
## memphis-tigers 39 14 19
## saint-louis-billikens 22 15 13
## marquette-golden-eagles 28 16 15
## butler-bulldogs 50 17 NA
## wisconsin-badgers 11 18 18
## oklahoma-state-cowboys 17 19 17
## florida-gators 2 20 14
## pittsburgh-panthers 8 21 20
## notre-dame-fighting-irish 31 22 23
## unlv-rebels 36 23 NA
## colorado-state-rams 21 24 NA
## north-carolina-tar-heels 23 25 NA
## USAT Rank
## gonzaga-bulldogs 1
## louisville-cardinals 2
## kansas-jayhawks 3
## indiana-hoosiers 5
## new-mexico-lobos 10
## duke-blue-devils 7
## ohio-state-buckeyes 6
## georgetown-hoyas 8
## michigan-state-spartans 9
## michigan-wolverines 11
## miami-(fl)-hurricanes 4
## kansas-state-wildcats 14
## syracuse-orange 18
## memphis-tigers 15
## saint-louis-billikens 13
## marquette-golden-eagles 16
## butler-bulldogs NA
## wisconsin-badgers 17
## oklahoma-state-cowboys 19
## florida-gators 12
## pittsburgh-panthers 22
## notre-dame-fighting-irish NA
## unlv-rebels NA
## colorado-state-rams NA
## north-carolina-tar-heels NA
When we ignore the actual value of yi and instead only use whether yi > 0, we are discarding information, so we might expect our model standard errors to be larger relative to the effect sizes. If we use the linear regression model, for what fraction of teams are we confident (p < 0.05) that the team is better (or worse) than Stanford? For what fraction are we confident if we instead use the logistic regression model?
With all the teams, we have fraction 0.793 confidence (p < 0.05) that the team is better (or worse) than Stanford by using the linear regression mode, and we have fraction 0.5295567 confidence (p < 0.05) that the team is better (or worse) than Stanford by using the logistic regression mode has lower confidence that has been illustrated in the question.
With the teams that have more than or equal to 5 games, we have fraction 0.9164265 confidence (p < 0.05) that the team is better (or worse) than Stanford by using the logistic regression mode, and we have fraction 0.6224784 confidence (p < 0.05) that the team is better (or worse) than Stanford by using the logistic regression mode has lower confidence that has been illustrated in the question. But this model without the team having less than 5 games perform better.
######################################################################################## p-value fractions. ##
## All teams
fraction.homeAdv.mod.l <- length(which(summary(homeAdv.mod.l)$coefficients[,
4] < 0.05))/length(summary(homeAdv.mod.l)$coefficients[, 4])
## Error: object 'homeAdv.mod.l' not found
fraction.homeAdv.mod.lr <- length(which(summary(homeAdv.mod.lr)$coefficients[,
4] < 0.05))/length(summary(homeAdv.mod.lr)$coefficients[, 4])
## Error: object 'homeAdv.mod.lr' not found
# Linear model fraction:
fraction.homeAdv.mod.l
## Error: object 'fraction.homeAdv.mod.l' not found
# Logistic regression fraction:
fraction.homeAdv.mod.lr
## Error: object 'fraction.homeAdv.mod.lr' not found
## Excluding teams with less than 5 games
fraction.homeAdv.mod.l.m5 <- length(which(summary(homeAdv.mod.l.m5)$coefficients[,
4] < 0.05))/length(summary(homeAdv.mod.l.m5)$coefficients[, 4])
fraction.homeAdv.mod.lr.m5 <- length(which(summary(homeAdv.mod.lr.m5)$coefficients[,
4] < 0.05))/length(summary(homeAdv.mod.lr.m5)$coefficients[, 4])
# Linear model fraction (excluding teams with less than 5 games):
fraction.homeAdv.mod.l.m5
## [1] 0.9164
# Logistic regression (excluding teams with less than 5 games):
fraction.homeAdv.mod.lr.m5
## [1] 0.6225
Each model makes a prediction about which team will win any given game. We can use ten-fold cross-validation to estimate the test error rate for these predictions, and also try to determine whether one model is better than the other. For each game in a given test set, there are four possible outcomes: both models are right in their prediction, both are wrong, only logistic regression is right, or only linear regression is right. Make a 2 ⇥ 2 contingency table displaying how often each of these four outcomes occur, over all the test examples in all ten folds of cross-validation. Your table should look like:
According to the outputs below, we can get (rounded to the closest whole number):
$$n_{11}=381, n_{12}=27, n_{21}=24, n_{22}=122$$
######################################################################################## Calculate prediction error using 10-fold CV. ##
k <- 10
set.seed(1)
folds <- sample(1:k, nrow(games), replace = T)
cv.error.l <- vector()
cv.error.lr <- vector()
cv.n11 <- vector()
cv.n12 <- vector()
cv.n21 <- vector()
cv.n22 <- vector()
cv.n <- vector()
for (j in 1:k) {
print(j)
## Model including home advantage
y.l.cv <- with(games[folds != j, ], homeScore - awayScore)
y.lr.cv <- as.factor(with(games[folds != j, ], ifelse(homeScore > awayScore,
1, 0)))
reg.season.games.cv <- which(games[folds != j, ]$gameType == "REG")
homeAdv <- 1 - games[folds != j, ]$neutralLocation
homeAdv.mod.l.cv <- lm(y.l.cv ~ 0 + homeAdv + ., data = X[folds != j, ],
subset = reg.season.games.cv)
homeAdv.mod.lr.cv <- glm(y.lr.cv ~ 0 + homeAdv + ., family = binomial, data = X[folds !=
j, ], subset = reg.season.games.cv)
homeAdv.coef.l.cv <- coef(homeAdv.mod.l.cv)[paste("`", all.teams, "`", sep = "")]
names(homeAdv.coef.l.cv) <- all.teams
homeAdv.coef.lr.cv <- coef(homeAdv.mod.lr.cv)[paste("`", all.teams, "`",
sep = "")]
names(homeAdv.coef.lr.cv) <- all.teams
## Calculate predictions, table data and error rate
resids.l.cv <- homeAdv.mod.l.cv$resid
resids.lr.cv <- homeAdv.mod.lr.cv$resid ## Residuals for the logistic regression deviate from the normal distribution
sigma.l.cv <- sd(resids.l.cv)
sigma.lr.cv <- sd(resids.lr.cv)
error.rate.l.cv <- 0
error.rate.lr.cv <- 0
n11 <- 0
n12 <- 0
n21 <- 0
n22 <- 0
home.wins <- with(games, ifelse(homeScore > awayScore, 1, 0))
for (i in which(folds == j)) {
mu.l.cv <- coef(homeAdv.mod.l.cv)["homeAdv"] + homeAdv.coef.l.cv[games$home[i]] -
homeAdv.coef.l.cv[games$away[i]]
mu.lr.cv <- coef(homeAdv.mod.lr.cv)["homeAdv"] + homeAdv.coef.lr.cv[games$home[i]] -
homeAdv.coef.lr.cv[games$away[i]]
if (games$home[i] == "stanford-cardinal") {
mu.l.cv <- coef(homeAdv.mod.l.cv)["homeAdv"] - homeAdv.coef.l.cv[games$away[i]]
mu.lr.cv <- coef(homeAdv.mod.lr.cv)["homeAdv"] - homeAdv.coef.lr.cv[games$away[i]]
}
if (games$away[i] == "stanford-cardinal") {
mu.l.cv <- coef(homeAdv.mod.l.cv)["homeAdv"] + homeAdv.coef.l.cv[games$home[i]]
mu.lr.cv <- coef(homeAdv.mod.lr.cv)["homeAdv"] + homeAdv.coef.lr.cv[games$home[i]]
}
## Get error rate and table data for the linear model
prob.away.win.l <- pnorm(-mu.l.cv/sigma.l.cv)
home.win.l <- 0
if (!is.na(prob.away.win.l)) {
if (prob.away.win.l < 0.5) {
home.win.l <- 1
}
if (home.wins[i] != home.win.l) {
error.rate.l.cv <- error.rate.l.cv + 1
}
} else {
print("Missing coefficient: Linear Model.")
}
## Get error rate and table data for the Logistic regression model
prob.away.win.lr <- pnorm(-mu.lr.cv/sigma.lr.cv)
home.win.lr <- 0
if (!is.na(prob.away.win.lr)) {
if (prob.away.win.lr < 0.5) {
home.win.lr <- 1
}
if (home.wins[i] != home.win.lr) {
error.rate.lr.cv <- error.rate.lr.cv + 1
}
} else {
print("Missing coefficient: Logistic regression Model.")
}
## Build table ***********************************************
if (home.win.l == home.wins[i] && home.win.lr == home.wins[i]) {
n11 <- n11 + 1
}
if (home.win.l == home.wins[i] && home.win.lr != home.wins[i]) {
n12 <- n12 + 1
}
if (home.win.l != home.wins[i] && home.win.lr == home.wins[i]) {
n21 <- n21 + 1
}
if (home.win.l != home.wins[i] && home.win.lr != home.wins[i]) {
n22 <- n22 + 1
}
}
error.rate.l.cv <- error.rate.l.cv/nrow(games[folds == j, ])
error.rate.lr.cv <- error.rate.lr.cv/nrow(games[folds == j, ])
cv.error.l[j] <- error.rate.l.cv
cv.error.lr[j] <- error.rate.lr.cv
cv.n11[j] <- n11/length(which(folds == j))
cv.n12[j] <- n12/length(which(folds == j))
cv.n21[j] <- n21/length(which(folds == j))
cv.n22[j] <- n22/length(which(folds == j))
cv.n[j] <- length(which(folds == j))
}
## [1] 1
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 2
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 3
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 4
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 5
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 6
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 7
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 8
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 9
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 10
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
mean(cv.n11) * mean(cv.n)
## [1] 381.4
mean(cv.n12) * mean(cv.n)
## [1] 27.28
mean(cv.n21) * mean(cv.n)
## [1] 23.64
mean(cv.n22) * mean(cv.n)
## [1] 121.8
Use the normal approximation to carry out a test of the hypothesis that both models are equally good at predicting games. What is the conclusion of your test?
\( n = n_{11} + n_{12} + n_{21} + n_{22} \)
\( p_{12} = n_{12}/n \) \( p_{21} = n_{21}/n \)
\[ H_0: p_{12} = p_{21} \] \[ H_1: p_{12} \neq p_{21} \]
Because \( n_{12} + n_{21} > 25 \), we can perform chi square test:
I used a correction of 1:
\[ \chi^2_{df=1} = \frac{(|n_{21} - n_{12}|-1)^2}{n_{21}+n_{12}} = \frac{(|24 - 27|-1)^2}{24 + 27} = 0.0784313 \]
For the single sided test, the two tailed p-value corresponting to this test statistics is 0.7794. Thus we can conclude that the hypothesis testing is insignificant. We fail to reject the null hypothesis. Thus we can conclude that the two models do not perform better than the other significantly.