Homework #2

STATS 216, Shijia Bian, Deadline: Feb 12, 2014

Question 1

(a) iv

(b) ii

(c.) iii

(d) iv

(e) v

Question 2

(a)

\( \frac{n-1}{n} \). There are n possible choices for the first observation of bootstrap, {1,2,3…,n}. Then we have n-1 choices that the first observation is not the jth observation from the original data set. As known, the withdraw is randomly independent, the probability that the first bootstrap observation is not the jth observation is

\[ \frac{the~number~of~options~that~is~not~the~jth~observation}{the~totoal~possible~option~for~the~first~bootstrap~observation} \].

(b)

\( \frac{n-1}{n} \).

(c.)

The probability that the jth observation is not in the first bootstrap observation is \( \frac{n-1}{n} \);

The probability that the jth observation is not in the second bootstrap observation is \( \frac{n-1}{n} \); \[ \dotsc \] The probability that the jth observation is not in the second bootstrap observation is \( \frac{n-1}{n} \);

As each of the withdraw is independent, and we need withdraw the data for n times, thus the probability that the jth observation is not in the bootstrap sample is \( (\frac{n-1}{n})^n \), that is \( (1 − 1/n)^n \).

(d)

\( 1-(1 − 1/n)^n=1-(1 − 1/5)^5=0.67232 \)

(e)

\( 1-(1 − 1/n)^n=1-(1 − 1/100)^{100}=0.6339677 \)

(f)

\( 1-(1 − 1/n)^n=1-(1 − 1/10000)^{10000}=0.632139 \)

(g)

Create a plot that displays, for each integer value of n from 1 to 100, 000, the probability that the jth observation is in the bootstrap sample. Comment on what you observe.

In terms of the rule: (1-1/n)^n converges to 1/e when n grows infinitely large. 1 - (1-1/n)^n converges to 1-1/e that is approximately 0.6321. This plot shows this result.
prob <- numeric(1e+05)
x <- numeric(1e+05)
for (i in 1:1e+05) {
    prob[i] <- 1 - ((i - 1)/i)^i
    x[i] <- i
}
plot(x, prob, type = "l")

plot of chunk unnamed-chunk-2

(h)

The mean is 0.6348 that is very closed to what we get for (e). We need sample more if we want to get more accurate result. 
set.seed(1232321)
store = rep(NA, 10000)
for (i in 1:10000) {
    store[i] = sum(sample(1:100, rep = TRUE) == 4) > 0
}
mean(store)
## [1] 0.6348

Question 3

(a) - ©

These three questions are hand written on a separate sheet of paper. 

(d)

I created two linearly well separated clusters as required. c was equal to 3.0. If the cluster was less than 3.0, it fell in category "0"; Otherwise, it fell in category "1". 

First I made the cluster corresponding to category "0":
require(ISLR)
## Loading required package: ISLR
cluster1x <- seq(0, 2.9, 0.1)
cluster1y <- runif(100, -5, 5)
cluser1bigx <- numeric(3000)
cluser1bigy <- numeric(3000)

count <- 1
for (i in 1:30) {
    for (j in 1:100) {
        cluser1bigx[count] = cluster1x[i]
        cluser1bigy[count] = cluster1y[j]
        count <- count + 1
    }
}

cluster1Res <- rep("0", 3000)

Here is the plot of the first cluster:

plot(cluser1bigx, cluser1bigy)

plot of chunk unnamed-chunk-5

Second I made the cluster corresponding to category “1”:

cluster2x <- seq(3.1, 6, 0.1)
cluster2y <- runif(100, -5, 5)
cluser2bigx <- numeric(3000)
cluser2bigy <- numeric(3000)

count2 <- 1
for (i in 1:30) {
    for (j in 1:100) {
        cluser2bigx[count2] = cluster2x[i]
        cluser2bigy[count2] = cluster2y[j]
        count2 <- count2 + 1
    }
}

cluster2Res <- rep("1", 3000)
plot(cluser2bigx, cluser2bigy)

plot of chunk unnamed-chunk-7

Then I combined all the x,y and categorical responsed into one single data frame:

bigclusterX <- c(cluser1bigx, cluser2bigx)
bigclusterY <- c(cluser1bigy, cluser2bigy)
bigclusterRes <- c(cluster1Res, cluster2Res)

wholeData <- data.frame(bigclusterX, bigclusterY, bigclusterRes)

ThenI plot the data as a whole:

plot(bigclusterX, bigclusterY, col = ifelse(bigclusterRes == "0", "red", "blue"))

plot of chunk unnamed-chunk-9

You can see that the data is well separated at c=3. Then I applied logistic regression model to fit the data. The warnings occur as described in question:

glm.fit = glm(bigclusterRes ~ bigclusterX + bigclusterY, data = wholeData, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Question 4

Team with Marcello Hasegawa

(a)

Fit the logistic regression model above to the data and examine the rankings. What happened to make the teams saint-mary-saint-mary and st.-thomas-(tx)-celts look so good? Can you explain it in terms of your answer to the previous question?

First, In the logistic regression model, we give a lot less credit than the linear regression does for running up the score against its opponents. This will lower the rank for some teams that run up the score against their components. 
Second, the teams saint-mary-saint-mary and st.-thomas-(tx)-celts only plays one game. The two teams also finish the game with relatively wide margin, say winning 332 and -86 respectively. In terms of question 3, we want to maximize the likelihood function. The logistic regressio will place more weight on the the teams saint-mary-saint-mary and st.-thomas-(tx)-celts that only play one game with wide margin. In order to regress on the same response variable as the linear regression does, the coefficient of the two teams should be large. This is why the rank is high in the logistic regression.
# 1. loading the data

games <- read.csv("http://www.stanford.edu/~wfithian/games.csv", as.is = TRUE)
teams <- read.csv("http://www.stanford.edu/~wfithian/teams.csv", as.is = TRUE)

head(games)
##         date                          home                   away
## 1 2012-11-09            albany-great-danes         duquesne-dukes
## 2 2012-11-11           ohio-state-buckeyes     albany-great-danes
## 3 2012-11-13            washington-huskies     albany-great-danes
## 4 2012-11-17            albany-great-danes         umkc-kangaroos
## 5 2012-11-18            albany-great-danes loyola-(md)-greyhounds
## 6 2012-11-20 south-carolina-state-bulldogs     albany-great-danes
##   homeScore awayScore neutralLocation gameType
## 1        69        66               0      REG
## 2        82        60               0      REG
## 3        62        63               0      REG
## 4        62        59               1      REG
## 5        64        67               1      REG
## 6        55        83               0      REG
head(teams)
##                         team   conference inTourney apRank usaTodayRank
## 1      stony-brook-seawolves america-east         0     NA           NA
## 2         vermont-catamounts america-east         0     NA           NA
## 3 boston-university-terriers america-east         0     NA           NA
## 4             hartford-hawks america-east         0     NA           NA
## 5         albany-great-danes america-east         1     NA           NA
## 6          maine-black-bears america-east         0     NA           NA

all.teams <- sort(unique(c(teams$team, games$home, games$away)))

games$home <- factor(games$home, levels = all.teams)
games$away <- factor(games$away, levels = all.teams)
teams$team <- factor(teams$team, levels = all.teams)
teams$conference <- factor(teams$conference)

# Rank the teams.

total.margin <- function(team) {
    with(games, sum(homeScore[home == team]) + sum(awayScore[away == team]) - 
        sum(homeScore[away == team]) - sum(awayScore[home == team]))
}

margins <- sapply(teams$team, total.margin)
names(margins) <- teams$team

rank.table <- cbind(Margin = margins, `Margin Rank` = rank(-margins, ties = "min"), 
    `AP Rank` = teams$apRank, `USAT Rank` = teams$usaTodayRank)

margin.top25 <- order(margins, decreasing = TRUE)[1:25]
rank.table[margin.top25, ]
##                               Margin Margin Rank AP Rank USAT Rank
## florida-gators                   630           1      14        12
## louisville-cardinals             627           2       2         2
## indiana-hoosiers                 595           3       4         5
## gonzaga-bulldogs                 554           4       1         1
## kansas-jayhawks                  490           5       3         3
## syracuse-orange                  469           6      16        18
## weber-state-wildcats             467           7      NA        NA
## michigan-wolverines              465           8      10        11
## virginia-commonwealth-rams       447           9      NA        23
## pittsburgh-panthers              435          10      20        22
## middle-tennessee-blue-raiders    431          11      NA        NA
## duke-blue-devils                 428          12       6         7
## creighton-bluejays               406          13      22        21
## ohio-state-buckeyes              402          14       7         6
## saint-mary's-gaels               381          15      NA        25
## davidson-wildcats                380          16      NA        NA
## ole-miss-rebels                  370          17      NA        NA
## saint-louis-billikens            351          18      13        13
## belmont-bruins                   351          18      NA        NA
## memphis-tigers                   344          20      19        15
## arizona-wildcats                 344          20      21        20
## wichita-state-shockers           335          22      NA        NA
## stephen-f.-austin-lumberjacks    332          23      NA        NA
## miami-(fl)-hurricanes            328          24       5         4
## missouri-tigers                  320          25      NA        NA

# Linear Regression for Ranking team

y <- with(games, homeScore - awayScore)

X0 <- as.data.frame(matrix(0, nrow(games), length(all.teams)))
names(X0) <- all.teams
for (tm in all.teams) {
    X0[[tm]] <- 1 * (games$home == tm) - 1 * (games$away == tm)
}

# 4. Identifiability Problem

stanf.ix <- which(names(X0) == "stanford-cardinal")
X <- X0[, -stanf.ix]
reg.season.games <- which(games$gameType == "REG")

mod <- lm(y ~ 0 + ., data = X, subset = reg.season.games)
head(coef(summary(mod)))
##                         Estimate Std. Error  t value  Pr(>|t|)
## `air-force-falcons`       -6.688      2.898  -2.3080 2.104e-02
## `akron-zips`              -2.558      2.841  -0.9002 3.681e-01
## `alabama-a&m-bulldogs`   -30.590      2.909 -10.5167 1.338e-25
## `alabama-crimson-tide`    -2.851      2.802  -1.0173 3.090e-01
## `alabama-state-hornets`  -29.958      2.849 -10.5143 1.372e-25
## `albany-great-danes`     -13.335      2.818  -4.7313 2.292e-06

summary(mod)$r.squared
## [1] 0.5512


## Court Advantage

# Home Court Advantage

homeAdv <- 1 - games$neutralLocation
homeAdv.mod <- lm(y ~ 0 + homeAdv + ., data = X, subset = reg.season.games)
homeAdv.coef <- coef(homeAdv.mod)[paste("`", all.teams, "`", sep = "")]
names(homeAdv.coef) <- all.teams

head(coef(summary(homeAdv.mod)))
##                         Estimate Std. Error t value   Pr(>|t|)
## homeAdv                   3.5285     0.1569 22.4920 8.655e-107
## `air-force-falcons`      -5.2991     2.7621 -1.9185  5.510e-02
## `akron-zips`             -0.9668     2.7085 -0.3570  7.211e-01
## `alabama-a&m-bulldogs`  -27.1437     2.7761 -9.7777  2.227e-22
## `alabama-crimson-tide`   -2.6111     2.6706 -0.9777  3.283e-01
## `alabama-state-hornets` -26.3396     2.7200 -9.6836  5.519e-22

# Home Court Advantage by Logistic Regression

newY <- ifelse(y > 0, 1, 0)
homeAdv.mod.logit <- glm(newY ~ 0 + homeAdv + ., data = X, subset = reg.season.games, 
    family = binomial)
homeAdv.coef.logit <- coef(homeAdv.mod.logit)[paste("`", all.teams, "`", sep = "")]
names(homeAdv.coef.logit) <- all.teams

head(coef(summary(homeAdv.mod.logit)))
##                          Estimate Std. Error   z value  Pr(>|z|)
## homeAdv                  0.679812    0.04032 16.860919 8.722e-64
## `air-force-falcons`      0.117271    0.70171  0.167122 8.673e-01
## `akron-zips`             0.228891    0.73603  0.310980 7.558e-01
## `alabama-a&m-bulldogs`  -4.576627    0.83025 -5.512339 3.541e-08
## `alabama-crimson-tide`  -0.004103    0.66055 -0.006211 9.950e-01
## `alabama-state-hornets` -4.590446    0.77545 -5.919731 3.225e-09


# Model Ranking Table
rank.table <- cbind(`Model Score` = homeAdv.coef, `Model Rank` = rank(-homeAdv.coef, 
    ties = "min"), margins = margins)
## Warning: number of rows of result is not a multiple of vector length (arg
## 3)

rank.table[order(homeAdv.coef, decreasing = TRUE)[1:25], ]
##                            Model Score Model Rank margins
## indiana-hoosiers                13.255          1      17
## florida-gators                  12.639          2     277
## louisville-cardinals            12.081          3     298
## gonzaga-bulldogs                 9.988          4       1
## duke-blue-devils                 9.394          5    -236
## kansas-jayhawks                  8.632          6    -208
## ohio-state-buckeyes              7.897          7      56
## michigan-wolverines              7.505          8    -366
## pittsburgh-panthers              7.356          9     155
## syracuse-orange                  7.070         10     242
## wisconsin-badgers                6.831         11     286
## michigan-state-spartans          6.176         12    -263
## creighton-bluejays               5.284         13    -106
## miami-(fl)-hurricanes            5.192         14    -288
## virginia-commonwealth-rams       4.957         15     -53
## arizona-wildcats                 4.685         16     196
## minnesota-golden-gophers         4.422         17     335
## georgetown-hoyas                 4.259         18    -134
## missouri-tigers                  3.962         19     -25
## oklahoma-state-cowboys           3.691         20     294
## saint-mary's-gaels               3.372         21     224
## colorado-state-rams              3.152         22      93
## new-mexico-lobos                 3.004         23       5
## north-carolina-tar-heels         2.715         24    -163
## ole-miss-rebels                  2.645         25       3

# Model Ranking Table for Logit
rank.table.logit <- cbind(`Model Linear Score` = homeAdv.coef, `Model Logistic Score` = homeAdv.coef.logit, 
    margins = margins, `Model Linear Rank` = rank(-homeAdv.coef, ties = "min"), 
    `Model Logistic Rank` = rank(-homeAdv.coef.logit, ties = "min"), `AP Rank` = teams$apRank, 
    `USAT Rank` = teams$usaTodayRank)
## Warning: number of rows of result is not a multiple of vector length (arg
## 3)
rank.table.logit[order(homeAdv.coef.logit, decreasing = TRUE)[1:25], ]
##                         Model Linear Score Model Logistic Score margins
## saint-mary-saint-mary             -10.4184               14.127     332
## st.-thomas-(tx)-celts              -5.6607               13.270     -86
## gonzaga-bulldogs                    9.9875                2.771       1
## louisville-cardinals               12.0810                2.412     298
## kansas-jayhawks                     8.6319                2.266    -208
## indiana-hoosiers                   13.2552                2.255      17
## new-mexico-lobos                    3.0044                2.239       5
## ohio-state-buckeyes                 7.8965                2.137      56
## duke-blue-devils                    9.3937                2.107    -236
## georgetown-hoyas                    4.2590                2.032    -134
## michigan-state-spartans             6.1762                1.976    -263
## michigan-wolverines                 7.5054                1.960    -366
## miami-(fl)-hurricanes               5.1916                1.761    -288
## kansas-state-wildcats               1.6525                1.749    -257
## syracuse-orange                     7.0696                1.605     242
## memphis-tigers                      1.0144                1.532     -23
## saint-louis-billikens               2.6373                1.522    -353
## marquette-golden-eagles             2.2985                1.508    -134
## butler-bulldogs                    -0.7322                1.460     207
## wisconsin-badgers                   6.8313                1.451     286
## florida-gators                     12.6386                1.316     277
## oklahoma-state-cowboys              3.6909                1.304     294
## unlv-rebels                         2.1699                1.294     105
## arizona-wildcats                    4.6849                1.292     196
## pittsburgh-panthers                 7.3563                1.266     155
##                         Model Linear Rank Model Logistic Rank AP Rank
## saint-mary-saint-mary                 162                   1      NA
## st.-thomas-(tx)-celts                  98                   2      NA
## gonzaga-bulldogs                        4                   3      NA
## louisville-cardinals                    3                   4      NA
## kansas-jayhawks                         6                   5      NA
## indiana-hoosiers                        1                   6      NA
## new-mexico-lobos                       23                   7      NA
## ohio-state-buckeyes                     7                   8      NA
## duke-blue-devils                        5                   9      NA
## georgetown-hoyas                       18                  10      NA
## michigan-state-spartans                12                  11      NA
## michigan-wolverines                     8                  12      NA
## miami-(fl)-hurricanes                  14                  13      NA
## kansas-state-wildcats                  36                  14      NA
## syracuse-orange                        10                  15      NA
## memphis-tigers                         42                  16      NA
## saint-louis-billikens                  26                  17      NA
## marquette-golden-eagles                31                  18      NA
## butler-bulldogs                        51                  19      NA
## wisconsin-badgers                      11                  20      NA
## florida-gators                          2                  21      NA
## oklahoma-state-cowboys                 20                  22      NA
## unlv-rebels                            32                  23      NA
## arizona-wildcats                       16                  24      NA
## pittsburgh-panthers                     9                  25      NA
##                         USAT Rank
## saint-mary-saint-mary          NA
## st.-thomas-(tx)-celts          NA
## gonzaga-bulldogs               NA
## louisville-cardinals           NA
## kansas-jayhawks                NA
## indiana-hoosiers               NA
## new-mexico-lobos               NA
## ohio-state-buckeyes            NA
## duke-blue-devils               NA
## georgetown-hoyas               NA
## michigan-state-spartans        NA
## michigan-wolverines            NA
## miami-(fl)-hurricanes          NA
## kansas-state-wildcats          NA
## syracuse-orange                NA
## memphis-tigers                 NA
## saint-louis-billikens          NA
## marquette-golden-eagles        NA
## butler-bulldogs                NA
## wisconsin-badgers              NA
## florida-gators                 NA
## oklahoma-state-cowboys         NA
## unlv-rebels                    NA
## arizona-wildcats               NA
## pittsburgh-panthers            NA

(b)

Get rid of teams that played less than five games and refit the model. Make a rank table like the ones we made in class, where you compare the logistic regression rankings to the linear regression rankings, the AP Rankings, and the USA Today rankings. Which model seems to correspond better to the voters’ decisions, the linear regression or logistic regression?

The logistic regression corresponds better to the voters’ decisions.
## Calculate the number of games for each team.
numGames <- vector()

for (i in 1:length(all.teams)) {
    numGames[i] <- length(which(games$home == all.teams[i])) + length(which(games$away == 
        all.teams[i]))
}
names(numGames) <- all.teams

## Check - Teams with less than 5 games
numGames[which(numGames < 5)]
##     arcadia-university-arcadia-university 
##                                         1 
##                           averett-cougars 
##                                         1 
##                        benedictine-eagles 
##                                         2 
##                        benedictine-ravens 
##                                         1 
##                      byu-hawaii-seasiders 
##                                         1 
##              cal-state-san-marcos-cougars 
##                                         2 
##                      chadron-state-eagles 
##                                         1 
##        college-of-coastal-georgia-mariner 
##                                         1 
##                     concordia-ill-cougars 
##                                         1 
## district-of-columbia-district-of-columbia 
##                                         4 
##                    east-texas-bapt-tigers 
##                                         1 
##                              emory-eagles 
##                                         1 
##                    fairmont-state-falcons 
##                                         1 
##            florida-christian-college-suns 
##                                         2 
##                     florida-tech-panthers 
##                                         1 
##                       fontbonne-fontbonne 
##                                         1 
##                   geneva-golden-tornadoes 
##                                         1 
##           georgia-southwestern-hurricanes 
##                                         1 
##       hannibal-lagrange-hannibal-lagrange 
##                                         2 
##                          hanover-panthers 
##                                         1 
##                            hiram-terriers 
##                                         1 
##                         holy-cross-saints 
##                                         1 
##                 indiana-east-indiana-east 
##                                         1 
##                 john-carroll-blue-streaks 
##                                         1 
##                        lenoir-rhyne-bears 
##                                         2 
##                    lewis-&-clark-pioneers 
##                                         1 
##                 lincoln-(mo)-lincoln-(mo) 
##                                         1 
##                     lsu-shreveport-pilots 
##                                         2 
##                    mansfield-mountaineers 
##                                         1 
##                       marymount-va-saints 
##                                         1 
##                        methodist-monarchs 
##                                         1 
##                       minot-state-beavers 
##                                         1 
##                   missouri-valley-vikings 
##                                         1 
##                   montana-tech-orediggers 
##                                         1 
##                          muhlenberg-mules 
##                                         1 
##                          new-jersey-lions 
##                                         1 
##                       north-alabama-lions 
##                                         1 
##                 oglethorpe-stormy-petrels 
##                                         1 
##                   ohio-dominican-panthers 
##                                         1 
##                     olivet-college-comets 
##                                         1 
##                     pacific-oregon-boxers 
##                                         1 
##               pacific-union-pacific-union 
##                                         2 
##                    panhandle-state-aggies 
##                                         2 
##                     saint-mary-saint-mary 
##                                         1 
##                    slippery-rock-the-rock 
##                                         1 
##                 southern-virginia-knights 
##                                         2 
##                         southwest-mustang 
##                                         2 
##                         spalding-pelicans 
##                                         1 
##                    st-gregory's-cavaliers 
##                                         2 
##                     st.-thomas-(tx)-celts 
##                                         1 
##               tabor-college-tabor-college 
##                                         1 
##                 union-ky-college-bulldogs 
##                                         3 
##              university-of-mary-marauders 
##                                         1 
##                             wabash-wabash 
##                                         1 
##                          wartburg-knights 
##                                         1 
##                         westmont-warriors 
##                                         1 
##                      whitman-missionaries 
##                                         1 
##                            whittier-poets 
##                                         1 
##                       willamette-bearcats 
##                                         1

## Eliminate teams with less than 5 games
all.teams.m5 <- all.teams[-which(numGames < 5)]

## Recalculate matrix
X0.m5 <- as.data.frame(matrix(0, nrow(games), length(all.teams.m5)))
names(X0.m5) <- all.teams.m5

## Fill in the columns, one by one
for (tm in all.teams.m5) {
    X0.m5[[tm]] <- 1 * (games$home == tm) - 1 * (games$away == tm)
}

## Stanford is the baseline.
X.m5 <- X0.m5[, names(X0.m5) != "stanford-cardinal"]

y.l <- with(games, homeScore - awayScore)
y.lr <- ifelse(y > 0, 1, 0)

## Model including home advantage
homeAdv.mod.l.m5 <- lm(y.l ~ 0 + homeAdv + ., data = X.m5, subset = reg.season.games)
homeAdv.mod.lr.m5 <- glm(y.lr ~ 0 + homeAdv + ., family = binomial, data = X.m5, 
    subset = reg.season.games)

## Check. summary(homeAdv.mod.l.m5)[1:25,] summary(homeAdv.mod.lr.m5)[1:25,]

## Calculate Rank.
homeAdv.coef.l.m5 <- coef(homeAdv.mod.l.m5)[paste("`", teams$team, "`", sep = "")]
names(homeAdv.coef.l.m5) <- teams$team

homeAdv.coef.lr.m5 <- coef(homeAdv.mod.lr.m5)[paste("`", teams$team, "`", sep = "")]
names(homeAdv.coef.lr.m5) <- teams$team

## Check.
head(coef(summary(homeAdv.mod.l.m5)))
##                         Estimate Std. Error  t value   Pr(>|t|)
## homeAdv                   3.8398     0.1612 23.82672 5.279e-119
## `air-force-falcons`      20.7865     2.3456  8.86199  1.072e-18
## `akron-zips`             26.1639     2.2026 11.87837  4.092e-32
## `alabama-a&m-bulldogs`   -0.7100     2.3592 -0.30093  7.635e-01
## `alabama-crimson-tide`   23.7965     2.2268 10.68663  2.246e-26
## `alabama-state-hornets`   0.1463     2.2826  0.06411  9.489e-01
head(coef(summary(homeAdv.mod.lr.m5)))
##                         Estimate Std. Error z value  Pr(>|z|)
## homeAdv                   0.7082    0.04013  17.647 1.071e-69
## `air-force-falcons`       3.2084    0.61464   5.220 1.790e-07
## `akron-zips`              3.4950    0.63577   5.497 3.858e-08
## `alabama-a&m-bulldogs`   -1.3650    0.74467  -1.833 6.679e-02
## `alabama-crimson-tide`    3.1509    0.56691   5.558 2.729e-08
## `alabama-state-hornets`  -1.3552    0.68459  -1.980 4.774e-02

## Build rank table.
rank.table.m5 <- cbind(`Model linear Score` = homeAdv.coef.l.m5, `Model log. reg. Score` = homeAdv.coef.lr.m5, 
    `Model linear Rank` = rank(-homeAdv.coef.l.m5, ties = "min"), `Model log. reg. Rank` = rank(-homeAdv.coef.lr.m5, 
        ties = "min"), `AP Rank` = teams$apRank, `USAT Rank` = teams$usaTodayRank)
rank.table.m5[order(homeAdv.coef.lr.m5, decreasing = TRUE)[1:25], ]
##                           Model linear Score Model log. reg. Score
## gonzaga-bulldogs                       36.18                 5.943
## louisville-cardinals                   38.67                 5.591
## kansas-jayhawks                        34.70                 5.403
## indiana-hoosiers                       39.52                 5.374
## new-mexico-lobos                       29.08                 5.354
## duke-blue-devils                       35.75                 5.273
## ohio-state-buckeyes                    34.03                 5.246
## georgetown-hoyas                       30.71                 5.185
## michigan-state-spartans                32.34                 5.092
## michigan-wolverines                    33.72                 5.080
## miami-(fl)-hurricanes                  31.56                 4.917
## kansas-state-wildcats                  28.08                 4.903
## syracuse-orange                        33.52                 4.778
## memphis-tigers                         27.75                 4.721
## saint-louis-billikens                  29.16                 4.690
## marquette-golden-eagles                28.73                 4.673
## butler-bulldogs                        25.64                 4.640
## wisconsin-badgers                      32.98                 4.555
## oklahoma-state-cowboys                 30.00                 4.460
## florida-gators                         38.95                 4.453
## pittsburgh-panthers                    33.96                 4.445
## notre-dame-fighting-irish              28.43                 4.426
## unlv-rebels                            28.04                 4.363
## colorado-state-rams                    29.41                 4.305
## north-carolina-tar-heels               29.10                 4.225
##                           Model linear Rank Model log. reg. Rank AP Rank
## gonzaga-bulldogs                          4                    1       1
## louisville-cardinals                      3                    2       2
## kansas-jayhawks                           6                    3       3
## indiana-hoosiers                          1                    4       4
## new-mexico-lobos                         24                    5      11
## duke-blue-devils                          5                    6       6
## ohio-state-buckeyes                       7                    7       7
## georgetown-hoyas                         16                    8       8
## michigan-state-spartans                  12                    9       9
## michigan-wolverines                       9                   10      10
## miami-(fl)-hurricanes                    15                   11       5
## kansas-state-wildcats                    35                   12      12
## syracuse-orange                          10                   13      16
## memphis-tigers                           39                   14      19
## saint-louis-billikens                    22                   15      13
## marquette-golden-eagles                  28                   16      15
## butler-bulldogs                          50                   17      NA
## wisconsin-badgers                        11                   18      18
## oklahoma-state-cowboys                   17                   19      17
## florida-gators                            2                   20      14
## pittsburgh-panthers                       8                   21      20
## notre-dame-fighting-irish                31                   22      23
## unlv-rebels                              36                   23      NA
## colorado-state-rams                      21                   24      NA
## north-carolina-tar-heels                 23                   25      NA
##                           USAT Rank
## gonzaga-bulldogs                  1
## louisville-cardinals              2
## kansas-jayhawks                   3
## indiana-hoosiers                  5
## new-mexico-lobos                 10
## duke-blue-devils                  7
## ohio-state-buckeyes               6
## georgetown-hoyas                  8
## michigan-state-spartans           9
## michigan-wolverines              11
## miami-(fl)-hurricanes             4
## kansas-state-wildcats            14
## syracuse-orange                  18
## memphis-tigers                   15
## saint-louis-billikens            13
## marquette-golden-eagles          16
## butler-bulldogs                  NA
## wisconsin-badgers                17
## oklahoma-state-cowboys           19
## florida-gators                   12
## pittsburgh-panthers              22
## notre-dame-fighting-irish        NA
## unlv-rebels                      NA
## colorado-state-rams              NA
## north-carolina-tar-heels         NA

(c.)

When we ignore the actual value of yi and instead only use whether yi > 0, we are discarding information, so we might expect our model standard errors to be larger relative to the effect sizes. If we use the linear regression model, for what fraction of teams are we confident (p < 0.05) that the team is better (or worse) than Stanford? For what fraction are we confident if we instead use the logistic regression model?

With all the teams, we have fraction 0.793 confidence (p < 0.05) that the team is better (or worse) than Stanford by using the linear regression mode, and we have fraction 0.5295567 confidence (p < 0.05) that the team is better (or worse) than Stanford by using the logistic regression mode has lower confidence that has been illustrated in the question.

With the teams that have more than or equal to 5 games, we have fraction 0.9164265 confidence (p < 0.05) that the team is better (or worse) than Stanford by using the logistic regression mode, and we have fraction 0.6224784 confidence (p < 0.05) that the team is better (or worse) than Stanford by using the logistic regression mode has lower confidence that has been illustrated in the question. But this model without the team having less than 5 games perform better.
######################################################################################## p-value fractions.  ##

## All teams
fraction.homeAdv.mod.l <- length(which(summary(homeAdv.mod.l)$coefficients[, 
    4] < 0.05))/length(summary(homeAdv.mod.l)$coefficients[, 4])
## Error: object 'homeAdv.mod.l' not found
fraction.homeAdv.mod.lr <- length(which(summary(homeAdv.mod.lr)$coefficients[, 
    4] < 0.05))/length(summary(homeAdv.mod.lr)$coefficients[, 4])
## Error: object 'homeAdv.mod.lr' not found

# Linear model fraction:
fraction.homeAdv.mod.l
## Error: object 'fraction.homeAdv.mod.l' not found

# Logistic regression fraction:
fraction.homeAdv.mod.lr
## Error: object 'fraction.homeAdv.mod.lr' not found

## Excluding teams with less than 5 games
fraction.homeAdv.mod.l.m5 <- length(which(summary(homeAdv.mod.l.m5)$coefficients[, 
    4] < 0.05))/length(summary(homeAdv.mod.l.m5)$coefficients[, 4])
fraction.homeAdv.mod.lr.m5 <- length(which(summary(homeAdv.mod.lr.m5)$coefficients[, 
    4] < 0.05))/length(summary(homeAdv.mod.lr.m5)$coefficients[, 4])

# Linear model fraction (excluding teams with less than 5 games):
fraction.homeAdv.mod.l.m5
## [1] 0.9164

# Logistic regression (excluding teams with less than 5 games):
fraction.homeAdv.mod.lr.m5
## [1] 0.6225

(d)

Each model makes a prediction about which team will win any given game. We can use ten-fold cross-validation to estimate the test error rate for these predictions, and also try to determine whether one model is better than the other. For each game in a given test set, there are four possible outcomes: both models are right in their prediction, both are wrong, only logistic regression is right, or only linear regression is right. Make a 2 ⇥ 2 contingency table displaying how often each of these four outcomes occur, over all the test examples in all ten folds of cross-validation. Your table should look like:

According to the outputs below, we can get (rounded to the closest whole number):

$$n_{11}=381, n_{12}=27, n_{21}=24, n_{22}=122$$
######################################################################################## Calculate prediction error using 10-fold CV.  ##

k <- 10
set.seed(1)
folds <- sample(1:k, nrow(games), replace = T)
cv.error.l <- vector()
cv.error.lr <- vector()
cv.n11 <- vector()
cv.n12 <- vector()
cv.n21 <- vector()
cv.n22 <- vector()
cv.n <- vector()

for (j in 1:k) {

    print(j)

    ## Model including home advantage
    y.l.cv <- with(games[folds != j, ], homeScore - awayScore)
    y.lr.cv <- as.factor(with(games[folds != j, ], ifelse(homeScore > awayScore, 
        1, 0)))
    reg.season.games.cv <- which(games[folds != j, ]$gameType == "REG")
    homeAdv <- 1 - games[folds != j, ]$neutralLocation
    homeAdv.mod.l.cv <- lm(y.l.cv ~ 0 + homeAdv + ., data = X[folds != j, ], 
        subset = reg.season.games.cv)
    homeAdv.mod.lr.cv <- glm(y.lr.cv ~ 0 + homeAdv + ., family = binomial, data = X[folds != 
        j, ], subset = reg.season.games.cv)

    homeAdv.coef.l.cv <- coef(homeAdv.mod.l.cv)[paste("`", all.teams, "`", sep = "")]
    names(homeAdv.coef.l.cv) <- all.teams

    homeAdv.coef.lr.cv <- coef(homeAdv.mod.lr.cv)[paste("`", all.teams, "`", 
        sep = "")]
    names(homeAdv.coef.lr.cv) <- all.teams

    ## Calculate predictions, table data and error rate
    resids.l.cv <- homeAdv.mod.l.cv$resid
    resids.lr.cv <- homeAdv.mod.lr.cv$resid  ## Residuals for the logistic regression deviate from the normal distribution

    sigma.l.cv <- sd(resids.l.cv)
    sigma.lr.cv <- sd(resids.lr.cv)

    error.rate.l.cv <- 0
    error.rate.lr.cv <- 0
    n11 <- 0
    n12 <- 0
    n21 <- 0
    n22 <- 0

    home.wins <- with(games, ifelse(homeScore > awayScore, 1, 0))

    for (i in which(folds == j)) {
        mu.l.cv <- coef(homeAdv.mod.l.cv)["homeAdv"] + homeAdv.coef.l.cv[games$home[i]] - 
            homeAdv.coef.l.cv[games$away[i]]
        mu.lr.cv <- coef(homeAdv.mod.lr.cv)["homeAdv"] + homeAdv.coef.lr.cv[games$home[i]] - 
            homeAdv.coef.lr.cv[games$away[i]]

        if (games$home[i] == "stanford-cardinal") {
            mu.l.cv <- coef(homeAdv.mod.l.cv)["homeAdv"] - homeAdv.coef.l.cv[games$away[i]]
            mu.lr.cv <- coef(homeAdv.mod.lr.cv)["homeAdv"] - homeAdv.coef.lr.cv[games$away[i]]
        }

        if (games$away[i] == "stanford-cardinal") {
            mu.l.cv <- coef(homeAdv.mod.l.cv)["homeAdv"] + homeAdv.coef.l.cv[games$home[i]]
            mu.lr.cv <- coef(homeAdv.mod.lr.cv)["homeAdv"] + homeAdv.coef.lr.cv[games$home[i]]
        }

        ## Get error rate and table data for the linear model
        prob.away.win.l <- pnorm(-mu.l.cv/sigma.l.cv)
        home.win.l <- 0

        if (!is.na(prob.away.win.l)) {

            if (prob.away.win.l < 0.5) {
                home.win.l <- 1
            }

            if (home.wins[i] != home.win.l) {
                error.rate.l.cv <- error.rate.l.cv + 1
            }
        } else {
            print("Missing coefficient: Linear Model.")
        }

        ## Get error rate and table data for the Logistic regression model
        prob.away.win.lr <- pnorm(-mu.lr.cv/sigma.lr.cv)
        home.win.lr <- 0

        if (!is.na(prob.away.win.lr)) {

            if (prob.away.win.lr < 0.5) {
                home.win.lr <- 1
            }

            if (home.wins[i] != home.win.lr) {
                error.rate.lr.cv <- error.rate.lr.cv + 1
            }
        } else {
            print("Missing coefficient: Logistic regression Model.")
        }

        ## Build table ***********************************************

        if (home.win.l == home.wins[i] && home.win.lr == home.wins[i]) {
            n11 <- n11 + 1
        }

        if (home.win.l == home.wins[i] && home.win.lr != home.wins[i]) {
            n12 <- n12 + 1
        }

        if (home.win.l != home.wins[i] && home.win.lr == home.wins[i]) {
            n21 <- n21 + 1
        }

        if (home.win.l != home.wins[i] && home.win.lr != home.wins[i]) {
            n22 <- n22 + 1
        }

    }
    error.rate.l.cv <- error.rate.l.cv/nrow(games[folds == j, ])
    error.rate.lr.cv <- error.rate.lr.cv/nrow(games[folds == j, ])

    cv.error.l[j] <- error.rate.l.cv
    cv.error.lr[j] <- error.rate.lr.cv

    cv.n11[j] <- n11/length(which(folds == j))
    cv.n12[j] <- n12/length(which(folds == j))
    cv.n21[j] <- n21/length(which(folds == j))
    cv.n22[j] <- n22/length(which(folds == j))
    cv.n[j] <- length(which(folds == j))
}
## [1] 1
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 2
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 3
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 4
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 5
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 6
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 7
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 8
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 9
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] 10
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
## [1] "Missing coefficient: Linear Model."
## [1] "Missing coefficient: Logistic regression Model."
mean(cv.n11) * mean(cv.n)
## [1] 381.4
mean(cv.n12) * mean(cv.n)
## [1] 27.28
mean(cv.n21) * mean(cv.n)
## [1] 23.64
mean(cv.n22) * mean(cv.n)
## [1] 121.8

(e)

Use the normal approximation to carry out a test of the hypothesis that both models are equally good at predicting games. What is the conclusion of your test?

\( n = n_{11} + n_{12} + n_{21} + n_{22} \)

\( p_{12} = n_{12}/n \) \( p_{21} = n_{21}/n \)

\[ H_0: p_{12} = p_{21} \] \[ H_1: p_{12} \neq p_{21} \]

Because \( n_{12} + n_{21} > 25 \), we can perform chi square test:

I used a correction of 1:

\[ \chi^2_{df=1} = \frac{(|n_{21} - n_{12}|-1)^2}{n_{21}+n_{12}} = \frac{(|24 - 27|-1)^2}{24 + 27} = 0.0784313 \]

For the single sided test, the two tailed p-value corresponting to this test statistics is 0.7794. Thus we can conclude that the hypothesis testing is insignificant. We fail to reject the null hypothesis. Thus we can conclude that the two models do not perform better than the other significantly.