Linear Regression Analysis of NCAA Basketball Data

In this in-class session, we will analyze a data set containing the outcomes of every game in the 2012-2013 regular season, and the postseason NCAA tournament. There are 5541 games and 347 teams. Our goals will be to:

Estimate the quality of each team, to obtain objective rankings
Explain the outcomes in the regular season and test hypotheses that may interest us
Predict the winner and margin of victory in future games

1. Loading in the data

First, read in the files games.csv and teams.csv into the data frames games and teams. We will load them “as is” (i.e. strings not converted to factors).

games <- read.csv("http://www.stanford.edu/~wfithian/games.csv",as.is=TRUE)
teams <- read.csv("http://www.stanford.edu/~wfithian/teams.csv",as.is=TRUE)

The games data has an entry for each game played, and the teams data has an entry for each Division 1 team (there are a few non-D1 teams represented in the games data).

head(games)

##         date                          home                   away homeScore awayScore neutralLocation gameType
## 1 2012-11-09            albany-great-danes         duquesne-dukes        69        66               0      REG
## 2 2012-11-11           ohio-state-buckeyes     albany-great-danes        82        60               0      REG
## 3 2012-11-13            washington-huskies     albany-great-danes        62        63               0      REG
## 4 2012-11-17            albany-great-danes         umkc-kangaroos        62        59               1      REG
## 5 2012-11-18            albany-great-danes loyola-(md)-greyhounds        64        67               1      REG
## 6 2012-11-20 south-carolina-state-bulldogs     albany-great-danes        55        83               0      REG

The column neutralLocation is 1 if the game was not a real home game for the nominal home team, and 0 otherwise. gameType is REG for regular-season games, NCAA for NCAA tournament games, and POST for other postseason games (not the NCAA tournament).

The teams data has an entry for each team coding its name, conference, whether it made the NCAA tournament, and its AP and USA Today ranks at the end of the regular season (1-25 or NA if unranked).

head(teams)

##                         team   conference inTourney apRank usaTodayRank
## 1      stony-brook-seawolves america-east         0     NA           NA
## 2         vermont-catamounts america-east         0     NA           NA
## 3 boston-university-terriers america-east         0     NA           NA
## 4             hartford-hawks america-east         0     NA           NA
## 5         albany-great-danes america-east         1     NA           NA
## 6          maine-black-bears america-east         0     NA           NA

Finally, we make one vector containing all the team names, because the three columns do not perfectly overlap:

all.teams <- sort(unique(c(teams$team,games$home,games$away)))

2. How to Rank the Teams?

Now, spend a few minutes to come up with a way to rank the teams, based on all of the regular-season games. Try your method, or a couple of different methods, and compare the ranks you get to the official end-season rankings. Which rankings do you find more credible?

Bonus Question

Think about the following bonus question if you have extra time. I will not be going over the answers to bonus questions in class, but you can ask the circulating staff.

Does your method account for a team's strength of schedule (the quality of a teams' opponents)? If not, how might you modify it so that it does?

Show Answers:

One simple way to rank teams is by comparing their total score to their opponents' total score.

## Function to compute a team's total margin of victory
total.margin <- function(team) {
  with(games,
         sum(homeScore[home==team]) 
       + sum(awayScore[away==team]) 
       - sum(homeScore[away==team])  
      - sum(awayScore[home==team]))
  }

## Compute total margin for each team
margins <- sapply(teams$team, total.margin)
names(margins) <- teams$team

rank.table <- cbind("Margin"      = margins,
                    "Margin Rank" = rank(-margins,ties="min"),
                    "AP Rank"     = teams$apRank,
                    "USAT Rank"   = teams$usaTodayRank)

margin.top25 <- order(margins,decreasing=TRUE)[1:25]
rank.table[margin.top25,]

##                               Margin Margin Rank AP Rank USAT Rank
## florida-gators                   630           1      14        12
## louisville-cardinals             627           2       2         2
## indiana-hoosiers                 595           3       4         5
## gonzaga-bulldogs                 554           4       1         1
## kansas-jayhawks                  490           5       3         3
## syracuse-orange                  469           6      16        18
## weber-state-wildcats             467           7      NA        NA
## michigan-wolverines              465           8      10        11
## virginia-commonwealth-rams       447           9      NA        23
## pittsburgh-panthers              435          10      20        22
## middle-tennessee-blue-raiders    431          11      NA        NA
## duke-blue-devils                 428          12       6         7
## creighton-bluejays               406          13      22        21
## ohio-state-buckeyes              402          14       7         6
## saint-mary's-gaels               381          15      NA        25
## davidson-wildcats                380          16      NA        NA
## ole-miss-rebels                  370          17      NA        NA
## saint-louis-billikens            351          18      13        13
## belmont-bruins                   351          18      NA        NA
## memphis-tigers                   344          20      19        15
## arizona-wildcats                 344          20      21        20
## wichita-state-shockers           335          22      NA        NA
## stephen-f.-austin-lumberjacks    332          23      NA        NA
## miami-(fl)-hurricanes            328          24       5         4
## missouri-tigers                  320          25      NA        NA

Some of the highly-ranked teams look about right (Louisville, Indiana, Gonzaga), but some, like Weber State, who went 26-6 in the Big Sky conference and are ranked 7th, appear to be rather overrated by this metric.

Looking at the page for Weber State's 2012-2013 season, it appears that they benefited from an overall easy schedule with no ranked opponents, including one 65-point victory against the Southwest Minnesota State Mustang.

In the next section, we will see how to modify this method to account for strength of schedule.

3. A Linear Regression Model for Ranking Teams

Statistical modeling is a powerful tool for learning from data. Our meta-strategy is to define some statistical model whose parameters correspond to whatever quantities we want to estimate.

Our response variable for today is the margin of victory (or defeat) for the home team in a particular game. That is, define \[ \begin{equation} y_i = \text{home score minus away score in game } i \end{equation} \]

Now, we want to define a linear regression model that explains the response, \( y_i \), in terms of both teams' merits. The simplest such model will look something like \[ \begin{equation} y_i = \text{quality of home} (i) - \text{quality of away} (i) + \text{noise} \end{equation} \] where \( \text{home}(i) \) and \( \text{away}(i) \) are the home and away teams for game \( i \).

To formulate this model as a linear regression in standard form, we need to come up with a definition for the predictors \( x \) and coefficients \( \beta \) so that estimating \( \beta \) amounts to estimating the quality of each team. That is, we want a definition for \( x_{ij} \) and \( \beta_j \) for which \[ \begin{equation} y_i = \sum_j x_{ij} \beta_j + \varepsilon_i \end{equation} \]

Now, with your group, try to formulate our model as a linear regression. How many predictor variables are there? How many coefficients? How is \( x_{ij} \) defined?

Bonus Questions

In our new model, suppose that a small high school plays only one game, against Weber State, and loses by a very wide margin. Can you prove that this game will have no effect whatsoever on the fitted values of \( \beta \)?
(Challenging) Our model now controls for the quality of a team's opponents. As a result, I claim that scheduling many difficult opponents (or many easy opponents) will neither hurt nor help a team's ranking. Can you back up this claim in a precise mathematical way?

Show Answers

Define one predictor variable for each team, which is a sort of “signed dummy variable.” For game \( i \) and team \( j \), let \[ \begin{equation} x_{ij} = \left\{\begin{matrix} +1 & j \text{ is home}(i)\\ -1 & j \text{ is away}(i) \\ 0 & j\text{ didn't play}\end{matrix}\right. \end{equation} \] For example, if game \( i \) consists of team 1 visiting team 2, then \( x_i = (-1,1,0,0,\ldots,0) \).

Now we can check that \[ \begin{equation} \sum_j x_{ij} \beta_j = \beta_{\text{home}(i)} - \beta_{\text{away}(i)} \end{equation} \] so the coefficient \( \beta_j \) corresponds exactly to the quality of team \( j \) in our model.

We can generate \( y \) and \( X \) as follows:

y <- with(games, homeScore-awayScore)

## Construct a data frame of the right dimensions, with all zeros
X0 <- as.data.frame(matrix(0,nrow(games),length(all.teams)))
names(X0) <- all.teams

## Fill in the columns, one by one
for(tm in all.teams) {
  X0[[tm]] <- 1*(games$home==tm) - 1*(games$away==tm)
}

Now, we are in good shape, because we can use all our tools from linear regression to estimate \( \beta \), carry out hypothesis tests, construct confidence intervals, etc.

4. An Identifiability Problem

When we fit our model, we will ask R to find the best-fitting \( \beta \) vector. There is a small problem, however: for any candidate value of \( \beta \), there are infinitely many other values \( \tilde\beta \) that make exactly the same predictions. So the “best \( \beta \)” is not uniquely defined.

For any constant \( c \), suppose that I redefine \( \tilde \beta_j = \beta_j + c \). Then for every game \( i \), \[ \begin{equation} \tilde \beta_{\text{home}(i)} - \tilde \beta_{\text{away}(i)} = \beta_{\text{home}(i)} - \beta_{\text{away}(i)} \end{equation} \] so the distribution of \( y \) is identical for parameters \( \tilde\beta \) and \( \beta \), no matter what \( c \) is. We can never distinguish these two models from each other, because the models make identical predictions no matter what. In statistical lingo, this is called an identifiability problem. It very often arises with dummy variables.

To fix it, we need to add some linear constraint on \( \beta \) to resolve the ambiguity. For example \( \sum_j \beta_j = 0 \), or we can pick some “special” baseline team \( j \) and require that \( \beta_j = 0 \). We will use the latter strategy, and we will take Stanford's team as the baseline. Now, with your team, figure out how to modify the \( X \) matrix to implement this.

(Actually, lm is smart enough to fix this automatically for you by arbitrarily picking one team to be the baseline. But let's not blindly rely on that)

Bonus Questions

Suppose that we had chosen a different team as our baseline.

How would the estimates be different?
Would we obtain identical rankings?
Would we obtain identical standard errors?

Under what circumstances would we still have an identifiability problem even after constraining \( \beta_{\text{Stanford}}=0 \)?

Show Answers

We can effectively force \( \beta_j = 0 \) for the \( j \) corresponding to Stanford by eliminating that column from the predictor matrix.

X <- X0[,names(X0) != "stanford-cardinal"]
reg.season.games <- which(games$gameType=="REG")

Now, let's fit our model. There is no intercept in the model, so we explicitly exclude it from the formula.

mod <- lm(y ~ 0 + ., data=X, subset=reg.season.games)
head(coef(summary(mod)))

##                         Estimate Std. Error t value Pr(>|t|)
## `air-force-falcons`        -6.69       2.90   -2.31 2.10e-02
## `akron-zips`               -2.56       2.84   -0.90 3.68e-01
## `alabama-a&m-bulldogs`    -30.59       2.91  -10.52 1.34e-25
## `alabama-crimson-tide`     -2.85       2.80   -1.02 3.09e-01
## `alabama-state-hornets`   -29.96       2.85  -10.51 1.37e-25
## `albany-great-danes`      -13.33       2.82   -4.73 2.29e-06

summary(mod)$r.squared

## [1] 0.551

It looks like our model explains about half of the variability in basketball scores.

5. Interpreting the Model

Next, let's try to interpret the model that we just fit. With your group, take a few moments to answer the following questions:

5.1. Based on this model, what would be a reasonable point spread if Alabama (alabama-crimson-tide) played against Air Force?

5.2. What would be a reasonable point spread if Alabama played against Stanford?

5.3. Can we be confident that Stanford is better than Alabama? Better than Air Force?

5.4. How can we test whether Alabama is better than Air Force?

Show Answers

5.1. If Alabama were the home team, then the expected score difference is \[ \begin{equation} \hat\beta_{\text{Alabama}} - \hat\beta_{\text{Air Force}} \end{equation} \] So we can answer the question by comparing their coefficients:

coef(mod)["`alabama-crimson-tide`"] - coef(mod)["`air-force-falcons`"]

## `alabama-crimson-tide` 
##                   3.84

so a fair point spread would be about 3.84 in favor of Alabama.

5.2. To answer this, we need to use the fact that \( \hat\beta_{\text{Stanford}} \) is 0 by definition. Hence, the expected score difference is \( \hat\beta_{\text{Alabama}} - 0 \), or -2.85 (a 2.85 point spread in favor of Stanford).

5.3. The hypothesis that Stanford is better than Alabama is equivalent to the hypothesis that \( \beta_{\text{Alabama}} < 0 \). If we use a two-sided test, our t statistic needs to be bigger than 2 in absolute value. It is not for Alabama, but it is for Air Force; hence the answers are “yes” and “no.”

5.4. We cannot answer this question based only on the coefficient table printed above. To test the null hypothesis that \[ \theta = \beta_{\text{Alabama}} - \beta_{\text{Air Force}} = 0 \] we would need to know the standard error of \[ \hat\theta = \hat\beta_{\text{Alabama}} - \hat\beta_{\text{Air Force}} \]

The easiest way to find this out would be to simply re-fit the model with Alabama (or Air Force) as the baseline.

6. Home Court Advantage

In a moment, we will look at our model's rankings. But first, let's check for whether we should have included a “home-court advantage” term in our model. With your team, carry out a t-test for the hypothesis that there is no home-court advantage, and report a 95% confidence interval for the number of points that home-court advantage is worth.

Show Answers

Define the variable \( h_i \), a binary indicator whose value is 1 if game \( i \) is not played at a neutral location (so the home team enjoys a home-court advantage), and 0 otherwise. Then we can modify our model as follows: \[ \begin{equation} y_i = \beta_0 h_i + \beta_{\text{home}(i)} - \beta_{\text{away}(i)} + \varepsilon_i \end{equation} \]

homeAdv <- 1 - games$neutralLocation
homeAdv.mod <- lm(y ~ 0 + homeAdv + ., data=X, subset=reg.season.games)
homeAdv.coef <- coef(homeAdv.mod)[paste("`",teams$team,"`",sep="")]
names(homeAdv.coef) <- teams$team

head(coef(summary(homeAdv.mod)))

##                         Estimate Std. Error t value  Pr(>|t|)
## homeAdv                    3.528      0.157  22.492 8.65e-107
## `air-force-falcons`       -5.299      2.762  -1.919  5.51e-02
## `akron-zips`              -0.967      2.708  -0.357  7.21e-01
## `alabama-a&m-bulldogs`   -27.144      2.776  -9.778  2.23e-22
## `alabama-crimson-tide`    -2.611      2.671  -0.978  3.28e-01
## `alabama-state-hornets`  -26.340      2.720  -9.684  5.52e-22

homeAdv.estimate <- coef(homeAdv.mod)["homeAdv"]
homeAdv.se <- coef(summary(homeAdv.mod))["homeAdv",2]
(homeAdv.CI <- c("CI Lower"=homeAdv.estimate - 2*homeAdv.se,
                 "CI Upper"=homeAdv.estimate + 2*homeAdv.se))

## CI Lower.homeAdv CI Upper.homeAdv 
##             3.21             3.84

We see from the above that the t statistic for \( \beta_0 \) is 22.5, which is extremely significant. The confidence interval is \( \beta_0 \pm 2\text{s.e.}(\beta_0) \), roughly \( [3.2, 3.8] \), so home-court advantage is worth about 3.5 points.

7. Predicting Wins and Losses

If we wanted to run a bookie business, we would not only have to set point spreads, but also set odds in advance of each game. In this section we will see that our model can give us odds as well as point spreads.

We made an assumption that the errors were normal, so we should also check whether the residuals are normal. Let's check out the residuals from the last model (with home-court advantage) and see if they look reasonably normal.

resids <- homeAdv.mod$resid
par(mfrow=c(1,2))
hist(resids,freq=FALSE)
curve(dnorm(x,mean(resids),sd(resids)),add=TRUE,col="blue",lwd=2)
qqnorm(resids)

plot of chunk residFig

Since our normal-errors assumption passes this basic sanity check, let's take our model seriously. Assuming the errors are truly normal, how can we use our model to predict the win/loss outcome of a particular game?

Bonus Question

Wichita State (wichita-state-schockers to us) made an improbable run in the NCAA tournament, beating a #1 seed (Gonzaga) and a #2 seed (Ohio State), and eventually losing to another #1 seed (Louisville) in the Final Four. According to the model we just fit, just how improbable was Wichita State's run? That is, what was the probability they would have beaten the first four opponents that they faced?
The actual tournament winner was louisville-cardinals. Given the six opponents they faced, what was the probability they would win the tournament?

You can use this function to get data in a slightly more manageable form:

schedule <- function(team, game.type) {
  home.sch <- with(games, games[home==team & gameType==game.type,c(1,3,4,5)])
  away.sch <- with(games, games[away==team & gameType==game.type,c(1,2,5,4)])
  names(home.sch) <- names(away.sch) <- c("date","opponent","score","oppoScore")
  sch <- rbind(home.sch,away.sch)

  sch$margin <- with(sch, score-oppoScore)
  sch$oppoQuality <- homeAdv.coef[as.character(sch$opponent)]

  sch <- sch[order(sch$date),]
  rownames(sch) <- NULL
  return(sch)
}
schedule("wichita-state-shockers","NCAA")

##         date             opponent score oppoScore margin oppoQuality
## 1 2013-03-21  pittsburgh-panthers    73        55     18       7.356
## 2 2013-03-23     gonzaga-bulldogs    76        70      6       9.988
## 3 2013-03-28   la-salle-explorers    72        58     14      -0.773
## 4 2013-03-30  ohio-state-buckeyes    70        66      4       7.897
## 5 2013-04-06 louisville-cardinals    68        72     -4      12.081

Show Answers

Our model says that \( y_i \sim N(\mu_i, \sigma^2) \), where \( \sigma^2 \) is the noise variance and \( \mu_i \) is the home-away quality difference (possibly adjusted for home-court advantage).

To calculate the probability the home team loses \( (y_i < 0) \), we can exploit the fact that \( (y_i - \mu_i)/\sigma \sim N(0,1) \). If \( \Phi \) denotes the cumulative distribution function of a \( N(0,1) \) variable, then: \[ \begin{align} P(y_i < 0) &= P\left( \frac{y_i - \mu_i}{\sigma} < -\mu_i / \sigma \right)\\ &= \Phi( -\mu_i / \sigma ) \end{align} \]

We can estimate \( \sigma \) by calculating the residual variance (technically we should make an adjustment for degrees of freedom but let's ignore that).

(sigma <- sd(resids))

## [1] 9.82

So, for example, we can compute the probability that Stanford would beat Berkeley (a slightly worse team, according to our model), at Berkeley:

mu <- coef(homeAdv.mod)["homeAdv"] + coef(homeAdv.mod)["`california-golden-bears`"] - 0
pnorm(-mu/sigma)

## homeAdv 
##    0.42

According to our model, that game would be roughly a toss-up.

Next, let's find out how improbable Wichita State's run was. Note that all the tournament games were played at neutral locations.

wst.schedule <- schedule("wichita-state-shockers","NCAA")
mu <- wst.schedule$oppoQuality - homeAdv.coef["wichita-state-shockers"]
names(mu) <- wst.schedule$opponent
(p.wst.win <- pnorm(-mu/sigma))

##  pittsburgh-panthers     gonzaga-bulldogs   la-salle-explorers  ohio-state-buckeyes louisville-cardinals 
##                0.292                0.207                0.610                0.273                0.152

So the probability that Wichita State would have won its first four games (conditional on the opponents it faced) is about 1%.

prod(p.wst.win[1:4])

## [1] 0.01

By contrast, the probability of Louisville winning the whole tournament (conditional on their opponents) was about 25%:

lou.schedule <- schedule("louisville-cardinals","NCAA")
mu <- lou.schedule$oppoQuality - homeAdv.coef["louisville-cardinals"]
prod(pnorm(-mu/sigma))

## [1] 0.251

8. Model Rankings Vs. Official Rankings

Finally, let's inspect the rankings given by our model.

rank.table <- cbind("Model Score" = homeAdv.coef,
                    "Model Rank"  = rank(-homeAdv.coef,ties="min"),
                    "AP Rank"     = teams$apRank,
                    "USAT Rank"   = teams$usaTodayRank)
rank.table[order(homeAdv.coef,decreasing=TRUE)[1:25],]

##                            Model Score Model Rank AP Rank USAT Rank
## indiana-hoosiers                 13.26          1       4         5
## florida-gators                   12.64          2      14        12
## louisville-cardinals             12.08          3       2         2
## gonzaga-bulldogs                  9.99          4       1         1
## duke-blue-devils                  9.39          5       6         7
## kansas-jayhawks                   8.63          6       3         3
## ohio-state-buckeyes               7.90          7       7         6
## michigan-wolverines               7.51          8      10        11
## pittsburgh-panthers               7.36          9      20        22
## syracuse-orange                   7.07         10      16        18
## wisconsin-badgers                 6.83         11      18        17
## michigan-state-spartans           6.18         12       9         9
## creighton-bluejays                5.28         13      22        21
## miami-(fl)-hurricanes             5.19         14       5         4
## virginia-commonwealth-rams        4.96         15      NA        23
## arizona-wildcats                  4.68         16      21        20
## minnesota-golden-gophers          4.42         17      NA        NA
## georgetown-hoyas                  4.26         18       8         8
## missouri-tigers                   3.96         19      NA        NA
## oklahoma-state-cowboys            3.69         20      17        19
## saint-mary's-gaels                3.37         21      NA        25
## colorado-state-rams               3.15         22      NA        NA
## new-mexico-lobos                  3.00         23      11        10
## north-carolina-tar-heels          2.72         24      NA        NA
## ole-miss-rebels                   2.64         25      NA        NA

Our rankings still differ significantly from the official rankings on several teams. To take one particularly glaring example, our model is highly confident that florida-gators is an elite team, despite its relatively low ranking by the press. By contrast, our model doesn't think too much of miami-(fl)-hurricanes despite the opinion of the press that the 'Canes were elite.

See if you can figure out why our model might beg to differ with the press.

Show Answers

From the schedules above, we see that Florida's 7 losses are all close, with only one loss by more than 6 points, whereas all of their 26 wins are by at least 10 points.

Miami has a better overall win-loss record (27-6) against a somewhat stronger group of opponents, and finished the season strong by winning the ACC tournament. However, 10 of their wins were by less than 10 points, and they lost three games by wide margins.

Below, we plot each team's margin of victory distribution for its 33 regular season games.

hist(schedule("miami-(fl)-hurricanes", "REG")$margin,
     col="gray",border="gray",breaks=seq(-20,50,5), 
     xlab="Margin of Victory",
     main="Margins of Victory for Miami (gray) and Florida (red)",)
hist(schedule("florida-gators", "REG")$margin,add=TRUE,
     border="red",breaks=seq(-20,50,5),col="red",density=20)
abline(v=0)

plot of chunk unnamed-chunk-14

Miami did have slightly better opponents, but not by much

mean(schedule("miami-(fl)-hurricanes", "REG")$oppoQuality)

## [1] -3.84

mean(schedule("florida-gators", "REG")$oppoQuality)

## [1] -4.87

Notice that in our model, going from a -1 margin of victory to a +1 margin has the same impact on a team's rating is going from +6 to +8, or from +40 to +42. So, we might be giving Florida a little too much credit for running up the score against its opponents.

Conversely, sports journalists typically care a lot about whether a team wins or loses, because they believe that scoring a buzzer-beater to win a game proves that a team possesses the “heart of a champion.” This elusive quality is thought to outweigh the more prosaic ability to score a lot of points.

9. The Value of Nerlens Noel

Nerlens Noel was a star center for the University of Kentucky (kentucky-wildcats in our data set). He was projected to be the #1 overall pick in the 2013 NBA draft before he tore his ACL in a February 12 game (and was eventually the #6 overall pick despite his injury).

Statistical analysts employed by sports teams are especially interested in evaluating individual players. There are various methods, usually requiring granular minute-by-minute data on scoring and substitutions. We do not have such granular data at our disposal, but for an injured player like Noel, we can estimate how much better Kentucky was with him than without him.

Re-fit your model with one additional predictor variable, to get an estimate and confidence interval for Noel's contribution to Kentucky (i.e., how much better Kentucky is with him than without him).

Bonus Questions

Give an estimate and confidence interval for Kentucky's quality with Noel, and for its quality without Noel. Where would Kentucky be ranked if Noel had not been injured?
Suppose that instead of the games data, we had a data point for every one-minute long period during the course of every game. During each minute, suppose we also had a record of which five players were playing for each team, as well as the number of points scored by each team during that minute (for simplicity, assume substitutions only happen in between the one-minute periods). Can you come up with a model, similar to the one we fit today, that we could use to estimate each player's quality?

Show Answers

Let \( n_i \) be +1 if Noel played for the home team (i.e., all UK home games before February 12), -1 if he played for the away team (UK away games before Feb 12), and 0 if he did not play. Then we can simply fit the model: \[ \begin{equation} y_i = \beta_{-1} n_i + \beta_0 h_i + \beta_{\text{home}(i)} - \beta_{\text{away}(i)} + \varepsilon_i \end{equation} \]

nerlens <- rep(0,nrow(games))
nerlens[games$home=="kentucky-wildcats" & games$date<"2013-02-12"] <- 1
nerlens[games$away=="kentucky-wildcats" & games$date<"2013-02-12"] <- -1
nerlens.mod <- lm(y ~ 0 + homeAdv + nerlens + ., data=X, subset=reg.season.games)
head(coef(summary(nerlens.mod)))

##                        Estimate Std. Error t value  Pr(>|t|)
## homeAdv                   3.526      0.157  22.494 8.29e-107
## nerlens                  11.450      4.103   2.791  5.28e-03
## `air-force-falcons`      -5.307      2.760  -1.923  5.46e-02
## `akron-zips`             -0.937      2.707  -0.346  7.29e-01
## `alabama-a&m-bulldogs`  -27.189      2.774  -9.801  1.78e-22
## `alabama-crimson-tide`   -2.659      2.669  -0.996  3.19e-01

nerlens.estimate <- coef(nerlens.mod)["nerlens"]
nerlens.se <- coef(summary(nerlens.mod))["nerlens",2]
(nerlens.CI <- c("CI lower"=nerlens.estimate - 2*nerlens.se,
                 "CI upper"=nerlens.estimate + 2*nerlens.se))

## CI lower.nerlens CI upper.nerlens 
##             3.24            19.66

So Noel was worth about 11 points to Kentucky, but we cannot place too much confidence in this figure.

coef(nerlens.mod)[c("nerlens","`kentucky-wildcats`")]

##             nerlens `kentucky-wildcats` 
##               11.45               -5.99

If we have minute-level data, then we can assign a coefficient \( \beta_j \) to each player. Then, letting \( y_i \) be the home-away scoring margin for minute \( i \), we can set \[ \begin{equation} X_{ij} = \left\{\begin{matrix} +1 & j \text{ played for the home team in minute } i\\ -1 & j \text{ played for the away team in minute } i \\ 0 & j\text{ didn't play in minute } i\end{matrix}\right. \end{equation} \]

This is known as the adjusted plus/minus model, and it is one of the tools used by statisticians in the NBA to evaluate players.