In this project, I model the personal fouls (PF) of an NBA player as a function of steals (STL), blocks (BLK), and minutes played (MIN).

Linear Regression Model is used in both frequentist and bayesian aprroach.

require(coda)
## Loading required package: coda
setwd("F:/Academics/2016 Spring/Bayesian Statistics")
source("Bayesfunctions.r")  
nba <- read.csv(file="NBA2015Data.csv", header=TRUE)
dim(nba)
## [1] 492  28

Frequentist analysis

(1) Estimate the regression coeffcients and provide the associated standard errors.

fit <- lm(PF ~ STL+BLK+MIN, data=nba)
summary(fit)
## 
## Call:
## lm(formula = PF ~ STL + BLK + MIN, data = nba)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.69405 -0.30945 -0.03146  0.27926  1.53612 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.61890    0.05135  12.053   <2e-16 ***
## STL          0.12737    0.07484   1.702   0.0894 .  
## BLK          0.51324    0.05129  10.006   <2e-16 ***
## MIN          0.04143    0.00373  11.108   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4661 on 488 degrees of freedom
## Multiple R-squared:  0.5762, Adjusted R-squared:  0.5736 
## F-statistic: 221.1 on 3 and 488 DF,  p-value: < 2.2e-16

(2) Calculate 95% conffidence intervals for the regression coeffcients and interpret these intervals.

confint.lm(fit)
##                   2.5 %     97.5 %
## (Intercept)  0.51800734 0.71978727
## STL         -0.01967918 0.27441343
## BLK          0.41245597 0.61402238
## MIN          0.03410368 0.04876113

[1] Beta2
We are 95% confident that if a player have one more steal per game, his personal foul per game will change as low as decreasing by -0.01967918, and as high as increasing by 0.27441343.

[2] Beta3
We are 95% confident that if a player have one more block per game, his personal foul per game will increase by as low as 0.41245597, and as high as 0.61402238.

[3] Beta4
We are 95% confident that if a player plays one more minute per game, his personal foul per game will increase by as low as 0.03410368, and as high as 0.04876113.

(3) Predict the number of personal fouls (per game) for a player that averages 1 steal, 1 block, and 25 minutes a game. Also, provide the corresponding 95% prediction interval.

new <- data.frame(STL=1, BLK=1, MIN=25)
predict.lm(fit, newdata=new,interval="prediction")
##        fit      lwr      upr
## 1 2.295314 1.376144 3.214484

A player that averages 1 steal, 1 block, and 25 minutes a game is perdicted to get 2.295314 per game. And we are 95% confident that his personal foul per game is from 1.376144 to 3.214484.

(4) Test the model assumptions. Discuss any issues with the model.

shapiro.test(nba$PF)
## 
##  Shapiro-Wilk normality test
## 
## data:  nba$PF
## W = 0.99538, p-value = 0.1532
par(mfrow=c(2,2)) 
plot(fit)

Shapiro test and QQ-plot show that personal foul is approximately normal, in accordance with the assumption.
There is a slight “^” pattern in the residual plot, showing that variance may not be constant for all the data. Transformation may make the model better.

Bayesian analysis

(1) Find appropriate prior hyper-parameters using the method discussed in the class notes. You should talk to someone who knows about NBA basketball. (If you think you are a basketball “expert” then you can use your own prior beliefs.)

First, I believe that STL, BLK and MIN are all positively associated with PF.
For the following four players, my believes are as follows:
(1) For players with 2 steals, 1 block, 35 minutes per game (similar with Kawhi Leonard), the mean of their personal fouls per game is 2.8; 95% percentle of their personal fouls per game is 3.5.
(2) For players with 1 steal, 2 blocks, 35 minutes per game (similar with DeAndre Jordan), the mean of their personal fouls per game is 3; 95% percentle of their personal fouls per game is 3.8.
(3) For players with 0.5 steal, 0.5 block, 35 minutes per game (average defensive ability), the mean of their personal fouls per game is 2; 95% percentle of their personal fouls per game is 2.6.
(4) For players with 0.1 steal, 0.1 block, 25 minutes per game (bad defensive ability), the mean of their personal fouls per game is 1; 95% percentle of their personal fouls per game is 1.5.

Priors for beta

m <- rep(NA, 4)  # Y.tilde in the book
D <- diag(rep(NA, 4)) # D(w.tilde) in the book

find.normal(prior.mean=2.8, percentile=3.5, p=0.95)
## $mean
## [1] 2.8
## 
## $sd
## [1] 0.4255698
## 
## $precision
## [1] 5.521517
find.normal(prior.mean=3, percentile=3.8, p=0.95)
## $mean
## [1] 3
## 
## $sd
## [1] 0.4863655
## 
## $precision
## [1] 4.227412
find.normal(prior.mean=2, percentile=2.6, p=0.95)
## $mean
## [1] 2
## 
## $sd
## [1] 0.3647741
## 
## $precision
## [1] 7.515398
find.normal(prior.mean=1, percentile=1.5, p=0.95)
## $mean
## [1] 1
## 
## $sd
## [1] 0.3039784
## 
## $precision
## [1] 10.82217
m <- c(2.8, 3, 2, 1)
D <- diag( c( (0.3039784)^2, (0.4863655)^2, (0.3647741)^2, (0.3039784)^2 ) )

player1 <- c(1, 2, 1, 35)
player2 <- c(1, 1, 2, 35)
player3 <- c(1, 0.5, 0.5, 35)
player4 <- c(1, 0.1, 0.1, 25)
X.tilde <- rbind(player1, player2, player3, player4)
X.tilde
##         [,1] [,2] [,3] [,4]
## player1    1  2.0  1.0   35
## player2    1  1.0  2.0   35
## player3    1  0.5  0.5   35
## player4    1  0.1  0.1   25
qr(X.tilde)$rank # rank of X.tilde
## [1] 4
prior.mean.beta <- solve(X.tilde) %*% m 
prior.mean.beta
##        [,1]
## [1,] -0.690
## [2,]  0.350
## [3,]  0.550
## [4,]  0.064
prior.cov.beta <- solve(X.tilde) %*% D %*% t(solve(X.tilde)) 
prior.cov.beta  
##             [,1]         [,2]        [,3]         [,4]
## [1,]  2.73672363  0.230776180  0.29564302 -0.098638221
## [2,]  0.23077618  0.100026112 -0.02841389 -0.009517496
## [3,]  0.29564302 -0.028413889  0.17210038 -0.012400467
## [4,] -0.09863822 -0.009517496 -0.01240047  0.003663589
coef.names <- c("Intercept", "Steals", "Blocks", "Minutes")
beta.prior.sample <- matrix(NA, nrow=1e5, ncol=4)

par(mfrow=c(2, 2))

for(j in 1:4){
beta.prior.sample[ , j] <- rnorm(1e5, mean=prior.mean.beta[j], sd=sqrt(prior.cov.beta[j,j]))
plot(density(beta.prior.sample[, j]), type="l", main=substitute(paste("Prior Density of ", beta[a]), list(a=coef.names[j])), xlab=substitute(beta[a], list(a=coef.names[j])), ylab="density")
}

Prior for tau (sigma)

For players with 0.5 steal, 0.5 block, 35 minutes per game (average defensive ability), the mean of their personal fouls per game is 2; 95% percentle of their personal fouls per game is 2.6; 95% percentile of that percentile is 3.**

normal.percentile.to.sd(mean.value=2, percentile=2.6, p=0.95) # Returns prior mode guess for sigma
## $sd
## [1] 0.3647741
normal.percentile.to.sd(mean.value=2, percentile=3,  p=0.95) # Returns prior percentile guess for sigma
## $sd
## [1] 0.6079568
gamma.parameters <- find.tau.gamma(prior.sigma.mode=0.3647741, sigma.percentile=0.6079568, p=0.95) # Returns shape and rate parameters for the Gamma distribution of tau
gamma.parameters
## $a
## [1] 5.69
## 
## $b
## [1] 0.8901724
tau.prior.shape <- gamma.parameters$a 
tau.prior.rate <-  gamma.parameters$b 

par(mfrow=c(1, 2))
plot(density(rgamma(10000, shape=tau.prior.shape, rate=tau.prior.rate)), main=expression(paste("Prior Density of ", tau)), xlab=expression(tau), ylab="density")
plot(density(1/sqrt(rgamma(10000, shape=tau.prior.shape, rate=tau.prior.rate ))), main=expression(paste("Prior Density of ", sigma)), xlab=expression(sigma), ylab="density")

(2) Using the Gibbs sampler, obtain 50,000 MCMC iterations after discarding 100,000 as burn-in.

X <- model.matrix(fit)
y <- nba$PF
beta.hat <- solve(t(X)%*%X) %*% t(X) %*% y

MCMC <- Gibbs.lm(net.iterations=50000, n.thin=1, burn.in=100000, 
                 start.beta = beta.hat, X=X, y=y, a=tau.prior.shape, 
                 b=tau.prior.rate, prior.mean.beta, prior.cov.beta)

(3) Provide a summary of the Bayesian results and compare it to the frequentist summary.

colnames(MCMC$beta) <- names(coef(fit))

beta.summary <- round(t(apply(MCMC$beta, MARGIN=2, FUN=summary.Bayes)), digits=3) # apply the summary.Bayes function to each regression coefficient
beta.summary
##                Min    Q1 Median  Mean    Q3   Max    SD HPD-Lower
## (Intercept)  0.406 0.592  0.626 0.626 0.661 0.836 0.051     0.526
## STL         -0.154 0.118  0.165 0.165 0.213 0.448 0.071     0.028
## BLK          0.318 0.488  0.522 0.522 0.556 0.747 0.050     0.426
## MIN          0.025 0.037  0.039 0.039 0.042 0.055 0.004     0.032
##             HPD-Upper Level
## (Intercept)     0.725  0.95
## STL             0.306  0.95
## BLK             0.623  0.95
## MIN             0.046  0.95
round(summary.Bayes(MCMC$sigma, HPD.p=0.95), digits=3)
##       Min        Q1    Median      Mean        Q3       Max        SD 
##     0.410     0.455     0.465     0.465     0.475     0.534     0.015 
## HPD-Lower HPD-Upper     Level 
##     0.437     0.494     0.950
# compared to frequentist
summary(fit)
## 
## Call:
## lm(formula = PF ~ STL + BLK + MIN, data = nba)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.69405 -0.30945 -0.03146  0.27926  1.53612 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.61890    0.05135  12.053   <2e-16 ***
## STL          0.12737    0.07484   1.702   0.0894 .  
## BLK          0.51324    0.05129  10.006   <2e-16 ***
## MIN          0.04143    0.00373  11.108   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4661 on 488 degrees of freedom
## Multiple R-squared:  0.5762, Adjusted R-squared:  0.5736 
## F-statistic: 221.1 on 3 and 488 DF,  p-value: < 2.2e-16

The mean of regression coeffients for Intercept, STL, BLK and MIN are 0.626, 0.166, 0.522, 0.039, respectively. They are quite close to the ones by frequentist approach (0.61890, 0.12737, 0.51324, 0.04143).
The mean of sigma is 0.465, quite close to the one by frequentist approach (0.4661).

(4) Plot the posterior densities of the regression coeffcients. Interpret the plots. Plot also the posterior density of sigma.

posterior densities of the regression coeffcients

coef.names <- c("Intercept", "Steals", "Blocks", "Minutes")

par(mfrow=c(2, 2))
#marginal posterior for each beta_j
for(j in 1:4){
  plot(density(MCMC$beta[, j]), type="l", main=substitute(paste("Posterior Density of ", beta[a]),  list(a=coef.names[j])), xlab=substitute(beta[a], list(a=coef.names[j])), ylab="density")
  lines(HPDinterval(as.mcmc(MCMC$beta[,j])), c(0,0), col="red", cex=2, lty=3)  # add PI line
  legend("topright", legend="95% HPD interval", lty=2, col="red", cex=0.4)
}

These four plots shows posterior densities of the regression coeffcients for Intercept(topleft), Steals(Topright), Blocks(Bottomleft) and Minutes(Bottomright), respectively.
Their 95% HPD inteval are shown as red lines.

[1] Beta2
There is 95% probabililty that if a player have one more steal per game, his personal foul per game will increase by as low as 0.028, and as high as 0.307.

[2] Beta3
There is 95% probabililty that if a player have one more block per game, his personal foul per game will increase by as low as 0.423, and as high as 0.621.

[3] Beta4
There is 95% probabililty that if a player plays one more minute per game, his personal foul per game will increase by as low as 0.032, and as high as 0.046.

posterior density of sigma

par(mfrow=c(1,1))
# marginal posterior for each sigma
plot(density(MCMC$sigma), type="l", main=expression(paste("Posterior Density of ", sigma)), xlab=expression(sigma), ylab="density")
lines(HPDinterval(as.mcmc(MCMC$sigma)), c(0,0), col="red", cex=2, lty=3)  # add PI line
legend("topright", legend="95% HPD interval", lty=2, col="red", cex=0.7)

There is 95% probabililty that the standard deviation of players’ personal foul per game is as low as 0.437, and as high as 0.495.

(5) Plot the predictive density of personal fouls (per game) for a player that averages 1 steal, 1 block, and 25 minutes a game. Also, provide a numerical summary that includes an approximate 95% probability interval.

x.f <- c(1, 1, 1, 25)  # covariates of new person
pf.1s.1b.25.min <-  as.vector(MCMC$beta %*% x.f) 

summary.Bayes(pf.1s.1b.25.min)
##        Min         Q1     Median       Mean         Q3        Max 
## 2.12400000 2.27300000 2.30000000 2.30000000 2.32600000 2.45700000 
##         SD  HPD-Lower  HPD-Upper      Level 
## 0.03894983 2.22364428 2.37597990 0.95000000
par(mfrow=c(1,1))

plot(density(pf.1s.1b.25.min), type="l", main=expression(paste("Posterior Density of Personal Foul \nif Steals=1, Blocks=1, Minutes=25")), xlab="Personal Foul per game", ylab="density")
lines(HPDinterval(as.mcmc(pf.1s.1b.25.min)), c(0,0), col="red", cex=2, lty=3)  # add PI line
legend("topright", legend="95% HPD interval", lty=2, col="red", cex=0.7)