Analysis of Variance (ANOVA) is a statistical technique, commonly used to studying differences between two or more group means. ANOVA in R primarily provides evidence of the existence of the mean equality between the groups. This statistical method is an extension of the t-test and is used in a situation where the factor variable has more than one group.
Our goal is to look at the 2018 NBA Draft to understand if there is a difference in the amount of minutes played per game, based on the Round drafted. To understand this, I would like to perform an ANOVA test to understand if there is a significant statistical difference of the means between the following groups: Lottery Draft Picks (1-15), Remainder Round 1 Draft Picks (16-30), and 2nd Round Draft Picks (31-60).
Before any analysis begin, I like to inspect the data to understand the column names, missing rows, and just the overall structure of the Data Frame. First lets check out a quick summary of the data:
setwd("~/R Projects/Hawks Season Project")
Draft2018<-read.csv("Draft2018.csv")
summary(Draft2018)
## Rk Pk Tm
## Min. : 1.00 Min. : 1.00 PHI : 6
## 1st Qu.:15.75 1st Qu.:15.75 ATL : 4
## Median :30.50 Median :30.50 PHO : 4
## Mean :30.50 Mean :30.50 BRK : 3
## 3rd Qu.:45.25 3rd Qu.:45.25 DAL : 3
## Max. :60.00 Max. :60.00 DEN : 3
## (Other):37
## Player College Yrs
## Aaron Holiday\\holidaa01 : 1 : 9 Min. :1.000
## Alize Johnson\\johnsal02 : 1 Duke : 4 1st Qu.:2.000
## Anfernee Simons\\simonan01 : 1 Kentucky : 4 Median :2.000
## Arnoldas Kulboka\\kulboar01 : 1 Villanova: 4 Mean :1.875
## Bruce Brown\\brownbr01 : 1 Kansas : 2 3rd Qu.:2.000
## Chandler Hutchison\\hutchch01: 1 Maryland : 2 Max. :2.000
## (Other) :54 (Other) :35 NA's :4
## G MP PTS TRB
## Min. : 1.00 Min. : 6.0 Min. : 0.0 Min. : 1.0
## 1st Qu.: 47.25 1st Qu.: 565.2 1st Qu.: 167.8 1st Qu.: 73.5
## Median : 86.50 Median :1481.5 Median : 539.0 Median : 203.5
## Mean : 80.50 Mean :1697.9 Mean : 739.7 Mean : 289.0
## 3rd Qu.:112.25 3rd Qu.:2800.8 3rd Qu.:1104.2 3rd Qu.: 470.8
## Max. :147.00 Max. :4748.0 Max. :3327.0 Max. :1089.0
## NA's :4 NA's :4 NA's :4 NA's :4
## AST FG. X3P. FT.
## Min. : 0.0 Min. :0.0000 Min. :0.0000 Min. :0.3330
## 1st Qu.: 32.0 1st Qu.:0.3835 1st Qu.:0.2995 1st Qu.:0.6840
## Median : 82.5 Median :0.4250 Median :0.3305 Median :0.7550
## Mean : 163.3 Mean :0.4211 Mean :0.2979 Mean :0.7173
## 3rd Qu.: 205.2 3rd Qu.:0.4670 3rd Qu.:0.3653 3rd Qu.:0.7870
## Max. :1213.0 Max. :0.7200 Max. :0.5000 Max. :0.8470
## NA's :4 NA's :5 NA's :8 NA's :7
## MP.1 PTS.1 TRB.1 AST.1
## Min. : 2.70 Min. : 0.000 Min. : 0.200 Min. :0.000
## 1st Qu.:10.90 1st Qu.: 3.600 1st Qu.: 1.400 1st Qu.:0.600
## Median :17.05 Median : 6.550 Median : 2.550 Median :1.000
## Mean :17.01 Mean : 7.123 Mean : 2.977 Mean :1.507
## 3rd Qu.:23.85 3rd Qu.: 8.575 3rd Qu.: 3.950 3rd Qu.:1.825
## Max. :32.80 Max. :24.400 Max. :10.800 Max. :8.600
## NA's :4 NA's :4 NA's :4 NA's :4
## WS WS.48 BPM VORP
## Min. :-1.200 Min. :-0.47100 Min. :-19.900 Min. :-2.7000
## 1st Qu.: 0.100 1st Qu.: 0.01100 1st Qu.: -4.825 1st Qu.:-0.2250
## Median : 1.700 Median : 0.05350 Median : -2.400 Median : 0.0000
## Mean : 2.595 Mean : 0.04516 Mean : -2.971 Mean : 0.3036
## 3rd Qu.: 3.925 3rd Qu.: 0.09550 3rd Qu.: -1.200 3rd Qu.: 0.3250
## Max. :13.000 Max. : 0.22300 Max. : 5.900 Max. : 8.2000
## NA's :4 NA's :4 NA's :4 NA's :4
Now lets, take a look at the structure:
str(Draft2018)
## 'data.frame': 60 obs. of 22 variables:
## $ Rk : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Pk : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Tm : Factor w/ 28 levels "ATL","BOS","BRK",..: 23 25 1 15 7 21 4 6 19 22 ...
## $ Player : Factor w/ 60 levels "Aaron Holiday\\holidaa01",..: 10 39 38 24 56 45 59 8 33 42 ...
## $ College: Factor w/ 37 levels "","Alabama","Arizona",..: 3 10 1 18 22 27 10 2 13 34 ...
## $ Yrs : int 2 2 2 2 2 2 2 2 2 2 ...
## $ G : int 101 75 126 112 141 107 87 147 140 147 ...
## $ MP : int 3179 1901 4117 3027 4623 1634 2366 4748 3324 4189 ...
## $ PTS : int 1729 1108 3075 1712 3327 624 939 2720 1382 1249 ...
## $ TRB : int 1089 568 1065 526 556 532 712 440 519 523 ...
## $ AST : int 182 72 899 140 1213 81 129 435 143 288 ...
## $ FG. : num 0.572 0.497 0.443 0.485 0.428 0.474 0.508 0.45 0.367 0.466 ...
## $ X3P. : num 0 0.288 0.322 0.386 0.344 0.333 0.197 0.392 0.337 0.341 ...
## $ FT. : num 0.753 0.703 0.733 0.755 0.847 0.624 0.761 0.843 0.697 0.825 ...
## $ MP.1 : num 31.5 25.3 32.7 27 32.8 15.3 27.2 32.3 23.7 28.5 ...
## $ PTS.1 : num 17.1 14.8 24.4 15.3 23.6 5.8 10.8 18.5 9.9 8.5 ...
## $ TRB.1 : num 10.8 7.6 8.5 4.7 3.9 5 8.2 3 3.7 3.6 ...
## $ AST.1 : num 1.8 1 7.1 1.3 8.6 0.8 1.5 3 1 2 ...
## $ WS : num 8.3 4 13 6.5 9.2 4.2 5.2 1.9 -1.2 7 ...
## $ WS.48 : num 0.125 0.1 0.151 0.102 0.095 0.124 0.105 0.02 -0.017 0.08 ...
## $ BPM : num 0.3 -1.4 5.9 0 1.5 -0.4 -2 -3.4 -5.2 -0.6 ...
## $ VORP : num 1.9 0.3 8.2 1.5 4 0.7 0 -1.7 -2.7 1.5 ...
Our main objective is to understand is there is a statiscal difference between the average minutes per game played by Lottery Draft Picks, Remainding 1st Round Draft Picks, and 2nd Round Draft Picks. Currently our data shows Total Minutes (MP); however, we need to find the Minutes Per Game (mpg), by dividing the Total Minutes by Games Played (G).
Draft2018$mpg <- Draft2018$MP/Draft2018$G
Draft2018$mpg<-round(Draft2018$mpg,2)
head (Draft2018$mpg)
## [1] 31.48 25.35 32.67 27.03 32.79 15.27
Now that we have our “Minutes Per Game” column, we need to define our three groups that we plan to measure. Lets add a column that specify Lottery, Round 1, and Round 2 groups based on Pk number.
Draft2018$Round<-ifelse(Draft2018$Pk<=15,"Lottery","Round 1")
Draft2018$Round<-ifelse(Draft2018$Pk> 30,"Round 2",Draft2018$Round)
View (Draft2018)
There are several columns that are still in my dataset that I don’t necessarily need for the Anova test. Lets remove those columns that I do not need for the analysis for simplicity.
DontNeed<-c(1,5,6,9,10,11,12,13,14,15,16,17,18,19,20,21,22)
Draft2018<-Draft2018[,-DontNeed]
DontNeedRows<-c(43,44,51,55)
Draft2018<-Draft2018[-DontNeedRows,]
head(Draft2018)
## Pk Tm Player G MP mpg Round
## 1 1 PHO Deandre Ayton\\aytonde01 101 3179 31.48 Lottery
## 2 2 SAC Marvin Bagley\\baglema01 75 1901 25.35 Lottery
## 3 3 ATL Luka Don?i?\\doncilu01 126 4117 32.67 Lottery
## 4 4 MEM Jaren Jackson\\jacksja02 112 3027 27.03 Lottery
## 5 5 DAL Trae Young\\youngtr01 141 4623 32.79 Lottery
## 6 6 ORL Mohamed Bamba\\bambamo01 107 1634 15.27 Lottery
Just for simplicity and ease of reading, I would like to reorder the columns.
Draft2018<-Draft2018[,c(7,1,2,3,5,4,6)]
head(Draft2018)
## Round Pk Tm Player MP G mpg
## 1 Lottery 1 PHO Deandre Ayton\\aytonde01 3179 101 31.48
## 2 Lottery 2 SAC Marvin Bagley\\baglema01 1901 75 25.35
## 3 Lottery 3 ATL Luka Don?i?\\doncilu01 4117 126 32.67
## 4 Lottery 4 MEM Jaren Jackson\\jacksja02 3027 112 27.03
## 5 Lottery 5 DAL Trae Young\\youngtr01 4623 141 32.79
## 6 Lottery 6 ORL Mohamed Bamba\\bambamo01 1634 107 15.27
I would also like to rename a few of the columns.
colnames(Draft2018)[colnames(Draft2018)=="Pk"]<-"Pick"
colnames(Draft2018)[colnames(Draft2018)=="MP"]<-"Total_MP"
head(Draft2018)
## Round Pick Tm Player Total_MP G mpg
## 1 Lottery 1 PHO Deandre Ayton\\aytonde01 3179 101 31.48
## 2 Lottery 2 SAC Marvin Bagley\\baglema01 1901 75 25.35
## 3 Lottery 3 ATL Luka Don?i?\\doncilu01 4117 126 32.67
## 4 Lottery 4 MEM Jaren Jackson\\jacksja02 3027 112 27.03
## 5 Lottery 5 DAL Trae Young\\youngtr01 4623 141 32.79
## 6 Lottery 6 ORL Mohamed Bamba\\bambamo01 1634 107 15.27
At this time, we are ready to perform our Anova test to understand if there is a significant difference between each of the Group’s means.
anova<-aov(mpg~Round, data=Draft2018)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Round 2 1597 798.7 16.97 2.01e-06 ***
## Residuals 53 2494 47.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the Anova test, we can see that there is a significant difference between the Groups; hwowever, we still do not necessarily know which Groups have a significant difference.
Lets first plot the groups, so that we can visually see our groups and then we will perform a Post Hoc Analysis to dig deeper.
boxplot(mpg ~ Round, data=Draft2018)
The boxplot show us that the Lottery Group has seperated itself from both Round 1 and Round 2 Draft picks. The seperation does not look as clear with Round 1 vs. Round 2 Groups.
Lets perform our Post Hoc Analysis, called TukeyHSD, to dig deeper
Post Hoc Analysis
TukeyHSD(anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mpg ~ Round, data = Draft2018)
##
## $Round
## diff lwr upr p adj
## Round 1-Lottery -8.136667 -14.17631 -2.0970263 0.0056303
## Round 2-Lottery -12.958333 -18.32125 -7.5954131 0.0000010
## Round 2-Round 1 -4.821667 -10.18459 0.5412536 0.0861837
Based on our TukeyHSD analysis, we can verify what our Boxplot visually showed us. There is a significant difference of the means between Lottery Draft picks and both Round 1 & Round 2 Draft picks. This is determined by our adjusted P value, which is below .05. Because the p value is below .05, we have enough evidence to reject the Null Hypothesis that the Means for all Groups are even.