Recipie for Descriptive Statistics

Ali Svoobda

RPI

10/8/14 V.1

1. Setting

System under test

For this recipe, a dataset of Baseball statistics for catchers will be examined.

The question under examination is how a catacher's statistics and team association effect the number of runs a team lets up (Earned run average or ERA).

data<-read.csv("C:/baseballdata2.csv")
summary(data)

##      Player     League       Team         Runs         RperG     
##  Min.   : 1.0   AL:34   CHC    : 3   Min.   :583   Min.   :3.60  
##  1st Qu.:21.5   NL:37   CIN    : 3   1st Qu.:651   1st Qu.:4.02  
##  Median :40.0           HOU    : 3   Median :700   Median :4.32  
##  Mean   :39.8           KC     : 3   Mean   :697   Mean   :4.30  
##  3rd Qu.:58.5           LAA    : 3   3rd Qu.:734   3rd Qu.:4.53  
##  Max.   :76.0           MIA    : 3   Max.   :808   Max.   :4.99  
##                         (Other):53                               
##       ERA            ARD               TC             PO     
##  Min.   :3.19   Min.   :-0.960   Min.   : 117   Min.   :110  
##  1st Qu.:3.71   1st Qu.:-0.170   1st Qu.: 308   1st Qu.:285  
##  Median :4.01   Median : 0.500   Median : 408   Median :379  
##  Mean   :4.00   Mean   : 0.305   Mean   : 471   Mean   :436  
##  3rd Qu.:4.30   3rd Qu.: 0.845   3rd Qu.: 574   3rd Qu.:522  
##  Max.   :5.22   Max.   : 1.180   Max.   :1053   Max.   :962  
##                                                              
##        A              E              DP              SB      
##  Min.   : 7.0   Min.   :0.00   Min.   : 0.00   Min.   : 3.0  
##  1st Qu.:18.0   1st Qu.:2.00   1st Qu.: 1.00   1st Qu.:22.5  
##  Median :25.0   Median :4.00   Median : 3.00   Median :37.0  
##  Mean   :31.7   Mean   :3.58   Mean   : 3.24   Mean   :38.9  
##  3rd Qu.:41.0   3rd Qu.:5.00   3rd Qu.: 4.50   3rd Qu.:52.5  
##  Max.   :88.0   Max.   :8.00   Max.   :12.00   Max.   :87.0  
##                                                              
##        CS           SBPct             PB             EPC       
##  Min.   : 1.0   Min.   :0.500   Min.   : 0.00   Min.   :0.970  
##  1st Qu.: 7.0   1st Qu.:0.685   1st Qu.: 2.00   1st Qu.:0.990  
##  Median :12.0   Median :0.760   Median : 3.00   Median :0.990  
##  Mean   :13.6   Mean   :0.741   Mean   : 4.18   Mean   :0.992  
##  3rd Qu.:18.0   3rd Qu.:0.805   3rd Qu.: 6.00   3rd Qu.:1.000  
##  Max.   :38.0   Max.   :0.870   Max.   :18.00   Max.   :1.000  
##                                                                
##        RF      
##  Min.   :5.29  
##  1st Qu.:6.79  
##  Median :7.38  
##  Mean   :7.30  
##  3rd Qu.:7.75  
##  Max.   :8.90  
##

Factors and Levels

The three factors being considered for this experiment are:

Errors (E)- number of errors a catcher made. Levels(4): 0-3, 3-6, 6-9, and 9 and up

Put Outs (PO)- number of opposing team players are put out by the catcher Levels(4): 0-250, 250-500, 500-750, 750 and up

Stolen Bases (SB)- Number of bases a catcher allows the opposing team to steal Levels(3): 0-30, 30-60, and 60-90

Since errors and put outs are currently continuous, we will create discrete factor levels

Creat discrete factor levels for Errors factor:

data$E[data$E<3 & data$E>=0] = "0-3"
data$E[data$E<6 & data$E>=3] = "3-6"
data$E[data$E<9 & data$E>=6] = "6-9"
data$E[data$E>=9] = "9 and up"

Creat discrete factor levels for Put Outs factor:

data$PO[data$PO<250 & data$PO>=0] = "0-250"
data$PO[data$PO<500 & data$PO>=250] = "250-500"
data$PO[data$PO<750 & data$PO>=500] = "500-750"
data$PO[data$PO>=750] = "750 and up"

Creat discrete factor levels for Stolen Bases factor:

data$SB[data$SB<30 & data$SB>=0] = "0-30"
data$SB[data$SB<60 & data$SB>=30] = "30-60"
data$SB[data$SB<90 & data$SB>=60] = "60-100"

Now that levels are set, save the 3 as factors:

Save Errors (E), Put outs (PO) and stolen basses allowed (SB) as factors:

data$E=as.factor(data$E)
data$PO=as.factor(data$PO)
data$SB=as.factor(data$SB)

Continuous Variables

The other continuous variables in the dataset are Runs, average runs per game, earned run average, average run differential, total chances, assists, double plays, caught stealing, stolen base percentage, passed balls, errors per total chances, and a range factor.

Response Variables

We will use earned run average (ERA), which is the average number of runs a team gives up per game, as the response variable to measure the catchers performance.

The Data: How is it organized and what does it look like?

The dataset is collected from all the catchers that played in the 2012 MLB season.

The data has the 15 continuous variables listed above (although in the factors and levels section above we turned errors, put outs, and stolen bases allowed into discrete levels), a player identification number, and league and team identification factors.

Structure of the baseball catchers dataset:

str(data)

## 'data.frame':    71 obs. of  18 variables:
##  $ Player: int  1 2 3 5 6 7 8 9 10 11 ...
##  $ League: Factor w/ 2 levels "AL","NL": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Team  : Factor w/ 30 levels "ARI","ATL","BAL",..: 26 26 22 6 6 6 16 16 16 5 ...
##  $ Runs  : int  765 765 651 669 669 669 776 776 776 613 ...
##  $ RperG : num  4.72 4.72 4.02 4.13 4.13 4.13 4.79 4.79 4.79 3.78 ...
##  $ ERA   : num  3.71 3.71 3.86 3.34 3.34 3.34 4.22 4.22 4.22 4.51 ...
##  $ ARD   : num  1.01 1.01 0.16 0.79 0.79 0.79 0.57 0.57 0.57 -0.73 ...
##  $ TC    : int  1053 280 539 393 833 141 169 755 576 388 ...
##  $ PO    : Factor w/ 4 levels "0-250","250-500",..: 4 2 2 2 4 1 1 3 3 2 ...
##  $ A     : int  88 23 41 24 48 9 7 34 47 25 ...
##  $ E     : Factor w/ 3 levels "0-3","3-6","6-9": 2 1 2 1 2 1 1 3 3 3 ...
##  $ DP    : int  12 3 1 1 6 0 0 4 6 2 ...
##  $ SB    : Factor w/ 3 levels "0-30","30-60",..: 2 1 3 2 2 1 1 3 2 2 ...
##  $ CS    : int  35 8 13 10 32 1 5 19 15 14 ...
##  $ SBPct : num  0.52 0.69 0.82 0.8 0.52 0.83 0.81 0.79 0.68 0.73 ...
##  $ PB    : int  6 3 2 3 3 1 1 2 2 1 ...
##  $ EPC   : num  1 0.99 0.99 1 1 0.99 1 0.99 0.99 0.98 ...
##  $ RF    : num  7.72 5.91 6.62 7.4 7.54 6.67 6.26 8.5 8.26 7.33 ...

First and last six observations of the dataset:

head(data)

##   Player League Team Runs RperG  ERA  ARD   TC         PO  A   E DP     SB
## 1      1     NL  STL  765  4.72 3.71 1.01 1053 750 and up 88 3-6 12  30-60
## 2      2     NL  STL  765  4.72 3.71 1.01  280    250-500 23 0-3  3   0-30
## 3      3     NL  PIT  651  4.02 3.86 0.16  539    250-500 41 3-6  1 60-100
## 4      5     NL  CIN  669  4.13 3.34 0.79  393    250-500 24 0-3  1  30-60
## 5      6     NL  CIN  669  4.13 3.34 0.79  833 750 and up 48 3-6  6  30-60
## 6      7     NL  CIN  669  4.13 3.34 0.79  141      0-250  9 0-3  0   0-30
##   CS SBPct PB  EPC   RF
## 1 35  0.52  6 1.00 7.72
## 2  8  0.69  3 0.99 5.91
## 3 13  0.82  2 0.99 6.62
## 4 10  0.80  3 1.00 7.40
## 5 32  0.52  3 1.00 7.54
## 6  1  0.83  1 0.99 6.67

tail(data)

##    Player League Team Runs RperG  ERA   ARD  TC      PO  A   E DP    SB CS
## 66     71     AL   TB  697  4.30 3.19  1.11 277 250-500 12 3-6  2  0-30  5
## 67     72     AL  TEX  808  4.99 3.99  1.00 552 500-750 37 3-6  1 30-60 11
## 68     73     AL  TEX  808  4.99 3.99  1.00 397 250-500 18 0-3  1 30-60  9
## 69     74     AL  TEX  808  4.99 3.99  1.00 371 250-500 12 0-3  1  0-30  7
## 70     75     AL  TOR  716  4.42 4.64 -0.22 683 500-750 57 3-6  1 30-60 22
## 71     76     AL  TOR  716  4.42 4.64 -0.22 473 250-500 38 0-3  5  0-30 20
##    SBPct PB  EPC   RF
## 66  0.80  0 0.99 7.03
## 67  0.79  8 0.99 7.61
## 68  0.77  4 1.00 8.06
## 69  0.80  1 1.00 8.41
## 70  0.71  9 0.99 7.22
## 71  0.59  6 1.00 7.14

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

The experiment will be a multi-factor design with 3 factors and multiple levels to test if the variation in runs against (ERA) can be explained by the team, errors by the catcher or put outs by the catcher.

An ANOVA model will be created and analyzed to test the null hypothesis that the variation in runs against (ERA) cannot be explained by anything other than variation (there are no differences in means among the samples).

What is the Rationale for this design?

Analyze diff of means of groups see variation among and between groups

Randomize: What is the Randomization Scheme?

This dataset does not have a randomization scheme as it is a set of all statistics for each catcher in the 2012 season.

Replicate: Are there replicates and/or repeated measures?

Some of the observations are averages, such as stolen base percentage and ERA, so the replicates can be considered the 162 games the data was averaged over.

Block: Did you use blocking in the design?

Only the 3 catchers per team that played the most games were considered in the dataset to block the catchers who only played a game or two.

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and descriptive summary

Summary Statistics for entire baseball catchers dataset:

summary(data)

##      Player     League       Team         Runs         RperG     
##  Min.   : 1.0   AL:34   CHC    : 3   Min.   :583   Min.   :3.60  
##  1st Qu.:21.5   NL:37   CIN    : 3   1st Qu.:651   1st Qu.:4.02  
##  Median :40.0           HOU    : 3   Median :700   Median :4.32  
##  Mean   :39.8           KC     : 3   Mean   :697   Mean   :4.30  
##  3rd Qu.:58.5           LAA    : 3   3rd Qu.:734   3rd Qu.:4.53  
##  Max.   :76.0           MIA    : 3   Max.   :808   Max.   :4.99  
##                         (Other):53                               
##       ERA            ARD               TC                PO    
##  Min.   :3.19   Min.   :-0.960   Min.   : 117   0-250     :13  
##  1st Qu.:3.71   1st Qu.:-0.170   1st Qu.: 308   250-500   :37  
##  Median :4.01   Median : 0.500   Median : 408   500-750   :13  
##  Mean   :4.00   Mean   : 0.305   Mean   : 471   750 and up: 8  
##  3rd Qu.:4.30   3rd Qu.: 0.845   3rd Qu.: 574                  
##  Max.   :5.22   Max.   : 1.180   Max.   :1053                  
##                                                                
##        A          E            DP             SB           CS      
##  Min.   : 7.0   0-3:26   Min.   : 0.00   0-30  :26   Min.   : 1.0  
##  1st Qu.:18.0   3-6:29   1st Qu.: 1.00   30-60 :33   1st Qu.: 7.0  
##  Median :25.0   6-9:16   Median : 3.00   60-100:12   Median :12.0  
##  Mean   :31.7            Mean   : 3.24               Mean   :13.6  
##  3rd Qu.:41.0            3rd Qu.: 4.50               3rd Qu.:18.0  
##  Max.   :88.0            Max.   :12.00               Max.   :38.0  
##                                                                    
##      SBPct             PB             EPC              RF      
##  Min.   :0.500   Min.   : 0.00   Min.   :0.970   Min.   :5.29  
##  1st Qu.:0.685   1st Qu.: 2.00   1st Qu.:0.990   1st Qu.:6.79  
##  Median :0.760   Median : 3.00   Median :0.990   Median :7.38  
##  Mean   :0.741   Mean   : 4.18   Mean   :0.992   Mean   :7.30  
##  3rd Qu.:0.805   3rd Qu.: 6.00   3rd Qu.:1.000   3rd Qu.:7.75  
##  Max.   :0.870   Max.   :18.00   Max.   :1.000   Max.   :8.90  
##

Mean ERA:

mean(data$ERA)

## [1] 3.997

Boxplots by errors and put outs:

boxplot(data$ERA~data$E, xlab="Number of Errors", ylab="ERA")

plot of chunk unnamed-chunk-10

boxplot(data$ERA~data$PO, xlab="Number of Put Outs", ylab="ERA")

plot of chunk unnamed-chunk-10

boxplot(data$ERA~data$SB, xlab="Number of Stolen Bases Allowed", ylab="ERA")

plot of chunk unnamed-chunk-10

Examining the errors plot, it is hard to say if the number of errors is the cause of the variation in ERA, although it does appear that the larger level of errors may lead to the largest ERA.

As for the put outs box plot, it does not appear that the first 3 levels would explain the variation in ERA, but again the last level of 750+ put outs by the catcher appears to result in a lower ERA, which makes sense in the game of baseball.

From the stolen bases boxplot, it seems unlikely that the number of stolen bases allowed will exaplin the variation in ERA.

Testing

#Model 1: ANOVA Model for number of Errors effect on ERA First we will test to see the effect the Number of errors a catcher has can have on the ERA

model1=aov(data$ERA~data$E)
anova(model1)

## Analysis of Variance Table
## 
## Response: data$ERA
##           Df Sum Sq Mean Sq F value Pr(>F)
## data$E     2   0.78   0.391    1.77   0.18
## Residuals 68  15.07   0.222

The probability that the variation in ERA among teams is caused by randomization is 0.1788. It is likely that randomization is causing the variation in ERA. We fail to reject the null hypothesis. Number of errors by a catcher alone is not enough to predict the ERA of a team.

#Model 2:ANOVA Model for number of put outs effect on ERA

model2=aov(data$ERA~data$PO)
anova(model2)

## Analysis of Variance Table
## 
## Response: data$ERA
##           Df Sum Sq Mean Sq F value Pr(>F)
## data$PO    3   1.02   0.339    1.53   0.21
## Residuals 67  14.84   0.221

Again, we fail to reject the null hypothesis. Put outs by a catcher alone does not explain the variation among ERA.

#Model 3:ANOVA Model for number of passed balls effect on ERA

model3=aov(data$ERA~data$SB)
anova(model3)

## Analysis of Variance Table
## 
## Response: data$ERA
##           Df Sum Sq Mean Sq F value Pr(>F)
## data$SB    2   0.37   0.185    0.81   0.45
## Residuals 68  15.48   0.228

It is even more unlikely that stolen bases can explain the varaition in ERA. We fail to reject the null hypothesis that the variation is due to randomization.

#Model 4: ANOVA model to test if interation of all three factors (Errors, put outs, and stolen bases) effect on ERA

Although each factor individually could not explain the variation among ERA, this model will test to see if it is likely that the factors together can explain the variation in ERA:

model4=aov(data$ERA~data$E*data$PO*data$SB)
anova(model4)

## Analysis of Variance Table
## 
## Response: data$ERA
##                 Df Sum Sq Mean Sq F value Pr(>F)  
## data$E           2   0.78   0.391    1.76  0.182  
## data$PO          3   1.53   0.510    2.30  0.088 .
## data$SB          2   0.89   0.443    1.99  0.146  
## data$E:data$PO   5   0.27   0.054    0.24  0.941  
## data$E:data$SB   3   0.11   0.036    0.16  0.920  
## data$PO:data$SB  2   0.51   0.254    1.14  0.327  
## Residuals       53  11.77   0.222                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This analysis gives statistics for each factor individual, for the combinations of each pair, and finally for all three together. It is unlikely than any of the combinations can explain the variation in ERA.

Diagnostics/Model Adequacy Checking

#Check Normality We must check that our data follows the normallity assumption of the ANOVA model:

qqnorm(residuals(model4))
qqline(residuals(model4))

plot of chunk unnamed-chunk-15

Based on this plot, the dataset looks somewhat normal, although there is a chance it may not be. This means the results of the above analysis may not accurate.

Fitted vs Residuals Plot

plot(fitted(model4),residuals(model4))

plot of chunk unnamed-chunk-16

The residuals are somewhat clustered in the middle but they still seem to be equally distributed along the horizontal zero, indicating a decent fit of the model.

#Tukey Test

The Tukey test takes the anova model created and compares the means of each level between all the different factors. The null hypothesis of the test is that there is no difference between the means of the data pair.

TukeyHSD(model4)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = data$ERA ~ data$E * data$PO * data$SB)
## 
## $`data$E`
##            diff     lwr    upr  p adj
## 3-6-0-3 -0.1298 -0.4367 0.1771 0.5677
## 6-9-0-3  0.1425 -0.2185 0.5036 0.6101
## 6-9-3-6  0.2723 -0.0815 0.6262 0.1617
## 
## $`data$PO`
##                        diff     lwr     upr  p adj
## 250-500-0-250       0.09166 -0.3113 0.49465 0.9306
## 500-750-0-250       0.14492 -0.3453 0.63517 0.8614
## 750 and up-0-250   -0.33997 -0.9016 0.22168 0.3843
## 500-750-250-500     0.05325 -0.3497 0.45624 0.9851
## 750 and up-250-500 -0.43164 -0.9190 0.05571 0.0999
## 750 and up-500-750 -0.48489 -1.0465 0.07676 0.1133
## 
## $`data$SB`
##                diff     lwr    upr  p adj
## 30-60-0-30   0.0857 -0.2123 0.3837 0.7683
## 60-100-0-30  0.2308 -0.1658 0.6273 0.3466
## 60-100-30-60 0.1451 -0.2380 0.5281 0.6344
## 
## $`data$E:data$PO`
##                                   diff     lwr    upr  p adj
## 3-6:0-250-0-3:0-250           -0.18087 -1.4284 1.0667 1.0000
## 6-9:0-250-0-3:0-250            0.22740 -1.4618 1.9166 1.0000
## 0-3:250-500-0-3:0-250          0.10747 -0.5500 0.7650 1.0000
## 3-6:250-500-0-3:0-250         -0.06502 -0.6942 0.5642 1.0000
## 6-9:250-500-0-3:0-250          0.30907 -0.7511 1.3693 0.9972
## 0-3:500-750-0-3:0-250               NA      NA     NA     NA
## 3-6:500-750-0-3:0-250         -0.08976 -0.9215 0.7419 1.0000
## 6-9:500-750-0-3:0-250          0.37489 -0.4188 1.1686 0.8960
## 0-3:750 and up-0-3:0-250      -0.71075 -2.3999 0.9784 0.9501
## 3-6:750 and up-0-3:0-250      -0.31157 -1.5591 0.9360 0.9993
## 6-9:750 and up-0-3:0-250      -0.18868 -1.0708 0.6935 0.9998
## 6-9:0-250-3-6:0-250            0.40828 -1.5643 2.3808 0.9999
## 0-3:250-500-3-6:0-250          0.28834 -0.9241 1.5007 0.9996
## 3-6:250-500-3-6:0-250          0.11585 -1.0814 1.3131 1.0000
## 6-9:250-500-3-6:0-250          0.48994 -0.9803 1.9602 0.9912
## 0-3:500-750-3-6:0-250               NA      NA     NA     NA
## 3-6:500-750-3-6:0-250          0.09111 -1.2239 1.4061 1.0000
## 6-9:500-750-3-6:0-250          0.55576 -0.7356 1.8471 0.9420
## 0-3:750 and up-3-6:0-250      -0.52988 -2.5024 1.4427 0.9986
## 3-6:750 and up-3-6:0-250      -0.13069 -1.7413 1.4799 1.0000
## 6-9:750 and up-3-6:0-250      -0.00781 -1.3553 1.3397 1.0000
## 0-3:250-500-6-9:0-250         -0.11993 -1.7833 1.5435 1.0000
## 3-6:250-500-6-9:0-250         -0.29243 -1.9448 1.3600 1.0000
## 6-9:250-500-6-9:0-250          0.08166 -1.7781 1.9414 1.0000
## 0-3:500-750-6-9:0-250               NA      NA     NA     NA
## 3-6:500-750-6-9:0-250         -0.31717 -2.0568 1.4225 1.0000
## 6-9:500-750-6-9:0-250          0.14748 -1.5743 1.8693 1.0000
## 0-3:750 and up-6-9:0-250      -0.93816 -3.2158 1.3395 0.9568
## 3-6:750 and up-6-9:0-250      -0.53897 -2.5115 1.4336 0.9984
## 6-9:750 and up-6-9:0-250      -0.41608 -2.1804 1.3482 0.9996
## 3-6:250-500-0-3:250-500       -0.17250 -0.7288 0.3838 0.9952
## 6-9:250-500-0-3:250-500        0.20160 -0.8170 1.2202 0.9999
## 0-3:500-750-0-3:250-500             NA      NA     NA     NA
## 3-6:500-750-0-3:250-500       -0.19723 -0.9752 0.5807 0.9992
## 6-9:500-750-0-3:250-500        0.26742 -0.4698 1.0046 0.9829
## 0-3:750 and up-0-3:250-500    -0.81822 -2.4816 0.8452 0.8686
## 3-6:750 and up-0-3:250-500    -0.41904 -1.6314 0.7934 0.9883
## 6-9:750 and up-0-3:250-500    -0.29615 -1.1278 0.5355 0.9852
## 6-9:250-500-3-6:250-500        0.37409 -0.6265 1.3747 0.9785
## 0-3:500-750-3-6:250-500             NA      NA     NA     NA
## 3-6:500-750-3-6:250-500       -0.02474 -0.7790 0.7295 1.0000
## 6-9:500-750-3-6:250-500        0.43991 -0.2722 1.1520 0.6172
## 0-3:750 and up-3-6:250-500    -0.64573 -2.2981 1.0067 0.9702
## 3-6:750 and up-3-6:250-500    -0.24654 -1.4438 0.9507 0.9999
## 6-9:750 and up-3-6:250-500    -0.12366 -0.9332 0.6859 1.0000
## 0-3:500-750-6-9:250-500             NA      NA     NA     NA
## 3-6:500-750-6-9:250-500       -0.39883 -1.5377 0.7400 0.9870
## 6-9:500-750-6-9:250-500        0.06582 -1.0456 1.1772 1.0000
## 0-3:750 and up-6-9:250-500    -1.01982 -2.8795 0.8399 0.7691
## 3-6:750 and up-6-9:250-500    -0.62063 -2.0909 0.8496 0.9490
## 6-9:750 and up-6-9:250-500    -0.49775 -1.6739 0.6784 0.9481
## 3-6:500-750-0-3:500-750             NA      NA     NA     NA
## 6-9:500-750-0-3:500-750             NA      NA     NA     NA
## 0-3:750 and up-0-3:500-750          NA      NA     NA     NA
## 3-6:750 and up-0-3:500-750          NA      NA     NA     NA
## 6-9:750 and up-0-3:500-750          NA      NA     NA     NA
## 6-9:500-750-3-6:500-750        0.46465 -0.4314 1.3607 0.8251
## 0-3:750 and up-3-6:500-750    -0.62099 -2.3606 1.1186 0.9849
## 3-6:750 and up-3-6:500-750    -0.22180 -1.5368 1.0932 1.0000
## 6-9:750 and up-3-6:500-750    -0.09892 -1.0742 0.8763 1.0000
## 0-3:750 and up-6-9:500-750    -1.08564 -2.8074 0.6361 0.5875
## 3-6:750 and up-6-9:500-750    -0.68645 -1.9778 0.6049 0.8015
## 6-9:750 and up-6-9:500-750    -0.56357 -1.5066 0.3795 0.6634
## 3-6:750 and up-0-3:750 and up  0.39919 -1.5734 2.3717 0.9999
## 6-9:750 and up-0-3:750 and up  0.52207 -1.2422 2.2864 0.9968
## 6-9:750 and up-3-6:750 and up  0.12288 -1.2246 1.4704 1.0000
## 
## $`data$E:data$SB`
##                           diff     lwr    upr  p adj
## 3-6:0-30-0-3:0-30     -0.16356 -0.8168 0.4897 0.9961
## 6-9:0-30-0-3:0-30      0.08809 -1.4797 1.6558 1.0000
## 0-3:30-60-0-3:0-30     0.11882 -0.5654 0.8030 0.9997
## 3-6:30-60-0-3:0-30    -0.04421 -0.5529 0.4644 1.0000
## 6-9:30-60-0-3:0-30     0.13659 -0.5476 0.8208 0.9992
## 0-3:60-100-0-3:0-30    0.44672 -0.6922 1.5857 0.9361
## 3-6:60-100-0-3:0-30    0.15422 -0.9847 1.2932 1.0000
## 6-9:60-100-0-3:0-30    0.28728 -0.3659 0.9405 0.8843
## 6-9:0-30-3-6:0-30      0.25165 -1.3644 1.8676 0.9999
## 0-3:30-60-3-6:0-30     0.28238 -0.5061 1.0709 0.9617
## 3-6:30-60-3-6:0-30     0.11935 -0.5228 0.7615 0.9995
## 6-9:30-60-3-6:0-30     0.30014 -0.4884 1.0887 0.9458
## 0-3:60-100-3-6:0-30    0.61028 -0.5942 1.8148 0.7796
## 3-6:60-100-3-6:0-30    0.31778 -0.8867 1.5223 0.9944
## 6-9:60-100-3-6:0-30    0.45084 -0.3110 1.2126 0.6078
## 0-3:30-60-6-9:0-30     0.03074 -1.5980 1.6595 1.0000
## 3-6:30-60-6-9:0-30    -0.13229 -1.6955 1.4309 1.0000
## 6-9:30-60-6-9:0-30     0.04850 -1.5803 1.6773 1.0000
## 0-3:60-100-6-9:0-30    0.35863 -1.5074 2.2246 0.9994
## 3-6:60-100-6-9:0-30    0.06614 -1.7999 1.9321 1.0000
## 6-9:60-100-6-9:0-30    0.19919 -1.4168 1.8152 1.0000
## 3-6:30-60-0-3:30-60   -0.16303 -0.8367 0.5106 0.9969
## 6-9:30-60-0-3:30-60    0.01776 -0.7966 0.8321 1.0000
## 0-3:60-100-0-3:30-60   0.32790 -0.8937 1.5495 0.9938
## 3-6:60-100-0-3:30-60   0.03540 -1.1862 1.2570 1.0000
## 6-9:60-100-0-3:30-60   0.16846 -0.6201 0.9570 0.9987
## 6-9:30-60-3-6:30-60    0.18079 -0.4928 0.8544 0.9938
## 0-3:60-100-3-6:30-60   0.49093 -0.6417 1.6235 0.8923
## 3-6:60-100-3-6:30-60   0.19843 -0.9342 1.3310 0.9997
## 6-9:60-100-3-6:30-60   0.33149 -0.3106 0.9736 0.7619
## 0-3:60-100-6-9:30-60   0.31014 -0.9114 1.5317 0.9957
## 3-6:60-100-6-9:30-60   0.01764 -1.2039 1.2392 1.0000
## 6-9:60-100-6-9:30-60   0.15070 -0.6378 0.9392 0.9994
## 3-6:60-100-0-3:60-100 -0.29250 -1.8161 1.2311 0.9994
## 6-9:60-100-0-3:60-100 -0.15944 -1.3639 1.0451 1.0000
## 6-9:60-100-3-6:60-100  0.13306 -1.0714 1.3376 1.0000
## 
## $`data$PO:data$SB`
##                                        diff     lwr    upr  p adj
## 250-500:0-30-0-250:0-30            -0.09430 -0.7390 0.5504 1.0000
## 500-750:0-30-0-250:0-30             0.47864 -1.1977 2.1550 0.9976
## 750 and up:0-30-0-250:0-30               NA      NA     NA     NA
## 0-250:30-60-0-250:0-30              0.05047 -1.6259 1.7268 1.0000
## 250-500:30-60-0-250:0-30            0.16769 -0.4151 0.7505 0.9975
## 500-750:30-60-0-250:0-30            0.03364 -0.6766 0.7438 1.0000
## 750 and up:30-60-0-250:0-30        -0.55843 -1.7885 0.6717 0.9183
## 0-250:60-100-0-250:0-30                  NA      NA     NA     NA
## 250-500:60-100-0-250:0-30           0.41325 -0.6264 1.4529 0.9664
## 500-750:60-100-0-250:0-30           0.38432 -0.6553 1.4239 0.9803
## 750 and up:60-100-0-250:0-30       -0.26198 -1.0673 0.5433 0.9928
## 500-750:0-30-250-500:0-30           0.57295 -1.0984 2.2443 0.9890
## 750 and up:0-30-250-500:0-30             NA      NA     NA     NA
## 0-250:30-60-250-500:0-30            0.14478 -1.5266 1.8161 1.0000
## 250-500:30-60-250-500:0-30          0.26199 -0.3064 0.8304 0.9102
## 500-750:30-60-250-500:0-30          0.12795 -0.5704 0.8263 1.0000
## 750 and up:30-60-250-500:0-30      -0.46413 -1.6874 0.7592 0.9760
## 0-250:60-100-250-500:0-30                NA      NA     NA     NA
## 250-500:60-100-250-500:0-30         0.50755 -0.5240 1.5391 0.8684
## 500-750:60-100-250-500:0-30         0.47862 -0.5530 1.5102 0.9066
## 750 and up:60-100-250-500:0-30     -0.16767 -0.9626 0.6272 0.9999
## 750 and up:0-30-500-750:0-30             NA      NA     NA     NA
## 0-250:30-60-500-750:0-30           -0.42817 -2.7059 1.8495 1.0000
## 250-500:30-60-500-750:0-30         -0.31096 -1.9594 1.3375 1.0000
## 500-750:30-60-500-750:0-30         -0.44500 -2.1427 1.2527 0.9989
## 750 and up:30-60-500-750:0-30      -1.03707 -3.0096 0.9355 0.8123
## 0-250:60-100-500-750:0-30                NA      NA     NA     NA
## 250-500:60-100-500-750:0-30        -0.06539 -1.9251 1.7943 1.0000
## 500-750:60-100-500-750:0-30        -0.09433 -1.9541 1.7654 1.0000
## 750 and up:60-100-500-750:0-30     -0.74062 -2.4802 0.9990 0.9460
## 0-250:30-60-750 and up:0-30              NA      NA     NA     NA
## 250-500:30-60-750 and up:0-30            NA      NA     NA     NA
## 500-750:30-60-750 and up:0-30            NA      NA     NA     NA
## 750 and up:30-60-750 and up:0-30         NA      NA     NA     NA
## 0-250:60-100-750 and up:0-30             NA      NA     NA     NA
## 250-500:60-100-750 and up:0-30           NA      NA     NA     NA
## 500-750:60-100-750 and up:0-30           NA      NA     NA     NA
## 750 and up:60-100-750 and up:0-30        NA      NA     NA     NA
## 250-500:30-60-0-250:30-60           0.11721 -1.5313 1.7657 1.0000
## 500-750:30-60-0-250:30-60          -0.01683 -1.7145 1.6809 1.0000
## 750 and up:30-60-0-250:30-60       -0.60890 -2.5814 1.3636 0.9954
## 0-250:60-100-0-250:30-60                 NA      NA     NA     NA
## 250-500:60-100-0-250:30-60          0.36278 -1.4970 2.2225 0.9999
## 500-750:60-100-0-250:30-60          0.33385 -1.5259 2.1936 1.0000
## 750 and up:60-100-0-250:30-60      -0.31245 -2.0521 1.4272 1.0000
## 500-750:30-60-250-500:30-60        -0.13404 -0.7757 0.5076 0.9999
## 750 and up:30-60-250-500:30-60     -0.72612 -1.9180 0.4657 0.6369
## 0-250:60-100-250-500:30-60               NA      NA     NA     NA
## 250-500:60-100-250-500:30-60        0.24556 -0.7485 1.2396 0.9994
## 500-750:60-100-250-500:30-60        0.21663 -0.7774 1.2107 0.9998
## 750 and up:60-100-250-500:30-60    -0.42967 -1.1752 0.3159 0.7107
## 750 and up:30-60-500-750:30-60     -0.59207 -1.8511 0.6670 0.8987
## 0-250:60-100-500-750:30-60               NA      NA     NA     NA
## 250-500:60-100-500-750:30-60        0.37961 -0.6941 1.4533 0.9860
## 500-750:60-100-500-750:30-60        0.35067 -0.7230 1.4244 0.9926
## 750 and up:60-100-500-750:30-60    -0.29562 -1.1445 0.5532 0.9875
## 0-250:60-100-750 and up:30-60            NA      NA     NA     NA
## 250-500:60-100-750 and up:30-60     0.97168 -0.4986 2.4419 0.5170
## 500-750:60-100-750 and up:30-60     0.94275 -0.5275 2.4130 0.5626
## 750 and up:60-100-750 and up:30-60  0.29645 -1.0186 1.6115 0.9997
## 250-500:60-100-0-250:60-100              NA      NA     NA     NA
## 500-750:60-100-0-250:60-100              NA      NA     NA     NA
## 750 and up:60-100-0-250:60-100           NA      NA     NA     NA
## 500-750:60-100-250-500:60-100      -0.02893 -1.3440 1.2861 1.0000
## 750 and up:60-100-250-500:60-100   -0.67523 -1.8141 0.4636 0.6740
## 750 and up:60-100-500-750:60-100   -0.64630 -1.7851 0.4925 0.7296

A low p-value means a low probability that the means are not likely to have a significant difference in means.

tukey<-TukeyHSD(model4)
plot(tukey)

plot of chunk unnamed-chunk-18

The first 3 plots suggest there is no difference in ERA means between any of the group combinations within each factor (demonstrated by the zero line intersecting with each plot) . The other plots comparing all the combinations between factors become difficult to read as there are so many combinations.

#Interaction Plot

We create an interaction plot to view the interactions between the factors.

To run the plot, we must re-save the factors as numeric:

data$E=as.numeric(data$E)
data$PO=as.numeric(data$PO)
data$SB=as.numeric(data$SB)

interaction.plot(data$E, data$PO, data$SB)

plot of chunk unnamed-chunk-19

If there were no interaction between factors, the lines would be parallel and they would not intersect. The intersection and non-parallel lines shows there is interaction between the 3 factors.

4. References to the Literature

None used.

5. Contingencies

Since it was unclear if the data was normally distributed, we can use the Kruskal-Wallis non-parametric analysis of variance to test the difference between groups. The null hypothesis is that the variation between groups cannot be explained by anything other than variation.

kruskal.test(data$ERA, data$E)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$ERA and data$E
## Kruskal-Wallis chi-squared = 3.675, df = 2, p-value = 0.1593

kruskal.test(data$ERA, data$PO)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$ERA and data$PO
## Kruskal-Wallis chi-squared = 5.902, df = 3, p-value = 0.1165

kruskal.test(data$ERA, data$SB)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  data$ERA and data$SB
## Kruskal-Wallis chi-squared = 1.184, df = 2, p-value = 0.5533

Since we have large p-values for each result, we fail to reject the null. Like the other tests suggested, there is not evidence that the errors, put outs, or passed balls of a catcher can explain the variation among the ERA's.

6. Appendicies

Complete R Code

All included above.