Recipes for the Design of Experiments

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Analysis of Regular Season Wins in Baseball for Teams which Played from 1973-2013. (Two-Factor, Multi-Level Experiment)

Brendan Howell

Renselaer Polytechnic Institute

10/09/14 - Version 1.0

1. Setting

Dataset of Regular Season Wins Earned by Baseball Teams that played from 1973-2013 (‘teams_forty’).

Description: The updated version of Lahman’s Baseball Database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. In this analysis, only the years ranging from 1973-2013 will be considered.

In this study, a two-factor, multi-level experiment will be performed to see if either the number of hits earned by a given team in a given season or the number of homeruns earned by a given team in a given season (or, both via interaction) has a statistically significant effect on the number of wins that a given team earns in a given season. In the dataset, the factor ‘H’ refers to the number of hits that a given team earned in a given year and the factor ‘HR’ refers to the number of homeruns that a given team earned in a given year. Additionally, this analysis’ response variable is referred to in the dataset as ‘W’, which denotes the number of regular season wins that a given team earned in a given year.

##Load in the Teams Dataset
#Get dataset from Project Documents File
teams_raw <- read.csv("~/Academics (RPI)/09. Fall 2014/Design of Experiments/02. Wikibook Recipes/Recipe #03/Teams.csv", header=TRUE)
head(teams_raw)

##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871 <NA>    PH1      PNA          1 28    NA 21  7                  Y
## 2   1871 <NA>    CH1      CNA          2 28    NA 19  9                  N
## 3   1871 <NA>    BS1      BNA          3 31    NA 20 10                  N
## 4   1871 <NA>    WS3      OLY          4 32    NA 15 15                  N
## 5   1871 <NA>    NY2      NNA          5 33    NA 16 17                  N
## 6   1871 <NA>    TRO      TRO          6 29    NA 13 15                  N
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1       376 1281 410  66  27  9 46 23 56 NA  NA NA 266 137 4.95 27   0  0
## 2       302 1196 323  52  21 10 60 22 69 NA  NA NA 241  77 2.76 25   0  1
## 3       401 1372 426  70  37  3 60 19 73 NA  NA NA 303 109 3.55 22   1  3
## 4       310 1353 375  54  26  6 48 13 48 NA  NA NA 303 137 4.37 32   0  0
## 5       302 1404 403  43  21  1 33 15 46 NA  NA NA 313 121 3.72 32   1  0
## 6       351 1248 384  51  34  6 49 19 62 NA  NA NA 362 153 5.51 28   0  0
##   IPouts  HA HRA BBA SOA   E DP   FP                    name
## 1    747 329   3  53  16 194 NA 0.84  Philadelphia Athletics
## 2    753 308   6  28  22 218 NA 0.82 Chicago White Stockings
## 3    828 367   2  42  23 225 NA 0.83    Boston Red Stockings
## 4    846 371   4  45  13 217 NA 0.85     Washington Olympics
## 5    879 373   7  42  22 227 NA 0.83        New York Mutuals
## 6    750 431   4  75  12 198 NA 0.84          Troy Haymakers
##                       park attendance BPF PPF teamIDBR teamIDlahman45
## 1 Jefferson Street Grounds         NA 102  98      ATH            PH1
## 2  Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3      South End Grounds I         NA 103  98      BOS            BS1
## 4         Olympics Grounds         NA  94  98      OLY            WS3
## 5 Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6       Haymakers' Grounds         NA 101 100      TRO            TRO
##   teamIDretro
## 1         PH1
## 2         CH1
## 3         BS1
## 4         WS3
## 5         NY2
## 6         TRO

tail(teams_raw)

##      yearID lgID teamID franchID divID Rank   G Ghome  W   L DivWin WCWin
## 2740   2013   NL    MIA      FLA     E    5 162    81 62 100      N     N
## 2741   2013   NL    LAN      LAD     W    1 162    81 92  70      Y     N
## 2742   2013   NL    ARI      ARI     W    2 162    81 81  81      N     N
## 2743   2013   NL    SDN      SDP     W    3 162    81 76  86      N     N
## 2744   2013   NL    SFN      SFG     W    4 162    82 76  86      N     N
## 2745   2013   NL    COL      COL     W    5 162    81 74  88      N     N
##      LgWin WSWin   R   AB    H X2B X3B  HR  BB   SO  SB CS HBP SF  RA  ER
## 2740     N     N 513 5449 1257 219  31  95 432 1232  78 29  56 26 646 602
## 2741     N     N 649 5491 1447 281  17 138 476 1146  78 28  57 48 582 524
## 2742     N     N 685 5676 1468 302  31 130 519 1142  62 41  43 43 695 651
## 2743     N     N 618 5517 1349 246  26 146 467 1309 118 34  52 34 700 643
## 2744     N     N 629 5552 1446 280  35 107 469 1078  67 26  39 42 691 643
## 2745     N     N 706 5599 1511 283  36 159 427 1204 112 32  26 35 760 708
##       ERA CG SHO SV IPouts   HA HRA BBA  SOA   E  DP    FP
## 2740 3.71  2   1 36   4380 1376 121 526 1177  88 144 0.986
## 2741 3.25  7   4 46   4351 1321 127 460 1292 109 160 0.982
## 2742 3.92  6   2 38   4485 1460 176 485 1218  75 134 0.988
## 2743 3.98  3   1 40   4365 1407 156 525 1171  83 140 0.986
## 2744 4.00  2   2 41   4342 1380 145 521 1256 107 126 0.982
## 2745 4.44  1   0 35   4308 1545 136 517 1064  90 162 0.986
##                      name           park attendance BPF PPF teamIDBR
## 2740        Miami Marlins   Marlins Park    1586322 102 103      MIA
## 2741  Los Angeles Dodgers Dodger Stadium    3743527  95  95      LAD
## 2742 Arizona Diamondbacks    Chase Field    2134795 102 102      ARI
## 2743     San Diego Padres     Petco Park    2166691  91  91      SDP
## 2744 San Francisco Giants      AT&T Park    3326796  90  89      SFG
## 2745     Colorado Rockies    Coors Field    2793828 117 118      COL
##      teamIDlahman45 teamIDretro
## 2740            FLO         MIA
## 2741            LAN         LAN
## 2742            ARI         ARI
## 2743            SDN         SDN
## 2744            SFN         SFN
## 2745            COL         COL

Factors and Levels

This analysis considers two different factors (with each having multiple levels), which include ‘H’ and ‘HR’. In the original dataset “teams_raw”, both the factor ‘H’ and the factor ‘HR’ are denoted as being integer variables with no specific categorical levels. However, in carrying out this analysis, these factors will be transformed into categorical variables with manually-defined levels. These factors were selected intuitively, since this analysis aims to determine whether or not the amount of hits and/or the amount of homeruns that a given team earns in a given season have a significant effect on the number of regular season wins that a given team earns in a given season.

#Display the summary statistics of "teams_raw".
summary(teams_raw)

##      yearID       lgID          teamID        franchID    divID   
##  Min.   :1871   AA  :  85   CHN    : 138   ATL    : 138    :1517  
##  1st Qu.:1918   AL  :1175   PHI    : 131   CHC    : 138   C: 215  
##  Median :1961   FL  :  16   PIT    : 127   CIN    : 132   E: 518  
##  Mean   :1954   NL  :1399   CIN    : 124   PIT    : 132   W: 495  
##  3rd Qu.:1990   PL  :   8   SLN    : 122   STL    : 132           
##  Max.   :2013   UA  :  12   BOS    : 113   PHI    : 131           
##                 NA's:  50   (Other):1990   (Other):1942           
##       Rank              G             Ghome            W         
##  Min.   : 1.000   Min.   :  6.0   Min.   :44.0   Min.   :  0.00  
##  1st Qu.: 2.000   1st Qu.:153.0   1st Qu.:77.0   1st Qu.: 66.00  
##  Median : 4.000   Median :157.0   Median :80.0   Median : 77.00  
##  Mean   : 4.132   Mean   :150.1   Mean   :78.4   Mean   : 74.61  
##  3rd Qu.: 6.000   3rd Qu.:162.0   3rd Qu.:81.0   3rd Qu.: 87.00  
##  Max.   :13.000   Max.   :165.0   Max.   :84.0   Max.   :116.00  
##                                   NA's   :399                    
##        L          DivWin   WCWin    LgWin    WSWin          R         
##  Min.   :  4.00    :1545    :2181    :  28    : 357   Min.   :  24.0  
##  1st Qu.: 65.00   N: 982   N: 522   N:2449   N:2274   1st Qu.: 612.0  
##  Median : 76.00   Y: 218   Y:  42   Y: 268   Y: 114   Median : 690.0  
##  Mean   : 74.61                                       Mean   : 682.1  
##  3rd Qu.: 86.00                                       3rd Qu.: 765.0  
##  Max.   :134.00                                       Max.   :1220.0  
##                                                                       
##        AB             H             X2B             X3B        
##  Min.   : 211   Min.   :  33   Min.   :  3.0   Min.   :  0.00  
##  1st Qu.:5117   1st Qu.:1297   1st Qu.:192.0   1st Qu.: 31.00  
##  Median :5379   Median :1393   Median :229.0   Median : 42.00  
##  Mean   :5134   Mean   :1345   Mean   :226.6   Mean   : 47.48  
##  3rd Qu.:5514   3rd Qu.:1468   3rd Qu.:269.0   3rd Qu.: 61.00  
##  Max.   :5781   Max.   :1783   Max.   :376.0   Max.   :150.00  
##                                                                
##        HR            BB              SO               SB       
##  Min.   :  0   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 40   1st Qu.:425.0   1st Qu.: 501.0   1st Qu.: 64.0  
##  Median :105   Median :494.0   Median : 746.0   Median : 96.0  
##  Mean   :100   Mean   :473.8   Mean   : 726.3   Mean   :112.8  
##  3rd Qu.:148   3rd Qu.:555.0   3rd Qu.: 955.0   3rd Qu.:143.0  
##  Max.   :264   Max.   :835.0   Max.   :1535.0   Max.   :581.0  
##                                NA's   :120      NA's   :144    
##        CS              HBP               SF              RA        
##  Min.   :  0.00   Min.   : 26.00   Min.   :24.00   Min.   :  34.0  
##  1st Qu.: 35.00   1st Qu.: 47.00   1st Qu.:39.00   1st Qu.: 608.0  
##  Median : 46.00   Median : 54.50   Median :44.00   Median : 688.0  
##  Mean   : 49.21   Mean   : 56.35   Mean   :45.09   Mean   : 682.1  
##  3rd Qu.: 59.00   3rd Qu.: 64.00   3rd Qu.:50.25   3rd Qu.: 765.0  
##  Max.   :191.00   Max.   :103.00   Max.   :75.00   Max.   :1252.0  
##  NA's   :859      NA's   :2325     NA's   :2325                    
##        ER              ERA              CG             SHO        
##  Min.   :  25.0   Min.   :1.220   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.: 498.0   1st Qu.:3.330   1st Qu.: 16.0   1st Qu.: 6.000  
##  Median : 590.0   Median :3.820   Median : 46.0   Median : 9.000  
##  Mean   : 569.8   Mean   :3.815   Mean   : 51.5   Mean   : 9.435  
##  3rd Qu.: 667.0   3rd Qu.:4.310   3rd Qu.: 78.0   3rd Qu.:12.000  
##  Max.   :1023.0   Max.   :8.000   Max.   :148.0   Max.   :32.000  
##                                                                   
##        SV            IPouts           HA            HRA     
##  Min.   : 0.00   Min.   : 162   Min.   :  49   Min.   :  0  
##  1st Qu.: 9.00   1st Qu.:4071   1st Qu.:1287   1st Qu.: 43  
##  Median :23.00   Median :4224   Median :1392   Median :107  
##  Mean   :23.25   Mean   :4015   Mean   :1345   Mean   :100  
##  3rd Qu.:37.00   3rd Qu.:4339   3rd Qu.:1471   3rd Qu.:147  
##  Max.   :68.00   Max.   :4518   Max.   :1993   Max.   :241  
##                                                             
##       BBA             SOA               E               DP       
##  Min.   :  0.0   Min.   :   0.0   Min.   : 47.0   Min.   : 18.0  
##  1st Qu.:426.0   1st Qu.: 499.0   1st Qu.:119.0   1st Qu.:126.0  
##  Median :496.0   Median : 721.0   Median :147.0   Median :144.0  
##  Mean   :474.1   Mean   : 719.9   Mean   :188.6   Mean   :140.1  
##  3rd Qu.:556.0   3rd Qu.: 951.0   3rd Qu.:219.0   3rd Qu.:160.0  
##  Max.   :827.0   Max.   :1428.0   Max.   :639.0   Max.   :217.0  
##                                                   NA's   :317    
##        FP                            name                       park     
##  Min.   :0.7600   Cincinnati Reds      : 123   Wrigley Field      : 100  
##  1st Qu.:0.9600   Pittsburgh Pirates   : 123   Sportsman's Park IV:  90  
##  Median :0.9700   Philadelphia Phillies: 122   Comiskey Park      :  80  
##  Mean   :0.9605   St. Louis Cardinals  : 114   Fenway Park II     :  80  
##  3rd Qu.:0.9800   Chicago White Sox    : 113   Forbes Field       :  60  
##  Max.   :0.9910   Detroit Tigers       : 113   Crosley Field      :  58  
##                   (Other)              :2037   (Other)            :2277  
##    attendance           BPF             PPF           teamIDBR   
##  Min.   :   6088   Min.   : 60.0   Min.   : 60.0   CHC    : 138  
##  1st Qu.: 518051   1st Qu.: 97.0   1st Qu.: 97.0   CIN    : 137  
##  Median :1107122   Median :100.0   Median :100.0   STL    : 135  
##  Mean   :1317241   Mean   :100.2   Mean   :100.2   PHI    : 134  
##  3rd Qu.:1950099   3rd Qu.:103.0   3rd Qu.:103.0   PIT    : 132  
##  Max.   :4483350   Max.   :129.0   Max.   :141.0   BOS    : 121  
##  NA's   :279                                       (Other):1948  
##  teamIDlahman45  teamIDretro  
##  CHN    : 138   CHN    : 138  
##  PHI    : 131   PHI    : 131  
##  PIT    : 127   PIT    : 127  
##  CIN    : 124   CIN    : 124  
##  SLN    : 122   SLN    : 122  
##  BOS    : 113   BOS    : 113  
##  (Other):1990   (Other):1990

#Display the names found in "teams_raw".
names(teams_raw)

##  [1] "yearID"         "lgID"           "teamID"         "franchID"      
##  [5] "divID"          "Rank"           "G"              "Ghome"         
##  [9] "W"              "L"              "DivWin"         "WCWin"         
## [13] "LgWin"          "WSWin"          "R"              "AB"            
## [17] "H"              "X2B"            "X3B"            "HR"            
## [21] "BB"             "SO"             "SB"             "CS"            
## [25] "HBP"            "SF"             "RA"             "ER"            
## [29] "ERA"            "CG"             "SHO"            "SV"            
## [33] "IPouts"         "HA"             "HRA"            "BBA"           
## [37] "SOA"            "E"              "DP"             "FP"            
## [41] "name"           "park"           "attendance"     "BPF"           
## [45] "PPF"            "teamIDBR"       "teamIDlahman45" "teamIDretro"

#Display the structure of "teams_raw".
str(teams_raw)

## 'data.frame':    2745 obs. of  48 variables:
##  $ yearID        : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
##  $ lgID          : Factor w/ 6 levels "AA","AL","FL",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ teamID        : Factor w/ 149 levels "ALT","ANA","ARI",..: 97 31 24 142 90 136 56 39 111 24 ...
##  $ franchID      : Factor w/ 120 levels "ALT","ANA","ARI",..: 85 36 13 77 70 109 56 25 91 13 ...
##  $ divID         : Factor w/ 4 levels "","C","E","W": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Rank          : int  1 2 3 4 5 6 7 8 9 1 ...
##  $ G             : int  28 28 31 32 33 29 19 29 25 48 ...
##  $ Ghome         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ W             : int  21 19 20 15 16 13 7 10 4 39 ...
##  $ L             : int  7 9 10 15 17 15 12 19 21 8 ...
##  $ DivWin        : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ WCWin         : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LgWin         : Factor w/ 3 levels "","N","Y": 3 2 2 2 2 2 2 2 2 3 ...
##  $ WSWin         : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ R             : int  376 302 401 310 302 351 137 249 231 521 ...
##  $ AB            : int  1281 1196 1372 1353 1404 1248 746 1186 1036 2137 ...
##  $ H             : int  410 323 426 375 403 384 178 328 274 677 ...
##  $ X2B           : int  66 52 70 54 43 51 19 35 44 114 ...
##  $ X3B           : int  27 21 37 26 21 34 8 40 25 31 ...
##  $ HR            : int  9 10 3 6 1 6 2 7 3 7 ...
##  $ BB            : int  46 60 60 48 33 49 33 26 38 28 ...
##  $ SO            : int  23 22 19 13 15 19 9 25 30 26 ...
##  $ SB            : int  56 69 73 48 46 62 16 18 53 47 ...
##  $ CS            : int  NA NA NA NA NA NA NA NA NA 14 ...
##  $ HBP           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SF            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ RA            : int  266 241 303 303 313 362 243 341 287 236 ...
##  $ ER            : int  137 77 109 137 121 153 97 116 108 95 ...
##  $ ERA           : num  4.95 2.76 3.55 4.37 3.72 5.51 5.17 4.11 4.3 1.99 ...
##  $ CG            : int  27 25 22 32 32 28 19 23 23 41 ...
##  $ SHO           : int  0 0 1 0 1 0 1 0 1 3 ...
##  $ SV            : int  0 1 3 0 0 0 0 0 0 1 ...
##  $ IPouts        : int  747 753 828 846 879 750 507 762 678 1290 ...
##  $ HA            : int  329 308 367 371 373 431 261 346 315 438 ...
##  $ HRA           : int  3 6 2 4 7 4 5 13 3 0 ...
##  $ BBA           : int  53 28 42 45 42 75 21 53 34 27 ...
##  $ SOA           : int  16 22 23 13 22 12 17 34 16 0 ...
##  $ E             : int  194 218 225 217 227 198 163 223 220 263 ...
##  $ DP            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ FP            : num  0.84 0.82 0.83 0.85 0.83 0.84 0.8 0.81 0.82 0.87 ...
##  $ name          : Factor w/ 139 levels "Altoona Mountain City",..: 97 42 17 135 93 131 63 51 111 17 ...
##  $ park          : Factor w/ 213 levels "","23rd Street Grounds",..: 87 197 170 130 199 80 77 116 4 170 ...
##  $ attendance    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BPF           : int  102 104 103 94 90 101 101 96 97 105 ...
##  $ PPF           : int  98 102 98 98 88 100 107 100 99 100 ...
##  $ teamIDBR      : Factor w/ 101 levels "ALT","ANA","ARI",..: 4 21 10 65 62 93 42 25 77 10 ...
##  $ teamIDlahman45: Factor w/ 148 levels "ALT","ANA","ARI",..: 96 31 24 140 89 135 56 39 110 24 ...
##  $ teamIDretro   : Factor w/ 149 levels "ALT","ANA","ARI",..: 96 31 24 141 89 135 56 39 110 24 ...

Continuous variables (if any)

In this dataset, there are a few variables that can be considered to be continuous variables; these variables are the ones which are categorized as being numeric variables. By this standard, the continuous variables that exist in this dataset include ‘ERA’ (which refers to a team’s overal earned run average) and ‘FP’(which refers to a team’s overal fielding percentage). The two main factors that our analysis considers, ‘H’ and ‘HR’, currently exist as integer variables. However, upon manually defining those given integer values into specific “integer-range” levels, these factors will be then defined as categorical variables.

Response variables

This analysis will consider one response variable, ‘W’, which denotes the number of regular season wins that a given team earned in a given year.

The Data: How is it organized and what does it look like?

As a whole, Lahman’s Baseball Database contains information pertaining to pitching, hitting, and fielding statistics for Major League Baseball from the years 1871 through 2013. It includes data from the two current leagues (American and National), the four other “major” leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875. It contains 2,745 observations of 48 variables, which are defined below (Lahman, 1996-2014) [1]:

Variable ID [Description]

yearID [Year]

lgID [League]

teamID [Team]

franchID [Franchise (links to TeamsFranchise table)]

divID [Team’s division]

Rank [Position in final standings]

G [Games played]

GHome [Games played at home]

W [Wins]

L [Losses]

DivWin [Division Winner (Y or N)]

WCWin [Wild Card Winner (Y or N)]

LgWin [League Champion (Y or N)]

WSWin [World Series Winner (Y or N)]

R [Runs scored]

AB [At bats]

H [Hits by batters]

2B [Doubles]

3B [Triples]

HR [Homeruns by batters]

BB [Walks by batters]

SO [Strikeouts by batters]

SB [Stolen bases]

CS [Caught stealing]

HBP [Batters hit by pitch]

SF [Sacrifice flies]

RA [Opponents runs scored]

ER [Earned runs allowed]

ERA [Earned run average]

CG [Complete games]

SHO [Shutouts]

SV [Saves]

IPOuts [Outs Pitched (innings pitched x 3)]

HA [Hits allowed]

HRA [Homeruns allowed]

BBA [Walks allowed]

SOA [Strikeouts by pitchers]

E [Errors]

DP [Double Plays]

FP [Fielding percentage]

name [Team’s full name]

park [Name of team’s home ballpark]

attendance [Home attendance total]

BPF [Three-year park factor for batters]

PPF [Three-year park factor for pitchers]

teamIDBR [Team ID used by Baseball Reference website]

teamIDlahman45 [Team ID used in Lahman database version 4.5]

teamIDretro [Team ID used by Retrosheet]

Randomization

Because this dataset is a completely comprehensive and accurate statistical resourse for baseball statistics gathered from 1871-2013 (that, in theory, contains every possible observation that could be collected during those years), we should definitely be able to make the assumption that this dataset exhibits randomization without any bias.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘W’ in this analysis) can be explained by the variation existent in the two different treatments of the experiment (which correspond to ‘H’ and ‘HR’). Therefore, the null hypothesis that is being tested states that the number of hits and homeruns earned by a given team in a given year do not have a significant effect on the number of regular season wins that a given team earns in a given year. In carrying out this analysis, we perform an analysis of variance (ANOVA) for the number of regular season wins (‘W’) to see if there is a significant difference in the means for this response variable when considering both the number of hits (‘H’) and the number of homeruns (‘HR’) that are earned by a given team in a given year, which are contained in this dataset.

What is the rationale for this design?

The rationale for this design lies primarily in the fact that we’re trying to determine if the number of hits and the number of homeruns earned by a given team in a given season have any effect on the number of regular season wins that a given team earns in a given year. So, since the number of regular season wins is a useful and relevant metric to consider when determining the most important performance-related factors that result in earned wins for a given team, this design of a two-factor, multi-level experiment (as it corresponds to an analysis of variance) was crafted to see if the number of hits and homeruns earned by a given team in a given year has a significant effect on the number of regular season wins that a given team earns in a given year. Therefore, by performing this analysis, we can hope to receive some insight regarding the different factors that come into play that typically result in regular season wins for baseball teams.

Randomize: What is the Randomization Scheme?

Since original assumption claimed that the entirety of Lahman’s Baseball Database exhibits randomization, we did not need to worry about randomizing our data any further to ensure that a completely randomized design is created. However, in carrying out this analysis in a reasonable and logical way, a chronologically-subsetted dataset was extracted from the entirety of Lahman’s Baseball Database (all data ranging from 1973-2013, for a grand total of 40 years of data, was used in this analysis). It’s important to note that this new subset of data only contains 6 different variables, which include ‘yearID’, ‘lgID’, ‘teamID’, ‘W’, ‘H’, and ‘HR’. Since the other 42 variables are not being considered in this analysis, it did not make sense to include them anymore.

#Create subset of the dataset that only includes team data from the years 1973-2013.
teams_forty <- subset(teams_raw, teams_raw$yearID >= 1973, select=c(yearID:teamID,W,H,HR))

Replicate: Are there replicates and/or repeated measures?

In this experiment, there are no replicates or repeated measures present.

Block: Did you use blocking in the design?

In order to transform our integer factor variables into categorical factor variables, blocking was used in this design. In transforming the number of hits, ‘H’, into a categorical variable, three different levels were defined, designating a low number of hits, a medium number of hits, and a high number of hits by a given team in a given year. These different levels were determined upon calculating the first and third quartiles of hitting data in the subsetted dataset “teams_forty” (see numetric levels in R code below). In transforming the number of homeruns, ‘HR’, into a cetegorical variable, two different levels were defined, designating an above average number of homeruns and a below average number of homeruns earned by a given team in a given year. These different levels were determined upon calculating the mean of all of the homerun data in the subsetted dataset “teams_forty” (see numetric levels in R code below). Therefore, in this newly subsetted dataset, factor ‘H’ now has three distinct levels (“Low”, “Medium”, and “High”) and factor ‘HR’ now has two distinct levels (“Above Average” and “Below Average”).

#Transform 'H' into categorical variables (Low, Medium, and High).
teams_forty$H[teams_forty$H > 0 & teams_forty$H <= 1376] = "Low"
teams_forty$H[teams_forty$H > 1376 & teams_forty$H <= 1494] = "Medium"
teams_forty$H[teams_forty$H > 1494 & teams_forty$H <= 1684] = "High"
#Categorize 'H' as a factor and display its resulting levels.
teams_forty$H = as.factor(teams_forty$H)
levels(teams_forty$H)

## [1] "High"   "Low"    "Medium"

#Transform 'HR' into categorical variables ("Above Average" and "Below Average").
teams_forty$HR[teams_forty$HR >= mean(teams_forty$HR)] = "Above Average"
teams_forty$HR[teams_forty$HR != "Above Average"] = "Below Average"
#Categorize 'HR' as a factor and display its resulting levels.
teams_forty$HR = as.factor(teams_forty$HR)
levels(teams_forty$HR)

## [1] "Above Average" "Below Average"

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and Descriptive Summary

In beginning to display this data graphically, summary statistics were gathered for the newly created dataset, “teams_forty”. Additionally, histograms and boxplots were created to represent the different observations of regular season wins existent within this subsetted dataset that contains data from 1973-2013.

#Display the summary statistics of "teams_forty".
summary(teams_forty)

##      yearID     lgID         teamID          W               H      
##  Min.   :1973   AA:  0   ATL    : 41   Min.   : 37.00   High  :283  
##  1st Qu.:1984   AL:567   BAL    : 41   1st Qu.: 71.00   Low   :286  
##  Median :1994   FL:  0   BOS    : 41   Median : 80.00   Medium:563  
##  Mean   :1994   NL:565   CHA    : 41   Mean   : 79.47               
##  3rd Qu.:2004   PL:  0   CHN    : 41   3rd Qu.: 89.00               
##  Max.   :2013   UA:  0   CIN    : 41   Max.   :116.00               
##                          (Other):886                                
##              HR     
##  Above Average:559  
##  Below Average:573  
##                     
##                     
##                     
##                     
##

#Display the names found in "teams_forty".
names(teams_forty)

## [1] "yearID" "lgID"   "teamID" "W"      "H"      "HR"

#Display the structure of "teams_forty".
str(teams_forty)

## 'data.frame':    1132 obs. of  6 variables:
##  $ yearID: int  1973 1973 1973 1973 1973 1973 1973 1973 1973 1973 ...
##  $ lgID  : Factor w/ 6 levels "AA","AL","FL",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ teamID: Factor w/ 149 levels "ALT","ANA","ARI",..: 5 16 52 93 83 45 96 66 79 30 ...
##  $ W     : int  97 89 85 80 74 71 94 88 81 79 ...
##  $ H     : Factor w/ 3 levels "High","Low","Medium": 3 3 3 3 3 3 3 3 1 3 ...
##  $ HR    : Factor w/ 2 levels "Above Average",..: 2 1 1 2 2 1 1 2 2 2 ...

#Display the head and tail of "teams_forty".
head(teams_forty)

##      yearID lgID teamID  W      H            HR
## 1614   1973   AL    BAL 97 Medium Below Average
## 1615   1973   AL    BOS 89 Medium Above Average
## 1616   1973   AL    DET 85 Medium Above Average
## 1617   1973   AL    NYA 80 Medium Below Average
## 1618   1973   AL    ML4 74 Medium Below Average
## 1619   1973   AL    CLE 71 Medium Above Average

tail(teams_forty)

##      yearID lgID teamID  W      H            HR
## 2740   2013   NL    MIA 62    Low Below Average
## 2741   2013   NL    LAN 92 Medium Below Average
## 2742   2013   NL    ARI 81 Medium Below Average
## 2743   2013   NL    SDN 76    Low Below Average
## 2744   2013   NL    SFN 76 Medium Below Average
## 2745   2013   NL    COL 74   High Above Average

#Display the levels of 'H' and 'HR' within "teams_forty".
levels(teams_forty$H)

## [1] "High"   "Low"    "Medium"

levels(teams_forty$HR)

## [1] "Above Average" "Below Average"

par(mfrow=c(1,1))
#Create a histogram of Regular Season Wins ('W') for teams from 1973-2013.
hist(teams_forty$W, xlim=c(37,116), ylab = "Regular Season Wins")

par(mfrow=c(1,1))
#Create a boxplot of Regular Season Wins ('W') [for all teams from 1973-2013].
boxplot(teams_forty$W~teams_forty$teamID, main = "Regular Season Wins", ylim = c(37,116), xlab = "Teams", ylab = "Wins")

Testing

In order to determine if the variation that is observed in the response variable (which corresponds to the number of regular season wins in this analysis) can be explained by the variation existent in the treatments of the experiment (which correspond to both the number of hits and the number of homeruns earned by a given team in a given year), an analysis of variance (ANOVA) is performed as a means for analyzing the differences in regular season wins for each of the different numbers of hits and homeruns earned by a given team in a given year (ranging from 1973-2013) contained within the dataset.

For each of the three ANOVA models that are designed in this experiment, the null hypothesis that is being tested (which we will either reject or fail to reject by the end of our analysis) states that the number of hits and homeruns earned by a given team in a given year do not have a significant effect on the number of regular season wins that a given team earns in a given year, implying that the differences in mean values of the number of regular season wins earned by a given team in a given year were solely the result of randomization in this experiment. In other words, if we reject the null hypothesis, we would infer that the differences in mean values of the numbers of regular season wins earned by a given team in a given year for each of the corresponding numbers of earned hits and homeruns in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of the numbers of regular season wins earned by a given team in a given year can be explained by the variation existent in the different numbers of earned hits and homeruns for a given team in a given year being considered in this analysis. Alternately, if we fail to reject the null hypothesis, we would infer that the variation that is observed in the mean values of the numbers of regular season wins earned by a given team in a given year cannot be explained by the variation existent in the different numbers of earned hits and homeruns for a given team in a given year being considered in this analysis and, as such, is likely caused by randomization.

#Perform an analysis of variance (ANOVA) for the different mean values observed in the number of regular season wins earned in a given year by a given team, given the factor 'H'.
model_hits <- aov(W~H,teams_forty)
anova(model_hits)

## Analysis of Variance Table
## 
## Response: W
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## H            2  36045 18022.3  147.26 < 2.2e-16 ***
## Residuals 1129 138172   122.4                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Perform an analysis of variance (ANOVA) for the different mean values observed in the number of regular season wins earned in a given year by a given team, given the factor 'HR'.
model_homeruns <- aov(W~HR,teams_forty)
anova(model_homeruns)

## Analysis of Variance Table
## 
## Response: W
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## HR           1  18016 18016.0  130.33 < 2.2e-16 ***
## Residuals 1130 156200   138.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Perform an analysis of variance (ANOVA) for the different mean values observed in the number of regular season wins earned in a given year by a given team, given the interaction of 'H' and 'HR'.
model_interaction <- aov(W~H*HR,teams_forty)
anova(model_interaction)

## Analysis of Variance Table
## 
## Response: W
##             Df Sum Sq Mean Sq  F value    Pr(>F)    
## H            2  36045 18022.3 153.7169 < 2.2e-16 ***
## HR           1   5648  5648.2  48.1745  6.57e-12 ***
## H:HR         2    507   253.5   2.1625    0.1155    
## Residuals 1126 132016   117.2                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow=c(1,1))
#Create an interaction plot that plots the mean values of 'W' against the interaction of both 'H' and 'HR'.
interaction.plot(teams_forty$H,teams_forty$HR,teams_forty$W)

For the analysis of variance (ANOVA) that is performed where ‘H’ is analyzed against the response variable ‘W’, a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (147.26) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season wins earned by a given team in a given year can be explained by the variation existent in the different numbers of earned hits for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

For the analysis of variance (ANOVA) that is performed where ‘HR’ is analyzed against the response variable ‘W’, a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (130.33) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season wins earned by a given team in a given year can be explained by the variation existent in the different numbers of earned homeruns for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

For the analysis of variance (ANOVA) that is performed where the interaction of both ‘H’ and ‘HR’ is analyzed against the response variable ‘W’, a p-value = ~0.12 is returned, indicating that there is roughly a probability of 0.1155 that the resulting associated F-value (2.1625) is the result of solely randomization. Additionally, upon generating an interaction plot that plots the mean values of the numbers of regular season wins earned by a given team in a given year (‘W’) against the interaction of both ‘H’ and ‘HR’, the plot suggests that the interaction of these two factors does not have a significant effect on the response variable (since the lines that are displayed on the plot do not cross over each other). Therefore, based on this result, we would fail to reject the null hypothesis, leading us to infer that the variation that is observed in the mean values of the numbers of regular season wins earned by a given team in a given year cannot be explained by the variation existent in the interaction of the different numbers of earned hits and homeruns for a given team in a given year being considered in this analysis and, as such, is likely caused by the two factors being considered individually as opposed to being considered together. (Note: The statistical significance of the effect that ‘H’ and ‘HR’ each individually have on the response variable ‘W’ did not change upon developing an interaction model [see Analysis of Variance Table for “model_interaction”].)

Tukey Honest Significant Differences

In further carrying out this analysis, we can compute Tukey Honest Significant Differences (via “TukeyHSD()”) as a means for determining the specifc levels of each factor existent in this analysis that are truly independent from each other and that significantly affect the response variable, ‘W’.

#Perform a TukeyHSD Test for "model_hits".
Tukey_hits = TukeyHSD(model_hits, ordered = FALSE, conf.level = 0.95)
par(mfrow=c(1,1))
plot(Tukey_hits)

#Perform a TukeyHSD Test for "model_homeruns".
Tukey_homeruns = TukeyHSD(model_homeruns, ordered = FALSE, conf.level = 0.95)
par(mfrow=c(1,1))
plot(Tukey_homeruns)

#Perform a TukeyHSD Test for "model_interaction".
Tukey_interaction = TukeyHSD(model_interaction, ordered = FALSE, conf.level = 0.95)
par(mfrow=c(1,1))
plot(Tukey_interaction)

After observing the results of these Tukey Honest Significant Differences for both “model_hits” and “model_homeruns”, it’s seemingly clear that each of the different level-comparisons within those two models, considered individually, suggest a significant effect on ‘W’ that is not due solely to randomization (since the p-value for each level-interaction is equal to zero, leading us to reject the null hypothesis that the number of hits and homeruns (considered separately) earned by a given team in a given year likely do not have a significant effect on the number of regular season wins that a given team earns in a given year).

After observing the results of the these Tukey Honest Significant Differences for “model_interaction”, we begin to see a picture that doesn’t really align with our interaction model’s ANOVA analysis or interaction plot. Based on the generated p-values and the confidence interval of 0.95 that was set up in our TukeyHSD test, it appears that some of the level-interactions within our interaction model do seem to have a statistically significant effect on ‘W’ that is not due solely to randomization. These level-interactions include “Low:Above Average-High:Above Average”, “Medium:Above Average-High:Above Average”, “Low:Below Average-High:Above Average”, “Medium:Below Average-High:Above Average”, “Medium:Above Average-Low:Above Average”, “High:Below Average-Low:Above Average”, “Low:Below Average-Low:Above Average”, “Low:Below Average-Medium:Above Average”, “Medium:Below Average-Medium:Above Average”, “Low:Below Average-High:Below Average”, “Medium:Below Average-High:Below Average”, and “Medium:Below Average-Low:Below Average”, since all of their respective p-values are < 0.05. Therefore, the results of this test suggest that the interaction of ‘H’ and ‘HR’ (in most level-based cases) is likely to have a statistically significant effect on the response variable, ‘W’.

Estimation (of Parameters)

In estimating the different parameters of the experiment, I performed summary statistics on relevant data in the dataset pertaining to the numbers of regular season wins earned by a given team in a given year for the years contained in “teams_forty” (which includes both the average number of regular season wins earned by all of the teams individually contained within the dataset and the standard deviation of those regular season wins), the numbers of hits earned by a given team in a given year contained in “teams_forty” (which includes both the quantities of earned hits classified as being “High”, “Medium”, and “Low”, respectively, and the standard deviation of those distributed quantities), and the numbers of homeruns earned by a given team in a given year contained in “teams_forty” (which includes both the quantities of earned homeruns classified as being ether “Above Average” or “Below Average”, respectively, and the standard deviation of those distributed quantities).

Tables of Parameter Values

#Display summary statistics of teams_forty$W.
summary(teams_forty$W)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37.00   71.00   80.00   79.47   89.00  116.00

#Display standard deviation of teams_forty$W.
sd(teams_forty$W, na.rm = FALSE)

## [1] 12.41118

#Display summary statistics of teams_forty$H.
summary(teams_forty$H)

##   High    Low Medium 
##    283    286    563

#Display standard deviation of teams_forty$H.
sd(teams_forty$H, na.rm = FALSE)

## [1] 0.8287186

#Display summary statistics of teams_forty$HR.
summary(teams_forty$HR)

## Above Average Below Average 
##           559           573

#Display standard deviation of teams_forty$HR.
sd(teams_forty$HR, na.rm = FALSE)

## [1] 0.5001827

Diagnostics/Model Adequacy Checking

In verifying the results of this experiment, it’s important to ensure that the dataset itself meets all of the assumptions that correlate with the design approach that was carried out. In this way, we want to make sure that our dataset exhibits normality. Until we know that our dataset does, in fact, exhibit normality, we cannot yet say with confidence that our results are significant and representative of a properly carried-out modeling approach. In verifying our dataset for normality, we can both create a Normal Quantile-Quantile (QQ) Plot of our data and perform a Shapiro-Wilk Test of Normality on our data.

#Create a Normal Q-Q Plot for the numbers of regular season wins earned by a given team in a given year.
qqnorm(teams_forty[,"W"], main = "Normal Q-Q Plot of Regular Season Wins")
qqline(teams_forty[,"W"])

#Create a Normal Q-Q Plot of the residuals for "model_hits".
qqnorm(residuals(model_hits), main = "Normal Q-Q Plot of Residuals of 'model_hits'")
qqline(residuals(model_hits))

#Create a Normal Q-Q Plot of the residuals for "model_homeruns".
qqnorm(residuals(model_homeruns), main = "Normal Q-Q Plot of Residuals of 'model_homeruns'")
qqline(residuals(model_homeruns))

#Create a Normal Q-Q Plot of the residuals for "model_interaction".
qqnorm(residuals(model_interaction), main = "Normal Q-Q Plot of Residuals of 'model_interaction'")
qqline(residuals(model_interaction))

Shapiro-Wilk Test of Normality Results

#Perform Shapiro-Wilk Test of Normality on the numbers of regular season wins earned by a given team in a given year (normality is assummed if p > 0.1).
shapiro.test(teams_forty[,"W"])

## 
##  Shapiro-Wilk normality test
## 
## data:  teams_forty[, "W"]
## W = 0.9906, p-value = 1.198e-06

Upon both constructing Normal Q-Q Plots and performing Shapiro-Wilk Tests of Normality on the data in this analysis, it’s likely that we can readily assume that our data exhibits normality. Despite the fact that the resulting p-value of the Shapiro-Wilk Tests of Normality for “teams_forty[,“W”]” were < 0.1, all of the constructed Normal Q-Q Plots did seem to display a trend of data that aligned closely with the Normal Q-Q Line. Additionally, since Lahman’s Baseball Database is a completely comprehensive and accurate statistical resourse for baseball statistics gathered from 1871-2013 (that, in theory, contains every possible observation that could be collected during those years), we should definitely be safe in making the assumption that this database (and the subsetted dataset that was extracted from it) exhibits normality.

In further backing up the confidence that we have with our results, we can generate a “quality of fit” model that plots residual error against each of the fitted models that were developed in our original analysis of variance (ANOVA).

#Create a "Quality of Fit Model" that plots the residuals of "model_hits" against its fitted model.
plot(fitted(model_hits),residuals(model_hits))

#Create a "Quality of Fit Model" that plots the residuals of "model_homeruns" against its fitted model.
plot(fitted(model_homeruns),residuals(model_homeruns))

#Create a "Quality of Fit Model" that plots the residuals of "model_interaction" against its fitted model.
plot(fitted(model_interaction),residuals(model_interaction))

Because each of the resulting plots appears to be scatted and clumped around zero, each of the three ANOVA models developed suggests good fit. Thus, we can confindently rely on both the modeling approach that we carried out and the dataset that we analyzed in justifying the significance of our results.

4. Contingencies to the Experimental Design/Analysis if Modeling Assumptions Failed

If our modeling assumptions failed in our analysis, we can still err on the side of caution by performing the nonparametric Kruskal-Wallis rank sum test to back up our original results (which will help us to decide whether the population distributions are identical without necessarily exhibiting a normal distribution)

Kruskal-Wallis Rank Sum Test Results

#Perform Kruskal-Wallis Rank Sum Test on 'W' within the "teams_forty" dataset for both 'H' and 'HR' (identical populations is assummed if p > 0.05).
kruskal.test(teams_forty[,"W"],teams_forty$H)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  teams_forty[, "W"] and teams_forty$H
## Kruskal-Wallis chi-squared = 209.2336, df = 2, p-value < 2.2e-16

kruskal.test(teams_forty[,"W"],teams_forty$HR)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  teams_forty[, "W"] and teams_forty$HR
## Kruskal-Wallis chi-squared = 111.1244, df = 1, p-value < 2.2e-16

Since the p-values for both of the resulting Kruskal-Wallis rank sum tests that consider the factors ‘H’ and ‘HR’ against the response variable ‘W’ are less than 0.05, we can assume that the mean values of the number of regular season wins that a given team earns in a given year compared to both the different number of hits and homeruns earned by a given team in a given year (considered separately) are comparatively nonidentical populations. Therefore, this result suggests that we would reject the null hypothesis of our main experiment, leading us to believe that the number of hits and homeruns (considered separately) earned by a given team in a given year likely does have a significant effect on the number of regular season wins that a given team earns in a given year in our analysis. Furthermore, in addition to treating our data in such a way that uses a nonparametric analysis upon any realization that normality cannot be assumed, transformations such as the “Box-Cox Power Transformation” certainly could have been performed on the data to make it approximately normal. However, these transformations would not be necessary for this analysis, since the nonparametric significance results that we generated by using the Kruskal-Wallis rank sums test were suitable in giving us confidence in the results of our analysis.

5. References to the literature

[1] Lahman, S. (1996-2014). Lahman’s Baseball Database.

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code

The updated version of the database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more (http://www.seanlahman.com/baseball-archive/statistics/). For more details on the latest release, please read the following documentation (http://seanlahman.com/files/database/readme2012.txt). The database can be used on any platform, but please be aware that this is not a standalone application. It is a database that requires Microsoft Access or some other relational database software to be useful.

Recipe 3: Completely Randomized Designs