Recipe 9: Response Surface Methods

This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).

When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Recipes for the Design of Experiments

as of August 28, 2014, superceding the version of August 24. Always use the most recent version.

Using Response Surface Methods for Regular Season Major League Baseball Losses from 1983-2013. (Four-Factor, Multi-Level Experiment)

Brendan Howell

Renselaer Polytechnic Institute

12/01/14 - Version 1.0

1. Setting

Dataset of Regular Season Losses Earned by Baseball Teams that played from 1983-2013 (‘teams_thirty’).

Description: The updated version of Lahman’s Baseball Database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. In this analysis, only the years ranging from 1983-2013 will be considered.

In this study, a four-factor, multi-level experiment is performed to see if the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), or the number of walks (‘BB’) earned by a given team in a given season has a statistically significant effect on the number of losses (‘L’) that a given team earns in a given season (which is this analysis’ response variable).

##Load in the Teams Dataset
#Get dataset from Project Documents File
raw_teams <- read.csv("~/Academics (RPI)/09. Fall 2014/Design of Experiments/02. Wikibook Recipes/Recipe #09/Teams.csv", header=TRUE)
head(raw_teams)
##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871 <NA>    PH1      PNA          1 28    NA 21  7                  Y
## 2   1871 <NA>    CH1      CNA          2 28    NA 19  9                  N
## 3   1871 <NA>    BS1      BNA          3 31    NA 20 10                  N
## 4   1871 <NA>    WS3      OLY          4 32    NA 15 15                  N
## 5   1871 <NA>    NY2      NNA          5 33    NA 16 17                  N
## 6   1871 <NA>    TRO      TRO          6 29    NA 13 15                  N
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1       376 1281 410  66  27  9 46 23 56 NA  NA NA 266 137 4.95 27   0  0
## 2       302 1196 323  52  21 10 60 22 69 NA  NA NA 241  77 2.76 25   0  1
## 3       401 1372 426  70  37  3 60 19 73 NA  NA NA 303 109 3.55 22   1  3
## 4       310 1353 375  54  26  6 48 13 48 NA  NA NA 303 137 4.37 32   0  0
## 5       302 1404 403  43  21  1 33 15 46 NA  NA NA 313 121 3.72 32   1  0
## 6       351 1248 384  51  34  6 49 19 62 NA  NA NA 362 153 5.51 28   0  0
##   IPouts  HA HRA BBA SOA   E DP   FP                    name
## 1    747 329   3  53  16 194 NA 0.84  Philadelphia Athletics
## 2    753 308   6  28  22 218 NA 0.82 Chicago White Stockings
## 3    828 367   2  42  23 225 NA 0.83    Boston Red Stockings
## 4    846 371   4  45  13 217 NA 0.85     Washington Olympics
## 5    879 373   7  42  22 227 NA 0.83        New York Mutuals
## 6    750 431   4  75  12 198 NA 0.84          Troy Haymakers
##                       park attendance BPF PPF teamIDBR teamIDlahman45
## 1 Jefferson Street Grounds         NA 102  98      ATH            PH1
## 2  Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3      South End Grounds I         NA 103  98      BOS            BS1
## 4         Olympics Grounds         NA  94  98      OLY            WS3
## 5 Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6       Haymakers' Grounds         NA 101 100      TRO            TRO
##   teamIDretro
## 1         PH1
## 2         CH1
## 3         BS1
## 4         WS3
## 5         NY2
## 6         TRO
tail(raw_teams)
##      yearID lgID teamID franchID divID Rank   G Ghome  W   L DivWin WCWin
## 2740   2013   NL    MIA      FLA     E    5 162    81 62 100      N     N
## 2741   2013   NL    LAN      LAD     W    1 162    81 92  70      Y     N
## 2742   2013   NL    ARI      ARI     W    2 162    81 81  81      N     N
## 2743   2013   NL    SDN      SDP     W    3 162    81 76  86      N     N
## 2744   2013   NL    SFN      SFG     W    4 162    82 76  86      N     N
## 2745   2013   NL    COL      COL     W    5 162    81 74  88      N     N
##      LgWin WSWin   R   AB    H X2B X3B  HR  BB   SO  SB CS HBP SF  RA  ER
## 2740     N     N 513 5449 1257 219  31  95 432 1232  78 29  56 26 646 602
## 2741     N     N 649 5491 1447 281  17 138 476 1146  78 28  57 48 582 524
## 2742     N     N 685 5676 1468 302  31 130 519 1142  62 41  43 43 695 651
## 2743     N     N 618 5517 1349 246  26 146 467 1309 118 34  52 34 700 643
## 2744     N     N 629 5552 1446 280  35 107 469 1078  67 26  39 42 691 643
## 2745     N     N 706 5599 1511 283  36 159 427 1204 112 32  26 35 760 708
##       ERA CG SHO SV IPouts   HA HRA BBA  SOA   E  DP    FP
## 2740 3.71  2   1 36   4380 1376 121 526 1177  88 144 0.986
## 2741 3.25  7   4 46   4351 1321 127 460 1292 109 160 0.982
## 2742 3.92  6   2 38   4485 1460 176 485 1218  75 134 0.988
## 2743 3.98  3   1 40   4365 1407 156 525 1171  83 140 0.986
## 2744 4.00  2   2 41   4342 1380 145 521 1256 107 126 0.982
## 2745 4.44  1   0 35   4308 1545 136 517 1064  90 162 0.986
##                      name           park attendance BPF PPF teamIDBR
## 2740        Miami Marlins   Marlins Park    1586322 102 103      MIA
## 2741  Los Angeles Dodgers Dodger Stadium    3743527  95  95      LAD
## 2742 Arizona Diamondbacks    Chase Field    2134795 102 102      ARI
## 2743     San Diego Padres     Petco Park    2166691  91  91      SDP
## 2744 San Francisco Giants      AT&T Park    3326796  90  89      SFG
## 2745     Colorado Rockies    Coors Field    2793828 117 118      COL
##      teamIDlahman45 teamIDretro
## 2740            FLO         MIA
## 2741            LAN         LAN
## 2742            ARI         ARI
## 2743            SDN         SDN
## 2744            SFN         SFN
## 2745            COL         COL

Factors and Levels

This analysis considers four factors (which each have multiple levels), which are the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) earned by a given team in a given season. In the original dataset “raw_teams”, the factors ‘H’, ‘HR’, ‘SO’, and ‘BB’ are denoted as being integer variables with no specific categorical levels. However, in carrying out this analysis, these factors will be transformed into categorical variables with manually-defined levels. These factor was selected intuitively, since this analysis aims to determine whether or not the the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), or the number of walks (‘BB’) earned by a given team in a given season has a significant effect on the number of regular season losses that a given team earns in a given season.

#Display the summary statistics of "raw_teams".
summary(raw_teams)
##      yearID       lgID          teamID        franchID    divID   
##  Min.   :1871   AA  :  85   CHN    : 138   ATL    : 138    :1517  
##  1st Qu.:1918   AL  :1175   PHI    : 131   CHC    : 138   C: 215  
##  Median :1961   FL  :  16   PIT    : 127   CIN    : 132   E: 518  
##  Mean   :1954   NL  :1399   CIN    : 124   PIT    : 132   W: 495  
##  3rd Qu.:1990   PL  :   8   SLN    : 122   STL    : 132           
##  Max.   :2013   UA  :  12   BOS    : 113   PHI    : 131           
##                 NA's:  50   (Other):1990   (Other):1942           
##       Rank              G             Ghome            W         
##  Min.   : 1.000   Min.   :  6.0   Min.   :44.0   Min.   :  0.00  
##  1st Qu.: 2.000   1st Qu.:153.0   1st Qu.:77.0   1st Qu.: 66.00  
##  Median : 4.000   Median :157.0   Median :80.0   Median : 77.00  
##  Mean   : 4.132   Mean   :150.1   Mean   :78.4   Mean   : 74.61  
##  3rd Qu.: 6.000   3rd Qu.:162.0   3rd Qu.:81.0   3rd Qu.: 87.00  
##  Max.   :13.000   Max.   :165.0   Max.   :84.0   Max.   :116.00  
##                                   NA's   :399                    
##        L          DivWin   WCWin    LgWin    WSWin          R         
##  Min.   :  4.00    :1545    :2181    :  28    : 357   Min.   :  24.0  
##  1st Qu.: 65.00   N: 982   N: 522   N:2449   N:2274   1st Qu.: 612.0  
##  Median : 76.00   Y: 218   Y:  42   Y: 268   Y: 114   Median : 690.0  
##  Mean   : 74.61                                       Mean   : 682.1  
##  3rd Qu.: 86.00                                       3rd Qu.: 765.0  
##  Max.   :134.00                                       Max.   :1220.0  
##                                                                       
##        AB             H             X2B             X3B        
##  Min.   : 211   Min.   :  33   Min.   :  3.0   Min.   :  0.00  
##  1st Qu.:5117   1st Qu.:1297   1st Qu.:192.0   1st Qu.: 31.00  
##  Median :5379   Median :1393   Median :229.0   Median : 42.00  
##  Mean   :5134   Mean   :1345   Mean   :226.6   Mean   : 47.48  
##  3rd Qu.:5514   3rd Qu.:1468   3rd Qu.:269.0   3rd Qu.: 61.00  
##  Max.   :5781   Max.   :1783   Max.   :376.0   Max.   :150.00  
##                                                                
##        HR            BB              SO               SB       
##  Min.   :  0   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 40   1st Qu.:425.0   1st Qu.: 501.0   1st Qu.: 64.0  
##  Median :105   Median :494.0   Median : 746.0   Median : 96.0  
##  Mean   :100   Mean   :473.8   Mean   : 726.3   Mean   :112.8  
##  3rd Qu.:148   3rd Qu.:555.0   3rd Qu.: 955.0   3rd Qu.:143.0  
##  Max.   :264   Max.   :835.0   Max.   :1535.0   Max.   :581.0  
##                                NA's   :120      NA's   :144    
##        CS              HBP               SF              RA        
##  Min.   :  0.00   Min.   : 26.00   Min.   :24.00   Min.   :  34.0  
##  1st Qu.: 35.00   1st Qu.: 47.00   1st Qu.:39.00   1st Qu.: 608.0  
##  Median : 46.00   Median : 54.50   Median :44.00   Median : 688.0  
##  Mean   : 49.21   Mean   : 56.35   Mean   :45.09   Mean   : 682.1  
##  3rd Qu.: 59.00   3rd Qu.: 64.00   3rd Qu.:50.25   3rd Qu.: 765.0  
##  Max.   :191.00   Max.   :103.00   Max.   :75.00   Max.   :1252.0  
##  NA's   :859      NA's   :2325     NA's   :2325                    
##        ER              ERA              CG             SHO        
##  Min.   :  25.0   Min.   :1.220   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.: 498.0   1st Qu.:3.330   1st Qu.: 16.0   1st Qu.: 6.000  
##  Median : 590.0   Median :3.820   Median : 46.0   Median : 9.000  
##  Mean   : 569.8   Mean   :3.815   Mean   : 51.5   Mean   : 9.435  
##  3rd Qu.: 667.0   3rd Qu.:4.310   3rd Qu.: 78.0   3rd Qu.:12.000  
##  Max.   :1023.0   Max.   :8.000   Max.   :148.0   Max.   :32.000  
##                                                                   
##        SV            IPouts           HA            HRA     
##  Min.   : 0.00   Min.   : 162   Min.   :  49   Min.   :  0  
##  1st Qu.: 9.00   1st Qu.:4071   1st Qu.:1287   1st Qu.: 43  
##  Median :23.00   Median :4224   Median :1392   Median :107  
##  Mean   :23.25   Mean   :4015   Mean   :1345   Mean   :100  
##  3rd Qu.:37.00   3rd Qu.:4339   3rd Qu.:1471   3rd Qu.:147  
##  Max.   :68.00   Max.   :4518   Max.   :1993   Max.   :241  
##                                                             
##       BBA             SOA               E               DP       
##  Min.   :  0.0   Min.   :   0.0   Min.   : 47.0   Min.   : 18.0  
##  1st Qu.:426.0   1st Qu.: 499.0   1st Qu.:119.0   1st Qu.:126.0  
##  Median :496.0   Median : 721.0   Median :147.0   Median :144.0  
##  Mean   :474.1   Mean   : 719.9   Mean   :188.6   Mean   :140.1  
##  3rd Qu.:556.0   3rd Qu.: 951.0   3rd Qu.:219.0   3rd Qu.:160.0  
##  Max.   :827.0   Max.   :1428.0   Max.   :639.0   Max.   :217.0  
##                                                   NA's   :317    
##        FP                            name                       park     
##  Min.   :0.7600   Cincinnati Reds      : 123   Wrigley Field      : 100  
##  1st Qu.:0.9600   Pittsburgh Pirates   : 123   Sportsman's Park IV:  90  
##  Median :0.9700   Philadelphia Phillies: 122   Comiskey Park      :  80  
##  Mean   :0.9605   St. Louis Cardinals  : 114   Fenway Park II     :  80  
##  3rd Qu.:0.9800   Chicago White Sox    : 113   Forbes Field       :  60  
##  Max.   :0.9910   Detroit Tigers       : 113   Crosley Field      :  58  
##                   (Other)              :2037   (Other)            :2277  
##    attendance           BPF             PPF           teamIDBR   
##  Min.   :   6088   Min.   : 60.0   Min.   : 60.0   CHC    : 138  
##  1st Qu.: 518051   1st Qu.: 97.0   1st Qu.: 97.0   CIN    : 137  
##  Median :1107122   Median :100.0   Median :100.0   STL    : 135  
##  Mean   :1317241   Mean   :100.2   Mean   :100.2   PHI    : 134  
##  3rd Qu.:1950099   3rd Qu.:103.0   3rd Qu.:103.0   PIT    : 132  
##  Max.   :4483350   Max.   :129.0   Max.   :141.0   BOS    : 121  
##  NA's   :279                                       (Other):1948  
##  teamIDlahman45  teamIDretro  
##  CHN    : 138   CHN    : 138  
##  PHI    : 131   PHI    : 131  
##  PIT    : 127   PIT    : 127  
##  CIN    : 124   CIN    : 124  
##  SLN    : 122   SLN    : 122  
##  BOS    : 113   BOS    : 113  
##  (Other):1990   (Other):1990
#Display the names found in "raw_teams".
names(raw_teams)
##  [1] "yearID"         "lgID"           "teamID"         "franchID"      
##  [5] "divID"          "Rank"           "G"              "Ghome"         
##  [9] "W"              "L"              "DivWin"         "WCWin"         
## [13] "LgWin"          "WSWin"          "R"              "AB"            
## [17] "H"              "X2B"            "X3B"            "HR"            
## [21] "BB"             "SO"             "SB"             "CS"            
## [25] "HBP"            "SF"             "RA"             "ER"            
## [29] "ERA"            "CG"             "SHO"            "SV"            
## [33] "IPouts"         "HA"             "HRA"            "BBA"           
## [37] "SOA"            "E"              "DP"             "FP"            
## [41] "name"           "park"           "attendance"     "BPF"           
## [45] "PPF"            "teamIDBR"       "teamIDlahman45" "teamIDretro"
#Display the structure of "raw_teams".
str(raw_teams)
## 'data.frame':    2745 obs. of  48 variables:
##  $ yearID        : int  1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
##  $ lgID          : Factor w/ 6 levels "AA","AL","FL",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ teamID        : Factor w/ 149 levels "ALT","ANA","ARI",..: 97 31 24 142 90 136 56 39 111 24 ...
##  $ franchID      : Factor w/ 120 levels "ALT","ANA","ARI",..: 85 36 13 77 70 109 56 25 91 13 ...
##  $ divID         : Factor w/ 4 levels "","C","E","W": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Rank          : int  1 2 3 4 5 6 7 8 9 1 ...
##  $ G             : int  28 28 31 32 33 29 19 29 25 48 ...
##  $ Ghome         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ W             : int  21 19 20 15 16 13 7 10 4 39 ...
##  $ L             : int  7 9 10 15 17 15 12 19 21 8 ...
##  $ DivWin        : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ WCWin         : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LgWin         : Factor w/ 3 levels "","N","Y": 3 2 2 2 2 2 2 2 2 3 ...
##  $ WSWin         : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ R             : int  376 302 401 310 302 351 137 249 231 521 ...
##  $ AB            : int  1281 1196 1372 1353 1404 1248 746 1186 1036 2137 ...
##  $ H             : int  410 323 426 375 403 384 178 328 274 677 ...
##  $ X2B           : int  66 52 70 54 43 51 19 35 44 114 ...
##  $ X3B           : int  27 21 37 26 21 34 8 40 25 31 ...
##  $ HR            : int  9 10 3 6 1 6 2 7 3 7 ...
##  $ BB            : int  46 60 60 48 33 49 33 26 38 28 ...
##  $ SO            : int  23 22 19 13 15 19 9 25 30 26 ...
##  $ SB            : int  56 69 73 48 46 62 16 18 53 47 ...
##  $ CS            : int  NA NA NA NA NA NA NA NA NA 14 ...
##  $ HBP           : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ SF            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ RA            : int  266 241 303 303 313 362 243 341 287 236 ...
##  $ ER            : int  137 77 109 137 121 153 97 116 108 95 ...
##  $ ERA           : num  4.95 2.76 3.55 4.37 3.72 5.51 5.17 4.11 4.3 1.99 ...
##  $ CG            : int  27 25 22 32 32 28 19 23 23 41 ...
##  $ SHO           : int  0 0 1 0 1 0 1 0 1 3 ...
##  $ SV            : int  0 1 3 0 0 0 0 0 0 1 ...
##  $ IPouts        : int  747 753 828 846 879 750 507 762 678 1290 ...
##  $ HA            : int  329 308 367 371 373 431 261 346 315 438 ...
##  $ HRA           : int  3 6 2 4 7 4 5 13 3 0 ...
##  $ BBA           : int  53 28 42 45 42 75 21 53 34 27 ...
##  $ SOA           : int  16 22 23 13 22 12 17 34 16 0 ...
##  $ E             : int  194 218 225 217 227 198 163 223 220 263 ...
##  $ DP            : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ FP            : num  0.84 0.82 0.83 0.85 0.83 0.84 0.8 0.81 0.82 0.87 ...
##  $ name          : Factor w/ 139 levels "Altoona Mountain City",..: 97 42 17 135 93 131 63 51 111 17 ...
##  $ park          : Factor w/ 213 levels "","23rd Street Grounds",..: 87 197 170 130 199 80 77 116 4 170 ...
##  $ attendance    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ BPF           : int  102 104 103 94 90 101 101 96 97 105 ...
##  $ PPF           : int  98 102 98 98 88 100 107 100 99 100 ...
##  $ teamIDBR      : Factor w/ 101 levels "ALT","ANA","ARI",..: 4 21 10 65 62 93 42 25 77 10 ...
##  $ teamIDlahman45: Factor w/ 148 levels "ALT","ANA","ARI",..: 96 31 24 140 89 135 56 39 110 24 ...
##  $ teamIDretro   : Factor w/ 149 levels "ALT","ANA","ARI",..: 96 31 24 141 89 135 56 39 110 24 ...

Continuous variables (if any)

In this dataset, there are a few variables that can be considered to be continuous variables; these variables are the ones which are categorized as being numeric variables. By this standard, the continuous variables that exist in this dataset include ‘ERA’ (which refers to a team’s overall earned run average) and ‘FP’ (which refers to a team’s overall fielding percentage). The main factors that our analysis considers, ‘H’, ‘HR’, ‘SO’, and ‘BB’, currently exist as integer variables. However, upon manually defining these given integer values into specific “integer-range” levels, these factors will become defined as being a categorical variable.

Response variables

This analysis will consider one response variable, ‘L’, which denotes the number of regular season losses that a given team earned in a given year.

The Data: How is it organized and what does it look like?

As a whole, Lahman’s Baseball Database contains information pertaining to pitching, hitting, and fielding statistics for Major League Baseball from the years 1871 through 2013. It includes data from the two current leagues (American and National), the four other “major” leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875. It contains 2,745 observations of 48 variables, which are defined below (Lahman, 1996-2014) [1]:

Variable ID [Description]

yearID [Year]

lgID [League]

teamID [Team]

franchID [Franchise (links to TeamsFranchise table)]

divID [Team’s division]

Rank [Position in final standings]

G [Games played]

GHome [Games played at home]

W [Wins]

L [Losses]

DivWin [Division Winner (Y or N)]

WCWin [Wild Card Winner (Y or N)]

LgWin [League Champion (Y or N)]

WSWin [World Series Winner (Y or N)]

R [Runs scored]

AB [At bats]

H [Hits by batters]

2B [Doubles]

3B [Triples]

HR [Homeruns by batters]

BB [Walks by batters]

SO [Strikeouts by batters]

SB [Stolen bases]

CS [Caught stealing]

HBP [Batters hit by pitch]

SF [Sacrifice flies]

RA [Opponents runs scored]

ER [Earned runs allowed]

ERA [Earned run average]

CG [Complete games]

SHO [Shutouts]

SV [Saves]

IPOuts [Outs Pitched (innings pitched x 3)]

HA [Hits allowed]

HRA [Homeruns allowed]

BBA [Walks allowed]

SOA [Strikeouts by pitchers]

E [Errors]

DP [Double Plays]

FP [Fielding percentage]

name [Team’s full name]

park [Name of team’s home ballpark]

attendance [Home attendance total]

BPF [Three-year park factor for batters]

PPF [Three-year park factor for pitchers]

teamIDBR [Team ID used by Baseball Reference website]

teamIDlahman45 [Team ID used in Lahman database version 4.5]

teamIDretro [Team ID used by Retrosheet]

Randomization

Because this dataset is a completely comprehensive and accurate statistical resource for baseball statistics gathered from 1871-2013 (that, in theory, contains every possible observation that could be collected during those years), we should definitely be able to make the assumption that this dataset exhibits randomization without any bias.

2. (Experimental) Design

How will the experiment be organized and conducted to test the hypothesis?

In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘L’ in this analysis) can be explained by the variation existent in the four treatments being considered in the experiment (which correspond to ‘H’, ‘HR’, ‘SO’, and ‘BB’). Therefore, the null hypothesis that is being tested states that the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) earned by a given team in a given season do not have a significant effect on the number of regular season losses that a given team earns in a given year. Alternately, the alternative hypothesis that is being tested states that the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) earned by a given team in a given season do have a significant effect on the number of regular season losses that a given team earns in a given year. In carrying out this analysis, we perform an analysis of variance (ANOVA) for the number of regular season losses (‘L’) and use Response Surface Methods to see if there is a significant difference in the means for this response variable when analyzing the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) earned by a given team in a given season.

What is the rationale for this design?

The rationale for this design lies primarily in the fact that we’re trying to determine if the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), or the number of walks (‘BB’) earned by a given team in a given season has any effect on the number of regular season losses (‘L’) that a given team earns in a given year. So, since the number of regular season losses is a useful and relevant metric to consider when determining the most important performance-related factors that result in earned losses for a given team, this design of a four-factor, multi-level experiment (as it corresponds to an analysis of variance and the use of response surface methods) was crafted to see if the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), or the number of walks (‘BB’) earned by a given team in a given season has a significant effect on the number of regular season losses that a given team earns in a given year. The reason for using Response Surface Methods in this experiment has to do with the fact that the response variable of interest in this scenario (‘L’) is likely influenced by the four factors that are being considered here (‘H’, ‘HR’, ‘SO’, and ‘BB’). Since this nature of influence exists in this scenario, it is likely true that the response variable ‘L’ can be optimized by a given set of these input variables over a specified region of interest (especially because the factors ‘H’, ‘HR’, ‘SO’, and ‘BB’ are known and controllable), allowing one to determine the factors’ main effects and interaction effects. By performing the analysis in this manner, we can hope to receive some insight regarding the different factors that come into play that typically result in regular season losses for baseball teams.

Randomize: What is the Randomization Scheme?

Since the original assumption claimed that the entirety of Lahman’s Baseball Database exhibits randomization, we did not need to worry about randomizing our data any further to ensure that a completely randomized design is created. However, in carrying out this analysis in a reasonable and logical way, a chronologically-subsetted dataset was extracted from the entirety of Lahman’s Baseball Database (all data ranging from 1983-2013, for a grand total of 30 years of data, was used in this analysis). In determining which variables to include in this experiment’s dataframe (which have been noted previously), a pairs plot was generated to give us some insight into the different effects that various variables in the dataset have on the response variable, ‘L’. To properly generate this pairs plot, the most relevant/interesting variables were included in a subset of data extracted from the larger “raw_teams” dataset, including ‘yearID’, ‘lgID’, ‘teamID’, ‘W’, ‘L’, ‘R’, ‘AB’, ‘H’, ‘HR’, ‘BB’, ‘SO’, ‘ERA’, and ‘E’. Since the other 35 variables are not being considered further in this analysis, it did not make sense to include them anymore. In selecting four specific factors to include in the ANOVA/RSM-specific analysis, the factors ‘H’, ‘HR’, ‘SO’, and ‘BB’ were chosen intuitively.

#Create a pairs plot for 'L' against 'H'.
par(mfrow=c(1,1))
pairs(L~H, raw_teams)

#Create a pairs plot for 'L' against 'HR'.
par(mfrow=c(1,1))
pairs(L~HR, raw_teams)

#Create a pairs plot for 'L' against 'SO'.
par(mfrow=c(1,1))
pairs(L~SO, raw_teams)

#Create a pairs plot for 'L' against 'BB'.
par(mfrow=c(1,1))
pairs(L~BB, raw_teams)

#Create subset of the dataset that only includes team data from the years 1983-2013.
teams_thirty <- subset(raw_teams, raw_teams$yearID >= 1983, select=c(yearID:teamID,W,L,R:H,HR:SO,ERA,E))
#Display the summary statistics of "teams_thirty".
summary(teams_thirty)
##      yearID     lgID         teamID          W               L        
##  Min.   :1983   AA:  0   ATL    : 31   Min.   : 43.0   Min.   : 40.0  
##  1st Qu.:1991   AL:435   BAL    : 31   1st Qu.: 72.0   1st Qu.: 72.0  
##  Median :1999   FL:  0   BOS    : 31   Median : 80.0   Median : 79.0  
##  Mean   :1999   NL:445   CHA    : 31   Mean   : 79.9   Mean   : 79.9  
##  3rd Qu.:2006   PL:  0   CHN    : 31   3rd Qu.: 89.0   3rd Qu.: 88.0  
##  Max.   :2013   UA:  0   CIN    : 31   Max.   :116.0   Max.   :119.0  
##                          (Other):694                                  
##        R                AB             H              HR       
##  Min.   : 466.0   Min.   :3856   Min.   : 963   Min.   : 58.0  
##  1st Qu.: 670.8   1st Qu.:5469   1st Qu.:1385   1st Qu.:127.0  
##  Median : 731.0   Median :5524   Median :1441   Median :154.5  
##  Mean   : 731.9   Mean   :5468   Mean   :1434   Mean   :155.6  
##  3rd Qu.: 791.0   3rd Qu.:5588   3rd Qu.:1499   3rd Qu.:180.0  
##  Max.   :1009.0   Max.   :5781   Max.   :1684   Max.   :264.0  
##                                                                
##        BB              SO            ERA              E        
##  Min.   :319.0   Min.   : 568   Min.   :2.910   Min.   : 54.0  
##  1st Qu.:479.8   1st Qu.: 910   1st Qu.:3.800   1st Qu.: 99.0  
##  Median :527.0   Median :1014   Median :4.180   Median :112.0  
##  Mean   :530.2   Mean   :1011   Mean   :4.211   Mean   :112.8  
##  3rd Qu.:578.5   3rd Qu.:1107   3rd Qu.:4.580   3rd Qu.:126.0  
##  Max.   :775.0   Max.   :1535   Max.   :6.380   Max.   :179.0  
## 
#Display the names found in "teams_thirty".
names(teams_thirty)
##  [1] "yearID" "lgID"   "teamID" "W"      "L"      "R"      "AB"    
##  [8] "H"      "HR"     "BB"     "SO"     "ERA"    "E"
#Display the structure of "teams_thirty".
str(teams_thirty)
## 'data.frame':    880 obs. of  13 variables:
##  $ yearID: int  1983 1983 1983 1983 1983 1983 1983 1983 1983 1983 ...
##  $ lgID  : Factor w/ 6 levels "AA","AL","FL",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ teamID: Factor w/ 149 levels "ALT","ANA","ARI",..: 5 52 93 134 83 16 45 33 66 131 ...
##  $ W     : int  98 92 91 89 87 78 70 99 79 77 ...
##  $ L     : int  64 70 71 73 75 84 92 63 83 85 ...
##  $ R     : int  799 789 770 795 764 724 704 800 696 639 ...
##  $ AB    : int  5546 5592 5631 5581 5620 5590 5476 5484 5598 5610 ...
##  $ H     : int  1492 1530 1535 1546 1556 1512 1451 1439 1515 1429 ...
##  $ HR    : int  168 156 153 167 132 142 86 157 109 106 ...
##  $ BB    : int  601 508 533 510 475 536 605 527 397 442 ...
##  $ SO    : int  800 831 686 810 665 758 691 888 722 767 ...
##  $ ERA   : num  3.63 3.8 3.86 4.12 4.02 4.34 4.43 3.67 4.25 3.31 ...
##  $ E     : int  121 124 139 115 113 130 122 120 164 113 ...
#Create a pairs plot for "teams_thirty".
par(mfrow=c(1,1))
pairs(teams_thirty)

Upon generating these pairs plots, it appears that the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) earned by a given team in a given season correlates positively with team losses (‘L’). Therefore, ‘H’, ‘HR’, ‘SO’, and ‘BB’ seem to be the most interesting/relevant factors to include in this analysis.

Replicate: Are there replicates and/or repeated measures?

In this experiment, there are no replicates or repeated measures present.

Block: Did you use blocking in the design?

In order to transform our integer factor variables into categorical factor variables, blocking was used in this design. In transforming the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) into a categorical variable, three different levels were defined for each respective factor, designating a low number of hits/homeruns/strikeouts/walks, a medium number of hits/homeruns/strikeouts/walks, and a high number of hits/homeruns/strikeouts/walks earned by a given team in a given season. These different levels were determined for each respective factor by calculating the first and third quartiles of each factor’s respective values in the subsetted dataset “teams_thirty” (see numeric levels in R code below). Therefore, in this newly subsetted dataset, ‘H’, ‘HR’, ‘SO’, and ‘BB’ each have three distinct levels (“Low”, “Medium”, and “High”).

#Determine quartile values for each factor.
summary(teams_thirty$H)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     963    1385    1441    1434    1499    1684
summary(teams_thirty$HR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    58.0   127.0   154.5   155.6   180.0   264.0
summary(teams_thirty$SO)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     568     910    1014    1011    1107    1535
summary(teams_thirty$BB)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   319.0   479.8   527.0   530.2   578.5   775.0
#Transform 'H' into categorical variables (Low, Medium, and High).
teams_thirty$H[teams_thirty$H > 0 & teams_thirty$H <= 1385] = "Low"
teams_thirty$H[teams_thirty$H > 1385 & teams_thirty$H <= 1499] = "Medium"
teams_thirty$H[teams_thirty$H > 1499 & teams_thirty$H <= 1684] = "High"
#Categorize 'H' as a factor and display its resulting levels.
teams_thirty$H = as.factor(teams_thirty$H)
levels(teams_thirty$H)
## [1] "High"   "Low"    "Medium"
#Transform 'HR' into categorical variables (Low, Medium, and High).
teams_thirty$HR[teams_thirty$HR > 0 & teams_thirty$HR <= 127] = "Low"
teams_thirty$HR[teams_thirty$HR > 127 & teams_thirty$HR <= 180] = "Medium"
teams_thirty$HR[teams_thirty$HR > 180 & teams_thirty$HR <= 264] = "High"
#Categorize 'HR' as a factor and display its resulting levels.
teams_thirty$HR = as.factor(teams_thirty$HR)
levels(teams_thirty$HR)
## [1] "High"   "Low"    "Medium"
#Transform 'SO' into categorical variables (Low, Medium, and High).
teams_thirty$SO[teams_thirty$SO > 0 & teams_thirty$SO <= 910] = "Low"
teams_thirty$SO[teams_thirty$SO > 1107 & teams_thirty$SO <= 1535] = "High"
union(which(teams_thirty$SO == "High"),which(teams_thirty$SO == "Low"))
##   [1]  92  94 122 211 264 354 357 361 366 370 371 372 376 380 382 383 385
##  [18] 386 388 389 390 392 397 400 408 414 415 416 417 425 430 442 445 446
##  [35] 447 448 450 455 459 471 475 477 478 480 483 485 490 497 500 505 508
##  [52] 509 512 514 519 526 528 530 536 537 538 539 540 544 565 569 570 573
##  [69] 579 584 587 598 600 601 602 604 609 623 627 629 642 644 657 658 659
##  [86] 662 663 664 665 670 671 674 680 683 684 686 689 690 691 693 694 695
## [103] 696 697 698 703 706 712 713 715 716 719 721 723 727 728 730 732 734
## [120] 737 738 742 746 747 748 750 751 752 755 757 758 759 760 763 764 766
## [137] 767 768 769 774 775 777 779 780 782 783 785 787 788 789 790 791 792
## [154] 796 797 798 799 800 804 807 808 809 810 812 813 815 816 817 819 820
## [171] 822 826 827 828 829 830 831 833 834 835 836 837 838 839 840 841 842
## [188] 844 845 847 848 849 850 852 854 855 856 857 858 859 860 861 863 864
## [205] 865 866 867 868 869 870 871 872 873 874 875 876 877 878 880   1   2
## [222]   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
## [239]  22  23  24  28  29  30  31  32  33  34  36  37  38  39  40  45  46
## [256]  47  48  49  50  53  54  56  57  58  59  60  61  62  63  64  66  67
## [273]  68  69  72  73  74  75  76  77  79  81  82  85  86  95 104 109 112
## [290] 115 129 131 132 136 137 139 141 142 144 145 148 149 153 155 156 159
## [307] 160 161 163 164 165 168 169 170 173 179 180 183 186 188 191 194 195
## [324] 196 198 200 202 207 210 212 213 215 216 217 220 223 224 226 227 229
## [341] 236 237 238 239 241 242 243 244 246 247 248 249 252 256 257 260 261
## [358] 262 265 266 268 271 273 276 277 281 289 290 291 292 293 294 295 296
## [375] 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313
## [392] 314 315 316 317 318 319 320 323 324 326 327 328 329 338 340 343 345
## [409] 350 374 409 432 439 460 464 469 489 495 532 559 561 563 607 608 619
## [426] 621 622 638 641 647 649 668 673 681 682 687 699 700 714 765
ind <- setdiff(seq(1:880),union(which(teams_thirty$SO == "High"),which(teams_thirty$SO == "Low")))
teams_thirty$SO[ind] = "Medium"
#Categorize 'SO' as a factor and display its resulting levels.
teams_thirty$SO = as.factor(teams_thirty$SO)
levels(teams_thirty$SO)
## [1] "High"   "Low"    "Medium"
#Transform 'BB' into categorical variables (Low, Medium, and High).
teams_thirty$BB[teams_thirty$BB > 0 & teams_thirty$BB <= 479.8] = "Low"
teams_thirty$BB[teams_thirty$BB > 479.8 & teams_thirty$BB <= 578.5] = "Medium"
teams_thirty$BB[teams_thirty$BB > 578.5 & teams_thirty$BB <= 775] = "High"
#Categorize 'BB' as a factor and display its resulting levels.
teams_thirty$BB = as.factor(teams_thirty$BB)
levels(teams_thirty$BB)
## [1] "High"   "Low"    "Medium"

3. (Statistical) Analysis

(Exploratory Data Analysis) Graphics and Descriptive Summary

In beginning to display this data graphically, summary statistics were gathered for the newly created dataset, “teams_thirty”. Additionally, histograms and boxplots were created to represent the different observations of regular season losses existent within this subsetted dataset that contains data from 1983-2013.

#Display the summary statistics of "teams_thirty".
summary(teams_thirty)
##      yearID     lgID         teamID          W               L        
##  Min.   :1983   AA:  0   ATL    : 31   Min.   : 43.0   Min.   : 40.0  
##  1st Qu.:1991   AL:435   BAL    : 31   1st Qu.: 72.0   1st Qu.: 72.0  
##  Median :1999   FL:  0   BOS    : 31   Median : 80.0   Median : 79.0  
##  Mean   :1999   NL:445   CHA    : 31   Mean   : 79.9   Mean   : 79.9  
##  3rd Qu.:2006   PL:  0   CHN    : 31   3rd Qu.: 89.0   3rd Qu.: 88.0  
##  Max.   :2013   UA:  0   CIN    : 31   Max.   :116.0   Max.   :119.0  
##                          (Other):694                                  
##        R                AB            H            HR           BB     
##  Min.   : 466.0   Min.   :3856   High  :216   High  :217   High  :220  
##  1st Qu.: 670.8   1st Qu.:5469   Low   :223   Low   :222   Low   :220  
##  Median : 731.0   Median :5524   Medium:441   Medium:441   Medium:440  
##  Mean   : 731.9   Mean   :5468                                         
##  3rd Qu.: 791.0   3rd Qu.:5588                                         
##  Max.   :1009.0   Max.   :5781                                         
##                                                                        
##       SO           ERA              E        
##  High  :219   Min.   :2.910   Min.   : 54.0  
##  Low   :221   1st Qu.:3.800   1st Qu.: 99.0  
##  Medium:440   Median :4.180   Median :112.0  
##               Mean   :4.211   Mean   :112.8  
##               3rd Qu.:4.580   3rd Qu.:126.0  
##               Max.   :6.380   Max.   :179.0  
## 
#Display the names found in "teams_thirty".
names(teams_thirty)
##  [1] "yearID" "lgID"   "teamID" "W"      "L"      "R"      "AB"    
##  [8] "H"      "HR"     "BB"     "SO"     "ERA"    "E"
#Display the structure of "teams_thirty".
str(teams_thirty)
## 'data.frame':    880 obs. of  13 variables:
##  $ yearID: int  1983 1983 1983 1983 1983 1983 1983 1983 1983 1983 ...
##  $ lgID  : Factor w/ 6 levels "AA","AL","FL",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ teamID: Factor w/ 149 levels "ALT","ANA","ARI",..: 5 52 93 134 83 16 45 33 66 131 ...
##  $ W     : int  98 92 91 89 87 78 70 99 79 77 ...
##  $ L     : int  64 70 71 73 75 84 92 63 83 85 ...
##  $ R     : int  799 789 770 795 764 724 704 800 696 639 ...
##  $ AB    : int  5546 5592 5631 5581 5620 5590 5476 5484 5598 5610 ...
##  $ H     : Factor w/ 3 levels "High","Low","Medium": 3 1 1 1 1 1 3 3 1 3 ...
##  $ HR    : Factor w/ 3 levels "High","Low","Medium": 3 3 3 3 3 3 2 3 2 2 ...
##  $ BB    : Factor w/ 3 levels "High","Low","Medium": 1 3 3 3 2 3 1 3 2 2 ...
##  $ SO    : Factor w/ 3 levels "High","Low","Medium": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ERA   : num  3.63 3.8 3.86 4.12 4.02 4.34 4.43 3.67 4.25 3.31 ...
##  $ E     : int  121 124 139 115 113 130 122 120 164 113 ...
#Display the head and tail of "teams_thirty".
head(teams_thirty)
##      yearID lgID teamID  W  L   R   AB      H     HR     BB  SO  ERA   E
## 1866   1983   AL    BAL 98 64 799 5546 Medium Medium   High Low 3.63 121
## 1867   1983   AL    DET 92 70 789 5592   High Medium Medium Low 3.80 124
## 1868   1983   AL    NYA 91 71 770 5631   High Medium Medium Low 3.86 139
## 1869   1983   AL    TOR 89 73 795 5581   High Medium Medium Low 4.12 115
## 1870   1983   AL    ML4 87 75 764 5620   High Medium    Low Low 4.02 113
## 1871   1983   AL    BOS 78 84 724 5590   High Medium Medium Low 4.34 130
tail(teams_thirty)
##      yearID lgID teamID  W   L   R   AB      H     HR     BB     SO  ERA
## 2740   2013   NL    MIA 62 100 513 5449    Low    Low    Low   High 3.71
## 2741   2013   NL    LAN 92  70 649 5491 Medium Medium    Low   High 3.25
## 2742   2013   NL    ARI 81  81 685 5676 Medium Medium Medium   High 3.92
## 2743   2013   NL    SDN 76  86 618 5517    Low Medium    Low   High 3.98
## 2744   2013   NL    SFN 76  86 629 5552 Medium    Low    Low Medium 4.00
## 2745   2013   NL    COL 74  88 706 5599   High Medium    Low   High 4.44
##        E
## 2740  88
## 2741 109
## 2742  75
## 2743  83
## 2744 107
## 2745  90
#Display the levels of 'H' within "teams_thirty".
levels(teams_thirty$H)
## [1] "High"   "Low"    "Medium"
#Display the levels of 'HR' within "teams_thirty".
levels(teams_thirty$HR)
## [1] "High"   "Low"    "Medium"
#Display the levels of 'SO' within "teams_thirty".
levels(teams_thirty$SO)
## [1] "High"   "Low"    "Medium"
#Display the levels of 'BB' within "teams_thirty".
levels(teams_thirty$BB)
## [1] "High"   "Low"    "Medium"
par(mfrow=c(1,1))
#Create a histogram of Regular Season Losses ('L') for teams from 1983-2013.
hist(teams_thirty$L, xlim=c(35,120), xlab = "Number of Regular Season Losses", main = "Histogram of MLB Regular Season Losses from 1983-2013")

par(mfrow=c(1,1))
#Create a boxplot of Regular Season Losses ('L') [for all teams from 1983-2013].
boxplot(teams_thirty$L~teams_thirty$teamID, main = "Boxplot of MLB Regular Season Losses from 1983-2013", ylim = c(35,120), xlab = "MLB Teams", ylab = "Regular Season Losses")

Testing - ANOVA

In order to determine if the variation that is observed in the response variable (which corresponds to the number of regular season losses in this analysis) can be explained by the variation existent in the treatment (which corresponds to the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’)), an analysis of variance (ANOVA) is performed as a means for analyzing the differences in regular season losses for each of the different numbers of hits earned by a given team in a given year (ranging from 1983-2013) contained within the dataset.

For the ANOVA model that is designed in this experiment, the null hypothesis that is being tested (which we will either reject or fail to reject by the end of our analysis) states that the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) earned by a given team in a given year do not have a significant effect on the number of regular season losses that a given team earns in a given year, implying that the differences in mean values of the number of regular season losses earned by a given team in a given year were solely the result of randomization in this experiment. In other words, if we reject the null hypothesis, we would infer that the differences in mean values of the numbers of regular season losses earned by a given team in a given year for each of the corresponding levels of the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year can be explained by the variation existent in the different numbers of hits (‘H’), numbers of homeruns (‘HR’), numbers of strikeouts (‘SO’), and numbers of walks (‘BB’) for a given team in a given year being considered in this analysis. Alternately, if we fail to reject the null hypothesis, we would infer that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year cannot be explained by the variation existent in the different numbers of hits (‘H’), numbers of homeruns (‘HR’), numbers of strikeouts (‘SO’), and numbers of walks (‘BB’) for a given team in a given year being considered in this analysis and, as such, is likely caused by randomization.

#Perform an analysis of variance (ANOVA) for the different mean values observed in the number of regular season losses earned in a given year by a given team, given the factors 'H', 'HR', 'SO', and 'BB'.
model_losses <- aov(L~H*HR*SO*BB,teams_thirty)
anova(model_losses)
## Analysis of Variance Table
## 
## Response: L
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## H            2   3023  1511.6 13.6591 1.465e-06 ***
## HR           2   2083  1041.4  9.4103 9.118e-05 ***
## SO           2   5537  2768.6 25.0182 2.865e-11 ***
## BB           2   6330  3164.8 28.5983 9.972e-13 ***
## H:HR         4    702   175.6  1.5866  0.175868    
## H:SO         4   3793   948.2  8.5678 8.969e-07 ***
## HR:SO        4    315    78.8  0.7125  0.583516    
## H:BB         4   3232   808.0  7.3009 8.891e-06 ***
## HR:BB        4   1759   439.7  3.9730  0.003362 ** 
## SO:BB        4    978   244.5  2.2092  0.066299 .  
## H:HR:SO      7    631    90.1  0.8141  0.575620    
## H:HR:BB      8   1104   138.0  1.2468  0.268570    
## H:SO:BB      8   2229   278.6  2.5172  0.010471 *  
## HR:SO:BB     7   1346   192.3  1.7381  0.096887 .  
## H:HR:SO:BB   7    389    55.6  0.5026  0.832966    
## Residuals  810  89639   110.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For the analysis of variance (ANOVA) that is performed where ‘H’ is analyzed against the response variable ‘L’, a p-value = 1.465e-06 is returned, indicating that there is roughly a probability of 1.465e-06 that the resulting associated F-value (13.6591) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year can be explained by the variation existent in the different numbers of earned hits for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

For the analysis of variance (ANOVA) that is performed where ‘HR’ is analyzed against the response variable ‘L’, a p-value = 9.118e-05 is returned, indicating that there is roughly a probability of 9.118e-05 that the resulting associated F-value (9.4103) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year can be explained by the variation existent in the different numbers of earned homeruns for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

For the analysis of variance (ANOVA) that is performed where ‘SO’ is analyzed against the response variable ‘L’, a p-value = 2.865e-11 is returned, indicating that there is roughly a probability of 2.865e-11 that the resulting associated F-value (25.0182) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year can be explained by the variation existent in the different numbers of earned strikeouts for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

For the analysis of variance (ANOVA) that is performed where ‘BB’ is analyzed against the response variable ‘L’, a p-value = 9.972e-13 is returned, indicating that there is roughly a probability of 9.972e-13 that the resulting associated F-value (28.5983) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year can be explained by the variation existent in the different numbers of earned walks for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)

For the analysis of variance (ANOVA) that is performed where the different interaction combinations of ‘H’, ‘HR’, ‘SO’, and ‘BB’ are analyzed against the response variable ‘W’, a p-value < 0.05 is returned for the interactions ‘H:SO’, ‘H:BB’, ‘HR:BB’, and ‘H:SO:BB’ (see resulting p-values in table). Therefore, this indicates that there is roughly a probability < 0.05 that the resulting associated F-values for these specific interactions (see resulting F-values in table) are the result of solely randomization. Therefore, based on this result, we would fail to reject the null hypothesis for these specific interactions, leading us to infer that the variation that is observed in the mean values of the numbers of regular season wins earned by a given team in a given year cannot be explained by the variation existent in the factor-interactions ‘H:SO’, ‘H:BB’, ‘HR:BB’, and ‘H:SO:BB’ being considered in this analysis and, as such, is likely not caused solely by randomization. (Note: The resulting associated F-values for all of the other factor-interactions are likely the result of solely randomization.)

Tukey Honest Significant Differences

In further carrying out this analysis, we can compute Tukey Honest Significant Differences (via “TukeyHSD()”) as a means for determining the specific levels of the factors ‘H’, ‘HR’, ‘SO’, and ‘BB’ existent in this analysis that are truly independent from each other and that significantly affect the response variable, ‘L’.

#Perform a TukeyHSD Test for "model_losses" when considering the factor 'H'.
Tukey_losses_H = TukeyHSD(model_losses, which = "H", ordered = FALSE, conf.level = 0.95)
Tukey_losses_H
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = L ~ H * HR * SO * BB, data = teams_thirty)
## 
## $H
##                  diff       lwr      upr     p adj
## Low-High     4.878239  2.520145 7.236332 0.0000043
## Medium-High  3.846655  1.795283 5.898028 0.0000360
## Medium-Low  -1.031583 -3.061230 0.998063 0.4574313
par(mfrow=c(1,1))
plot(Tukey_losses_H)

#Perform a TukeyHSD Test for "model_losses" when considering the factor 'HR'.
Tukey_losses_HR = TukeyHSD(model_losses, which = "HR", ordered = FALSE, conf.level = 0.95)
Tukey_losses_HR
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = L ~ H * HR * SO * BB, data = teams_thirty)
## 
## $HR
##                   diff       lwr      upr     p adj
## Low-High     3.4898172  1.131871 5.847764 0.0015579
## Medium-High  3.3010811  1.252884 5.349279 0.0004853
## Medium-Low  -0.1887361 -2.221416 1.843944 0.9741386
par(mfrow=c(1,1))
plot(Tukey_losses_HR)

#Perform a TukeyHSD Test for "model_losses" when considering the factor 'SO'.
Tukey_losses_SO = TukeyHSD(model_losses, which = "SO", ordered = FALSE, conf.level = 0.95)
Tukey_losses_SO
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = L ~ H * HR * SO * BB, data = teams_thirty)
## 
## $SO
##                  diff       lwr        upr     p adj
## Low-High    -6.455233 -8.810370 -4.1000954 0.0000000
## Medium-High -2.889600 -4.932290 -0.8469109 0.0026886
## Medium-Low   3.565632  1.529123  5.6021410 0.0001284
par(mfrow=c(1,1))
plot(Tukey_losses_SO)

#Perform a TukeyHSD Test for "model_losses" when considering the factor 'BB'.
Tukey_losses_BB = TukeyHSD(model_losses, which = "BB", ordered = FALSE, conf.level = 0.95)
Tukey_losses_BB
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = L ~ H * HR * SO * BB, data = teams_thirty)
## 
## $BB
##                  diff       lwr        upr     p adj
## Low-High     6.774948  4.419836 9.13006128 0.0000000
## Medium-High  4.830871  2.791284 6.87045874 0.0000001
## Medium-Low  -1.944077 -3.983665 0.09551026 0.0655365
par(mfrow=c(1,1))
plot(Tukey_losses_BB)

After observing the results of these Tukey Honest Significant Differences for the ANOVA model that would consider ‘L’ with the factors ‘H’, ‘HR’, ‘SO’, and ‘BB’, respectively, it’s seemingly clear that each of the different level-comparisons (for each individual factor being considered) within the model (excluding the level-interaction “Medium-Low” for ‘H’, the level-interaction “Medium-Low” for ‘HR’, and the level-interaction “Medium-Low” for ‘BB’), when considered individually, suggests a significant effect on ‘L’ that is not due solely to randomization (since the p-value for every other level-interaction is less than 0.05, leading us to reject the null hypothesis that the number of hits, the number of homeruns, the number of strikeouts, and the number of walks earned by a given team in a given year likely does not have a significant effect on the number of regular season losses that a given team earns in a given year).

Testing - Response Surface Methods

In order to determine the extent to which the variation that is observed in the response variable (which corresponds to the number of regular season losses in this analysis) can be explained by the variation existent in the treatment (which corresponds to the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’)), response surface methods can be used in this experiment that will allow us to determine the most optimal “operating conditions” for the response variable ‘L’.

#Create a new dataset to be used for the Response Surface Methods analysis.
teams_rsm <- teams_thirty[,]
#Convert the factor 'H' into an integer variable designated by its levels "Low", "Medium" and "High".
teams_rsm$H <- as.character(teams_rsm$H)
teams_rsm$H[teams_rsm$H == "Low"] <- 1
teams_rsm$H[teams_rsm$H == "Medium"] <- 2
teams_rsm$H[teams_rsm$H == "High"] <- 3
#Categorize 'H' as an integer variable and display its resulting structure.
teams_rsm$H = as.integer(teams_rsm$H)
str(teams_rsm$H)
##  int [1:880] 2 3 3 3 3 3 2 2 3 2 ...
#Convert the factor 'HR' into an integer variable designated by its levels "Low", "Medium" and "High".
teams_rsm$HR <- as.character(teams_rsm$HR)
teams_rsm$HR[teams_rsm$HR == "Low"] <- 1
teams_rsm$HR[teams_rsm$HR == "Medium"] <- 2
teams_rsm$HR[teams_rsm$HR == "High"] <- 3
#Categorize 'HR' as an integer variable and display its resulting structure.
teams_rsm$HR = as.integer(teams_rsm$HR)
str(teams_rsm$HR)
##  int [1:880] 2 2 2 2 2 2 1 2 1 1 ...
#Convert the factor 'SO' into an integer variable designated by its levels "Low", "Medium" and "High".
teams_rsm$SO <- as.character(teams_rsm$SO)
teams_rsm$SO[teams_rsm$SO == "Low"] <- 1
teams_rsm$SO[teams_rsm$SO == "Medium"] <- 2
teams_rsm$SO[teams_rsm$SO == "High"] <- 3
#Categorize 'SO' as an integer variable and display its resulting structure.
teams_rsm$SO = as.integer(teams_rsm$SO)
str(teams_rsm$SO)
##  int [1:880] 1 1 1 1 1 1 1 1 1 1 ...
#Convert the factor 'BB' into an integer variable designated by its levels "Low", "Medium" and "High".
teams_rsm$BB <- as.character(teams_rsm$BB)
teams_rsm$BB[teams_rsm$BB == "Low"] <- 1
teams_rsm$BB[teams_rsm$BB == "Medium"] <- 2
teams_rsm$BB[teams_rsm$BB == "High"] <- 3
#Categorize 'BB' as an integer variable and display its resulting structure.
teams_rsm$BB = as.integer(teams_rsm$BB)
str(teams_rsm$BB)
##  int [1:880] 3 2 2 2 1 2 3 2 1 1 ...
#Load in Response Surface Methods package, "rsm".
library("rsm")
## Warning: package 'rsm' was built under R version 3.1.2
#Create a second-order Response Surface Methods model.
losses.rsm <- rsm(L~SO(H, HR, SO, BB), data = teams_rsm)
summary(losses.rsm)
## 
## Call:
## rsm(formula = L ~ SO(H, HR, SO, BB), data = teams_rsm)
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 52.7320819  5.8718818  8.9804 < 2.2e-16 ***
## H           13.6730442  3.4672541  3.9435  8.68e-05 ***
## HR          -2.1817751  3.4586450 -0.6308 0.5283262    
## SO          13.0853321  3.4696870  3.7713 0.0001734 ***
## BB           7.0622695  3.3517047  2.1071 0.0353991 *  
## H:HR         0.5724484  1.0810272  0.5295 0.5965660    
## H:SO        -2.7744589  0.8948929 -3.1003 0.0019957 ** 
## H:BB        -3.1245882  0.8606482 -3.6305 0.0002995 ***
## HR:SO       -0.2002805  1.0570814 -0.1895 0.8497724    
## HR:BB        0.0038567  1.0286509  0.0037 0.9970094    
## SO:BB       -1.5605071  0.8516473 -1.8323 0.0672445 .  
## H^2         -0.8614669  0.8419113 -1.0232 0.3064863    
## HR^2        -0.3348097  0.9719714 -0.3445 0.7305807    
## SO^2        -0.0661058  0.8229784 -0.0803 0.9359973    
## BB^2        -0.3753587  0.8011825 -0.4685 0.6395409    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Multiple R-squared:  0.1758, Adjusted R-squared:  0.1625 
## F-statistic: 13.18 on 14 and 865 DF,  p-value: < 2.2e-16
## 
## Analysis of Variance Table
## 
## Response: L
##                     Df Sum Sq Mean Sq F value    Pr(>F)
## FO(H, HR, SO, BB)    4  15779  3944.6 33.6340 < 2.2e-16
## TWI(H, HR, SO, BB)   6   5708   951.4  8.1121 1.514e-08
## PQ(H, HR, SO, BB)    4    154    38.6  0.3288    0.8587
## Residuals          865 101448   117.3                  
## Lack of fit         55  11809   214.7  1.9402 8.391e-05
## Pure error         810  89639   110.7                  
## 
## Stationary point of response surface:
##          H         HR         SO         BB 
##  2.1160580 -0.8014994 -2.0712852  4.9014763 
## 
## Eigenanalysis:
## $values
## [1]  1.1991517  0.5728527 -0.3722024 -3.0375430
## 
## $vectors
##          [,1]        [,2]        [,3]        [,4]
## H  -0.7106911 -0.07638990  0.07534390  0.69527406
## HR -0.1656596  0.05499774 -0.98301059 -0.05676561
## SO  0.5149352 -0.70326528 -0.15301547  0.46566718
## BB  0.4497964  0.70466856 -0.06782137  0.54454175

Upon observing the resulting p-values for the model “losses.rsm”, it’s likely that we would reject this experiment’s original null hypothesis (see ANOVA model in above section) for the factors ‘H’, ‘SO’, and ‘BB’ (and for the factor-interactions ‘H:SO’ and ‘H:BB’) with regard to their effect’s significance on the response variable ‘L’ (Note: the factor-interaction ‘SO:BB’ has a p-value of 0.0672445, bringing it close to the “significance threshold” of 0.05). Furthermore, the coordinates of the stationary point of the response surface (as generated by the model “losses.rsm”) are listed as follows: ‘H’ = 2.1160580, ‘HR’ = -0.8014994, ‘SO’ = -2.0712852, and ‘BB’ = 4.9014763. In further examining the nature of this response surface, the resulting eigenvalues generated from this experiment’s eigenanalysis can be analyzed (‘H.value’ = 1.1991517, ‘HR.value’ = 0.5728527, ‘SO.value’ = -0.3722024, and ‘BB.value’ = -3.0375430). Since positive and negative eigenvalues exist for these factors (when considered together), the resulting response surface likely exhibits a saddle (ie. the stationary point is a saddle point). Additionally, it can be further determined that a maximum point exists at the interaction ‘SO:BB’ (ie. two negative eigenvalues), a minimum point exists at the interaction of ‘H:HR’ (ie. two positive eigenvalues), and saddle points exist at the interactions of ‘H:SO’, ‘H:BB’, ‘HR:SO’, and ‘HR:BB’ (ie. one positive and one negative eigenvalue). Lastly, upon observing the generated results, the p-value associated with the “lack of fit” term suggests that the model does not exhibit good fit (since its p-value of 8.391e-05 is less than 0.05) and the p-value associated with the “two-way interaction” term (TWI) suggests that the second-order “TWI” term does contribute significantly to the model (since its p-value of 1.514e-08 is less than 0.05). This significance to the model holds true in a similar way for the “first-order” term (FO) in the model, but it does not hold true for the “pure quadratic” term (PQ) in the model. The response surface itself, as described by the different response surface modeling details above, can be further exhibited by the contour plots and the surface plots below:

#Generate contour plots of the model "losses.rsm".
par(mfrow=c(2,3))
contour(losses.rsm, ~H + HR + SO + BB, at=summary(losses.rsm$canoncial$xs), image = TRUE)

par(mfrow=c(1,1))
contour(losses.rsm, ~H + HR, at=summary(losses.rsm$canoncial$xs), image = TRUE)

par(mfrow=c(1,1))
contour(losses.rsm, ~H + SO, at=summary(losses.rsm$canoncial$xs), image = TRUE)

par(mfrow=c(1,1))
contour(losses.rsm, ~H + BB, at=summary(losses.rsm$canoncial$xs), image = TRUE)

par(mfrow=c(1,1))
contour(losses.rsm, ~HR + SO, at=summary(losses.rsm$canoncial$xs), image = TRUE)

par(mfrow=c(1,1))
contour(losses.rsm, ~HR + BB, at=summary(losses.rsm$canoncial$xs), image = TRUE)

par(mfrow=c(1,1))
contour(losses.rsm, ~SO + BB, at=summary(losses.rsm$canoncial$xs), image = TRUE)

#Generate surface plots of the model "losses.rsm".
library("rgl")
## Warning: package 'rgl' was built under R version 3.1.2
persp(losses.rsm, ~ H + HR, at = c(summary(losses.rsm)$canonical$xs, Block="B2"),theta=-60,zlab="Regular Season Losses",contour = "colors")

persp(losses.rsm, ~ H + SO, at = c(summary(losses.rsm)$canonical$xs, Block="B2"),theta=30,zlab="Regular Season Losses",contour="colors")

persp(losses.rsm, ~ H + BB, at = c(summary(losses.rsm)$canonical$xs, Block="B2"),theta=-60,zlab="Regular Season Losses",contour="colors")

persp(losses.rsm, ~ HR + SO, at = c(summary(losses.rsm)$canonical$xs, Block="B2"),theta=50,zlab="Regular Season Losses",contour="colors")

persp(losses.rsm, ~ HR + BB, at = c(summary(losses.rsm)$canonical$xs, Block="B2"),theta=30,zlab="Regular Season Losses",contour="colors")

persp(losses.rsm, ~ SO + BB, at = c(summary(losses.rsm)$canonical$xs, Block="B2"),theta=-60,zlab="Regular Season Losses",contour="colors")

Estimation (of Parameters)

In estimating the different parameters of the experiment, summary statistics are performed on relevant data in the dataset pertaining to the numbers of regular season losses earned by a given team in a given year for the years contained in “teams_thirty” (which includes both the average number of regular season losses earned by all of the teams individually contained within the dataset and the standard deviation of those regular season losses) and the numbers of hits/homeruns/strikeouts/walks earned by a given team in a given year contained in “teams_thirty” (which includes both the quantities of earned hits/homeruns/strikeouts/walks classified as being “High”, “Medium”, and “Low”, respectively, and the standard deviation of those distributed quantities).

Tables of Parameter Values

#Display summary statistics of teams_thirty$L.
summary(teams_thirty$L)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    40.0    72.0    79.0    79.9    88.0   119.0
#Display standard deviation of teams_thirty$L.
sd(teams_thirty$L, na.rm = FALSE)
## [1] 11.83356
#Display summary statistics of teams_thirty$H.
summary(teams_thirty$H)
##   High    Low Medium 
##    216    223    441
#Display standard deviation of teams_thirty$H.
sd(teams_thirty$H, na.rm = FALSE)
## [1] 0.8258285
#Display summary statistics of teams_thirty$HR.
summary(teams_thirty$HR)
##   High    Low Medium 
##    217    222    441
#Display standard deviation of teams_thirty$HR.
sd(teams_thirty$HR, na.rm = FALSE)
## [1] 0.8268681
#Display summary statistics of teams_thirty$SO.
summary(teams_thirty$SO)
##   High    Low Medium 
##    219    221    440
#Display standard deviation of teams_thirty$SO.
sd(teams_thirty$SO, na.rm = FALSE)
## [1] 0.8285978
#Display summary statistics of teams_thirty$BB.
summary(teams_thirty$BB)
##   High    Low Medium 
##    220    220    440
#Display standard deviation of teams_thirty$BB.
sd(teams_thirty$BB, na.rm = FALSE)
## [1] 0.8296277

Diagnostics/Model Adequacy Checking

In verifying the results of this experiment, it’s important to ensure that the dataset itself meets all of the assumptions that correlate with the design approach that was carried out. In this way, we want to make sure that our dataset exhibits normality. Until we know that our dataset does, in fact, exhibit normality, we cannot yet say with confidence that our results are significant and representative of a properly carried-out modeling approach. In verifying our dataset for normality, we can both create a Normal Quantile-Quantile (QQ) Plot for the data (and for the response surface model “losses.rsm”) and perform a Shapiro-Wilk Test of Normality on the data.

#Create a Normal Q-Q Plot for the numbers of regular season losses earned by a given team in a given year.
qqnorm(teams_thirty[,"L"], main = "Normal Q-Q Plot of Regular Season Losses")
qqline(teams_thirty[,"L"])

#Create a Normal Q-Q Plot of the residuals for "model_losses".
qqnorm(residuals(model_losses), main = "Normal Q-Q Plot of Residuals of 'model_losses'")
qqline(residuals(model_losses))

#Create a Normal Q-Q Plot of the residuals for "losses.rsm".
qqnorm(residuals(losses.rsm), main = "Normal Q-Q Plot of Residuals of 'losses.rsm'")
qqline(residuals(losses.rsm))

Shapiro-Wilk Test of Normality Results

#Perform Shapiro-Wilk Test of Normality on the numbers of regular season losses earned by a given team in a given year (normality is assummed if p > 0.1).
shapiro.test(teams_thirty[,"L"])
## 
##  Shapiro-Wilk normality test
## 
## data:  teams_thirty[, "L"]
## W = 0.9958, p-value = 0.01724

Upon both constructing Normal Q-Q Plots (for “model_losses” and “losses.rsm”) and performing Shapiro-Wilk Tests of Normality on the data in this analysis, it’s likely that we can readily assume that our data exhibits normality. Despite the fact that the resulting p-value of the Shapiro-Wilk Tests of Normality for “teams_thirty[,“L”]” were < 0.1, all of the constructed Normal Q-Q Plots did seem to display a trend of data that aligned closely with the Normal Q-Q Line. Additionally, since Lahman’s Baseball Database is a completely comprehensive and accurate statistical resource for baseball statistics gathered from 1871-2013 (that, in theory, contains every possible observation that could be collected during those years), we should definitely be safe in making the assumption that this database (and the subsetted dataset that was extracted from it) exhibits normality.

In further backing up the confidence that we have with our results, we can generate a “quality of fit” model that plots residual error against the fitted model that was developed in the original analysis of covariance (ANCOVA).

#Create a "Quality of Fit Model" that plots the residuals of "model_losses" against its fitted model.
par(mfrow=c(1,1))
plot(fitted(model_losses),residuals(model_losses), main = "Residuals of 'model_losses' Against Fitted Model 'model_losses'")

#Create a "Quality of Fit Model" that plots the residuals of "losses.rsm" against its fitted model.
par(mfrow=c(1,1))
plot(fitted(losses.rsm),residuals(losses.rsm), main = "Residuals of 'losses.rsm' Against Fitted Model 'losses.rsm'")

Because the resulting “residual vs. fitted” plots (for both “model_losses” and “losses.rsm”, respectively) appears to exhibit a fairly symmetric distribution of residuals clumped around the zero line, the ANOVA and the Response Surface Methods models developed suggest a goodness of fit. Thus, we can confidently rely on both the modeling approach that we carried out and the dataset that we analyzed in justifying the significance of our results.

4. Contingencies to the Experimental Design/Analysis if Modeling Assumptions Failed

If our modeling assumptions failed in our analysis, we can still err on the side of caution by performing the nonparametric Kruskal-Wallis rank sum test to back up our original results (which will help us to decide whether the population distributions are identical without necessarily exhibiting a normal distribution)

Kruskal-Wallis Rank Sum Test Results

#Perform Kruskal-Wallis Rank Sum Test on 'L' within the "teams_thirty" dataset for 'H' (identical populations is assummed if p > 0.05).
kruskal.test(teams_thirty[,"L"],teams_thirty$H)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  teams_thirty[, "L"] and teams_thirty$H
## Kruskal-Wallis chi-squared = 26.2239, df = 2, p-value = 2.021e-06
#Perform Kruskal-Wallis Rank Sum Test on 'L' within the "teams_thirty" dataset for 'HR' (identical populations is assummed if p > 0.05).
kruskal.test(teams_thirty[,"L"],teams_thirty$HR)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  teams_thirty[, "L"] and teams_thirty$HR
## Kruskal-Wallis chi-squared = 30.5457, df = 2, p-value = 2.329e-07
#Perform Kruskal-Wallis Rank Sum Test on 'L' within the "teams_thirty" dataset for 'SO' (identical populations is assummed if p > 0.05).
kruskal.test(teams_thirty[,"L"],teams_thirty$SO)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  teams_thirty[, "L"] and teams_thirty$SO
## Kruskal-Wallis chi-squared = 16.5727, df = 2, p-value = 0.0002519
#Perform Kruskal-Wallis Rank Sum Test on 'L' within the "teams_thirty" dataset for 'BB' (identical populations is assummed if p > 0.05).
kruskal.test(teams_thirty[,"L"],teams_thirty$BB)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  teams_thirty[, "L"] and teams_thirty$BB
## Kruskal-Wallis chi-squared = 71.4705, df = 2, p-value = 3.023e-16

Since the p-values for the resulting Kruskal-Wallis rank sum tests that consider the factors ‘H’, ‘HR’, ‘SO’, and ‘BB’ individually against the response variable ‘L’ are all less than 0.05, we can assume that the mean values of the number of regular season losses that a given team earns in a given year compared to the number of hits/homeruns/strikeouts/walks (considered individually) earned by a given team in a given year are comparatively nonidentical populations. Therefore, this result suggests that we would reject the null hypothesis of our main experiment, leading us to believe that the number of hits, the number of homeruns, the number of strikeouts, and the number of walks earned by a given team in a given year likely does have a significant effect on the number of regular season losses that a given team earns in a given year in our analysis. Furthermore, in addition to treating our data in such a way that uses a nonparametric analysis upon any realization that normality cannot be assumed, transformations such as the “Box-Cox Power Transformation” could certainly have been performed on the data to make it approximate normality. However, these transformations would not be necessary for this analysis, since the nonparametric significance results that we generated by using the Kruskal-Wallis rank sums test were suitable in giving us confidence in the results of our analysis.

5. References to the literature

[1] Lahman, S. (1996-2014). Lahman’s Baseball Database.

6. Appendices

A summary of, or pointer to, the raw data

complete and documented R code

The updated version of the database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more (http://www.seanlahman.com/baseball-archive/statistics/). For more details on the latest release, please read the following documentation (http://seanlahman.com/files/database/readme2012.txt). The database can be used on any platform, but please be aware that this is not a standalone application. It is a database that requires Microsoft Access or some other relational database software to be useful.