This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).
When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
In this study, a four-factor, multi-level experiment is performed to see if the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), or the number of walks (‘BB’) earned by a given team in a given season has a statistically significant effect on the number of losses (‘L’) that a given team earns in a given season (which is this analysis’ response variable).
##Load in the Teams Dataset
#Get dataset from Project Documents File
raw_teams <- read.csv("~/Academics (RPI)/09. Fall 2014/Design of Experiments/02. Wikibook Recipes/Recipe #09/Teams.csv", header=TRUE)
head(raw_teams)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 <NA> PH1 PNA 1 28 NA 21 7 Y
## 2 1871 <NA> CH1 CNA 2 28 NA 19 9 N
## 3 1871 <NA> BS1 BNA 3 31 NA 20 10 N
## 4 1871 <NA> WS3 OLY 4 32 NA 15 15 N
## 5 1871 <NA> NY2 NNA 5 33 NA 16 17 N
## 6 1871 <NA> TRO TRO 6 29 NA 13 15 N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 376 1281 410 66 27 9 46 23 56 NA NA NA 266 137 4.95 27 0 0
## 2 302 1196 323 52 21 10 60 22 69 NA NA NA 241 77 2.76 25 0 1
## 3 401 1372 426 70 37 3 60 19 73 NA NA NA 303 109 3.55 22 1 3
## 4 310 1353 375 54 26 6 48 13 48 NA NA NA 303 137 4.37 32 0 0
## 5 302 1404 403 43 21 1 33 15 46 NA NA NA 313 121 3.72 32 1 0
## 6 351 1248 384 51 34 6 49 19 62 NA NA NA 362 153 5.51 28 0 0
## IPouts HA HRA BBA SOA E DP FP name
## 1 747 329 3 53 16 194 NA 0.84 Philadelphia Athletics
## 2 753 308 6 28 22 218 NA 0.82 Chicago White Stockings
## 3 828 367 2 42 23 225 NA 0.83 Boston Red Stockings
## 4 846 371 4 45 13 217 NA 0.85 Washington Olympics
## 5 879 373 7 42 22 227 NA 0.83 New York Mutuals
## 6 750 431 4 75 12 198 NA 0.84 Troy Haymakers
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 Jefferson Street Grounds NA 102 98 ATH PH1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 South End Grounds I NA 103 98 BOS BS1
## 4 Olympics Grounds NA 94 98 OLY WS3
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Haymakers' Grounds NA 101 100 TRO TRO
## teamIDretro
## 1 PH1
## 2 CH1
## 3 BS1
## 4 WS3
## 5 NY2
## 6 TRO
tail(raw_teams)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin
## 2740 2013 NL MIA FLA E 5 162 81 62 100 N N
## 2741 2013 NL LAN LAD W 1 162 81 92 70 Y N
## 2742 2013 NL ARI ARI W 2 162 81 81 81 N N
## 2743 2013 NL SDN SDP W 3 162 81 76 86 N N
## 2744 2013 NL SFN SFG W 4 162 82 76 86 N N
## 2745 2013 NL COL COL W 5 162 81 74 88 N N
## LgWin WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER
## 2740 N N 513 5449 1257 219 31 95 432 1232 78 29 56 26 646 602
## 2741 N N 649 5491 1447 281 17 138 476 1146 78 28 57 48 582 524
## 2742 N N 685 5676 1468 302 31 130 519 1142 62 41 43 43 695 651
## 2743 N N 618 5517 1349 246 26 146 467 1309 118 34 52 34 700 643
## 2744 N N 629 5552 1446 280 35 107 469 1078 67 26 39 42 691 643
## 2745 N N 706 5599 1511 283 36 159 427 1204 112 32 26 35 760 708
## ERA CG SHO SV IPouts HA HRA BBA SOA E DP FP
## 2740 3.71 2 1 36 4380 1376 121 526 1177 88 144 0.986
## 2741 3.25 7 4 46 4351 1321 127 460 1292 109 160 0.982
## 2742 3.92 6 2 38 4485 1460 176 485 1218 75 134 0.988
## 2743 3.98 3 1 40 4365 1407 156 525 1171 83 140 0.986
## 2744 4.00 2 2 41 4342 1380 145 521 1256 107 126 0.982
## 2745 4.44 1 0 35 4308 1545 136 517 1064 90 162 0.986
## name park attendance BPF PPF teamIDBR
## 2740 Miami Marlins Marlins Park 1586322 102 103 MIA
## 2741 Los Angeles Dodgers Dodger Stadium 3743527 95 95 LAD
## 2742 Arizona Diamondbacks Chase Field 2134795 102 102 ARI
## 2743 San Diego Padres Petco Park 2166691 91 91 SDP
## 2744 San Francisco Giants AT&T Park 3326796 90 89 SFG
## 2745 Colorado Rockies Coors Field 2793828 117 118 COL
## teamIDlahman45 teamIDretro
## 2740 FLO MIA
## 2741 LAN LAN
## 2742 ARI ARI
## 2743 SDN SDN
## 2744 SFN SFN
## 2745 COL COL
This analysis considers four factors (which each have multiple levels), which are the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) earned by a given team in a given season. In the original dataset “raw_teams”, the factors ‘H’, ‘HR’, ‘SO’, and ‘BB’ are denoted as being integer variables with no specific categorical levels. However, in carrying out this analysis, these factors will be transformed into categorical variables with manually-defined levels. These factor was selected intuitively, since this analysis aims to determine whether or not the the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), or the number of walks (‘BB’) earned by a given team in a given season has a significant effect on the number of regular season losses that a given team earns in a given season.
#Display the summary statistics of "raw_teams".
summary(raw_teams)
## yearID lgID teamID franchID divID
## Min. :1871 AA : 85 CHN : 138 ATL : 138 :1517
## 1st Qu.:1918 AL :1175 PHI : 131 CHC : 138 C: 215
## Median :1961 FL : 16 PIT : 127 CIN : 132 E: 518
## Mean :1954 NL :1399 CIN : 124 PIT : 132 W: 495
## 3rd Qu.:1990 PL : 8 SLN : 122 STL : 132
## Max. :2013 UA : 12 BOS : 113 PHI : 131
## NA's: 50 (Other):1990 (Other):1942
## Rank G Ghome W
## Min. : 1.000 Min. : 6.0 Min. :44.0 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.:153.0 1st Qu.:77.0 1st Qu.: 66.00
## Median : 4.000 Median :157.0 Median :80.0 Median : 77.00
## Mean : 4.132 Mean :150.1 Mean :78.4 Mean : 74.61
## 3rd Qu.: 6.000 3rd Qu.:162.0 3rd Qu.:81.0 3rd Qu.: 87.00
## Max. :13.000 Max. :165.0 Max. :84.0 Max. :116.00
## NA's :399
## L DivWin WCWin LgWin WSWin R
## Min. : 4.00 :1545 :2181 : 28 : 357 Min. : 24.0
## 1st Qu.: 65.00 N: 982 N: 522 N:2449 N:2274 1st Qu.: 612.0
## Median : 76.00 Y: 218 Y: 42 Y: 268 Y: 114 Median : 690.0
## Mean : 74.61 Mean : 682.1
## 3rd Qu.: 86.00 3rd Qu.: 765.0
## Max. :134.00 Max. :1220.0
##
## AB H X2B X3B
## Min. : 211 Min. : 33 Min. : 3.0 Min. : 0.00
## 1st Qu.:5117 1st Qu.:1297 1st Qu.:192.0 1st Qu.: 31.00
## Median :5379 Median :1393 Median :229.0 Median : 42.00
## Mean :5134 Mean :1345 Mean :226.6 Mean : 47.48
## 3rd Qu.:5514 3rd Qu.:1468 3rd Qu.:269.0 3rd Qu.: 61.00
## Max. :5781 Max. :1783 Max. :376.0 Max. :150.00
##
## HR BB SO SB
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 40 1st Qu.:425.0 1st Qu.: 501.0 1st Qu.: 64.0
## Median :105 Median :494.0 Median : 746.0 Median : 96.0
## Mean :100 Mean :473.8 Mean : 726.3 Mean :112.8
## 3rd Qu.:148 3rd Qu.:555.0 3rd Qu.: 955.0 3rd Qu.:143.0
## Max. :264 Max. :835.0 Max. :1535.0 Max. :581.0
## NA's :120 NA's :144
## CS HBP SF RA
## Min. : 0.00 Min. : 26.00 Min. :24.00 Min. : 34.0
## 1st Qu.: 35.00 1st Qu.: 47.00 1st Qu.:39.00 1st Qu.: 608.0
## Median : 46.00 Median : 54.50 Median :44.00 Median : 688.0
## Mean : 49.21 Mean : 56.35 Mean :45.09 Mean : 682.1
## 3rd Qu.: 59.00 3rd Qu.: 64.00 3rd Qu.:50.25 3rd Qu.: 765.0
## Max. :191.00 Max. :103.00 Max. :75.00 Max. :1252.0
## NA's :859 NA's :2325 NA's :2325
## ER ERA CG SHO
## Min. : 25.0 Min. :1.220 Min. : 0.0 Min. : 0.000
## 1st Qu.: 498.0 1st Qu.:3.330 1st Qu.: 16.0 1st Qu.: 6.000
## Median : 590.0 Median :3.820 Median : 46.0 Median : 9.000
## Mean : 569.8 Mean :3.815 Mean : 51.5 Mean : 9.435
## 3rd Qu.: 667.0 3rd Qu.:4.310 3rd Qu.: 78.0 3rd Qu.:12.000
## Max. :1023.0 Max. :8.000 Max. :148.0 Max. :32.000
##
## SV IPouts HA HRA
## Min. : 0.00 Min. : 162 Min. : 49 Min. : 0
## 1st Qu.: 9.00 1st Qu.:4071 1st Qu.:1287 1st Qu.: 43
## Median :23.00 Median :4224 Median :1392 Median :107
## Mean :23.25 Mean :4015 Mean :1345 Mean :100
## 3rd Qu.:37.00 3rd Qu.:4339 3rd Qu.:1471 3rd Qu.:147
## Max. :68.00 Max. :4518 Max. :1993 Max. :241
##
## BBA SOA E DP
## Min. : 0.0 Min. : 0.0 Min. : 47.0 Min. : 18.0
## 1st Qu.:426.0 1st Qu.: 499.0 1st Qu.:119.0 1st Qu.:126.0
## Median :496.0 Median : 721.0 Median :147.0 Median :144.0
## Mean :474.1 Mean : 719.9 Mean :188.6 Mean :140.1
## 3rd Qu.:556.0 3rd Qu.: 951.0 3rd Qu.:219.0 3rd Qu.:160.0
## Max. :827.0 Max. :1428.0 Max. :639.0 Max. :217.0
## NA's :317
## FP name park
## Min. :0.7600 Cincinnati Reds : 123 Wrigley Field : 100
## 1st Qu.:0.9600 Pittsburgh Pirates : 123 Sportsman's Park IV: 90
## Median :0.9700 Philadelphia Phillies: 122 Comiskey Park : 80
## Mean :0.9605 St. Louis Cardinals : 114 Fenway Park II : 80
## 3rd Qu.:0.9800 Chicago White Sox : 113 Forbes Field : 60
## Max. :0.9910 Detroit Tigers : 113 Crosley Field : 58
## (Other) :2037 (Other) :2277
## attendance BPF PPF teamIDBR
## Min. : 6088 Min. : 60.0 Min. : 60.0 CHC : 138
## 1st Qu.: 518051 1st Qu.: 97.0 1st Qu.: 97.0 CIN : 137
## Median :1107122 Median :100.0 Median :100.0 STL : 135
## Mean :1317241 Mean :100.2 Mean :100.2 PHI : 134
## 3rd Qu.:1950099 3rd Qu.:103.0 3rd Qu.:103.0 PIT : 132
## Max. :4483350 Max. :129.0 Max. :141.0 BOS : 121
## NA's :279 (Other):1948
## teamIDlahman45 teamIDretro
## CHN : 138 CHN : 138
## PHI : 131 PHI : 131
## PIT : 127 PIT : 127
## CIN : 124 CIN : 124
## SLN : 122 SLN : 122
## BOS : 113 BOS : 113
## (Other):1990 (Other):1990
#Display the names found in "raw_teams".
names(raw_teams)
## [1] "yearID" "lgID" "teamID" "franchID"
## [5] "divID" "Rank" "G" "Ghome"
## [9] "W" "L" "DivWin" "WCWin"
## [13] "LgWin" "WSWin" "R" "AB"
## [17] "H" "X2B" "X3B" "HR"
## [21] "BB" "SO" "SB" "CS"
## [25] "HBP" "SF" "RA" "ER"
## [29] "ERA" "CG" "SHO" "SV"
## [33] "IPouts" "HA" "HRA" "BBA"
## [37] "SOA" "E" "DP" "FP"
## [41] "name" "park" "attendance" "BPF"
## [45] "PPF" "teamIDBR" "teamIDlahman45" "teamIDretro"
#Display the structure of "raw_teams".
str(raw_teams)
## 'data.frame': 2745 obs. of 48 variables:
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
## $ lgID : Factor w/ 6 levels "AA","AL","FL",..: NA NA NA NA NA NA NA NA NA NA ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 97 31 24 142 90 136 56 39 111 24 ...
## $ franchID : Factor w/ 120 levels "ALT","ANA","ARI",..: 85 36 13 77 70 109 56 25 91 13 ...
## $ divID : Factor w/ 4 levels "","C","E","W": 1 1 1 1 1 1 1 1 1 1 ...
## $ Rank : int 1 2 3 4 5 6 7 8 9 1 ...
## $ G : int 28 28 31 32 33 29 19 29 25 48 ...
## $ Ghome : int NA NA NA NA NA NA NA NA NA NA ...
## $ W : int 21 19 20 15 16 13 7 10 4 39 ...
## $ L : int 7 9 10 15 17 15 12 19 21 8 ...
## $ DivWin : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ WCWin : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ LgWin : Factor w/ 3 levels "","N","Y": 3 2 2 2 2 2 2 2 2 3 ...
## $ WSWin : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ R : int 376 302 401 310 302 351 137 249 231 521 ...
## $ AB : int 1281 1196 1372 1353 1404 1248 746 1186 1036 2137 ...
## $ H : int 410 323 426 375 403 384 178 328 274 677 ...
## $ X2B : int 66 52 70 54 43 51 19 35 44 114 ...
## $ X3B : int 27 21 37 26 21 34 8 40 25 31 ...
## $ HR : int 9 10 3 6 1 6 2 7 3 7 ...
## $ BB : int 46 60 60 48 33 49 33 26 38 28 ...
## $ SO : int 23 22 19 13 15 19 9 25 30 26 ...
## $ SB : int 56 69 73 48 46 62 16 18 53 47 ...
## $ CS : int NA NA NA NA NA NA NA NA NA 14 ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ RA : int 266 241 303 303 313 362 243 341 287 236 ...
## $ ER : int 137 77 109 137 121 153 97 116 108 95 ...
## $ ERA : num 4.95 2.76 3.55 4.37 3.72 5.51 5.17 4.11 4.3 1.99 ...
## $ CG : int 27 25 22 32 32 28 19 23 23 41 ...
## $ SHO : int 0 0 1 0 1 0 1 0 1 3 ...
## $ SV : int 0 1 3 0 0 0 0 0 0 1 ...
## $ IPouts : int 747 753 828 846 879 750 507 762 678 1290 ...
## $ HA : int 329 308 367 371 373 431 261 346 315 438 ...
## $ HRA : int 3 6 2 4 7 4 5 13 3 0 ...
## $ BBA : int 53 28 42 45 42 75 21 53 34 27 ...
## $ SOA : int 16 22 23 13 22 12 17 34 16 0 ...
## $ E : int 194 218 225 217 227 198 163 223 220 263 ...
## $ DP : int NA NA NA NA NA NA NA NA NA NA ...
## $ FP : num 0.84 0.82 0.83 0.85 0.83 0.84 0.8 0.81 0.82 0.87 ...
## $ name : Factor w/ 139 levels "Altoona Mountain City",..: 97 42 17 135 93 131 63 51 111 17 ...
## $ park : Factor w/ 213 levels "","23rd Street Grounds",..: 87 197 170 130 199 80 77 116 4 170 ...
## $ attendance : int NA NA NA NA NA NA NA NA NA NA ...
## $ BPF : int 102 104 103 94 90 101 101 96 97 105 ...
## $ PPF : int 98 102 98 98 88 100 107 100 99 100 ...
## $ teamIDBR : Factor w/ 101 levels "ALT","ANA","ARI",..: 4 21 10 65 62 93 42 25 77 10 ...
## $ teamIDlahman45: Factor w/ 148 levels "ALT","ANA","ARI",..: 96 31 24 140 89 135 56 39 110 24 ...
## $ teamIDretro : Factor w/ 149 levels "ALT","ANA","ARI",..: 96 31 24 141 89 135 56 39 110 24 ...
In this dataset, there are a few variables that can be considered to be continuous variables; these variables are the ones which are categorized as being numeric variables. By this standard, the continuous variables that exist in this dataset include ‘ERA’ (which refers to a team’s overall earned run average) and ‘FP’ (which refers to a team’s overall fielding percentage). The main factors that our analysis considers, ‘H’, ‘HR’, ‘SO’, and ‘BB’, currently exist as integer variables. However, upon manually defining these given integer values into specific “integer-range” levels, these factors will become defined as being a categorical variable.
This analysis will consider one response variable, ‘L’, which denotes the number of regular season losses that a given team earned in a given year.
As a whole, Lahman’s Baseball Database contains information pertaining to pitching, hitting, and fielding statistics for Major League Baseball from the years 1871 through 2013. It includes data from the two current leagues (American and National), the four other “major” leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875. It contains 2,745 observations of 48 variables, which are defined below (Lahman, 1996-2014) [1]:
yearID [Year]
lgID [League]
teamID [Team]
franchID [Franchise (links to TeamsFranchise table)]
divID [Team’s division]
Rank [Position in final standings]
G [Games played]
GHome [Games played at home]
W [Wins]
L [Losses]
DivWin [Division Winner (Y or N)]
WCWin [Wild Card Winner (Y or N)]
LgWin [League Champion (Y or N)]
WSWin [World Series Winner (Y or N)]
R [Runs scored]
AB [At bats]
H [Hits by batters]
2B [Doubles]
3B [Triples]
HR [Homeruns by batters]
BB [Walks by batters]
SO [Strikeouts by batters]
SB [Stolen bases]
CS [Caught stealing]
HBP [Batters hit by pitch]
SF [Sacrifice flies]
RA [Opponents runs scored]
ER [Earned runs allowed]
ERA [Earned run average]
CG [Complete games]
SHO [Shutouts]
SV [Saves]
IPOuts [Outs Pitched (innings pitched x 3)]
HA [Hits allowed]
HRA [Homeruns allowed]
BBA [Walks allowed]
SOA [Strikeouts by pitchers]
E [Errors]
DP [Double Plays]
FP [Fielding percentage]
name [Team’s full name]
park [Name of team’s home ballpark]
attendance [Home attendance total]
BPF [Three-year park factor for batters]
PPF [Three-year park factor for pitchers]
teamIDBR [Team ID used by Baseball Reference website]
teamIDlahman45 [Team ID used in Lahman database version 4.5]
teamIDretro [Team ID used by Retrosheet]
Because this dataset is a completely comprehensive and accurate statistical resource for baseball statistics gathered from 1871-2013 (that, in theory, contains every possible observation that could be collected during those years), we should definitely be able to make the assumption that this dataset exhibits randomization without any bias.
In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘L’ in this analysis) can be explained by the variation existent in the four treatments being considered in the experiment (which correspond to ‘H’, ‘HR’, ‘SO’, and ‘BB’). Therefore, the null hypothesis that is being tested states that the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) earned by a given team in a given season do not have a significant effect on the number of regular season losses that a given team earns in a given year. Alternately, the alternative hypothesis that is being tested states that the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) earned by a given team in a given season do have a significant effect on the number of regular season losses that a given team earns in a given year. In carrying out this analysis, we perform an analysis of variance (ANOVA) for the number of regular season losses (‘L’) and use Response Surface Methods to see if there is a significant difference in the means for this response variable when analyzing the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) earned by a given team in a given season.
The rationale for this design lies primarily in the fact that we’re trying to determine if the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), or the number of walks (‘BB’) earned by a given team in a given season has any effect on the number of regular season losses (‘L’) that a given team earns in a given year. So, since the number of regular season losses is a useful and relevant metric to consider when determining the most important performance-related factors that result in earned losses for a given team, this design of a four-factor, multi-level experiment (as it corresponds to an analysis of variance and the use of response surface methods) was crafted to see if the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), or the number of walks (‘BB’) earned by a given team in a given season has a significant effect on the number of regular season losses that a given team earns in a given year. The reason for using Response Surface Methods in this experiment has to do with the fact that the response variable of interest in this scenario (‘L’) is likely influenced by the four factors that are being considered here (‘H’, ‘HR’, ‘SO’, and ‘BB’). Since this nature of influence exists in this scenario, it is likely true that the response variable ‘L’ can be optimized by a given set of these input variables over a specified region of interest (especially because the factors ‘H’, ‘HR’, ‘SO’, and ‘BB’ are known and controllable), allowing one to determine the factors’ main effects and interaction effects. By performing the analysis in this manner, we can hope to receive some insight regarding the different factors that come into play that typically result in regular season losses for baseball teams.
Since the original assumption claimed that the entirety of Lahman’s Baseball Database exhibits randomization, we did not need to worry about randomizing our data any further to ensure that a completely randomized design is created. However, in carrying out this analysis in a reasonable and logical way, a chronologically-subsetted dataset was extracted from the entirety of Lahman’s Baseball Database (all data ranging from 1983-2013, for a grand total of 30 years of data, was used in this analysis). In determining which variables to include in this experiment’s dataframe (which have been noted previously), a pairs plot was generated to give us some insight into the different effects that various variables in the dataset have on the response variable, ‘L’. To properly generate this pairs plot, the most relevant/interesting variables were included in a subset of data extracted from the larger “raw_teams” dataset, including ‘yearID’, ‘lgID’, ‘teamID’, ‘W’, ‘L’, ‘R’, ‘AB’, ‘H’, ‘HR’, ‘BB’, ‘SO’, ‘ERA’, and ‘E’. Since the other 35 variables are not being considered further in this analysis, it did not make sense to include them anymore. In selecting four specific factors to include in the ANOVA/RSM-specific analysis, the factors ‘H’, ‘HR’, ‘SO’, and ‘BB’ were chosen intuitively.
#Create a pairs plot for 'L' against 'H'.
par(mfrow=c(1,1))
pairs(L~H, raw_teams)
#Create a pairs plot for 'L' against 'HR'.
par(mfrow=c(1,1))
pairs(L~HR, raw_teams)
#Create a pairs plot for 'L' against 'SO'.
par(mfrow=c(1,1))
pairs(L~SO, raw_teams)
#Create a pairs plot for 'L' against 'BB'.
par(mfrow=c(1,1))
pairs(L~BB, raw_teams)
#Create subset of the dataset that only includes team data from the years 1983-2013.
teams_thirty <- subset(raw_teams, raw_teams$yearID >= 1983, select=c(yearID:teamID,W,L,R:H,HR:SO,ERA,E))
#Display the summary statistics of "teams_thirty".
summary(teams_thirty)
## yearID lgID teamID W L
## Min. :1983 AA: 0 ATL : 31 Min. : 43.0 Min. : 40.0
## 1st Qu.:1991 AL:435 BAL : 31 1st Qu.: 72.0 1st Qu.: 72.0
## Median :1999 FL: 0 BOS : 31 Median : 80.0 Median : 79.0
## Mean :1999 NL:445 CHA : 31 Mean : 79.9 Mean : 79.9
## 3rd Qu.:2006 PL: 0 CHN : 31 3rd Qu.: 89.0 3rd Qu.: 88.0
## Max. :2013 UA: 0 CIN : 31 Max. :116.0 Max. :119.0
## (Other):694
## R AB H HR
## Min. : 466.0 Min. :3856 Min. : 963 Min. : 58.0
## 1st Qu.: 670.8 1st Qu.:5469 1st Qu.:1385 1st Qu.:127.0
## Median : 731.0 Median :5524 Median :1441 Median :154.5
## Mean : 731.9 Mean :5468 Mean :1434 Mean :155.6
## 3rd Qu.: 791.0 3rd Qu.:5588 3rd Qu.:1499 3rd Qu.:180.0
## Max. :1009.0 Max. :5781 Max. :1684 Max. :264.0
##
## BB SO ERA E
## Min. :319.0 Min. : 568 Min. :2.910 Min. : 54.0
## 1st Qu.:479.8 1st Qu.: 910 1st Qu.:3.800 1st Qu.: 99.0
## Median :527.0 Median :1014 Median :4.180 Median :112.0
## Mean :530.2 Mean :1011 Mean :4.211 Mean :112.8
## 3rd Qu.:578.5 3rd Qu.:1107 3rd Qu.:4.580 3rd Qu.:126.0
## Max. :775.0 Max. :1535 Max. :6.380 Max. :179.0
##
#Display the names found in "teams_thirty".
names(teams_thirty)
## [1] "yearID" "lgID" "teamID" "W" "L" "R" "AB"
## [8] "H" "HR" "BB" "SO" "ERA" "E"
#Display the structure of "teams_thirty".
str(teams_thirty)
## 'data.frame': 880 obs. of 13 variables:
## $ yearID: int 1983 1983 1983 1983 1983 1983 1983 1983 1983 1983 ...
## $ lgID : Factor w/ 6 levels "AA","AL","FL",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ teamID: Factor w/ 149 levels "ALT","ANA","ARI",..: 5 52 93 134 83 16 45 33 66 131 ...
## $ W : int 98 92 91 89 87 78 70 99 79 77 ...
## $ L : int 64 70 71 73 75 84 92 63 83 85 ...
## $ R : int 799 789 770 795 764 724 704 800 696 639 ...
## $ AB : int 5546 5592 5631 5581 5620 5590 5476 5484 5598 5610 ...
## $ H : int 1492 1530 1535 1546 1556 1512 1451 1439 1515 1429 ...
## $ HR : int 168 156 153 167 132 142 86 157 109 106 ...
## $ BB : int 601 508 533 510 475 536 605 527 397 442 ...
## $ SO : int 800 831 686 810 665 758 691 888 722 767 ...
## $ ERA : num 3.63 3.8 3.86 4.12 4.02 4.34 4.43 3.67 4.25 3.31 ...
## $ E : int 121 124 139 115 113 130 122 120 164 113 ...
#Create a pairs plot for "teams_thirty".
par(mfrow=c(1,1))
pairs(teams_thirty)
Upon generating these pairs plots, it appears that the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) earned by a given team in a given season correlates positively with team losses (‘L’). Therefore, ‘H’, ‘HR’, ‘SO’, and ‘BB’ seem to be the most interesting/relevant factors to include in this analysis.
In this experiment, there are no replicates or repeated measures present.
In order to transform our integer factor variables into categorical factor variables, blocking was used in this design. In transforming the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) into a categorical variable, three different levels were defined for each respective factor, designating a low number of hits/homeruns/strikeouts/walks, a medium number of hits/homeruns/strikeouts/walks, and a high number of hits/homeruns/strikeouts/walks earned by a given team in a given season. These different levels were determined for each respective factor by calculating the first and third quartiles of each factor’s respective values in the subsetted dataset “teams_thirty” (see numeric levels in R code below). Therefore, in this newly subsetted dataset, ‘H’, ‘HR’, ‘SO’, and ‘BB’ each have three distinct levels (“Low”, “Medium”, and “High”).
#Determine quartile values for each factor.
summary(teams_thirty$H)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 963 1385 1441 1434 1499 1684
summary(teams_thirty$HR)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 58.0 127.0 154.5 155.6 180.0 264.0
summary(teams_thirty$SO)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 568 910 1014 1011 1107 1535
summary(teams_thirty$BB)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 319.0 479.8 527.0 530.2 578.5 775.0
#Transform 'H' into categorical variables (Low, Medium, and High).
teams_thirty$H[teams_thirty$H > 0 & teams_thirty$H <= 1385] = "Low"
teams_thirty$H[teams_thirty$H > 1385 & teams_thirty$H <= 1499] = "Medium"
teams_thirty$H[teams_thirty$H > 1499 & teams_thirty$H <= 1684] = "High"
#Categorize 'H' as a factor and display its resulting levels.
teams_thirty$H = as.factor(teams_thirty$H)
levels(teams_thirty$H)
## [1] "High" "Low" "Medium"
#Transform 'HR' into categorical variables (Low, Medium, and High).
teams_thirty$HR[teams_thirty$HR > 0 & teams_thirty$HR <= 127] = "Low"
teams_thirty$HR[teams_thirty$HR > 127 & teams_thirty$HR <= 180] = "Medium"
teams_thirty$HR[teams_thirty$HR > 180 & teams_thirty$HR <= 264] = "High"
#Categorize 'HR' as a factor and display its resulting levels.
teams_thirty$HR = as.factor(teams_thirty$HR)
levels(teams_thirty$HR)
## [1] "High" "Low" "Medium"
#Transform 'SO' into categorical variables (Low, Medium, and High).
teams_thirty$SO[teams_thirty$SO > 0 & teams_thirty$SO <= 910] = "Low"
teams_thirty$SO[teams_thirty$SO > 1107 & teams_thirty$SO <= 1535] = "High"
union(which(teams_thirty$SO == "High"),which(teams_thirty$SO == "Low"))
## [1] 92 94 122 211 264 354 357 361 366 370 371 372 376 380 382 383 385
## [18] 386 388 389 390 392 397 400 408 414 415 416 417 425 430 442 445 446
## [35] 447 448 450 455 459 471 475 477 478 480 483 485 490 497 500 505 508
## [52] 509 512 514 519 526 528 530 536 537 538 539 540 544 565 569 570 573
## [69] 579 584 587 598 600 601 602 604 609 623 627 629 642 644 657 658 659
## [86] 662 663 664 665 670 671 674 680 683 684 686 689 690 691 693 694 695
## [103] 696 697 698 703 706 712 713 715 716 719 721 723 727 728 730 732 734
## [120] 737 738 742 746 747 748 750 751 752 755 757 758 759 760 763 764 766
## [137] 767 768 769 774 775 777 779 780 782 783 785 787 788 789 790 791 792
## [154] 796 797 798 799 800 804 807 808 809 810 812 813 815 816 817 819 820
## [171] 822 826 827 828 829 830 831 833 834 835 836 837 838 839 840 841 842
## [188] 844 845 847 848 849 850 852 854 855 856 857 858 859 860 861 863 864
## [205] 865 866 867 868 869 870 871 872 873 874 875 876 877 878 880 1 2
## [222] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## [239] 22 23 24 28 29 30 31 32 33 34 36 37 38 39 40 45 46
## [256] 47 48 49 50 53 54 56 57 58 59 60 61 62 63 64 66 67
## [273] 68 69 72 73 74 75 76 77 79 81 82 85 86 95 104 109 112
## [290] 115 129 131 132 136 137 139 141 142 144 145 148 149 153 155 156 159
## [307] 160 161 163 164 165 168 169 170 173 179 180 183 186 188 191 194 195
## [324] 196 198 200 202 207 210 212 213 215 216 217 220 223 224 226 227 229
## [341] 236 237 238 239 241 242 243 244 246 247 248 249 252 256 257 260 261
## [358] 262 265 266 268 271 273 276 277 281 289 290 291 292 293 294 295 296
## [375] 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313
## [392] 314 315 316 317 318 319 320 323 324 326 327 328 329 338 340 343 345
## [409] 350 374 409 432 439 460 464 469 489 495 532 559 561 563 607 608 619
## [426] 621 622 638 641 647 649 668 673 681 682 687 699 700 714 765
ind <- setdiff(seq(1:880),union(which(teams_thirty$SO == "High"),which(teams_thirty$SO == "Low")))
teams_thirty$SO[ind] = "Medium"
#Categorize 'SO' as a factor and display its resulting levels.
teams_thirty$SO = as.factor(teams_thirty$SO)
levels(teams_thirty$SO)
## [1] "High" "Low" "Medium"
#Transform 'BB' into categorical variables (Low, Medium, and High).
teams_thirty$BB[teams_thirty$BB > 0 & teams_thirty$BB <= 479.8] = "Low"
teams_thirty$BB[teams_thirty$BB > 479.8 & teams_thirty$BB <= 578.5] = "Medium"
teams_thirty$BB[teams_thirty$BB > 578.5 & teams_thirty$BB <= 775] = "High"
#Categorize 'BB' as a factor and display its resulting levels.
teams_thirty$BB = as.factor(teams_thirty$BB)
levels(teams_thirty$BB)
## [1] "High" "Low" "Medium"
In beginning to display this data graphically, summary statistics were gathered for the newly created dataset, “teams_thirty”. Additionally, histograms and boxplots were created to represent the different observations of regular season losses existent within this subsetted dataset that contains data from 1983-2013.
#Display the summary statistics of "teams_thirty".
summary(teams_thirty)
## yearID lgID teamID W L
## Min. :1983 AA: 0 ATL : 31 Min. : 43.0 Min. : 40.0
## 1st Qu.:1991 AL:435 BAL : 31 1st Qu.: 72.0 1st Qu.: 72.0
## Median :1999 FL: 0 BOS : 31 Median : 80.0 Median : 79.0
## Mean :1999 NL:445 CHA : 31 Mean : 79.9 Mean : 79.9
## 3rd Qu.:2006 PL: 0 CHN : 31 3rd Qu.: 89.0 3rd Qu.: 88.0
## Max. :2013 UA: 0 CIN : 31 Max. :116.0 Max. :119.0
## (Other):694
## R AB H HR BB
## Min. : 466.0 Min. :3856 High :216 High :217 High :220
## 1st Qu.: 670.8 1st Qu.:5469 Low :223 Low :222 Low :220
## Median : 731.0 Median :5524 Medium:441 Medium:441 Medium:440
## Mean : 731.9 Mean :5468
## 3rd Qu.: 791.0 3rd Qu.:5588
## Max. :1009.0 Max. :5781
##
## SO ERA E
## High :219 Min. :2.910 Min. : 54.0
## Low :221 1st Qu.:3.800 1st Qu.: 99.0
## Medium:440 Median :4.180 Median :112.0
## Mean :4.211 Mean :112.8
## 3rd Qu.:4.580 3rd Qu.:126.0
## Max. :6.380 Max. :179.0
##
#Display the names found in "teams_thirty".
names(teams_thirty)
## [1] "yearID" "lgID" "teamID" "W" "L" "R" "AB"
## [8] "H" "HR" "BB" "SO" "ERA" "E"
#Display the structure of "teams_thirty".
str(teams_thirty)
## 'data.frame': 880 obs. of 13 variables:
## $ yearID: int 1983 1983 1983 1983 1983 1983 1983 1983 1983 1983 ...
## $ lgID : Factor w/ 6 levels "AA","AL","FL",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ teamID: Factor w/ 149 levels "ALT","ANA","ARI",..: 5 52 93 134 83 16 45 33 66 131 ...
## $ W : int 98 92 91 89 87 78 70 99 79 77 ...
## $ L : int 64 70 71 73 75 84 92 63 83 85 ...
## $ R : int 799 789 770 795 764 724 704 800 696 639 ...
## $ AB : int 5546 5592 5631 5581 5620 5590 5476 5484 5598 5610 ...
## $ H : Factor w/ 3 levels "High","Low","Medium": 3 1 1 1 1 1 3 3 1 3 ...
## $ HR : Factor w/ 3 levels "High","Low","Medium": 3 3 3 3 3 3 2 3 2 2 ...
## $ BB : Factor w/ 3 levels "High","Low","Medium": 1 3 3 3 2 3 1 3 2 2 ...
## $ SO : Factor w/ 3 levels "High","Low","Medium": 2 2 2 2 2 2 2 2 2 2 ...
## $ ERA : num 3.63 3.8 3.86 4.12 4.02 4.34 4.43 3.67 4.25 3.31 ...
## $ E : int 121 124 139 115 113 130 122 120 164 113 ...
#Display the head and tail of "teams_thirty".
head(teams_thirty)
## yearID lgID teamID W L R AB H HR BB SO ERA E
## 1866 1983 AL BAL 98 64 799 5546 Medium Medium High Low 3.63 121
## 1867 1983 AL DET 92 70 789 5592 High Medium Medium Low 3.80 124
## 1868 1983 AL NYA 91 71 770 5631 High Medium Medium Low 3.86 139
## 1869 1983 AL TOR 89 73 795 5581 High Medium Medium Low 4.12 115
## 1870 1983 AL ML4 87 75 764 5620 High Medium Low Low 4.02 113
## 1871 1983 AL BOS 78 84 724 5590 High Medium Medium Low 4.34 130
tail(teams_thirty)
## yearID lgID teamID W L R AB H HR BB SO ERA
## 2740 2013 NL MIA 62 100 513 5449 Low Low Low High 3.71
## 2741 2013 NL LAN 92 70 649 5491 Medium Medium Low High 3.25
## 2742 2013 NL ARI 81 81 685 5676 Medium Medium Medium High 3.92
## 2743 2013 NL SDN 76 86 618 5517 Low Medium Low High 3.98
## 2744 2013 NL SFN 76 86 629 5552 Medium Low Low Medium 4.00
## 2745 2013 NL COL 74 88 706 5599 High Medium Low High 4.44
## E
## 2740 88
## 2741 109
## 2742 75
## 2743 83
## 2744 107
## 2745 90
#Display the levels of 'H' within "teams_thirty".
levels(teams_thirty$H)
## [1] "High" "Low" "Medium"
#Display the levels of 'HR' within "teams_thirty".
levels(teams_thirty$HR)
## [1] "High" "Low" "Medium"
#Display the levels of 'SO' within "teams_thirty".
levels(teams_thirty$SO)
## [1] "High" "Low" "Medium"
#Display the levels of 'BB' within "teams_thirty".
levels(teams_thirty$BB)
## [1] "High" "Low" "Medium"
par(mfrow=c(1,1))
#Create a histogram of Regular Season Losses ('L') for teams from 1983-2013.
hist(teams_thirty$L, xlim=c(35,120), xlab = "Number of Regular Season Losses", main = "Histogram of MLB Regular Season Losses from 1983-2013")
par(mfrow=c(1,1))
#Create a boxplot of Regular Season Losses ('L') [for all teams from 1983-2013].
boxplot(teams_thirty$L~teams_thirty$teamID, main = "Boxplot of MLB Regular Season Losses from 1983-2013", ylim = c(35,120), xlab = "MLB Teams", ylab = "Regular Season Losses")
In order to determine if the variation that is observed in the response variable (which corresponds to the number of regular season losses in this analysis) can be explained by the variation existent in the treatment (which corresponds to the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’)), an analysis of variance (ANOVA) is performed as a means for analyzing the differences in regular season losses for each of the different numbers of hits earned by a given team in a given year (ranging from 1983-2013) contained within the dataset.
For the ANOVA model that is designed in this experiment, the null hypothesis that is being tested (which we will either reject or fail to reject by the end of our analysis) states that the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) earned by a given team in a given year do not have a significant effect on the number of regular season losses that a given team earns in a given year, implying that the differences in mean values of the number of regular season losses earned by a given team in a given year were solely the result of randomization in this experiment. In other words, if we reject the null hypothesis, we would infer that the differences in mean values of the numbers of regular season losses earned by a given team in a given year for each of the corresponding levels of the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’) in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year can be explained by the variation existent in the different numbers of hits (‘H’), numbers of homeruns (‘HR’), numbers of strikeouts (‘SO’), and numbers of walks (‘BB’) for a given team in a given year being considered in this analysis. Alternately, if we fail to reject the null hypothesis, we would infer that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year cannot be explained by the variation existent in the different numbers of hits (‘H’), numbers of homeruns (‘HR’), numbers of strikeouts (‘SO’), and numbers of walks (‘BB’) for a given team in a given year being considered in this analysis and, as such, is likely caused by randomization.
#Perform an analysis of variance (ANOVA) for the different mean values observed in the number of regular season losses earned in a given year by a given team, given the factors 'H', 'HR', 'SO', and 'BB'.
model_losses <- aov(L~H*HR*SO*BB,teams_thirty)
anova(model_losses)
## Analysis of Variance Table
##
## Response: L
## Df Sum Sq Mean Sq F value Pr(>F)
## H 2 3023 1511.6 13.6591 1.465e-06 ***
## HR 2 2083 1041.4 9.4103 9.118e-05 ***
## SO 2 5537 2768.6 25.0182 2.865e-11 ***
## BB 2 6330 3164.8 28.5983 9.972e-13 ***
## H:HR 4 702 175.6 1.5866 0.175868
## H:SO 4 3793 948.2 8.5678 8.969e-07 ***
## HR:SO 4 315 78.8 0.7125 0.583516
## H:BB 4 3232 808.0 7.3009 8.891e-06 ***
## HR:BB 4 1759 439.7 3.9730 0.003362 **
## SO:BB 4 978 244.5 2.2092 0.066299 .
## H:HR:SO 7 631 90.1 0.8141 0.575620
## H:HR:BB 8 1104 138.0 1.2468 0.268570
## H:SO:BB 8 2229 278.6 2.5172 0.010471 *
## HR:SO:BB 7 1346 192.3 1.7381 0.096887 .
## H:HR:SO:BB 7 389 55.6 0.5026 0.832966
## Residuals 810 89639 110.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the analysis of variance (ANOVA) that is performed where ‘H’ is analyzed against the response variable ‘L’, a p-value = 1.465e-06 is returned, indicating that there is roughly a probability of 1.465e-06 that the resulting associated F-value (13.6591) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year can be explained by the variation existent in the different numbers of earned hits for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
For the analysis of variance (ANOVA) that is performed where ‘HR’ is analyzed against the response variable ‘L’, a p-value = 9.118e-05 is returned, indicating that there is roughly a probability of 9.118e-05 that the resulting associated F-value (9.4103) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year can be explained by the variation existent in the different numbers of earned homeruns for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
For the analysis of variance (ANOVA) that is performed where ‘SO’ is analyzed against the response variable ‘L’, a p-value = 2.865e-11 is returned, indicating that there is roughly a probability of 2.865e-11 that the resulting associated F-value (25.0182) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year can be explained by the variation existent in the different numbers of earned strikeouts for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
For the analysis of variance (ANOVA) that is performed where ‘BB’ is analyzed against the response variable ‘L’, a p-value = 9.972e-13 is returned, indicating that there is roughly a probability of 9.972e-13 that the resulting associated F-value (28.5983) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year can be explained by the variation existent in the different numbers of earned walks for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
For the analysis of variance (ANOVA) that is performed where the different interaction combinations of ‘H’, ‘HR’, ‘SO’, and ‘BB’ are analyzed against the response variable ‘W’, a p-value < 0.05 is returned for the interactions ‘H:SO’, ‘H:BB’, ‘HR:BB’, and ‘H:SO:BB’ (see resulting p-values in table). Therefore, this indicates that there is roughly a probability < 0.05 that the resulting associated F-values for these specific interactions (see resulting F-values in table) are the result of solely randomization. Therefore, based on this result, we would fail to reject the null hypothesis for these specific interactions, leading us to infer that the variation that is observed in the mean values of the numbers of regular season wins earned by a given team in a given year cannot be explained by the variation existent in the factor-interactions ‘H:SO’, ‘H:BB’, ‘HR:BB’, and ‘H:SO:BB’ being considered in this analysis and, as such, is likely not caused solely by randomization. (Note: The resulting associated F-values for all of the other factor-interactions are likely the result of solely randomization.)
In further carrying out this analysis, we can compute Tukey Honest Significant Differences (via “TukeyHSD()”) as a means for determining the specific levels of the factors ‘H’, ‘HR’, ‘SO’, and ‘BB’ existent in this analysis that are truly independent from each other and that significantly affect the response variable, ‘L’.
#Perform a TukeyHSD Test for "model_losses" when considering the factor 'H'.
Tukey_losses_H = TukeyHSD(model_losses, which = "H", ordered = FALSE, conf.level = 0.95)
Tukey_losses_H
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = L ~ H * HR * SO * BB, data = teams_thirty)
##
## $H
## diff lwr upr p adj
## Low-High 4.878239 2.520145 7.236332 0.0000043
## Medium-High 3.846655 1.795283 5.898028 0.0000360
## Medium-Low -1.031583 -3.061230 0.998063 0.4574313
par(mfrow=c(1,1))
plot(Tukey_losses_H)
#Perform a TukeyHSD Test for "model_losses" when considering the factor 'HR'.
Tukey_losses_HR = TukeyHSD(model_losses, which = "HR", ordered = FALSE, conf.level = 0.95)
Tukey_losses_HR
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = L ~ H * HR * SO * BB, data = teams_thirty)
##
## $HR
## diff lwr upr p adj
## Low-High 3.4898172 1.131871 5.847764 0.0015579
## Medium-High 3.3010811 1.252884 5.349279 0.0004853
## Medium-Low -0.1887361 -2.221416 1.843944 0.9741386
par(mfrow=c(1,1))
plot(Tukey_losses_HR)
#Perform a TukeyHSD Test for "model_losses" when considering the factor 'SO'.
Tukey_losses_SO = TukeyHSD(model_losses, which = "SO", ordered = FALSE, conf.level = 0.95)
Tukey_losses_SO
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = L ~ H * HR * SO * BB, data = teams_thirty)
##
## $SO
## diff lwr upr p adj
## Low-High -6.455233 -8.810370 -4.1000954 0.0000000
## Medium-High -2.889600 -4.932290 -0.8469109 0.0026886
## Medium-Low 3.565632 1.529123 5.6021410 0.0001284
par(mfrow=c(1,1))
plot(Tukey_losses_SO)
#Perform a TukeyHSD Test for "model_losses" when considering the factor 'BB'.
Tukey_losses_BB = TukeyHSD(model_losses, which = "BB", ordered = FALSE, conf.level = 0.95)
Tukey_losses_BB
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = L ~ H * HR * SO * BB, data = teams_thirty)
##
## $BB
## diff lwr upr p adj
## Low-High 6.774948 4.419836 9.13006128 0.0000000
## Medium-High 4.830871 2.791284 6.87045874 0.0000001
## Medium-Low -1.944077 -3.983665 0.09551026 0.0655365
par(mfrow=c(1,1))
plot(Tukey_losses_BB)
After observing the results of these Tukey Honest Significant Differences for the ANOVA model that would consider ‘L’ with the factors ‘H’, ‘HR’, ‘SO’, and ‘BB’, respectively, it’s seemingly clear that each of the different level-comparisons (for each individual factor being considered) within the model (excluding the level-interaction “Medium-Low” for ‘H’, the level-interaction “Medium-Low” for ‘HR’, and the level-interaction “Medium-Low” for ‘BB’), when considered individually, suggests a significant effect on ‘L’ that is not due solely to randomization (since the p-value for every other level-interaction is less than 0.05, leading us to reject the null hypothesis that the number of hits, the number of homeruns, the number of strikeouts, and the number of walks earned by a given team in a given year likely does not have a significant effect on the number of regular season losses that a given team earns in a given year).
In order to determine the extent to which the variation that is observed in the response variable (which corresponds to the number of regular season losses in this analysis) can be explained by the variation existent in the treatment (which corresponds to the number of hits (‘H’), the number of homeruns (‘HR’), the number of strikeouts (‘SO’), and the number of walks (‘BB’)), response surface methods can be used in this experiment that will allow us to determine the most optimal “operating conditions” for the response variable ‘L’.
#Create a new dataset to be used for the Response Surface Methods analysis.
teams_rsm <- teams_thirty[,]
#Convert the factor 'H' into an integer variable designated by its levels "Low", "Medium" and "High".
teams_rsm$H <- as.character(teams_rsm$H)
teams_rsm$H[teams_rsm$H == "Low"] <- 1
teams_rsm$H[teams_rsm$H == "Medium"] <- 2
teams_rsm$H[teams_rsm$H == "High"] <- 3
#Categorize 'H' as an integer variable and display its resulting structure.
teams_rsm$H = as.integer(teams_rsm$H)
str(teams_rsm$H)
## int [1:880] 2 3 3 3 3 3 2 2 3 2 ...
#Convert the factor 'HR' into an integer variable designated by its levels "Low", "Medium" and "High".
teams_rsm$HR <- as.character(teams_rsm$HR)
teams_rsm$HR[teams_rsm$HR == "Low"] <- 1
teams_rsm$HR[teams_rsm$HR == "Medium"] <- 2
teams_rsm$HR[teams_rsm$HR == "High"] <- 3
#Categorize 'HR' as an integer variable and display its resulting structure.
teams_rsm$HR = as.integer(teams_rsm$HR)
str(teams_rsm$HR)
## int [1:880] 2 2 2 2 2 2 1 2 1 1 ...
#Convert the factor 'SO' into an integer variable designated by its levels "Low", "Medium" and "High".
teams_rsm$SO <- as.character(teams_rsm$SO)
teams_rsm$SO[teams_rsm$SO == "Low"] <- 1
teams_rsm$SO[teams_rsm$SO == "Medium"] <- 2
teams_rsm$SO[teams_rsm$SO == "High"] <- 3
#Categorize 'SO' as an integer variable and display its resulting structure.
teams_rsm$SO = as.integer(teams_rsm$SO)
str(teams_rsm$SO)
## int [1:880] 1 1 1 1 1 1 1 1 1 1 ...
#Convert the factor 'BB' into an integer variable designated by its levels "Low", "Medium" and "High".
teams_rsm$BB <- as.character(teams_rsm$BB)
teams_rsm$BB[teams_rsm$BB == "Low"] <- 1
teams_rsm$BB[teams_rsm$BB == "Medium"] <- 2
teams_rsm$BB[teams_rsm$BB == "High"] <- 3
#Categorize 'BB' as an integer variable and display its resulting structure.
teams_rsm$BB = as.integer(teams_rsm$BB)
str(teams_rsm$BB)
## int [1:880] 3 2 2 2 1 2 3 2 1 1 ...
#Load in Response Surface Methods package, "rsm".
library("rsm")
## Warning: package 'rsm' was built under R version 3.1.2
#Create a second-order Response Surface Methods model.
losses.rsm <- rsm(L~SO(H, HR, SO, BB), data = teams_rsm)
summary(losses.rsm)
##
## Call:
## rsm(formula = L ~ SO(H, HR, SO, BB), data = teams_rsm)
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.7320819 5.8718818 8.9804 < 2.2e-16 ***
## H 13.6730442 3.4672541 3.9435 8.68e-05 ***
## HR -2.1817751 3.4586450 -0.6308 0.5283262
## SO 13.0853321 3.4696870 3.7713 0.0001734 ***
## BB 7.0622695 3.3517047 2.1071 0.0353991 *
## H:HR 0.5724484 1.0810272 0.5295 0.5965660
## H:SO -2.7744589 0.8948929 -3.1003 0.0019957 **
## H:BB -3.1245882 0.8606482 -3.6305 0.0002995 ***
## HR:SO -0.2002805 1.0570814 -0.1895 0.8497724
## HR:BB 0.0038567 1.0286509 0.0037 0.9970094
## SO:BB -1.5605071 0.8516473 -1.8323 0.0672445 .
## H^2 -0.8614669 0.8419113 -1.0232 0.3064863
## HR^2 -0.3348097 0.9719714 -0.3445 0.7305807
## SO^2 -0.0661058 0.8229784 -0.0803 0.9359973
## BB^2 -0.3753587 0.8011825 -0.4685 0.6395409
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Multiple R-squared: 0.1758, Adjusted R-squared: 0.1625
## F-statistic: 13.18 on 14 and 865 DF, p-value: < 2.2e-16
##
## Analysis of Variance Table
##
## Response: L
## Df Sum Sq Mean Sq F value Pr(>F)
## FO(H, HR, SO, BB) 4 15779 3944.6 33.6340 < 2.2e-16
## TWI(H, HR, SO, BB) 6 5708 951.4 8.1121 1.514e-08
## PQ(H, HR, SO, BB) 4 154 38.6 0.3288 0.8587
## Residuals 865 101448 117.3
## Lack of fit 55 11809 214.7 1.9402 8.391e-05
## Pure error 810 89639 110.7
##
## Stationary point of response surface:
## H HR SO BB
## 2.1160580 -0.8014994 -2.0712852 4.9014763
##
## Eigenanalysis:
## $values
## [1] 1.1991517 0.5728527 -0.3722024 -3.0375430
##
## $vectors
## [,1] [,2] [,3] [,4]
## H -0.7106911 -0.07638990 0.07534390 0.69527406
## HR -0.1656596 0.05499774 -0.98301059 -0.05676561
## SO 0.5149352 -0.70326528 -0.15301547 0.46566718
## BB 0.4497964 0.70466856 -0.06782137 0.54454175
Upon observing the resulting p-values for the model “losses.rsm”, it’s likely that we would reject this experiment’s original null hypothesis (see ANOVA model in above section) for the factors ‘H’, ‘SO’, and ‘BB’ (and for the factor-interactions ‘H:SO’ and ‘H:BB’) with regard to their effect’s significance on the response variable ‘L’ (Note: the factor-interaction ‘SO:BB’ has a p-value of 0.0672445, bringing it close to the “significance threshold” of 0.05). Furthermore, the coordinates of the stationary point of the response surface (as generated by the model “losses.rsm”) are listed as follows: ‘H’ = 2.1160580, ‘HR’ = -0.8014994, ‘SO’ = -2.0712852, and ‘BB’ = 4.9014763. In further examining the nature of this response surface, the resulting eigenvalues generated from this experiment’s eigenanalysis can be analyzed (‘H.value’ = 1.1991517, ‘HR.value’ = 0.5728527, ‘SO.value’ = -0.3722024, and ‘BB.value’ = -3.0375430). Since positive and negative eigenvalues exist for these factors (when considered together), the resulting response surface likely exhibits a saddle (ie. the stationary point is a saddle point). Additionally, it can be further determined that a maximum point exists at the interaction ‘SO:BB’ (ie. two negative eigenvalues), a minimum point exists at the interaction of ‘H:HR’ (ie. two positive eigenvalues), and saddle points exist at the interactions of ‘H:SO’, ‘H:BB’, ‘HR:SO’, and ‘HR:BB’ (ie. one positive and one negative eigenvalue). Lastly, upon observing the generated results, the p-value associated with the “lack of fit” term suggests that the model does not exhibit good fit (since its p-value of 8.391e-05 is less than 0.05) and the p-value associated with the “two-way interaction” term (TWI) suggests that the second-order “TWI” term does contribute significantly to the model (since its p-value of 1.514e-08 is less than 0.05). This significance to the model holds true in a similar way for the “first-order” term (FO) in the model, but it does not hold true for the “pure quadratic” term (PQ) in the model. The response surface itself, as described by the different response surface modeling details above, can be further exhibited by the contour plots and the surface plots below:
#Generate contour plots of the model "losses.rsm".
par(mfrow=c(2,3))
contour(losses.rsm, ~H + HR + SO + BB, at=summary(losses.rsm$canoncial$xs), image = TRUE)
par(mfrow=c(1,1))
contour(losses.rsm, ~H + HR, at=summary(losses.rsm$canoncial$xs), image = TRUE)
par(mfrow=c(1,1))
contour(losses.rsm, ~H + SO, at=summary(losses.rsm$canoncial$xs), image = TRUE)
par(mfrow=c(1,1))
contour(losses.rsm, ~H + BB, at=summary(losses.rsm$canoncial$xs), image = TRUE)
par(mfrow=c(1,1))
contour(losses.rsm, ~HR + SO, at=summary(losses.rsm$canoncial$xs), image = TRUE)
par(mfrow=c(1,1))
contour(losses.rsm, ~HR + BB, at=summary(losses.rsm$canoncial$xs), image = TRUE)
par(mfrow=c(1,1))
contour(losses.rsm, ~SO + BB, at=summary(losses.rsm$canoncial$xs), image = TRUE)
#Generate surface plots of the model "losses.rsm".
library("rgl")
## Warning: package 'rgl' was built under R version 3.1.2
persp(losses.rsm, ~ H + HR, at = c(summary(losses.rsm)$canonical$xs, Block="B2"),theta=-60,zlab="Regular Season Losses",contour = "colors")
persp(losses.rsm, ~ H + SO, at = c(summary(losses.rsm)$canonical$xs, Block="B2"),theta=30,zlab="Regular Season Losses",contour="colors")
persp(losses.rsm, ~ H + BB, at = c(summary(losses.rsm)$canonical$xs, Block="B2"),theta=-60,zlab="Regular Season Losses",contour="colors")
persp(losses.rsm, ~ HR + SO, at = c(summary(losses.rsm)$canonical$xs, Block="B2"),theta=50,zlab="Regular Season Losses",contour="colors")
persp(losses.rsm, ~ HR + BB, at = c(summary(losses.rsm)$canonical$xs, Block="B2"),theta=30,zlab="Regular Season Losses",contour="colors")
persp(losses.rsm, ~ SO + BB, at = c(summary(losses.rsm)$canonical$xs, Block="B2"),theta=-60,zlab="Regular Season Losses",contour="colors")
In estimating the different parameters of the experiment, summary statistics are performed on relevant data in the dataset pertaining to the numbers of regular season losses earned by a given team in a given year for the years contained in “teams_thirty” (which includes both the average number of regular season losses earned by all of the teams individually contained within the dataset and the standard deviation of those regular season losses) and the numbers of hits/homeruns/strikeouts/walks earned by a given team in a given year contained in “teams_thirty” (which includes both the quantities of earned hits/homeruns/strikeouts/walks classified as being “High”, “Medium”, and “Low”, respectively, and the standard deviation of those distributed quantities).
#Display summary statistics of teams_thirty$L.
summary(teams_thirty$L)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40.0 72.0 79.0 79.9 88.0 119.0
#Display standard deviation of teams_thirty$L.
sd(teams_thirty$L, na.rm = FALSE)
## [1] 11.83356
#Display summary statistics of teams_thirty$H.
summary(teams_thirty$H)
## High Low Medium
## 216 223 441
#Display standard deviation of teams_thirty$H.
sd(teams_thirty$H, na.rm = FALSE)
## [1] 0.8258285
#Display summary statistics of teams_thirty$HR.
summary(teams_thirty$HR)
## High Low Medium
## 217 222 441
#Display standard deviation of teams_thirty$HR.
sd(teams_thirty$HR, na.rm = FALSE)
## [1] 0.8268681
#Display summary statistics of teams_thirty$SO.
summary(teams_thirty$SO)
## High Low Medium
## 219 221 440
#Display standard deviation of teams_thirty$SO.
sd(teams_thirty$SO, na.rm = FALSE)
## [1] 0.8285978
#Display summary statistics of teams_thirty$BB.
summary(teams_thirty$BB)
## High Low Medium
## 220 220 440
#Display standard deviation of teams_thirty$BB.
sd(teams_thirty$BB, na.rm = FALSE)
## [1] 0.8296277
In verifying the results of this experiment, it’s important to ensure that the dataset itself meets all of the assumptions that correlate with the design approach that was carried out. In this way, we want to make sure that our dataset exhibits normality. Until we know that our dataset does, in fact, exhibit normality, we cannot yet say with confidence that our results are significant and representative of a properly carried-out modeling approach. In verifying our dataset for normality, we can both create a Normal Quantile-Quantile (QQ) Plot for the data (and for the response surface model “losses.rsm”) and perform a Shapiro-Wilk Test of Normality on the data.
#Create a Normal Q-Q Plot for the numbers of regular season losses earned by a given team in a given year.
qqnorm(teams_thirty[,"L"], main = "Normal Q-Q Plot of Regular Season Losses")
qqline(teams_thirty[,"L"])
#Create a Normal Q-Q Plot of the residuals for "model_losses".
qqnorm(residuals(model_losses), main = "Normal Q-Q Plot of Residuals of 'model_losses'")
qqline(residuals(model_losses))
#Create a Normal Q-Q Plot of the residuals for "losses.rsm".
qqnorm(residuals(losses.rsm), main = "Normal Q-Q Plot of Residuals of 'losses.rsm'")
qqline(residuals(losses.rsm))
#Perform Shapiro-Wilk Test of Normality on the numbers of regular season losses earned by a given team in a given year (normality is assummed if p > 0.1).
shapiro.test(teams_thirty[,"L"])
##
## Shapiro-Wilk normality test
##
## data: teams_thirty[, "L"]
## W = 0.9958, p-value = 0.01724
Upon both constructing Normal Q-Q Plots (for “model_losses” and “losses.rsm”) and performing Shapiro-Wilk Tests of Normality on the data in this analysis, it’s likely that we can readily assume that our data exhibits normality. Despite the fact that the resulting p-value of the Shapiro-Wilk Tests of Normality for “teams_thirty[,“L”]” were < 0.1, all of the constructed Normal Q-Q Plots did seem to display a trend of data that aligned closely with the Normal Q-Q Line. Additionally, since Lahman’s Baseball Database is a completely comprehensive and accurate statistical resource for baseball statistics gathered from 1871-2013 (that, in theory, contains every possible observation that could be collected during those years), we should definitely be safe in making the assumption that this database (and the subsetted dataset that was extracted from it) exhibits normality.
In further backing up the confidence that we have with our results, we can generate a “quality of fit” model that plots residual error against the fitted model that was developed in the original analysis of covariance (ANCOVA).
#Create a "Quality of Fit Model" that plots the residuals of "model_losses" against its fitted model.
par(mfrow=c(1,1))
plot(fitted(model_losses),residuals(model_losses), main = "Residuals of 'model_losses' Against Fitted Model 'model_losses'")
#Create a "Quality of Fit Model" that plots the residuals of "losses.rsm" against its fitted model.
par(mfrow=c(1,1))
plot(fitted(losses.rsm),residuals(losses.rsm), main = "Residuals of 'losses.rsm' Against Fitted Model 'losses.rsm'")
Because the resulting “residual vs. fitted” plots (for both “model_losses” and “losses.rsm”, respectively) appears to exhibit a fairly symmetric distribution of residuals clumped around the zero line, the ANOVA and the Response Surface Methods models developed suggest a goodness of fit. Thus, we can confidently rely on both the modeling approach that we carried out and the dataset that we analyzed in justifying the significance of our results.
If our modeling assumptions failed in our analysis, we can still err on the side of caution by performing the nonparametric Kruskal-Wallis rank sum test to back up our original results (which will help us to decide whether the population distributions are identical without necessarily exhibiting a normal distribution)
#Perform Kruskal-Wallis Rank Sum Test on 'L' within the "teams_thirty" dataset for 'H' (identical populations is assummed if p > 0.05).
kruskal.test(teams_thirty[,"L"],teams_thirty$H)
##
## Kruskal-Wallis rank sum test
##
## data: teams_thirty[, "L"] and teams_thirty$H
## Kruskal-Wallis chi-squared = 26.2239, df = 2, p-value = 2.021e-06
#Perform Kruskal-Wallis Rank Sum Test on 'L' within the "teams_thirty" dataset for 'HR' (identical populations is assummed if p > 0.05).
kruskal.test(teams_thirty[,"L"],teams_thirty$HR)
##
## Kruskal-Wallis rank sum test
##
## data: teams_thirty[, "L"] and teams_thirty$HR
## Kruskal-Wallis chi-squared = 30.5457, df = 2, p-value = 2.329e-07
#Perform Kruskal-Wallis Rank Sum Test on 'L' within the "teams_thirty" dataset for 'SO' (identical populations is assummed if p > 0.05).
kruskal.test(teams_thirty[,"L"],teams_thirty$SO)
##
## Kruskal-Wallis rank sum test
##
## data: teams_thirty[, "L"] and teams_thirty$SO
## Kruskal-Wallis chi-squared = 16.5727, df = 2, p-value = 0.0002519
#Perform Kruskal-Wallis Rank Sum Test on 'L' within the "teams_thirty" dataset for 'BB' (identical populations is assummed if p > 0.05).
kruskal.test(teams_thirty[,"L"],teams_thirty$BB)
##
## Kruskal-Wallis rank sum test
##
## data: teams_thirty[, "L"] and teams_thirty$BB
## Kruskal-Wallis chi-squared = 71.4705, df = 2, p-value = 3.023e-16
Since the p-values for the resulting Kruskal-Wallis rank sum tests that consider the factors ‘H’, ‘HR’, ‘SO’, and ‘BB’ individually against the response variable ‘L’ are all less than 0.05, we can assume that the mean values of the number of regular season losses that a given team earns in a given year compared to the number of hits/homeruns/strikeouts/walks (considered individually) earned by a given team in a given year are comparatively nonidentical populations. Therefore, this result suggests that we would reject the null hypothesis of our main experiment, leading us to believe that the number of hits, the number of homeruns, the number of strikeouts, and the number of walks earned by a given team in a given year likely does have a significant effect on the number of regular season losses that a given team earns in a given year in our analysis. Furthermore, in addition to treating our data in such a way that uses a nonparametric analysis upon any realization that normality cannot be assumed, transformations such as the “Box-Cox Power Transformation” could certainly have been performed on the data to make it approximate normality. However, these transformations would not be necessary for this analysis, since the nonparametric significance results that we generated by using the Kruskal-Wallis rank sums test were suitable in giving us confidence in the results of our analysis.
[1] Lahman, S. (1996-2014). Lahman’s Baseball Database.
The updated version of the database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more (http://www.seanlahman.com/baseball-archive/statistics/). For more details on the latest release, please read the following documentation (http://seanlahman.com/files/database/readme2012.txt). The database can be used on any platform, but please be aware that this is not a standalone application. It is a database that requires Microsoft Access or some other relational database software to be useful.