This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).
When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
In this study, a single-factor, multi-level experiment will be performed to see if the number of hits earned by a given team in a given season has a statistically significant effect on the number of losses that a given team earns in a given season. In addition to the single factor being analyzed in this experiment, two explanatory variables are included that serve to explain the variation existent in the number of losses that a given team earns in a given season, which are the number of strikeouts that a team earned in total in a given season (‘SO’) and the earned-run average of a given team in a given season (‘ERA’). In the dataset, the factor ‘H’ refers to the number of hits that a given team earned in a given year. Furthermore, this analysis’ response variable is referred to in the dataset as ‘L’, which denotes the number of regular season losses that a given team earned in a given year.
##Load in the Teams Dataset
#Get dataset from Project Documents File
raw_teams <- read.csv("~/Academics (RPI)/09. Fall 2014/Design of Experiments/02. Wikibook Recipes/Recipe #06/Teams.csv", header=TRUE)
head(raw_teams)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 <NA> PH1 PNA 1 28 NA 21 7 Y
## 2 1871 <NA> CH1 CNA 2 28 NA 19 9 N
## 3 1871 <NA> BS1 BNA 3 31 NA 20 10 N
## 4 1871 <NA> WS3 OLY 4 32 NA 15 15 N
## 5 1871 <NA> NY2 NNA 5 33 NA 16 17 N
## 6 1871 <NA> TRO TRO 6 29 NA 13 15 N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 376 1281 410 66 27 9 46 23 56 NA NA NA 266 137 4.95 27 0 0
## 2 302 1196 323 52 21 10 60 22 69 NA NA NA 241 77 2.76 25 0 1
## 3 401 1372 426 70 37 3 60 19 73 NA NA NA 303 109 3.55 22 1 3
## 4 310 1353 375 54 26 6 48 13 48 NA NA NA 303 137 4.37 32 0 0
## 5 302 1404 403 43 21 1 33 15 46 NA NA NA 313 121 3.72 32 1 0
## 6 351 1248 384 51 34 6 49 19 62 NA NA NA 362 153 5.51 28 0 0
## IPouts HA HRA BBA SOA E DP FP name
## 1 747 329 3 53 16 194 NA 0.84 Philadelphia Athletics
## 2 753 308 6 28 22 218 NA 0.82 Chicago White Stockings
## 3 828 367 2 42 23 225 NA 0.83 Boston Red Stockings
## 4 846 371 4 45 13 217 NA 0.85 Washington Olympics
## 5 879 373 7 42 22 227 NA 0.83 New York Mutuals
## 6 750 431 4 75 12 198 NA 0.84 Troy Haymakers
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 Jefferson Street Grounds NA 102 98 ATH PH1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 South End Grounds I NA 103 98 BOS BS1
## 4 Olympics Grounds NA 94 98 OLY WS3
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Haymakers' Grounds NA 101 100 TRO TRO
## teamIDretro
## 1 PH1
## 2 CH1
## 3 BS1
## 4 WS3
## 5 NY2
## 6 TRO
tail(raw_teams)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin
## 2740 2013 NL MIA FLA E 5 162 81 62 100 N N
## 2741 2013 NL LAN LAD W 1 162 81 92 70 Y N
## 2742 2013 NL ARI ARI W 2 162 81 81 81 N N
## 2743 2013 NL SDN SDP W 3 162 81 76 86 N N
## 2744 2013 NL SFN SFG W 4 162 82 76 86 N N
## 2745 2013 NL COL COL W 5 162 81 74 88 N N
## LgWin WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER
## 2740 N N 513 5449 1257 219 31 95 432 1232 78 29 56 26 646 602
## 2741 N N 649 5491 1447 281 17 138 476 1146 78 28 57 48 582 524
## 2742 N N 685 5676 1468 302 31 130 519 1142 62 41 43 43 695 651
## 2743 N N 618 5517 1349 246 26 146 467 1309 118 34 52 34 700 643
## 2744 N N 629 5552 1446 280 35 107 469 1078 67 26 39 42 691 643
## 2745 N N 706 5599 1511 283 36 159 427 1204 112 32 26 35 760 708
## ERA CG SHO SV IPouts HA HRA BBA SOA E DP FP
## 2740 3.71 2 1 36 4380 1376 121 526 1177 88 144 0.986
## 2741 3.25 7 4 46 4351 1321 127 460 1292 109 160 0.982
## 2742 3.92 6 2 38 4485 1460 176 485 1218 75 134 0.988
## 2743 3.98 3 1 40 4365 1407 156 525 1171 83 140 0.986
## 2744 4.00 2 2 41 4342 1380 145 521 1256 107 126 0.982
## 2745 4.44 1 0 35 4308 1545 136 517 1064 90 162 0.986
## name park attendance BPF PPF teamIDBR
## 2740 Miami Marlins Marlins Park 1586322 102 103 MIA
## 2741 Los Angeles Dodgers Dodger Stadium 3743527 95 95 LAD
## 2742 Arizona Diamondbacks Chase Field 2134795 102 102 ARI
## 2743 San Diego Padres Petco Park 2166691 91 91 SDP
## 2744 San Francisco Giants AT&T Park 3326796 90 89 SFG
## 2745 Colorado Rockies Coors Field 2793828 117 118 COL
## teamIDlahman45 teamIDretro
## 2740 FLO MIA
## 2741 LAN LAN
## 2742 ARI ARI
## 2743 SDN SDN
## 2744 SFN SFN
## 2745 COL COL
This analysis considers one single factor (which has multiple levels), which is the number of hits that a team earns in a given season, ‘H’. In the original dataset “raw_teams”, the factor ‘H’ is denoted as being an integer variable with no specific categorical levels. However, in carrying out this analysis, this factors will be transformed into a categorical variable with manually-defined levels. This factor was selected intuitively, since this analysis aims to determine whether or not the amount of hits that a given team earns in a given season has a significant effect on the number of regular season losses that a given team earns in a given season.
#Display the summary statistics of "raw_teams".
summary(raw_teams)
## yearID lgID teamID franchID divID
## Min. :1871 AA : 85 CHN : 138 ATL : 138 :1517
## 1st Qu.:1918 AL :1175 PHI : 131 CHC : 138 C: 215
## Median :1961 FL : 16 PIT : 127 CIN : 132 E: 518
## Mean :1954 NL :1399 CIN : 124 PIT : 132 W: 495
## 3rd Qu.:1990 PL : 8 SLN : 122 STL : 132
## Max. :2013 UA : 12 BOS : 113 PHI : 131
## NA's: 50 (Other):1990 (Other):1942
## Rank G Ghome W
## Min. : 1.000 Min. : 6.0 Min. :44.0 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.:153.0 1st Qu.:77.0 1st Qu.: 66.00
## Median : 4.000 Median :157.0 Median :80.0 Median : 77.00
## Mean : 4.132 Mean :150.1 Mean :78.4 Mean : 74.61
## 3rd Qu.: 6.000 3rd Qu.:162.0 3rd Qu.:81.0 3rd Qu.: 87.00
## Max. :13.000 Max. :165.0 Max. :84.0 Max. :116.00
## NA's :399
## L DivWin WCWin LgWin WSWin R
## Min. : 4.00 :1545 :2181 : 28 : 357 Min. : 24.0
## 1st Qu.: 65.00 N: 982 N: 522 N:2449 N:2274 1st Qu.: 612.0
## Median : 76.00 Y: 218 Y: 42 Y: 268 Y: 114 Median : 690.0
## Mean : 74.61 Mean : 682.1
## 3rd Qu.: 86.00 3rd Qu.: 765.0
## Max. :134.00 Max. :1220.0
##
## AB H X2B X3B
## Min. : 211 Min. : 33 Min. : 3.0 Min. : 0.00
## 1st Qu.:5117 1st Qu.:1297 1st Qu.:192.0 1st Qu.: 31.00
## Median :5379 Median :1393 Median :229.0 Median : 42.00
## Mean :5134 Mean :1345 Mean :226.6 Mean : 47.48
## 3rd Qu.:5514 3rd Qu.:1468 3rd Qu.:269.0 3rd Qu.: 61.00
## Max. :5781 Max. :1783 Max. :376.0 Max. :150.00
##
## HR BB SO SB
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 40 1st Qu.:425.0 1st Qu.: 501.0 1st Qu.: 64.0
## Median :105 Median :494.0 Median : 746.0 Median : 96.0
## Mean :100 Mean :473.8 Mean : 726.3 Mean :112.8
## 3rd Qu.:148 3rd Qu.:555.0 3rd Qu.: 955.0 3rd Qu.:143.0
## Max. :264 Max. :835.0 Max. :1535.0 Max. :581.0
## NA's :120 NA's :144
## CS HBP SF RA
## Min. : 0.00 Min. : 26.00 Min. :24.00 Min. : 34.0
## 1st Qu.: 35.00 1st Qu.: 47.00 1st Qu.:39.00 1st Qu.: 608.0
## Median : 46.00 Median : 54.50 Median :44.00 Median : 688.0
## Mean : 49.21 Mean : 56.35 Mean :45.09 Mean : 682.1
## 3rd Qu.: 59.00 3rd Qu.: 64.00 3rd Qu.:50.25 3rd Qu.: 765.0
## Max. :191.00 Max. :103.00 Max. :75.00 Max. :1252.0
## NA's :859 NA's :2325 NA's :2325
## ER ERA CG SHO
## Min. : 25.0 Min. :1.220 Min. : 0.0 Min. : 0.000
## 1st Qu.: 498.0 1st Qu.:3.330 1st Qu.: 16.0 1st Qu.: 6.000
## Median : 590.0 Median :3.820 Median : 46.0 Median : 9.000
## Mean : 569.8 Mean :3.815 Mean : 51.5 Mean : 9.435
## 3rd Qu.: 667.0 3rd Qu.:4.310 3rd Qu.: 78.0 3rd Qu.:12.000
## Max. :1023.0 Max. :8.000 Max. :148.0 Max. :32.000
##
## SV IPouts HA HRA
## Min. : 0.00 Min. : 162 Min. : 49 Min. : 0
## 1st Qu.: 9.00 1st Qu.:4071 1st Qu.:1287 1st Qu.: 43
## Median :23.00 Median :4224 Median :1392 Median :107
## Mean :23.25 Mean :4015 Mean :1345 Mean :100
## 3rd Qu.:37.00 3rd Qu.:4339 3rd Qu.:1471 3rd Qu.:147
## Max. :68.00 Max. :4518 Max. :1993 Max. :241
##
## BBA SOA E DP
## Min. : 0.0 Min. : 0.0 Min. : 47.0 Min. : 18.0
## 1st Qu.:426.0 1st Qu.: 499.0 1st Qu.:119.0 1st Qu.:126.0
## Median :496.0 Median : 721.0 Median :147.0 Median :144.0
## Mean :474.1 Mean : 719.9 Mean :188.6 Mean :140.1
## 3rd Qu.:556.0 3rd Qu.: 951.0 3rd Qu.:219.0 3rd Qu.:160.0
## Max. :827.0 Max. :1428.0 Max. :639.0 Max. :217.0
## NA's :317
## FP name park
## Min. :0.7600 Cincinnati Reds : 123 Wrigley Field : 100
## 1st Qu.:0.9600 Pittsburgh Pirates : 123 Sportsman's Park IV: 90
## Median :0.9700 Philadelphia Phillies: 122 Comiskey Park : 80
## Mean :0.9605 St. Louis Cardinals : 114 Fenway Park II : 80
## 3rd Qu.:0.9800 Chicago White Sox : 113 Forbes Field : 60
## Max. :0.9910 Detroit Tigers : 113 Crosley Field : 58
## (Other) :2037 (Other) :2277
## attendance BPF PPF teamIDBR
## Min. : 6088 Min. : 60.0 Min. : 60.0 CHC : 138
## 1st Qu.: 518051 1st Qu.: 97.0 1st Qu.: 97.0 CIN : 137
## Median :1107122 Median :100.0 Median :100.0 STL : 135
## Mean :1317241 Mean :100.2 Mean :100.2 PHI : 134
## 3rd Qu.:1950099 3rd Qu.:103.0 3rd Qu.:103.0 PIT : 132
## Max. :4483350 Max. :129.0 Max. :141.0 BOS : 121
## NA's :279 (Other):1948
## teamIDlahman45 teamIDretro
## CHN : 138 CHN : 138
## PHI : 131 PHI : 131
## PIT : 127 PIT : 127
## CIN : 124 CIN : 124
## SLN : 122 SLN : 122
## BOS : 113 BOS : 113
## (Other):1990 (Other):1990
#Display the names found in "raw_teams".
names(raw_teams)
## [1] "yearID" "lgID" "teamID" "franchID"
## [5] "divID" "Rank" "G" "Ghome"
## [9] "W" "L" "DivWin" "WCWin"
## [13] "LgWin" "WSWin" "R" "AB"
## [17] "H" "X2B" "X3B" "HR"
## [21] "BB" "SO" "SB" "CS"
## [25] "HBP" "SF" "RA" "ER"
## [29] "ERA" "CG" "SHO" "SV"
## [33] "IPouts" "HA" "HRA" "BBA"
## [37] "SOA" "E" "DP" "FP"
## [41] "name" "park" "attendance" "BPF"
## [45] "PPF" "teamIDBR" "teamIDlahman45" "teamIDretro"
#Display the structure of "raw_teams".
str(raw_teams)
## 'data.frame': 2745 obs. of 48 variables:
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
## $ lgID : Factor w/ 6 levels "AA","AL","FL",..: NA NA NA NA NA NA NA NA NA NA ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 97 31 24 142 90 136 56 39 111 24 ...
## $ franchID : Factor w/ 120 levels "ALT","ANA","ARI",..: 85 36 13 77 70 109 56 25 91 13 ...
## $ divID : Factor w/ 4 levels "","C","E","W": 1 1 1 1 1 1 1 1 1 1 ...
## $ Rank : int 1 2 3 4 5 6 7 8 9 1 ...
## $ G : int 28 28 31 32 33 29 19 29 25 48 ...
## $ Ghome : int NA NA NA NA NA NA NA NA NA NA ...
## $ W : int 21 19 20 15 16 13 7 10 4 39 ...
## $ L : int 7 9 10 15 17 15 12 19 21 8 ...
## $ DivWin : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ WCWin : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ LgWin : Factor w/ 3 levels "","N","Y": 3 2 2 2 2 2 2 2 2 3 ...
## $ WSWin : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ R : int 376 302 401 310 302 351 137 249 231 521 ...
## $ AB : int 1281 1196 1372 1353 1404 1248 746 1186 1036 2137 ...
## $ H : int 410 323 426 375 403 384 178 328 274 677 ...
## $ X2B : int 66 52 70 54 43 51 19 35 44 114 ...
## $ X3B : int 27 21 37 26 21 34 8 40 25 31 ...
## $ HR : int 9 10 3 6 1 6 2 7 3 7 ...
## $ BB : int 46 60 60 48 33 49 33 26 38 28 ...
## $ SO : int 23 22 19 13 15 19 9 25 30 26 ...
## $ SB : int 56 69 73 48 46 62 16 18 53 47 ...
## $ CS : int NA NA NA NA NA NA NA NA NA 14 ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ RA : int 266 241 303 303 313 362 243 341 287 236 ...
## $ ER : int 137 77 109 137 121 153 97 116 108 95 ...
## $ ERA : num 4.95 2.76 3.55 4.37 3.72 5.51 5.17 4.11 4.3 1.99 ...
## $ CG : int 27 25 22 32 32 28 19 23 23 41 ...
## $ SHO : int 0 0 1 0 1 0 1 0 1 3 ...
## $ SV : int 0 1 3 0 0 0 0 0 0 1 ...
## $ IPouts : int 747 753 828 846 879 750 507 762 678 1290 ...
## $ HA : int 329 308 367 371 373 431 261 346 315 438 ...
## $ HRA : int 3 6 2 4 7 4 5 13 3 0 ...
## $ BBA : int 53 28 42 45 42 75 21 53 34 27 ...
## $ SOA : int 16 22 23 13 22 12 17 34 16 0 ...
## $ E : int 194 218 225 217 227 198 163 223 220 263 ...
## $ DP : int NA NA NA NA NA NA NA NA NA NA ...
## $ FP : num 0.84 0.82 0.83 0.85 0.83 0.84 0.8 0.81 0.82 0.87 ...
## $ name : Factor w/ 139 levels "Altoona Mountain City",..: 97 42 17 135 93 131 63 51 111 17 ...
## $ park : Factor w/ 213 levels "","23rd Street Grounds",..: 87 197 170 130 199 80 77 116 4 170 ...
## $ attendance : int NA NA NA NA NA NA NA NA NA NA ...
## $ BPF : int 102 104 103 94 90 101 101 96 97 105 ...
## $ PPF : int 98 102 98 98 88 100 107 100 99 100 ...
## $ teamIDBR : Factor w/ 101 levels "ALT","ANA","ARI",..: 4 21 10 65 62 93 42 25 77 10 ...
## $ teamIDlahman45: Factor w/ 148 levels "ALT","ANA","ARI",..: 96 31 24 140 89 135 56 39 110 24 ...
## $ teamIDretro : Factor w/ 149 levels "ALT","ANA","ARI",..: 96 31 24 141 89 135 56 39 110 24 ...
In this dataset, there are a few variables that can be considered to be continuous variables; these variables are the ones which are categorized as being numeric variables. By this standard, the continuous variables that exist in this dataset include ‘ERA’ (which refers to a team’s overal earned run average) and ‘FP’ (which refers to a team’s overal fielding percentage). The main factor that our analysis considers, ‘H’, currently exists as an integer variable. However, upon manually defining these given integer values into specific “integer-range” levels, this factor will become defined as being a categorical variable.
This analysis will consider one response variable, ‘L’, which denotes the number of regular season losses that a given team earned in a given year.
As a whole, Lahman’s Baseball Database contains information pertaining to pitching, hitting, and fielding statistics for Major League Baseball from the years 1871 through 2013. It includes data from the two current leagues (American and National), the four other “major” leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875. It contains 2,745 observations of 48 variables, which are defined below (Lahman, 1996-2014) [1]:
yearID [Year]
lgID [League]
teamID [Team]
franchID [Franchise (links to TeamsFranchise table)]
divID [Team’s division]
Rank [Position in final standings]
G [Games played]
GHome [Games played at home]
W [Wins]
L [Losses]
DivWin [Division Winner (Y or N)]
WCWin [Wild Card Winner (Y or N)]
LgWin [League Champion (Y or N)]
WSWin [World Series Winner (Y or N)]
R [Runs scored]
AB [At bats]
H [Hits by batters]
2B [Doubles]
3B [Triples]
HR [Homeruns by batters]
BB [Walks by batters]
SO [Strikeouts by batters]
SB [Stolen bases]
CS [Caught stealing]
HBP [Batters hit by pitch]
SF [Sacrifice flies]
RA [Opponents runs scored]
ER [Earned runs allowed]
ERA [Earned run average]
CG [Complete games]
SHO [Shutouts]
SV [Saves]
IPOuts [Outs Pitched (innings pitched x 3)]
HA [Hits allowed]
HRA [Homeruns allowed]
BBA [Walks allowed]
SOA [Strikeouts by pitchers]
E [Errors]
DP [Double Plays]
FP [Fielding percentage]
name [Team’s full name]
park [Name of team’s home ballpark]
attendance [Home attendance total]
BPF [Three-year park factor for batters]
PPF [Three-year park factor for pitchers]
teamIDBR [Team ID used by Baseball Reference website]
teamIDlahman45 [Team ID used in Lahman database version 4.5]
teamIDretro [Team ID used by Retrosheet]
Because this dataset is a completely comprehensive and accurate statistical resourse for baseball statistics gathered from 1871-2013 (that, in theory, contains every possible observation that could be collected during those years), we should definitely be able to make the assumption that this dataset exhibits randomization without any bias.
In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘L’ in this analysis) can be explained by the variation existent in the single treatment being considered in the experiment (which corresponds to ‘H’) [as it is analyzed with the explanatory variables that the model also considers (‘SO’ and ‘ERA’, which were determined by performing a pairs plot [see below])]. Therefore, the null hypothesis that is being tested states that the number of hits earned by a given team in a given year do not have a significant effect on the number of regular season losses that a given team earns in a given year. Alternately, the alternative hypothesis that is being tested states that the number of hits do have a significant effect on the number of regular season losses that a given team earns in a given year. In carrying out this analysis, we perform an analysis of covariance (ANCOVA) for the number of regular season losses (‘L’) to see if there is a significant difference in the means for this response variable when analyzing the number of hits (‘H’) [with considerations being made for the explanatory variables ‘SO’ and ‘ERA’ that are earned by a given team in a given year, which are also contained in this dataset].
The rationale for this design lies primarily in the fact that we’re trying to determine if the number of hits earned by a given team in a given season has any effect on the number of regular season losses that a given team earns in a given year. So, since the number of regular season losses is a useful and relevant metric to consider when determining the most important performance-related factors that result in earned losses for a given team, this design of a single-factor, multi-level experiment (as it corresponds to an analysis of covariance with the inclusion of two explanatory variables) was crafted to see if the number of hits earned by a given team in a given year has a significant effect on the number of regular season losses that a given team earns in a given year. The reason for performing an ANCOVA analysis in this experiment has to do with the nature of the different variables being considered in this experiment. Although the factor ‘H’ is known and controllable, the explanatory variables ‘SO’ and ‘ERA’ are known and uncontrollable. Since these variables are uncontrollable, we cannot treat them as being categorical factors and must let them remain as continuous, quantitative variables (which are also known as covariates, here). Despite this, an ANCOVA model (which brings different aspects pertaining to regression analysis and ANOVA together) allows us to still include these covariates in a model of our treatment variable, ‘H’, stimulating the reduction of the variation existent in the error terms while giving us a more accurate measurement of the effect of the treatment. In this way, ANCOVA generally allows us to control for the effects that the covariates in the experiment may have while the factors’ main effects and interaction effects are being determined. By performing the analysis in this manner, we can hope to receive some insight regarding the different factors that come into play that typically result in regular season losses for baseball teams.
Since the original assumption claimed that the entirety of Lahman’s Baseball Database exhibits randomization, we did not need to worry about randomizing our data any further to ensure that a completely randomized design is created. However, in carrying out this analysis in a reasonable and logical way, a chronologically-subsetted dataset was extracted from the entirety of Lahman’s Baseball Database (all data ranging from 1983-2013, for a grand total of 30 years of data, was used in this analysis). In determining which explanatory variables to include in this experiment (which have been noted previously), a pairs plot was generated to give us some insight into the different effects that various variables in the dataset have on the response variable, ‘L’. To properly generate this pairs plot, the most relevant/interesting variables were included in a subset of data extracted from the larger “raw_teams” dataset, including ‘yearID’, ‘lgID’, ‘teamID’, ‘W’, ‘L’, ‘R’, ‘AB’, ‘H’, ‘HR’, ‘BB’, ‘SO’, ‘ERA’, and ‘E’. Since the other 35 variables are not being considered further in this analysis, it did not make sense to include them anymore.
#Create a pairs plot for 'L' against 'H'.
par(mfrow=c(1,1))
pairs(L~H, raw_teams)
#Create subset of the dataset that only includes team data from the years 1983-2013.
teams_thirty <- subset(raw_teams, raw_teams$yearID >= 1983, select=c(yearID:teamID,W,L,R:H,HR:SO,ERA,E))
#Create a pairs plot for "teams_thirty".
par(mfrow=c(1,1))
pairs(teams_thirty)
#Create a pairs plot for 'L' against 'SO'.
par(mfrow=c(1,1))
pairs(L~SO, teams_thirty)
#Create a pairs plot for 'L' against 'ERA'.
par(mfrow=c(1,1))
pairs(L~ERA, teams_thirty)
Upon generating these pairs plots, it appears that the number of strikeous (‘SO’) and the earned run average (‘ERA’) earned by a given team in a given year correlates positively with team losses (‘L’). Therefore, ‘SO’ and ‘ERA’ seem to be the most interesting/relevant explanatory variables to include in the ANCOVA model along with the main factor ‘H’.
In this experiment, there are no replicates or repeated measures present.
In order to transform our integer factor variable into a categorical factor variable, blocking was used in this design. In transforming the number of hits, ‘H’, into a categorical variable, three different levels were defined, designating a low number of hits, a medium number of hits, and a high number of hits by a given team in a given year. These different levels were determined upon calculating the first and third quartiles of hitting data in the subsetted dataset “teams_thirty” (see numetric levels in R code below). Therefore, in this newly subsetted dataset, factor ‘H’ now has three distinct levels (“Low”, “Medium”, and “High”).
#Transform 'H' into categorical variables (Low, Medium, and High).
teams_thirty$H[teams_thirty$H > 0 & teams_thirty$H <= 1385] = "Low"
teams_thirty$H[teams_thirty$H > 1385 & teams_thirty$H <= 1499] = "Medium"
teams_thirty$H[teams_thirty$H > 1499 & teams_thirty$H <= 1684] = "High"
#Categorize 'H' as a factor and display its resulting levels.
teams_thirty$H = as.factor(teams_thirty$H)
levels(teams_thirty$H)
## [1] "High" "Low" "Medium"
In beginning to display this data graphically, summary statistics were gathered for the newly created dataset, “teams_thirty”. Additionally, histograms and boxplots were created to represent the different observations of regular season losses existent within this subsetted dataset that contains data from 1983-2013.
#Display the summary statistics of "teams_thirty".
summary(teams_thirty)
## yearID lgID teamID W L
## Min. :1983 AA: 0 ATL : 31 Min. : 43.0 Min. : 40.0
## 1st Qu.:1991 AL:435 BAL : 31 1st Qu.: 72.0 1st Qu.: 72.0
## Median :1999 FL: 0 BOS : 31 Median : 80.0 Median : 79.0
## Mean :1999 NL:445 CHA : 31 Mean : 79.9 Mean : 79.9
## 3rd Qu.:2006 PL: 0 CHN : 31 3rd Qu.: 89.0 3rd Qu.: 88.0
## Max. :2013 UA: 0 CIN : 31 Max. :116.0 Max. :119.0
## (Other):694
## R AB H HR
## Min. : 466.0 Min. :3856 High :216 Min. : 58.0
## 1st Qu.: 670.8 1st Qu.:5469 Low :223 1st Qu.:127.0
## Median : 731.0 Median :5524 Medium:441 Median :154.5
## Mean : 731.9 Mean :5468 Mean :155.6
## 3rd Qu.: 791.0 3rd Qu.:5588 3rd Qu.:180.0
## Max. :1009.0 Max. :5781 Max. :264.0
##
## BB SO ERA E
## Min. :319.0 Min. : 568 Min. :2.910 Min. : 54.0
## 1st Qu.:479.8 1st Qu.: 910 1st Qu.:3.800 1st Qu.: 99.0
## Median :527.0 Median :1014 Median :4.180 Median :112.0
## Mean :530.2 Mean :1011 Mean :4.211 Mean :112.8
## 3rd Qu.:578.5 3rd Qu.:1107 3rd Qu.:4.580 3rd Qu.:126.0
## Max. :775.0 Max. :1535 Max. :6.380 Max. :179.0
##
#Display the names found in "teams_thirty".
names(teams_thirty)
## [1] "yearID" "lgID" "teamID" "W" "L" "R" "AB"
## [8] "H" "HR" "BB" "SO" "ERA" "E"
#Display the structure of "teams_thirty".
str(teams_thirty)
## 'data.frame': 880 obs. of 13 variables:
## $ yearID: int 1983 1983 1983 1983 1983 1983 1983 1983 1983 1983 ...
## $ lgID : Factor w/ 6 levels "AA","AL","FL",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ teamID: Factor w/ 149 levels "ALT","ANA","ARI",..: 5 52 93 134 83 16 45 33 66 131 ...
## $ W : int 98 92 91 89 87 78 70 99 79 77 ...
## $ L : int 64 70 71 73 75 84 92 63 83 85 ...
## $ R : int 799 789 770 795 764 724 704 800 696 639 ...
## $ AB : int 5546 5592 5631 5581 5620 5590 5476 5484 5598 5610 ...
## $ H : Factor w/ 3 levels "High","Low","Medium": 3 1 1 1 1 1 3 3 1 3 ...
## $ HR : int 168 156 153 167 132 142 86 157 109 106 ...
## $ BB : int 601 508 533 510 475 536 605 527 397 442 ...
## $ SO : int 800 831 686 810 665 758 691 888 722 767 ...
## $ ERA : num 3.63 3.8 3.86 4.12 4.02 4.34 4.43 3.67 4.25 3.31 ...
## $ E : int 121 124 139 115 113 130 122 120 164 113 ...
#Display the head and tail of "teams_thirty".
head(teams_thirty)
## yearID lgID teamID W L R AB H HR BB SO ERA E
## 1866 1983 AL BAL 98 64 799 5546 Medium 168 601 800 3.63 121
## 1867 1983 AL DET 92 70 789 5592 High 156 508 831 3.80 124
## 1868 1983 AL NYA 91 71 770 5631 High 153 533 686 3.86 139
## 1869 1983 AL TOR 89 73 795 5581 High 167 510 810 4.12 115
## 1870 1983 AL ML4 87 75 764 5620 High 132 475 665 4.02 113
## 1871 1983 AL BOS 78 84 724 5590 High 142 536 758 4.34 130
tail(teams_thirty)
## yearID lgID teamID W L R AB H HR BB SO ERA E
## 2740 2013 NL MIA 62 100 513 5449 Low 95 432 1232 3.71 88
## 2741 2013 NL LAN 92 70 649 5491 Medium 138 476 1146 3.25 109
## 2742 2013 NL ARI 81 81 685 5676 Medium 130 519 1142 3.92 75
## 2743 2013 NL SDN 76 86 618 5517 Low 146 467 1309 3.98 83
## 2744 2013 NL SFN 76 86 629 5552 Medium 107 469 1078 4.00 107
## 2745 2013 NL COL 74 88 706 5599 High 159 427 1204 4.44 90
#Display the levels of 'H' within "teams_thirty".
levels(teams_thirty$H)
## [1] "High" "Low" "Medium"
par(mfrow=c(1,1))
#Create a histogram of Regular Season Losses ('L') for teams from 1983-2013.
hist(teams_thirty$L, xlim=c(35,120), xlab = "Number of Regular Season Losses", main = "Histogram of MLB Regular Season Losses from 1983-2013")
par(mfrow=c(1,1))
#Create a boxplot of Regular Season Losses ('L') [for all teams from 1983-2013].
boxplot(teams_thirty$L~teams_thirty$teamID, main = "Boxplot of MLB Regular Season Losses from 1983-2013", ylim = c(35,120), xlab = "MLB Teams", ylab = "Regular Season Losses")
In order to determine if the variation that is observed in the response variable (which corresponds to the number of regular season losses in this analysis) can be explained by the variation existent in the treatment (which corresponds to the number of hits, ‘H’) [with considerations being made for the explanatory variables of the experiment, which correspond to the number of strikeouts, ‘SO’, and the earned run average, ‘ERA’, earned by a given team in a given year, respectively], an analysis of covariance (ANCOVA) is performed (that includes our two explanatory variables, ‘SO’ and ‘ERA’) as a means for analyzing the differences in regular season losses for each of the different numbers of hits earned by a given team in a given year (ranging from 1983-2013) contained within the dataset.
For the ANCOVA model that is designed in this experiment, the null hypothesis that is being tested (which we will either reject or fail to reject by the end of our analysis) states that the number of hits earned by a given team in a given year does not have a significant effect on the number of regular season losses that a given team earns in a given year, implying that the differences in mean values of the number of regular season losses earned by a given team in a given year were solely the result of randomization in this experiment. In other words, if we reject the null hypothesis, we would infer that the differences in mean values of the numbers of regular season losses earned by a given team in a given year for each of the corresponding levels of the numbers of earned hits in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year can be explained by the variation existent in the different numbers of earned hits for a given team in a given year being considered in this analysis. Alternately, if we fail to reject the null hypothesis, we would infer that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year cannot be explained by the variation existent in the different numbers of earned hits for a given team in a given year being considered in this analysis and, as such, is likely caused by randomization. In testing the null hypothesis and carrying out an ANCOVA analysis that properly represents the experiment that we are looking to perform, the explanatory variables corresponding to the number of strikeouts and the ERA earned by a given team in a given year are included in the ANCOVA model [so that the effects that the covariates may have in the experiment (‘SO’ and ‘ERA’) are controlled while the factor’s (‘H’) main effects are being determined].
#Perform an analysis of covariance (ANCOVA) for the different mean values observed in the number of regular season losses earned in a given year by a given team, given the factor 'H' and the explanatory variables 'SO' and 'ERA'.
model_losses <- lm(L~SO+ERA+H,teams_thirty)
anova(model_losses)
## Analysis of Variance Table
##
## Response: L
## Df Sum Sq Mean Sq F value Pr(>F)
## SO 1 7022 7021.7 74.355 < 2.2e-16 ***
## ERA 1 23471 23470.7 248.538 < 2.2e-16 ***
## H 2 9966 4983.2 52.769 < 2.2e-16 ***
## Residuals 875 82630 94.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For the analysis of covariance (ANCOVA) that is performed where ‘H’ is analyzed against the response variable ‘L’ (with considerations being made for the variables ‘SO’ and ‘ERA’), a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (52.769) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season losses earned by a given team in a given year can be explained by the variation existent in the different numbers of earned hits for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
In further carrying out this analysis, we can compute Tukey Honest Significant Differences (via “TukeyHSD()”) as a means for determining the specifc levels of the factor ‘H’ existent in this analysis that are truly independent from each other and that significantly affect the response variable, ‘L’.
#Perform a TukeyHSD Test for the ANOVA model that considers 'L' and 'H' without any explanatory variables. [Since TukeyHSD tests cannot be performed on an object of class "lm", they would need to be performed on an ANOVA model in this situation. However, since the explanatory variables that are being considered in this analysis are not categorical variables, they should not be included in the ANOVA model that is created for the TukeyHSD test.]
Tukey_losses = TukeyHSD(aov(L~H, teams_thirty), ordered = FALSE, conf.level = 0.95)
Tukey_losses
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = L ~ H, data = teams_thirty)
##
## $H
## diff lwr upr p adj
## Low-High 4.878239 2.255804 7.500673 0.0000419
## Medium-High 3.846655 1.565325 6.127986 0.0002403
## Medium-Low -1.031583 -3.288752 1.225585 0.5311406
par(mfrow=c(1,1))
plot(Tukey_losses)
After observing the results of these Tukey Honest Significant Differences for the ANOVA model that would consider ‘L’ and ‘H’, it’s seemingly clear that each of the different level-comparisons within the model (excluding the level-interaction “Medium-Low”), considered individually, suggests a significant effect on ‘L’ that is not due solely to randomization (since the p-value for each level-interaction is less than 0.05, leading us to reject the null hypothesis that the number of hits earned by a given team in a given year likely does not have a significant effect on the number of regular season losses that a given team earns in a given year).
In estimating the different parameters of the experiment, summary statistics are performed on relevant data in the dataset pertaining to the numbers of regular season losses earned by a given team in a given year for the years contained in “teams_thirty” (which includes both the average number of regular season losses earned by all of the teams individually contained within the dataset and the standard deviation of those regular season losses) and the numbers of hits earned by a given team in a given year contained in “teams_thirty” (which includes both the quantities of earned hits classified as being “High”, “Medium”, and “Low”, respectively, and the standard deviation of those distributed quantities). Additionally, summary statistics are performed on the data pertaining to the two explanatory variables in the experiment, which include the number of strikeous earned by a given team in a given year for the years contained in “teams_thirty” (which includes both the average number of strikeouts earned by all of the teams individually contained within the dataset and the standard deviation of those strikeouts) and the earned run average earned by a given team in a given year for the years contained in “teams_thirty” (which includes both the average earned run average earned by all of the teams individually contained within the dataset and the standard deviation of those earned run averages).
#Display summary statistics of teams_thirty$L.
summary(teams_thirty$L)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40.0 72.0 79.0 79.9 88.0 119.0
#Display standard deviation of teams_thirty$L.
sd(teams_thirty$L, na.rm = FALSE)
## [1] 11.83356
#Display summary statistics of teams_thirty$H.
summary(teams_thirty$H)
## High Low Medium
## 216 223 441
#Display standard deviation of teams_thirty$H.
sd(teams_thirty$H, na.rm = FALSE)
## [1] 0.8258285
#Display summary statistics of teams_thirty$SO.
summary(teams_thirty$SO)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 568 910 1014 1011 1107 1535
#Display standard deviation of teams_thirty$SO.
sd(teams_thirty$SO, na.rm = FALSE)
## [1] 148.4102
#Display summary statistics of teams_thirty$ERA.
summary(teams_thirty$ERA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.910 3.800 4.180 4.211 4.580 6.380
#Display standard deviation of teams_thirty$ERA.
sd(teams_thirty$ERA, na.rm = FALSE)
## [1] 0.5573624
In verifying the results of this experiment, it’s important to ensure that the dataset itself meets all of the assumptions that correlate with the design approach that was carried out. In this way, we want to make sure that our dataset exhibits normality. Until we know that our dataset does, in fact, exhibit normality, we cannot yet say with confidence that our results are significant and representative of a properly carried-out modeling approach. In verifying our dataset for normality, we can both create a Normal Quantile-Quantile (QQ) Plot of our data and perform a Shapiro-Wilk Test of Normality on our data.
#Create a Normal Q-Q Plot for the numbers of regular season losses earned by a given team in a given year.
qqnorm(teams_thirty[,"L"], main = "Normal Q-Q Plot of Regular Season Losses")
qqline(teams_thirty[,"L"])
#Create a Normal Q-Q Plot of the residuals for "model_losses".
qqnorm(residuals(model_losses), main = "Normal Q-Q Plot of Residuals of 'model_losses'")
qqline(residuals(model_losses))
#Perform Shapiro-Wilk Test of Normality on the numbers of regular season losses earned by a given team in a given year (normality is assummed if p > 0.1).
shapiro.test(teams_thirty[,"L"])
##
## Shapiro-Wilk normality test
##
## data: teams_thirty[, "L"]
## W = 0.9958, p-value = 0.01724
Upon both constructing Normal Q-Q Plots and performing Shapiro-Wilk Tests of Normality on the data in this analysis, it’s likely that we can readily assume that our data exhibits normality. Despite the fact that the resulting p-value of the Shapiro-Wilk Tests of Normality for “teams_thirty[,“L”]” were < 0.1, all of the constructed Normal Q-Q Plots did seem to display a trend of data that aligned closely with the Normal Q-Q Line. Additionally, since Lahman’s Baseball Database is a completely comprehensive and accurate statistical resourse for baseball statistics gathered from 1871-2013 (that, in theory, contains every possible observation that could be collected during those years), we should definitely be safe in making the assumption that this database (and the subsetted dataset that was extracted from it) exhibits normality.
In further backing up the confidence that we have with our results, we can generate a “quality of fit” model that plots residual error against the fitted model that was developed in the original analysis of covariance (ANCOVA).
#Create a "Quality of Fit Model" that plots the residuals of "model_losses" against its fitted model.
par(mfrow=c(1,1))
plot(fitted(model_losses),residuals(model_losses), main = "Residuals of 'model_losses' Against Fitted Model 'model_losses'")
Because the resulting plot appears to exhibit a fairly symmetric distribution of residuals clumped around the zero line (aside from the outlier that exists to the right of the plot), the ANCOVA model developed suggests good fit. Thus, we can confindently rely on both the modeling approach that we carried out and the dataset that we analyzed in justifying the significance of our results.
If our modeling assumptions failed in our analysis, we can still err on the side of caution by performing the nonparametric Kruskal-Wallis rank sum test to back up our original results (which will help us to decide whether the population distributions are identical without necessarily exhibiting a normal distribution)
#Perform Kruskal-Wallis Rank Sum Test on 'L' within the "teams_thirty" dataset for 'H' (identical populations is assummed if p > 0.05).
kruskal.test(teams_thirty[,"L"],teams_thirty$H)
##
## Kruskal-Wallis rank sum test
##
## data: teams_thirty[, "L"] and teams_thirty$H
## Kruskal-Wallis chi-squared = 26.2239, df = 2, p-value = 2.021e-06
Since the p-value for the resulting Kruskal-Wallis rank sum test that considers the factor ‘H’ against the response variable ‘L’ is less than 0.05, we can assume that the mean values of the number of regular season losses that a given team earns in a given year compared to the number of hits earned by a given team in a given year are comparatively nonidentical populations. Therefore, this result suggests that we would reject the null hypothesis of our main experiment, leading us to believe that the number of hits earned by a given team in a given year likely does have a significant effect on the number of regular season losses that a given team earns in a given year in our analysis. Furthermore, in addition to treating our data in such a way that uses a nonparametric analysis upon any realization that normality cannot be assumed, transformations such as the “Box-Cox Power Transformation” could certainly have been performed on the data to make it approximate normality. However, these transformations would not be necessary for this analysis, since the nonparametric significance results that we generated by using the Kruskal-Wallis rank sums test were suitable in giving us confidence in the results of our analysis.
[1] Lahman, S. (1996-2014). Lahman’s Baseball Database.
The updated version of the database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more (http://www.seanlahman.com/baseball-archive/statistics/). For more details on the latest release, please read the following documentation (http://seanlahman.com/files/database/readme2012.txt). The database can be used on any platform, but please be aware that this is not a standalone application. It is a database that requires Microsoft Access or some other relational database software to be useful.