This is an R Markdown document. Markdown is a simple formatting syntax for authoring web pages (click the MD toolbar button for help on Markdown).
When you click the Knit HTML button a web page will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
as of August 28, 2014, superceding the version of August 24. Always use the most recent version.
In this study, a two-factor, multi-level experiment will be performed to see if either the number of hits earned by a given team in a given season or the number of homeruns earned by a given team in a given season (or, both via interaction) has a statistically significant effect on the number of wins that a given team earns in a given season. In the dataset, the factor ‘H’ refers to the number of hits that a given team earned in a given year and the factor ‘HR’ refers to the number of homeruns that a given team earned in a given year. Additionally, this analysis’ response variable is referred to in the dataset as ‘W’, which denotes the number of regular season wins that a given team earned in a given year.
##Load in the Teams Dataset
#Get dataset from Project Documents File
teams_raw <- read.csv("~/Academics (RPI)/09. Fall 2014/Design of Experiments/02. Wikibook Recipes/Recipe #03/Teams.csv", header=TRUE)
head(teams_raw)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 <NA> PH1 PNA 1 28 NA 21 7 Y
## 2 1871 <NA> CH1 CNA 2 28 NA 19 9 N
## 3 1871 <NA> BS1 BNA 3 31 NA 20 10 N
## 4 1871 <NA> WS3 OLY 4 32 NA 15 15 N
## 5 1871 <NA> NY2 NNA 5 33 NA 16 17 N
## 6 1871 <NA> TRO TRO 6 29 NA 13 15 N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 376 1281 410 66 27 9 46 23 56 NA NA NA 266 137 4.95 27 0 0
## 2 302 1196 323 52 21 10 60 22 69 NA NA NA 241 77 2.76 25 0 1
## 3 401 1372 426 70 37 3 60 19 73 NA NA NA 303 109 3.55 22 1 3
## 4 310 1353 375 54 26 6 48 13 48 NA NA NA 303 137 4.37 32 0 0
## 5 302 1404 403 43 21 1 33 15 46 NA NA NA 313 121 3.72 32 1 0
## 6 351 1248 384 51 34 6 49 19 62 NA NA NA 362 153 5.51 28 0 0
## IPouts HA HRA BBA SOA E DP FP name
## 1 747 329 3 53 16 194 NA 0.84 Philadelphia Athletics
## 2 753 308 6 28 22 218 NA 0.82 Chicago White Stockings
## 3 828 367 2 42 23 225 NA 0.83 Boston Red Stockings
## 4 846 371 4 45 13 217 NA 0.85 Washington Olympics
## 5 879 373 7 42 22 227 NA 0.83 New York Mutuals
## 6 750 431 4 75 12 198 NA 0.84 Troy Haymakers
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 Jefferson Street Grounds NA 102 98 ATH PH1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 South End Grounds I NA 103 98 BOS BS1
## 4 Olympics Grounds NA 94 98 OLY WS3
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Haymakers' Grounds NA 101 100 TRO TRO
## teamIDretro
## 1 PH1
## 2 CH1
## 3 BS1
## 4 WS3
## 5 NY2
## 6 TRO
tail(teams_raw)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin
## 2740 2013 NL MIA FLA E 5 162 81 62 100 N N
## 2741 2013 NL LAN LAD W 1 162 81 92 70 Y N
## 2742 2013 NL ARI ARI W 2 162 81 81 81 N N
## 2743 2013 NL SDN SDP W 3 162 81 76 86 N N
## 2744 2013 NL SFN SFG W 4 162 82 76 86 N N
## 2745 2013 NL COL COL W 5 162 81 74 88 N N
## LgWin WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER
## 2740 N N 513 5449 1257 219 31 95 432 1232 78 29 56 26 646 602
## 2741 N N 649 5491 1447 281 17 138 476 1146 78 28 57 48 582 524
## 2742 N N 685 5676 1468 302 31 130 519 1142 62 41 43 43 695 651
## 2743 N N 618 5517 1349 246 26 146 467 1309 118 34 52 34 700 643
## 2744 N N 629 5552 1446 280 35 107 469 1078 67 26 39 42 691 643
## 2745 N N 706 5599 1511 283 36 159 427 1204 112 32 26 35 760 708
## ERA CG SHO SV IPouts HA HRA BBA SOA E DP FP
## 2740 3.71 2 1 36 4380 1376 121 526 1177 88 144 0.986
## 2741 3.25 7 4 46 4351 1321 127 460 1292 109 160 0.982
## 2742 3.92 6 2 38 4485 1460 176 485 1218 75 134 0.988
## 2743 3.98 3 1 40 4365 1407 156 525 1171 83 140 0.986
## 2744 4.00 2 2 41 4342 1380 145 521 1256 107 126 0.982
## 2745 4.44 1 0 35 4308 1545 136 517 1064 90 162 0.986
## name park attendance BPF PPF teamIDBR
## 2740 Miami Marlins Marlins Park 1586322 102 103 MIA
## 2741 Los Angeles Dodgers Dodger Stadium 3743527 95 95 LAD
## 2742 Arizona Diamondbacks Chase Field 2134795 102 102 ARI
## 2743 San Diego Padres Petco Park 2166691 91 91 SDP
## 2744 San Francisco Giants AT&T Park 3326796 90 89 SFG
## 2745 Colorado Rockies Coors Field 2793828 117 118 COL
## teamIDlahman45 teamIDretro
## 2740 FLO MIA
## 2741 LAN LAN
## 2742 ARI ARI
## 2743 SDN SDN
## 2744 SFN SFN
## 2745 COL COL
This analysis considers two different factors (with each having multiple levels), which include ‘H’ and ‘HR’. In the original dataset “teams_raw”, both the factor ‘H’ and the factor ‘HR’ are denoted as being integer variables with no specific categorical levels. However, in carrying out this analysis, these factors will be transformed into categorical variables with manually-defined levels. These factors were selected intuitively, since this analysis aims to determine whether or not the amount of hits and/or the amount of homeruns that a given team earns in a given season have a significant effect on the number of regular season wins that a given team earns in a given season.
#Display the summary statistics of "teams_raw".
summary(teams_raw)
## yearID lgID teamID franchID divID
## Min. :1871 AA : 85 CHN : 138 ATL : 138 :1517
## 1st Qu.:1918 AL :1175 PHI : 131 CHC : 138 C: 215
## Median :1961 FL : 16 PIT : 127 CIN : 132 E: 518
## Mean :1954 NL :1399 CIN : 124 PIT : 132 W: 495
## 3rd Qu.:1990 PL : 8 SLN : 122 STL : 132
## Max. :2013 UA : 12 BOS : 113 PHI : 131
## NA's: 50 (Other):1990 (Other):1942
## Rank G Ghome W
## Min. : 1.000 Min. : 6.0 Min. :44.0 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.:153.0 1st Qu.:77.0 1st Qu.: 66.00
## Median : 4.000 Median :157.0 Median :80.0 Median : 77.00
## Mean : 4.132 Mean :150.1 Mean :78.4 Mean : 74.61
## 3rd Qu.: 6.000 3rd Qu.:162.0 3rd Qu.:81.0 3rd Qu.: 87.00
## Max. :13.000 Max. :165.0 Max. :84.0 Max. :116.00
## NA's :399
## L DivWin WCWin LgWin WSWin R
## Min. : 4.00 :1545 :2181 : 28 : 357 Min. : 24.0
## 1st Qu.: 65.00 N: 982 N: 522 N:2449 N:2274 1st Qu.: 612.0
## Median : 76.00 Y: 218 Y: 42 Y: 268 Y: 114 Median : 690.0
## Mean : 74.61 Mean : 682.1
## 3rd Qu.: 86.00 3rd Qu.: 765.0
## Max. :134.00 Max. :1220.0
##
## AB H X2B X3B
## Min. : 211 Min. : 33 Min. : 3.0 Min. : 0.00
## 1st Qu.:5117 1st Qu.:1297 1st Qu.:192.0 1st Qu.: 31.00
## Median :5379 Median :1393 Median :229.0 Median : 42.00
## Mean :5134 Mean :1345 Mean :226.6 Mean : 47.48
## 3rd Qu.:5514 3rd Qu.:1468 3rd Qu.:269.0 3rd Qu.: 61.00
## Max. :5781 Max. :1783 Max. :376.0 Max. :150.00
##
## HR BB SO SB
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 40 1st Qu.:425.0 1st Qu.: 501.0 1st Qu.: 64.0
## Median :105 Median :494.0 Median : 746.0 Median : 96.0
## Mean :100 Mean :473.8 Mean : 726.3 Mean :112.8
## 3rd Qu.:148 3rd Qu.:555.0 3rd Qu.: 955.0 3rd Qu.:143.0
## Max. :264 Max. :835.0 Max. :1535.0 Max. :581.0
## NA's :120 NA's :144
## CS HBP SF RA
## Min. : 0.00 Min. : 26.00 Min. :24.00 Min. : 34.0
## 1st Qu.: 35.00 1st Qu.: 47.00 1st Qu.:39.00 1st Qu.: 608.0
## Median : 46.00 Median : 54.50 Median :44.00 Median : 688.0
## Mean : 49.21 Mean : 56.35 Mean :45.09 Mean : 682.1
## 3rd Qu.: 59.00 3rd Qu.: 64.00 3rd Qu.:50.25 3rd Qu.: 765.0
## Max. :191.00 Max. :103.00 Max. :75.00 Max. :1252.0
## NA's :859 NA's :2325 NA's :2325
## ER ERA CG SHO
## Min. : 25.0 Min. :1.220 Min. : 0.0 Min. : 0.000
## 1st Qu.: 498.0 1st Qu.:3.330 1st Qu.: 16.0 1st Qu.: 6.000
## Median : 590.0 Median :3.820 Median : 46.0 Median : 9.000
## Mean : 569.8 Mean :3.815 Mean : 51.5 Mean : 9.435
## 3rd Qu.: 667.0 3rd Qu.:4.310 3rd Qu.: 78.0 3rd Qu.:12.000
## Max. :1023.0 Max. :8.000 Max. :148.0 Max. :32.000
##
## SV IPouts HA HRA
## Min. : 0.00 Min. : 162 Min. : 49 Min. : 0
## 1st Qu.: 9.00 1st Qu.:4071 1st Qu.:1287 1st Qu.: 43
## Median :23.00 Median :4224 Median :1392 Median :107
## Mean :23.25 Mean :4015 Mean :1345 Mean :100
## 3rd Qu.:37.00 3rd Qu.:4339 3rd Qu.:1471 3rd Qu.:147
## Max. :68.00 Max. :4518 Max. :1993 Max. :241
##
## BBA SOA E DP
## Min. : 0.0 Min. : 0.0 Min. : 47.0 Min. : 18.0
## 1st Qu.:426.0 1st Qu.: 499.0 1st Qu.:119.0 1st Qu.:126.0
## Median :496.0 Median : 721.0 Median :147.0 Median :144.0
## Mean :474.1 Mean : 719.9 Mean :188.6 Mean :140.1
## 3rd Qu.:556.0 3rd Qu.: 951.0 3rd Qu.:219.0 3rd Qu.:160.0
## Max. :827.0 Max. :1428.0 Max. :639.0 Max. :217.0
## NA's :317
## FP name park
## Min. :0.7600 Cincinnati Reds : 123 Wrigley Field : 100
## 1st Qu.:0.9600 Pittsburgh Pirates : 123 Sportsman's Park IV: 90
## Median :0.9700 Philadelphia Phillies: 122 Comiskey Park : 80
## Mean :0.9605 St. Louis Cardinals : 114 Fenway Park II : 80
## 3rd Qu.:0.9800 Chicago White Sox : 113 Forbes Field : 60
## Max. :0.9910 Detroit Tigers : 113 Crosley Field : 58
## (Other) :2037 (Other) :2277
## attendance BPF PPF teamIDBR
## Min. : 6088 Min. : 60.0 Min. : 60.0 CHC : 138
## 1st Qu.: 518051 1st Qu.: 97.0 1st Qu.: 97.0 CIN : 137
## Median :1107122 Median :100.0 Median :100.0 STL : 135
## Mean :1317241 Mean :100.2 Mean :100.2 PHI : 134
## 3rd Qu.:1950099 3rd Qu.:103.0 3rd Qu.:103.0 PIT : 132
## Max. :4483350 Max. :129.0 Max. :141.0 BOS : 121
## NA's :279 (Other):1948
## teamIDlahman45 teamIDretro
## CHN : 138 CHN : 138
## PHI : 131 PHI : 131
## PIT : 127 PIT : 127
## CIN : 124 CIN : 124
## SLN : 122 SLN : 122
## BOS : 113 BOS : 113
## (Other):1990 (Other):1990
#Display the names found in "teams_raw".
names(teams_raw)
## [1] "yearID" "lgID" "teamID" "franchID"
## [5] "divID" "Rank" "G" "Ghome"
## [9] "W" "L" "DivWin" "WCWin"
## [13] "LgWin" "WSWin" "R" "AB"
## [17] "H" "X2B" "X3B" "HR"
## [21] "BB" "SO" "SB" "CS"
## [25] "HBP" "SF" "RA" "ER"
## [29] "ERA" "CG" "SHO" "SV"
## [33] "IPouts" "HA" "HRA" "BBA"
## [37] "SOA" "E" "DP" "FP"
## [41] "name" "park" "attendance" "BPF"
## [45] "PPF" "teamIDBR" "teamIDlahman45" "teamIDretro"
#Display the structure of "teams_raw".
str(teams_raw)
## 'data.frame': 2745 obs. of 48 variables:
## $ yearID : int 1871 1871 1871 1871 1871 1871 1871 1871 1871 1872 ...
## $ lgID : Factor w/ 6 levels "AA","AL","FL",..: NA NA NA NA NA NA NA NA NA NA ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 97 31 24 142 90 136 56 39 111 24 ...
## $ franchID : Factor w/ 120 levels "ALT","ANA","ARI",..: 85 36 13 77 70 109 56 25 91 13 ...
## $ divID : Factor w/ 4 levels "","C","E","W": 1 1 1 1 1 1 1 1 1 1 ...
## $ Rank : int 1 2 3 4 5 6 7 8 9 1 ...
## $ G : int 28 28 31 32 33 29 19 29 25 48 ...
## $ Ghome : int NA NA NA NA NA NA NA NA NA NA ...
## $ W : int 21 19 20 15 16 13 7 10 4 39 ...
## $ L : int 7 9 10 15 17 15 12 19 21 8 ...
## $ DivWin : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ WCWin : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ LgWin : Factor w/ 3 levels "","N","Y": 3 2 2 2 2 2 2 2 2 3 ...
## $ WSWin : Factor w/ 3 levels "","N","Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ R : int 376 302 401 310 302 351 137 249 231 521 ...
## $ AB : int 1281 1196 1372 1353 1404 1248 746 1186 1036 2137 ...
## $ H : int 410 323 426 375 403 384 178 328 274 677 ...
## $ X2B : int 66 52 70 54 43 51 19 35 44 114 ...
## $ X3B : int 27 21 37 26 21 34 8 40 25 31 ...
## $ HR : int 9 10 3 6 1 6 2 7 3 7 ...
## $ BB : int 46 60 60 48 33 49 33 26 38 28 ...
## $ SO : int 23 22 19 13 15 19 9 25 30 26 ...
## $ SB : int 56 69 73 48 46 62 16 18 53 47 ...
## $ CS : int NA NA NA NA NA NA NA NA NA 14 ...
## $ HBP : int NA NA NA NA NA NA NA NA NA NA ...
## $ SF : int NA NA NA NA NA NA NA NA NA NA ...
## $ RA : int 266 241 303 303 313 362 243 341 287 236 ...
## $ ER : int 137 77 109 137 121 153 97 116 108 95 ...
## $ ERA : num 4.95 2.76 3.55 4.37 3.72 5.51 5.17 4.11 4.3 1.99 ...
## $ CG : int 27 25 22 32 32 28 19 23 23 41 ...
## $ SHO : int 0 0 1 0 1 0 1 0 1 3 ...
## $ SV : int 0 1 3 0 0 0 0 0 0 1 ...
## $ IPouts : int 747 753 828 846 879 750 507 762 678 1290 ...
## $ HA : int 329 308 367 371 373 431 261 346 315 438 ...
## $ HRA : int 3 6 2 4 7 4 5 13 3 0 ...
## $ BBA : int 53 28 42 45 42 75 21 53 34 27 ...
## $ SOA : int 16 22 23 13 22 12 17 34 16 0 ...
## $ E : int 194 218 225 217 227 198 163 223 220 263 ...
## $ DP : int NA NA NA NA NA NA NA NA NA NA ...
## $ FP : num 0.84 0.82 0.83 0.85 0.83 0.84 0.8 0.81 0.82 0.87 ...
## $ name : Factor w/ 139 levels "Altoona Mountain City",..: 97 42 17 135 93 131 63 51 111 17 ...
## $ park : Factor w/ 213 levels "","23rd Street Grounds",..: 87 197 170 130 199 80 77 116 4 170 ...
## $ attendance : int NA NA NA NA NA NA NA NA NA NA ...
## $ BPF : int 102 104 103 94 90 101 101 96 97 105 ...
## $ PPF : int 98 102 98 98 88 100 107 100 99 100 ...
## $ teamIDBR : Factor w/ 101 levels "ALT","ANA","ARI",..: 4 21 10 65 62 93 42 25 77 10 ...
## $ teamIDlahman45: Factor w/ 148 levels "ALT","ANA","ARI",..: 96 31 24 140 89 135 56 39 110 24 ...
## $ teamIDretro : Factor w/ 149 levels "ALT","ANA","ARI",..: 96 31 24 141 89 135 56 39 110 24 ...
In this dataset, there are a few variables that can be considered to be continuous variables; these variables are the ones which are categorized as being numeric variables. By this standard, the continuous variables that exist in this dataset include ‘ERA’ (which refers to a team’s overal earned run average) and ‘FP’(which refers to a team’s overal fielding percentage). The two main factors that our analysis considers, ‘H’ and ‘HR’, currently exist as integer variables. However, upon manually defining those given integer values into specific “integer-range” levels, these factors will be then defined as categorical variables.
This analysis will consider one response variable, ‘W’, which denotes the number of regular season wins that a given team earned in a given year.
As a whole, Lahman’s Baseball Database contains information pertaining to pitching, hitting, and fielding statistics for Major League Baseball from the years 1871 through 2013. It includes data from the two current leagues (American and National), the four other “major” leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875. It contains 2,745 observations of 48 variables, which are defined below (Lahman, 1996-2014) [1]:
yearID [Year]
lgID [League]
teamID [Team]
franchID [Franchise (links to TeamsFranchise table)]
divID [Team’s division]
Rank [Position in final standings]
G [Games played]
GHome [Games played at home]
W [Wins]
L [Losses]
DivWin [Division Winner (Y or N)]
WCWin [Wild Card Winner (Y or N)]
LgWin [League Champion (Y or N)]
WSWin [World Series Winner (Y or N)]
R [Runs scored]
AB [At bats]
H [Hits by batters]
2B [Doubles]
3B [Triples]
HR [Homeruns by batters]
BB [Walks by batters]
SO [Strikeouts by batters]
SB [Stolen bases]
CS [Caught stealing]
HBP [Batters hit by pitch]
SF [Sacrifice flies]
RA [Opponents runs scored]
ER [Earned runs allowed]
ERA [Earned run average]
CG [Complete games]
SHO [Shutouts]
SV [Saves]
IPOuts [Outs Pitched (innings pitched x 3)]
HA [Hits allowed]
HRA [Homeruns allowed]
BBA [Walks allowed]
SOA [Strikeouts by pitchers]
E [Errors]
DP [Double Plays]
FP [Fielding percentage]
name [Team’s full name]
park [Name of team’s home ballpark]
attendance [Home attendance total]
BPF [Three-year park factor for batters]
PPF [Three-year park factor for pitchers]
teamIDBR [Team ID used by Baseball Reference website]
teamIDlahman45 [Team ID used in Lahman database version 4.5]
teamIDretro [Team ID used by Retrosheet]
Because this dataset is a completely comprehensive and accurate statistical resourse for baseball statistics gathered from 1871-2013 (that, in theory, contains every possible observation that could be collected during those years), we should definitely be able to make the assumption that this dataset exhibits randomization without any bias.
In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘W’ in this analysis) can be explained by the variation existent in the two different treatments of the experiment (which correspond to ‘H’ and ‘HR’). Therefore, the null hypothesis that is being tested states that the number of hits and homeruns earned by a given team in a given year do not have a significant effect on the number of regular season wins that a given team earns in a given year. In carrying out this analysis, we perform an analysis of variance (ANOVA) for the number of regular season wins (‘W’) to see if there is a significant difference in the means for this response variable when considering both the number of hits (‘H’) and the number of homeruns (‘HR’) that are earned by a given team in a given year, which are contained in this dataset.
The rationale for this design lies primarily in the fact that we’re trying to determine if the number of hits and the number of homeruns earned by a given team in a given season have any effect on the number of regular season wins that a given team earns in a given year. So, since the number of regular season wins is a useful and relevant metric to consider when determining the most important performance-related factors that result in earned wins for a given team, this design of a two-factor, multi-level experiment (as it corresponds to an analysis of variance) was crafted to see if the number of hits and homeruns earned by a given team in a given year has a significant effect on the number of regular season wins that a given team earns in a given year. Therefore, by performing this analysis, we can hope to receive some insight regarding the different factors that come into play that typically result in regular season wins for baseball teams.
Since original assumption claimed that the entirety of Lahman’s Baseball Database exhibits randomization, we did not need to worry about randomizing our data any further to ensure that a completely randomized design is created. However, in carrying out this analysis in a reasonable and logical way, a chronologically-subsetted dataset was extracted from the entirety of Lahman’s Baseball Database (all data ranging from 1973-2013, for a grand total of 40 years of data, was used in this analysis). It’s important to note that this new subset of data only contains 6 different variables, which include ‘yearID’, ‘lgID’, ‘teamID’, ‘W’, ‘H’, and ‘HR’. Since the other 42 variables are not being considered in this analysis, it did not make sense to include them anymore.
#Create subset of the dataset that only includes team data from the years 1973-2013.
teams_forty <- subset(teams_raw, teams_raw$yearID >= 1973, select=c(yearID:teamID,W,H,HR))
In this experiment, there are no replicates or repeated measures present.
In order to transform our integer factor variables into categorical factor variables, blocking was used in this design. In transforming the number of hits, ‘H’, into a categorical variable, three different levels were defined, designating a low number of hits, a medium number of hits, and a high number of hits by a given team in a given year. These different levels were determined upon calculating the first and third quartiles of hitting data in the subsetted dataset “teams_forty” (see numetric levels in R code below). In transforming the number of homeruns, ‘HR’, into a cetegorical variable, two different levels were defined, designating an above average number of homeruns and a below average number of homeruns earned by a given team in a given year. These different levels were determined upon calculating the mean of all of the homerun data in the subsetted dataset “teams_forty” (see numetric levels in R code below). Therefore, in this newly subsetted dataset, factor ‘H’ now has three distinct levels (“Low”, “Medium”, and “High”) and factor ‘HR’ now has two distinct levels (“Above Average” and “Below Average”).
#Transform 'H' into categorical variables (Low, Medium, and High).
teams_forty$H[teams_forty$H > 0 & teams_forty$H <= 1376] = "Low"
teams_forty$H[teams_forty$H > 1376 & teams_forty$H <= 1494] = "Medium"
teams_forty$H[teams_forty$H > 1494 & teams_forty$H <= 1684] = "High"
#Categorize 'H' as a factor and display its resulting levels.
teams_forty$H = as.factor(teams_forty$H)
levels(teams_forty$H)
## [1] "High" "Low" "Medium"
#Transform 'HR' into categorical variables ("Above Average" and "Below Average").
teams_forty$HR[teams_forty$HR >= mean(teams_forty$HR)] = "Above Average"
teams_forty$HR[teams_forty$HR != "Above Average"] = "Below Average"
#Categorize 'HR' as a factor and display its resulting levels.
teams_forty$HR = as.factor(teams_forty$HR)
levels(teams_forty$HR)
## [1] "Above Average" "Below Average"
In beginning to display this data graphically, summary statistics were gathered for the newly created dataset, “teams_forty”. Additionally, histograms and boxplots were created to represent the different observations of regular season wins existent within this subsetted dataset that contains data from 1973-2013.
#Display the summary statistics of "teams_forty".
summary(teams_forty)
## yearID lgID teamID W H
## Min. :1973 AA: 0 ATL : 41 Min. : 37.00 High :283
## 1st Qu.:1984 AL:567 BAL : 41 1st Qu.: 71.00 Low :286
## Median :1994 FL: 0 BOS : 41 Median : 80.00 Medium:563
## Mean :1994 NL:565 CHA : 41 Mean : 79.47
## 3rd Qu.:2004 PL: 0 CHN : 41 3rd Qu.: 89.00
## Max. :2013 UA: 0 CIN : 41 Max. :116.00
## (Other):886
## HR
## Above Average:559
## Below Average:573
##
##
##
##
##
#Display the names found in "teams_forty".
names(teams_forty)
## [1] "yearID" "lgID" "teamID" "W" "H" "HR"
#Display the structure of "teams_forty".
str(teams_forty)
## 'data.frame': 1132 obs. of 6 variables:
## $ yearID: int 1973 1973 1973 1973 1973 1973 1973 1973 1973 1973 ...
## $ lgID : Factor w/ 6 levels "AA","AL","FL",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ teamID: Factor w/ 149 levels "ALT","ANA","ARI",..: 5 16 52 93 83 45 96 66 79 30 ...
## $ W : int 97 89 85 80 74 71 94 88 81 79 ...
## $ H : Factor w/ 3 levels "High","Low","Medium": 3 3 3 3 3 3 3 3 1 3 ...
## $ HR : Factor w/ 2 levels "Above Average",..: 2 1 1 2 2 1 1 2 2 2 ...
#Display the head and tail of "teams_forty".
head(teams_forty)
## yearID lgID teamID W H HR
## 1614 1973 AL BAL 97 Medium Below Average
## 1615 1973 AL BOS 89 Medium Above Average
## 1616 1973 AL DET 85 Medium Above Average
## 1617 1973 AL NYA 80 Medium Below Average
## 1618 1973 AL ML4 74 Medium Below Average
## 1619 1973 AL CLE 71 Medium Above Average
tail(teams_forty)
## yearID lgID teamID W H HR
## 2740 2013 NL MIA 62 Low Below Average
## 2741 2013 NL LAN 92 Medium Below Average
## 2742 2013 NL ARI 81 Medium Below Average
## 2743 2013 NL SDN 76 Low Below Average
## 2744 2013 NL SFN 76 Medium Below Average
## 2745 2013 NL COL 74 High Above Average
#Display the levels of 'H' and 'HR' within "teams_forty".
levels(teams_forty$H)
## [1] "High" "Low" "Medium"
levels(teams_forty$HR)
## [1] "Above Average" "Below Average"
par(mfrow=c(1,1))
#Create a histogram of Regular Season Wins ('W') for teams from 1973-2013.
hist(teams_forty$W, xlim=c(37,116), ylab = "Regular Season Wins")
par(mfrow=c(1,1))
#Create a boxplot of Regular Season Wins ('W') [for all teams from 1973-2013].
boxplot(teams_forty$W~teams_forty$teamID, main = "Regular Season Wins", ylim = c(37,116), xlab = "Teams", ylab = "Wins")
In order to determine if the variation that is observed in the response variable (which corresponds to the number of regular season wins in this analysis) can be explained by the variation existent in the treatments of the experiment (which correspond to both the number of hits and the number of homeruns earned by a given team in a given year), an analysis of variance (ANOVA) is performed as a means for analyzing the differences in regular season wins for each of the different numbers of hits and homeruns earned by a given team in a given year (ranging from 1973-2013) contained within the dataset.
For each of the three ANOVA models that are designed in this experiment, the null hypothesis that is being tested (which we will either reject or fail to reject by the end of our analysis) states that the number of hits and homeruns earned by a given team in a given year do not have a significant effect on the number of regular season wins that a given team earns in a given year, implying that the differences in mean values of the number of regular season wins earned by a given team in a given year were solely the result of randomization in this experiment. In other words, if we reject the null hypothesis, we would infer that the differences in mean values of the numbers of regular season wins earned by a given team in a given year for each of the corresponding numbers of earned hits and homeruns in this dataset is caused by something other than randomization, leading us to believe that the variation that is observed in the mean values of the numbers of regular season wins earned by a given team in a given year can be explained by the variation existent in the different numbers of earned hits and homeruns for a given team in a given year being considered in this analysis. Alternately, if we fail to reject the null hypothesis, we would infer that the variation that is observed in the mean values of the numbers of regular season wins earned by a given team in a given year cannot be explained by the variation existent in the different numbers of earned hits and homeruns for a given team in a given year being considered in this analysis and, as such, is likely caused by randomization.
#Perform an analysis of variance (ANOVA) for the different mean values observed in the number of regular season wins earned in a given year by a given team, given the factor 'H'.
model_hits <- aov(W~H,teams_forty)
anova(model_hits)
## Analysis of Variance Table
##
## Response: W
## Df Sum Sq Mean Sq F value Pr(>F)
## H 2 36045 18022.3 147.26 < 2.2e-16 ***
## Residuals 1129 138172 122.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Perform an analysis of variance (ANOVA) for the different mean values observed in the number of regular season wins earned in a given year by a given team, given the factor 'HR'.
model_homeruns <- aov(W~HR,teams_forty)
anova(model_homeruns)
## Analysis of Variance Table
##
## Response: W
## Df Sum Sq Mean Sq F value Pr(>F)
## HR 1 18016 18016.0 130.33 < 2.2e-16 ***
## Residuals 1130 156200 138.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Perform an analysis of variance (ANOVA) for the different mean values observed in the number of regular season wins earned in a given year by a given team, given the interaction of 'H' and 'HR'.
model_interaction <- aov(W~H*HR,teams_forty)
anova(model_interaction)
## Analysis of Variance Table
##
## Response: W
## Df Sum Sq Mean Sq F value Pr(>F)
## H 2 36045 18022.3 153.7169 < 2.2e-16 ***
## HR 1 5648 5648.2 48.1745 6.57e-12 ***
## H:HR 2 507 253.5 2.1625 0.1155
## Residuals 1126 132016 117.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow=c(1,1))
#Create an interaction plot that plots the mean values of 'W' against the interaction of both 'H' and 'HR'.
interaction.plot(teams_forty$H,teams_forty$HR,teams_forty$W)
For the analysis of variance (ANOVA) that is performed where ‘H’ is analyzed against the response variable ‘W’, a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (147.26) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season wins earned by a given team in a given year can be explained by the variation existent in the different numbers of earned hits for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
For the analysis of variance (ANOVA) that is performed where ‘HR’ is analyzed against the response variable ‘W’, a p-value < 2.2e-16 is returned, indicating that there is roughly a probability of < 2.2e-16 that the resulting associated F-value (130.33) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of the numbers of regular season wins earned by a given team in a given year can be explained by the variation existent in the different numbers of earned homeruns for a given team in a given year being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value.)
For the analysis of variance (ANOVA) that is performed where the interaction of both ‘H’ and ‘HR’ is analyzed against the response variable ‘W’, a p-value = ~0.12 is returned, indicating that there is roughly a probability of 0.1155 that the resulting associated F-value (2.1625) is the result of solely randomization. Additionally, upon generating an interaction plot that plots the mean values of the numbers of regular season wins earned by a given team in a given year (‘W’) against the interaction of both ‘H’ and ‘HR’, the plot suggests that the interaction of these two factors does not have a significant effect on the response variable (since the lines that are displayed on the plot do not cross over each other). Therefore, based on this result, we would fail to reject the null hypothesis, leading us to infer that the variation that is observed in the mean values of the numbers of regular season wins earned by a given team in a given year cannot be explained by the variation existent in the interaction of the different numbers of earned hits and homeruns for a given team in a given year being considered in this analysis and, as such, is likely caused by the two factors being considered individually as opposed to being considered together. (Note: The statistical significance of the effect that ‘H’ and ‘HR’ each individually have on the response variable ‘W’ did not change upon developing an interaction model [see Analysis of Variance Table for “model_interaction”].)
In further carrying out this analysis, we can compute Tukey Honest Significant Differences (via “TukeyHSD()”) as a means for determining the specifc levels of each factor existent in this analysis that are truly independent from each other and that significantly affect the response variable, ‘W’.
#Perform a TukeyHSD Test for "model_hits".
Tukey_hits = TukeyHSD(model_hits, ordered = FALSE, conf.level = 0.95)
par(mfrow=c(1,1))
plot(Tukey_hits)
#Perform a TukeyHSD Test for "model_homeruns".
Tukey_homeruns = TukeyHSD(model_homeruns, ordered = FALSE, conf.level = 0.95)
par(mfrow=c(1,1))
plot(Tukey_homeruns)
#Perform a TukeyHSD Test for "model_interaction".
Tukey_interaction = TukeyHSD(model_interaction, ordered = FALSE, conf.level = 0.95)
par(mfrow=c(1,1))
plot(Tukey_interaction)
After observing the results of these Tukey Honest Significant Differences for both “model_hits” and “model_homeruns”, it’s seemingly clear that each of the different level-comparisons within those two models, considered individually, suggest a significant effect on ‘W’ that is not due solely to randomization (since the p-value for each level-interaction is equal to zero, leading us to reject the null hypothesis that the number of hits and homeruns (considered separately) earned by a given team in a given year likely do not have a significant effect on the number of regular season wins that a given team earns in a given year).
After observing the results of the these Tukey Honest Significant Differences for “model_interaction”, we begin to see a picture that doesn’t really align with our interaction model’s ANOVA analysis or interaction plot. Based on the generated p-values and the confidence interval of 0.95 that was set up in our TukeyHSD test, it appears that some of the level-interactions within our interaction model do seem to have a statistically significant effect on ‘W’ that is not due solely to randomization. These level-interactions include “Low:Above Average-High:Above Average”, “Medium:Above Average-High:Above Average”, “Low:Below Average-High:Above Average”, “Medium:Below Average-High:Above Average”, “Medium:Above Average-Low:Above Average”, “High:Below Average-Low:Above Average”, “Low:Below Average-Low:Above Average”, “Low:Below Average-Medium:Above Average”, “Medium:Below Average-Medium:Above Average”, “Low:Below Average-High:Below Average”, “Medium:Below Average-High:Below Average”, and “Medium:Below Average-Low:Below Average”, since all of their respective p-values are < 0.05. Therefore, the results of this test suggest that the interaction of ‘H’ and ‘HR’ (in most level-based cases) is likely to have a statistically significant effect on the response variable, ‘W’.
In estimating the different parameters of the experiment, I performed summary statistics on relevant data in the dataset pertaining to the numbers of regular season wins earned by a given team in a given year for the years contained in “teams_forty” (which includes both the average number of regular season wins earned by all of the teams individually contained within the dataset and the standard deviation of those regular season wins), the numbers of hits earned by a given team in a given year contained in “teams_forty” (which includes both the quantities of earned hits classified as being “High”, “Medium”, and “Low”, respectively, and the standard deviation of those distributed quantities), and the numbers of homeruns earned by a given team in a given year contained in “teams_forty” (which includes both the quantities of earned homeruns classified as being ether “Above Average” or “Below Average”, respectively, and the standard deviation of those distributed quantities).
#Display summary statistics of teams_forty$W.
summary(teams_forty$W)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.00 71.00 80.00 79.47 89.00 116.00
#Display standard deviation of teams_forty$W.
sd(teams_forty$W, na.rm = FALSE)
## [1] 12.41118
#Display summary statistics of teams_forty$H.
summary(teams_forty$H)
## High Low Medium
## 283 286 563
#Display standard deviation of teams_forty$H.
sd(teams_forty$H, na.rm = FALSE)
## [1] 0.8287186
#Display summary statistics of teams_forty$HR.
summary(teams_forty$HR)
## Above Average Below Average
## 559 573
#Display standard deviation of teams_forty$HR.
sd(teams_forty$HR, na.rm = FALSE)
## [1] 0.5001827
In verifying the results of this experiment, it’s important to ensure that the dataset itself meets all of the assumptions that correlate with the design approach that was carried out. In this way, we want to make sure that our dataset exhibits normality. Until we know that our dataset does, in fact, exhibit normality, we cannot yet say with confidence that our results are significant and representative of a properly carried-out modeling approach. In verifying our dataset for normality, we can both create a Normal Quantile-Quantile (QQ) Plot of our data and perform a Shapiro-Wilk Test of Normality on our data.
#Create a Normal Q-Q Plot for the numbers of regular season wins earned by a given team in a given year.
qqnorm(teams_forty[,"W"], main = "Normal Q-Q Plot of Regular Season Wins")
qqline(teams_forty[,"W"])
#Create a Normal Q-Q Plot of the residuals for "model_hits".
qqnorm(residuals(model_hits), main = "Normal Q-Q Plot of Residuals of 'model_hits'")
qqline(residuals(model_hits))
#Create a Normal Q-Q Plot of the residuals for "model_homeruns".
qqnorm(residuals(model_homeruns), main = "Normal Q-Q Plot of Residuals of 'model_homeruns'")
qqline(residuals(model_homeruns))
#Create a Normal Q-Q Plot of the residuals for "model_interaction".
qqnorm(residuals(model_interaction), main = "Normal Q-Q Plot of Residuals of 'model_interaction'")
qqline(residuals(model_interaction))
#Perform Shapiro-Wilk Test of Normality on the numbers of regular season wins earned by a given team in a given year (normality is assummed if p > 0.1).
shapiro.test(teams_forty[,"W"])
##
## Shapiro-Wilk normality test
##
## data: teams_forty[, "W"]
## W = 0.9906, p-value = 1.198e-06
Upon both constructing Normal Q-Q Plots and performing Shapiro-Wilk Tests of Normality on the data in this analysis, it’s likely that we can readily assume that our data exhibits normality. Despite the fact that the resulting p-value of the Shapiro-Wilk Tests of Normality for “teams_forty[,“W”]” were < 0.1, all of the constructed Normal Q-Q Plots did seem to display a trend of data that aligned closely with the Normal Q-Q Line. Additionally, since Lahman’s Baseball Database is a completely comprehensive and accurate statistical resourse for baseball statistics gathered from 1871-2013 (that, in theory, contains every possible observation that could be collected during those years), we should definitely be safe in making the assumption that this database (and the subsetted dataset that was extracted from it) exhibits normality.
In further backing up the confidence that we have with our results, we can generate a “quality of fit” model that plots residual error against each of the fitted models that were developed in our original analysis of variance (ANOVA).
#Create a "Quality of Fit Model" that plots the residuals of "model_hits" against its fitted model.
plot(fitted(model_hits),residuals(model_hits))
#Create a "Quality of Fit Model" that plots the residuals of "model_homeruns" against its fitted model.
plot(fitted(model_homeruns),residuals(model_homeruns))
#Create a "Quality of Fit Model" that plots the residuals of "model_interaction" against its fitted model.
plot(fitted(model_interaction),residuals(model_interaction))
Because each of the resulting plots appears to be scatted and clumped around zero, each of the three ANOVA models developed suggests good fit. Thus, we can confindently rely on both the modeling approach that we carried out and the dataset that we analyzed in justifying the significance of our results.
If our modeling assumptions failed in our analysis, we can still err on the side of caution by performing the nonparametric Kruskal-Wallis rank sum test to back up our original results (which will help us to decide whether the population distributions are identical without necessarily exhibiting a normal distribution)
#Perform Kruskal-Wallis Rank Sum Test on 'W' within the "teams_forty" dataset for both 'H' and 'HR' (identical populations is assummed if p > 0.05).
kruskal.test(teams_forty[,"W"],teams_forty$H)
##
## Kruskal-Wallis rank sum test
##
## data: teams_forty[, "W"] and teams_forty$H
## Kruskal-Wallis chi-squared = 209.2336, df = 2, p-value < 2.2e-16
kruskal.test(teams_forty[,"W"],teams_forty$HR)
##
## Kruskal-Wallis rank sum test
##
## data: teams_forty[, "W"] and teams_forty$HR
## Kruskal-Wallis chi-squared = 111.1244, df = 1, p-value < 2.2e-16
Since the p-values for both of the resulting Kruskal-Wallis rank sum tests that consider the factors ‘H’ and ‘HR’ against the response variable ‘W’ are less than 0.05, we can assume that the mean values of the number of regular season wins that a given team earns in a given year compared to both the different number of hits and homeruns earned by a given team in a given year (considered separately) are comparatively nonidentical populations. Therefore, this result suggests that we would reject the null hypothesis of our main experiment, leading us to believe that the number of hits and homeruns (considered separately) earned by a given team in a given year likely does have a significant effect on the number of regular season wins that a given team earns in a given year in our analysis. Furthermore, in addition to treating our data in such a way that uses a nonparametric analysis upon any realization that normality cannot be assumed, transformations such as the “Box-Cox Power Transformation” certainly could have been performed on the data to make it approximately normal. However, these transformations would not be necessary for this analysis, since the nonparametric significance results that we generated by using the Kruskal-Wallis rank sums test were suitable in giving us confidence in the results of our analysis.
[1] Lahman, S. (1996-2014). Lahman’s Baseball Database.
The updated version of the database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more (http://www.seanlahman.com/baseball-archive/statistics/). For more details on the latest release, please read the following documentation (http://seanlahman.com/files/database/readme2012.txt). The database can be used on any platform, but please be aware that this is not a standalone application. It is a database that requires Microsoft Access or some other relational database software to be useful.