We will be analyzing a data frame containing 5 variables (“gameno”, “month”, “homeruns”, “playerstatus”, “player”) and 326 observations.The following below will help us understand the variables we are looking at:
gameno
an integer variable denoting the game number
month
a factor variable taking with levels “March” through “September” denoting the month of the game
homeruns
an integer vector denoting the number of homeruns hit in that game for that player
playerstatus
an integer vector equal to “0” if the player played in the game, and “1” if they did not.
player
an integer vector equal to “0” (McGwire) or “1” (Sosa)
Exploring the Data:
library(Zelig)
data("homerun")
library(survival)
str(homerun)
'data.frame': 314 obs. of 5 variables:
$ gameno : int 1 2 3 4 5 6 7 8 9 10 ...
$ month : Factor w/ 7 levels "April","August",..: 5 1 1 1 1 1 1 1 1 1 ...
$ homeruns : int 1 1 1 1 0 0 0 0 0 0 ...
$ playerstatus: int 0 0 0 0 0 0 0 0 0 0 ...
$ player : Factor w/ 2 levels "McGwire","Sosa": 1 1 1 1 1 1 1 1 1 1 ...
Descriptive Analysis:
Bar Graphs and Histograms
library(ggplot2)
ggplot(homerun, aes(x=player)) + geom_bar(fill = "red")

The bar graph above tells us that Sammy Sosa had played in more games than Mark McGwire. There are 162 games in a season, and from the data above, Sosa played in more games.
ggplot(homerun, aes(x=player, y=homeruns)) + geom_bar(stat = "identity", fill="red")

The bar graph above shows the amount of homeruns each player had.
ggplot(homerun, aes(x=gameno, y=homeruns)) + geom_histogram(stat = "identity", fill = "blue")
Ignoring unknown parameters: binwidth, bins, pad

The histogram shows the number of home runs for both teams overall in each game (out of 162 games).
The relationship between the player and the amount of homeruns hit in that game
m1<- lm(homeruns ~ player, data=homerun)
summary(m1)
Call:
lm(formula = homeruns ~ player, data = homerun)
Residuals:
Min 1Q Median 3Q Max
-0.4516 -0.4516 -0.4151 0.5484 2.5849
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.45161 0.05204 8.677 2.28e-16 ***
playerSosa -0.03652 0.07314 -0.499 0.618
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.648 on 312 degrees of freedom
Multiple R-squared: 0.0007984, Adjusted R-squared: -0.002404
F-statistic: 0.2493 on 1 and 312 DF, p-value: 0.6179
According to the information above it seems that Sosa has fewer homeruns than McGwire. For every 1 homerun from McGwire, Sosa has a -.03 homerun chance. There was no significant difference.
The relationship between homeruns hit and the month of each game
m2<- lm(homeruns ~ month, data=homerun)
summary(m2)
Call:
lm(formula = homeruns ~ month, data = homerun)
Residuals:
Min 1Q Median 3Q Max
-0.5882 -0.4510 -0.3200 0.4800 2.6800
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.320000 0.091332 3.504 0.000527 ***
monthAugust 0.090714 0.125655 0.722 0.470888
monthJuly -0.005185 0.126748 -0.041 0.967395
monthJune 0.268235 0.128528 2.087 0.037715 *
monthMarch 0.180000 0.465703 0.387 0.699385
monthMay 0.130980 0.128528 1.019 0.308967
monthSeptember 0.200000 0.129163 1.548 0.122548
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6458 on 307 degrees of freedom
Multiple R-squared: 0.02329, Adjusted R-squared: 0.004203
F-statistic: 1.22 on 6 and 307 DF, p-value: 0.2955
Now we are looking at the homeruns hit and each each month. We don’t really see any signifcance in here. It just shows us that each month varies. The only significance is for the month of July. I wonder if other factors like weather, place of game and time of game influence the number of homeruns possible for each month.
Multiple Regression: The Relationship between Homeruns for each player and the month of the games.
m3 <- lm(homeruns ~ player + month, data=homerun)
summary(m3)
Call:
lm(formula = homeruns ~ player + month, data = homerun)
Residuals:
Min 1Q Median 3Q Max
-0.6083 -0.4316 -0.3338 0.4990 2.6603
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.339737 0.099009 3.431 0.000683 ***
playerSosa -0.037955 0.073010 -0.520 0.603533
monthAugust 0.089955 0.125813 0.715 0.475161
monthJuly -0.005944 0.126908 -0.047 0.962672
monthJune 0.268593 0.128683 2.087 0.037693 *
monthMarch 0.179241 0.466260 0.384 0.700932
monthMay 0.129849 0.128699 1.009 0.313804
monthSeptember 0.199241 0.129325 1.541 0.124442
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6466 on 306 degrees of freedom
Multiple R-squared: 0.02415, Adjusted R-squared: 0.00183
F-statistic: 1.082 on 7 and 306 DF, p-value: 0.3747
On Average we can see that Sosa had fewer homeruns and that it varied in each month. The only t value that shows significant difference is again the month of July. For every one homerun McGwire had, Sosa was behind him by -.03.
More Simple Table
library(texreg)
screenreg(list(m1,m2,m3))
==================================================
Model 1 Model 2 Model 3
--------------------------------------------------
(Intercept) 0.45 *** 0.32 *** 0.34 ***
(0.05) (0.09) (0.10)
playerSosa -0.04 -0.04
(0.07) (0.07)
monthAugust 0.09 0.09
(0.13) (0.13)
monthJuly -0.01 -0.01
(0.13) (0.13)
monthJune 0.27 * 0.27 *
(0.13) (0.13)
monthMarch 0.18 0.18
(0.47) (0.47)
monthMay 0.13 0.13
(0.13) (0.13)
monthSeptember 0.20 0.20
(0.13) (0.13)
--------------------------------------------------
R^2 0.00 0.02 0.02
Adj. R^2 -0.00 0.00 0.00
Num. obs. 314 314 314
RMSE 0.65 0.65 0.65
==================================================
*** p < 0.001, ** p < 0.01, * p < 0.05
Possible Interaction Effects
m4<-lm(homeruns ~ player*gameno, data = homerun)
summary(m4)
Call:
lm(formula = homeruns ~ player * gameno, data = homerun)
Residuals:
Min 1Q Median 3Q Max
-0.5353 -0.4517 -0.3905 0.5482 2.6065
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.508e-01 1.053e-01 4.282 2.47e-05 ***
playerSosa -1.589e-01 1.473e-01 -1.079 0.282
gameno 9.515e-06 1.102e-03 0.009 0.993
playerSosa:gameno 1.484e-03 1.545e-03 0.960 0.338
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6481 on 310 degrees of freedom
Multiple R-squared: 0.006894, Adjusted R-squared: -0.002717
F-statistic: 0.7173 on 3 and 310 DF, p-value: 0.5423
I wanted to see if the game number had any significance on the number of homeruns each player had. The tvalue for gameno is less than .05 which does show that there is a signficance. On average Sosa had less homeruns than McGwire.
Overall, I wanted to seee what effected the players from achieving more homeruns. I would have liked to look at more data such as age, weight, years played, no. of injuries / seasons or games out. McGwire ended up beating Sosa by 4 homeruns. McGwire had 70 at the end of the season and Sosa had 66.
LS0tDQp0aXRsZTogIkxpbmVhciBSZWdyZXNzaW9uIE1vZGVscyBhbmQgRGVzY3JpcHRpdmUgRGF0YSBBbmFseXNpcyINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCg0KIyMjIFdlIHdpbGwgYmUgYW5hbHl6aW5nIGEgZGF0YSBmcmFtZSBjb250YWluaW5nIDUgdmFyaWFibGVzICgiZ2FtZW5vIiwgIm1vbnRoIiwgImhvbWVydW5zIiwgInBsYXllcnN0YXR1cyIsICJwbGF5ZXIiKSBhbmQgMzI2IG9ic2VydmF0aW9ucy5UaGUgZm9sbG93aW5nIGJlbG93IHdpbGwgaGVscCB1cyB1bmRlcnN0YW5kIHRoZSB2YXJpYWJsZXMgd2UgYXJlIGxvb2tpbmcgYXQ6DQojIyMjZ2FtZW5vDQojIyMjI2FuIGludGVnZXIgdmFyaWFibGUgZGVub3RpbmcgdGhlIGdhbWUgbnVtYmVyDQojIyMjbW9udGgNCiMjIyMjYSBmYWN0b3IgdmFyaWFibGUgdGFraW5nIHdpdGggbGV2ZWxzICJNYXJjaCIgdGhyb3VnaCAiU2VwdGVtYmVyIiBkZW5vdGluZyB0aGUgbW9udGggb2YgdGhlIGdhbWUNCiMjIyNob21lcnVucw0KIyMjIyNhbiBpbnRlZ2VyIHZlY3RvciBkZW5vdGluZyB0aGUgbnVtYmVyIG9mIGhvbWVydW5zIGhpdCBpbiB0aGF0IGdhbWUgZm9yIHRoYXQgcGxheWVyDQojIyMjcGxheWVyc3RhdHVzDQojIyMjI2FuIGludGVnZXIgdmVjdG9yIGVxdWFsIHRvICIwIiBpZiB0aGUgcGxheWVyIHBsYXllZCBpbiB0aGUgZ2FtZSwgYW5kICIxIiBpZiB0aGV5IGRpZCBub3QuDQojIyMjcGxheWVyDQojIyMjI2FuIGludGVnZXIgdmVjdG9yIGVxdWFsIHRvICIwIiAoTWNHd2lyZSkgb3IgIjEiIChTb3NhKQ0KDQoNCiNFeHBsb3JpbmcgdGhlIERhdGE6DQoNCmBgYHtyfQ0KbGlicmFyeShaZWxpZykNCmRhdGEoImhvbWVydW4iKQ0KbGlicmFyeShzdXJ2aXZhbCkNCnN0cihob21lcnVuKQ0KYGBgDQoNCg0KI0Rlc2NyaXB0aXZlIEFuYWx5c2lzOg0KIyNCYXIgR3JhcGhzIGFuZCBIaXN0b2dyYW1zDQoNCg0KYGBge3J9DQpsaWJyYXJ5KGdncGxvdDIpDQpnZ3Bsb3QoaG9tZXJ1biwgYWVzKHg9cGxheWVyKSkgKyBnZW9tX2JhcihmaWxsID0gInJlZCIpDQpgYGANCg0KIyMjI1RoZSBiYXIgZ3JhcGggYWJvdmUgdGVsbHMgdXMgdGhhdCBTYW1teSBTb3NhIGhhZCBwbGF5ZWQgaW4gbW9yZSBnYW1lcyB0aGFuIE1hcmsgTWNHd2lyZS4gVGhlcmUgYXJlIDE2MiBnYW1lcyBpbiBhIHNlYXNvbiwgYW5kIGZyb20gdGhlIGRhdGEgYWJvdmUsIFNvc2EgcGxheWVkIGluIG1vcmUgZ2FtZXMuIA0KDQpgYGB7cn0NCmdncGxvdChob21lcnVuLCBhZXMoeD1wbGF5ZXIsIHk9aG9tZXJ1bnMpKSArIGdlb21fYmFyKHN0YXQgPSAiaWRlbnRpdHkiLCBmaWxsPSJyZWQiKQ0KYGBgDQoNCiMjIyMgVGhlIGJhciBncmFwaCBhYm92ZSBzaG93cyB0aGUgYW1vdW50IG9mIGhvbWVydW5zIGVhY2ggcGxheWVyIGhhZC4gDQoNCmBgYHtyfQ0KZ2dwbG90KGhvbWVydW4sIGFlcyh4PWdhbWVubywgeT1ob21lcnVucykpICsgZ2VvbV9oaXN0b2dyYW0oc3RhdCA9ICJpZGVudGl0eSIsIGZpbGwgPSAiYmx1ZSIpDQpgYGANCg0KIyMjI1RoZSBoaXN0b2dyYW0gc2hvd3MgdGhlIG51bWJlciBvZiBob21lIHJ1bnMgZm9yIGJvdGggdGVhbXMgb3ZlcmFsbCBpbiBlYWNoIGdhbWUgKG91dCBvZiAxNjIgZ2FtZXMpLg0KDQoNCiMjVGhlIHJlbGF0aW9uc2hpcCBiZXR3ZWVuIHRoZSBwbGF5ZXIgYW5kIHRoZSBhbW91bnQgb2YgaG9tZXJ1bnMgaGl0IGluIHRoYXQgZ2FtZQ0KDQpgYGB7cn0NCm0xPC0gbG0oaG9tZXJ1bnMgfiBwbGF5ZXIsIGRhdGE9aG9tZXJ1bikNCnN1bW1hcnkobTEpDQpgYGANCg0KDQojIyMjQWNjb3JkaW5nIHRvIHRoZSBpbmZvcm1hdGlvbiBhYm92ZSBpdCBzZWVtcyB0aGF0IFNvc2EgaGFzIGZld2VyIGhvbWVydW5zIHRoYW4gTWNHd2lyZS4gRm9yIGV2ZXJ5IDEgaG9tZXJ1biBmcm9tIE1jR3dpcmUsIFNvc2EgaGFzIGEgLS4wMyBob21lcnVuIGNoYW5jZS4gVGhlcmUgd2FzIG5vIHNpZ25pZmljYW50IGRpZmZlcmVuY2UuDQoNCiMjVGhlIHJlbGF0aW9uc2hpcCBiZXR3ZWVuIGhvbWVydW5zIGhpdCBhbmQgdGhlIG1vbnRoIG9mIGVhY2ggZ2FtZQ0KDQpgYGB7cn0NCm0yPC0gbG0oaG9tZXJ1bnMgfiBtb250aCwgZGF0YT1ob21lcnVuKQ0Kc3VtbWFyeShtMikNCmBgYA0KIyMjIyBOb3cgd2UgYXJlIGxvb2tpbmcgYXQgdGhlIGhvbWVydW5zIGhpdCBhbmQgZWFjaCBlYWNoIG1vbnRoLiBXZSBkb24ndCByZWFsbHkgc2VlIGFueSBzaWduaWZjYW5jZSBpbiBoZXJlLiBJdCBqdXN0IHNob3dzIHVzIHRoYXQgZWFjaCBtb250aCB2YXJpZXMuIFRoZSBvbmx5IHNpZ25pZmljYW5jZSBpcyBmb3IgdGhlIG1vbnRoIG9mIEp1bHkuIEkgd29uZGVyIGlmIG90aGVyIGZhY3RvcnMgbGlrZSB3ZWF0aGVyLCBwbGFjZSBvZiBnYW1lIGFuZCB0aW1lIG9mIGdhbWUgaW5mbHVlbmNlIHRoZSBudW1iZXIgb2YgaG9tZXJ1bnMgcG9zc2libGUgZm9yIGVhY2ggbW9udGguDQoNCiMjTXVsdGlwbGUgUmVncmVzc2lvbjogVGhlIFJlbGF0aW9uc2hpcCBiZXR3ZWVuIEhvbWVydW5zIGZvciBlYWNoIHBsYXllciBhbmQgdGhlIG1vbnRoIG9mIHRoZSBnYW1lcy4NCg0KYGBge3J9DQptMyA8LSBsbShob21lcnVucyB+IHBsYXllciArIG1vbnRoLCBkYXRhPWhvbWVydW4pDQpzdW1tYXJ5KG0zKQ0KYGBgDQojIyMjT24gQXZlcmFnZSB3ZSBjYW4gc2VlIHRoYXQgU29zYSBoYWQgZmV3ZXIgaG9tZXJ1bnMgYW5kIHRoYXQgaXQgdmFyaWVkIGluIGVhY2ggbW9udGguIFRoZSBvbmx5IHQgdmFsdWUgdGhhdCBzaG93cyBzaWduaWZpY2FudCBkaWZmZXJlbmNlIGlzIGFnYWluIHRoZSBtb250aCBvZiBKdWx5LiBGb3IgZXZlcnkgb25lIGhvbWVydW4gTWNHd2lyZSBoYWQsIFNvc2Egd2FzIGJlaGluZCBoaW0gYnkgLS4wMy4NCg0KIyNNb3JlIFNpbXBsZSBUYWJsZQ0KDQpgYGB7cn0NCmxpYnJhcnkodGV4cmVnKQ0Kc2NyZWVucmVnKGxpc3QobTEsbTIsbTMpKQ0KYGBgDQoNCg0KIyNQb3NzaWJsZSBJbnRlcmFjdGlvbiBFZmZlY3RzDQpgYGB7cn0NCm00PC1sbShob21lcnVucyB+IHBsYXllcipnYW1lbm8sIGRhdGEgPSBob21lcnVuKQ0Kc3VtbWFyeShtNCkNCmBgYA0KDQoNCiMjIyNJIHdhbnRlZCB0byBzZWUgaWYgdGhlIGdhbWUgbnVtYmVyIGhhZCBhbnkgc2lnbmlmaWNhbmNlIG9uIHRoZSBudW1iZXIgb2YgaG9tZXJ1bnMgZWFjaCBwbGF5ZXIgaGFkLiBUaGUgdHZhbHVlIGZvciBnYW1lbm8gaXMgbGVzcyB0aGFuIC4wNSB3aGljaCBkb2VzIHNob3cgdGhhdCB0aGVyZSBpcyBhIHNpZ25maWNhbmNlLiBPbiBhdmVyYWdlIFNvc2EgaGFkIGxlc3MgaG9tZXJ1bnMgdGhhbiBNY0d3aXJlLg0KDQojIyMjT3ZlcmFsbCwgSSB3YW50ZWQgdG8gc2VlZSB3aGF0IGVmZmVjdGVkIHRoZSBwbGF5ZXJzIGZyb20gYWNoaWV2aW5nIG1vcmUgaG9tZXJ1bnMuIEkgd291bGQgaGF2ZSBsaWtlZCB0byBsb29rIGF0IG1vcmUgZGF0YSBzdWNoIGFzIGFnZSwgd2VpZ2h0LCB5ZWFycyBwbGF5ZWQsIG5vLiBvZiBpbmp1cmllcyAvIHNlYXNvbnMgb3IgZ2FtZXMgb3V0LiBNY0d3aXJlIGVuZGVkIHVwIGJlYXRpbmcgU29zYSBieSA0IGhvbWVydW5zLiBNY0d3aXJlIGhhZCA3MCBhdCB0aGUgZW5kIG9mIHRoZSBzZWFzb24gYW5kIFNvc2EgaGFkIDY2LiANCg0KDQoNCg==