This homework explore, analyze and model a data set containing approximately 2000 records. Each record represents a professional baseball team from the years 1871-2006. Each record has the performance of the team for the given year, with all the statistics adjusted to match the performance of a 162 game season.
–
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
INDEX | 1 | 2276 | 1268.46353 | 736.34904 | 1270.5 | 1268.56970 | 952.5705 | 1 | 2535 | 2534 | 0.0042149 | -1.2167564 | 15.4346788 |
TARGET_WINS | 2 | 2276 | 80.79086 | 15.75215 | 82.0 | 81.31229 | 14.8260 | 0 | 146 | 146 | -0.3987232 | 1.0274757 | 0.3301823 |
TEAM_BATTING_H | 3 | 2276 | 1469.26977 | 144.59120 | 1454.0 | 1459.04116 | 114.1602 | 891 | 2554 | 1663 | 1.5713335 | 7.2785261 | 3.0307891 |
TEAM_BATTING_2B | 4 | 2276 | 241.24692 | 46.80141 | 238.0 | 240.39627 | 47.4432 | 69 | 458 | 389 | 0.2151018 | 0.0061609 | 0.9810087 |
TEAM_BATTING_3B | 5 | 2276 | 55.25000 | 27.93856 | 47.0 | 52.17563 | 23.7216 | 0 | 223 | 223 | 1.1094652 | 1.5032418 | 0.5856226 |
TEAM_BATTING_HR | 6 | 2276 | 99.61204 | 60.54687 | 102.0 | 97.38529 | 78.5778 | 0 | 264 | 264 | 0.1860421 | -0.9631189 | 1.2691285 |
TEAM_BATTING_BB | 7 | 2276 | 501.55888 | 122.67086 | 512.0 | 512.18331 | 94.8864 | 0 | 878 | 878 | -1.0257599 | 2.1828544 | 2.5713150 |
TEAM_BATTING_SO | 8 | 2174 | 735.60534 | 248.52642 | 750.0 | 742.31322 | 284.6592 | 0 | 1399 | 1399 | -0.2978001 | -0.3207992 | 5.3301912 |
TEAM_BASERUN_SB | 9 | 2145 | 124.76177 | 87.79117 | 101.0 | 110.81188 | 60.7866 | 0 | 697 | 697 | 1.9724140 | 5.4896754 | 1.8955584 |
TEAM_BASERUN_CS | 10 | 1504 | 52.80386 | 22.95634 | 49.0 | 50.35963 | 17.7912 | 0 | 201 | 201 | 1.9762180 | 7.6203818 | 0.5919414 |
TEAM_BATTING_HBP | 11 | 191 | 59.35602 | 12.96712 | 58.0 | 58.86275 | 11.8608 | 29 | 95 | 66 | 0.3185754 | -0.1119828 | 0.9382681 |
TEAM_PITCHING_H | 12 | 2276 | 1779.21046 | 1406.84293 | 1518.0 | 1555.89517 | 174.9468 | 1137 | 30132 | 28995 | 10.3295111 | 141.8396985 | 29.4889618 |
TEAM_PITCHING_HR | 13 | 2276 | 105.69859 | 61.29875 | 107.0 | 103.15697 | 74.1300 | 0 | 343 | 343 | 0.2877877 | -0.6046311 | 1.2848886 |
TEAM_PITCHING_BB | 14 | 2276 | 553.00791 | 166.35736 | 536.5 | 542.62459 | 98.5929 | 0 | 3645 | 3645 | 6.7438995 | 96.9676398 | 3.4870317 |
TEAM_PITCHING_SO | 15 | 2174 | 817.73045 | 553.08503 | 813.5 | 796.93391 | 257.2311 | 0 | 19278 | 19278 | 22.1745535 | 671.1891292 | 11.8621151 |
TEAM_FIELDING_E | 16 | 2276 | 246.48067 | 227.77097 | 159.0 | 193.43798 | 62.2692 | 65 | 1898 | 1833 | 2.9904656 | 10.9702717 | 4.7743279 |
TEAM_FIELDING_DP | 17 | 1990 | 146.38794 | 26.22639 | 149.0 | 147.57789 | 23.7216 | 52 | 228 | 176 | -0.3889390 | 0.1817397 | 0.5879114 |
The plot above shows that there are missing values for the variables TEAM_PITCHING_SO, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_FIELDING_DP, TEAM_BASERUN_CS, and TEAM_BATTING_HBP.
The variable TEAM_BATTING_HBP has the most missing values at 92% missing or 2085 out of 2276 observations.
The variable TEAM_BASERUN_CS has the most missing values at 34% missing or 772 out of 2276 observations.
VARIABLE | CORRELATION WITH WINNING |
---|---|
TEAM_BATTING_H | 0.3887675 |
TEAM_BATTING_2B | 0.2891036 |
TEAM_BATTING_3B | 0.1426084 |
TEAM_BATTING_HR | 0.1761532 |
TEAM_BATTING_BB | 0.2325599 |
TEAM_BATTING_SO | -0.0317507 |
TEAM_BASERUN_SB | 0.1351389 |
TEAM_BASERUN_CS | 0.0224041 |
TEAM_BATTING_HBP | 0.0735042 |
TEAM_PITCHING_H | -0.1099371 |
TEAM_PITCHING_HR | 0.1890137 |
TEAM_PITCHING_BB | 0.1241745 |
TEAM_PITCHING_SO | -0.0784361 |
TEAM_FIELDING_E | -0.1764848 |
TEAM_FIELDING_DP | -0.0348506 |
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 2276 | 553.0079 | 166.3574 | 536.5 | 542.6246 | 98.5929 | 0 | 3645 | 3645 | 6.7439 | 96.96764 | 3.487032 |
Shapiro-Wilk normality test
Normality
The results of the Shapiro-Wilk normality test indicate that the walks allowed statistics is not normally distributed. This is also evident by the skew and kurtosis values of 6.7 and 97 respectively. The positive, high kurtosis value indicates a fat-tailed distribution.
##
## Shapiro-Wilk normality test
##
## data: mbstats$TEAM_BATTING_BB
## W = 0.93784, p-value < 2.2e-16
Outliers
## Selecting by .
## .
## 1 53
## 2 272
## 3 273
## 4 296
## 5 298
## 6 299
## 7 342
## 8 391
## 9 393
## 10 409
## Selecting by .
## Selecting by .
Extreme Outliers
## Selecting by .
. |
---|
1823 |
1824 |
1825 |
2015 |
2016 |
2137 |
2220 |
2232 |
2233 |
2239 |
## Selecting by .
. |
---|
53 |
272 |
273 |
296 |
298 |
299 |
342 |
391 |
393 |
409 |
. |
---|
53 |
272 |
273 |
296 |
298 |
299 |
342 |
391 |
393 |
409 |
415 |
417 |
860 |
861 |
862 |
982 |
996 |
997 |
998 |
999 |
1191 |
1210 |
1211 |
1345 |
1348 |
1349 |
1350 |
1397 |
1584 |
1812 |
1813 |
1823 |
1824 |
1825 |
2015 |
2016 |
2137 |
2220 |
2232 |
2233 |
2239 |
As shown above, the shapiro.test tests the NULL hypothesis that the samples came from a Normal distribution. This means that if your p-value <= 0.05, then you would reject the NULL hypothesis that the samples came from a Normal distribution.
cts <- corr.test(attitude[1:3],attitude[4:6])
mbstats$TEAM_PITCHING_H
https://www.r-bloggers.com/how-to-detect-heteroscedasticity-and-rectify-it/
http://www.ianruginski.com/regressionassumptionswithR_tutorial.html