Introduction

This homework explore, analyze and model a data set containing approximately 2000 records. Each record represents a professional baseball team from the years 1871-2006. Each record has the performance of the team for the given year, with all the statistics adjusted to match the performance of a 162 game season.

Provided Variables

Derived Variables

Exploratory Data Analysis

vars n mean sd median trimmed mad min max range skew kurtosis se
INDEX 1 2276 1268.46353 736.34904 1270.5 1268.56970 952.5705 1 2535 2534 0.0042149 -1.2167564 15.4346788
TARGET_WINS 2 2276 80.79086 15.75215 82.0 81.31229 14.8260 0 146 146 -0.3987232 1.0274757 0.3301823
TEAM_BATTING_H 3 2276 1469.26977 144.59120 1454.0 1459.04116 114.1602 891 2554 1663 1.5713335 7.2785261 3.0307891
TEAM_BATTING_2B 4 2276 241.24692 46.80141 238.0 240.39627 47.4432 69 458 389 0.2151018 0.0061609 0.9810087
TEAM_BATTING_3B 5 2276 55.25000 27.93856 47.0 52.17563 23.7216 0 223 223 1.1094652 1.5032418 0.5856226
TEAM_BATTING_HR 6 2276 99.61204 60.54687 102.0 97.38529 78.5778 0 264 264 0.1860421 -0.9631189 1.2691285
TEAM_BATTING_BB 7 2276 501.55888 122.67086 512.0 512.18331 94.8864 0 878 878 -1.0257599 2.1828544 2.5713150
TEAM_BATTING_SO 8 2174 735.60534 248.52642 750.0 742.31322 284.6592 0 1399 1399 -0.2978001 -0.3207992 5.3301912
TEAM_BASERUN_SB 9 2145 124.76177 87.79117 101.0 110.81188 60.7866 0 697 697 1.9724140 5.4896754 1.8955584
TEAM_BASERUN_CS 10 1504 52.80386 22.95634 49.0 50.35963 17.7912 0 201 201 1.9762180 7.6203818 0.5919414
TEAM_BATTING_HBP 11 191 59.35602 12.96712 58.0 58.86275 11.8608 29 95 66 0.3185754 -0.1119828 0.9382681
TEAM_PITCHING_H 12 2276 1779.21046 1406.84293 1518.0 1555.89517 174.9468 1137 30132 28995 10.3295111 141.8396985 29.4889618
TEAM_PITCHING_HR 13 2276 105.69859 61.29875 107.0 103.15697 74.1300 0 343 343 0.2877877 -0.6046311 1.2848886
TEAM_PITCHING_BB 14 2276 553.00791 166.35736 536.5 542.62459 98.5929 0 3645 3645 6.7438995 96.9676398 3.4870317
TEAM_PITCHING_SO 15 2174 817.73045 553.08503 813.5 796.93391 257.2311 0 19278 19278 22.1745535 671.1891292 11.8621151
TEAM_FIELDING_E 16 2276 246.48067 227.77097 159.0 193.43798 62.2692 65 1898 1833 2.9904656 10.9702717 4.7743279
TEAM_FIELDING_DP 17 1990 146.38794 26.22639 149.0 147.57789 23.7216 52 228 176 -0.3889390 0.1817397 0.5879114

Missing Values

The plot above shows that there are missing values for the variables TEAM_PITCHING_SO, TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_FIELDING_DP, TEAM_BASERUN_CS, and TEAM_BATTING_HBP.

The variable TEAM_BATTING_HBP has the most missing values at 92% missing or 2085 out of 2276 observations.

The variable TEAM_BASERUN_CS has the most missing values at 34% missing or 772 out of 2276 observations.

Correlation

VARIABLE CORRELATION WITH WINNING
TEAM_BATTING_H 0.3887675
TEAM_BATTING_2B 0.2891036
TEAM_BATTING_3B 0.1426084
TEAM_BATTING_HR 0.1761532
TEAM_BATTING_BB 0.2325599
TEAM_BATTING_SO -0.0317507
TEAM_BASERUN_SB 0.1351389
TEAM_BASERUN_CS 0.0224041
TEAM_BATTING_HBP 0.0735042
TEAM_PITCHING_H -0.1099371
TEAM_PITCHING_HR 0.1890137
TEAM_PITCHING_BB 0.1241745
TEAM_PITCHING_SO -0.0784361
TEAM_FIELDING_E -0.1764848
TEAM_FIELDING_DP -0.0348506

Among Predictor variables

Variable: TEAM_PITCHING_BB – Walks Allowed

vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 2276 553.0079 166.3574 536.5 542.6246 98.5929 0 3645 3645 6.7439 96.96764 3.487032

Shapiro-Wilk normality test

Normality

The results of the Shapiro-Wilk normality test indicate that the walks allowed statistics is not normally distributed. This is also evident by the skew and kurtosis values of 6.7 and 97 respectively. The positive, high kurtosis value indicates a fat-tailed distribution.

## 
##  Shapiro-Wilk normality test
## 
## data:  mbstats$TEAM_BATTING_BB
## W = 0.93784, p-value < 2.2e-16

Outliers

## Selecting by .
##      .
## 1   53
## 2  272
## 3  273
## 4  296
## 5  298
## 6  299
## 7  342
## 8  391
## 9  393
## 10 409
## Selecting by .
## Selecting by .

Extreme Outliers

## Selecting by .
.
1823
1824
1825
2015
2016
2137
2220
2232
2233
2239
## Selecting by .
.
53
272
273
296
298
299
342
391
393
409
.
53
272
273
296
298
299
342
391
393
409
415
417
860
861
862
982
996
997
998
999
1191
1210
1211
1345
1348
1349
1350
1397
1584
1812
1813
1823
1824
1825
2015
2016
2137
2220
2232
2233
2239

As shown above, the shapiro.test tests the NULL hypothesis that the samples came from a Normal distribution. This means that if your p-value <= 0.05, then you would reject the NULL hypothesis that the samples came from a Normal distribution.

cts <- corr.test(attitude[1:3],attitude[4:6])