Loading the Dataset

Reading the data from the Web.

mlb <- read.csv("MLB_cleaned.csv")

First Few Rows

head(mlb)
##   First.Name Last.Name Team      Position Height.inches. Weight.pounds.   Age
## 1       Jeff    Mathis  ANA       Catcher             72            180 23.92
## 2       Mike    Napoli  ANA       Catcher             72            205 25.33
## 3       Jose    Molina  ANA       Catcher             74            220 31.74
## 4      Howie  Kendrick  ANA First Baseman             70            180 23.64
## 5     Kendry   Morales  ANA First Baseman             73            220 23.70
## 6      Casey  Kotchman  ANA First Baseman             75            210 24.02

Data Summary

knitr::kable(summary(mlb))
First.Name Last.Name Team Position Height.inches. Weight.pounds. Age
Length:1034 Length:1034 Length:1034 Length:1034 Min. :67.0 Min. :150.0 Min. :20.90
Class :character Class :character Class :character Class :character 1st Qu.:72.0 1st Qu.:187.0 1st Qu.:25.44
Mode :character Mode :character Mode :character Mode :character Median :74.0 Median :200.0 Median :27.93
NA NA NA NA Mean :73.7 Mean :201.7 Mean :28.74
NA NA NA NA 3rd Qu.:75.0 3rd Qu.:215.0 3rd Qu.:31.23
NA NA NA NA Max. :83.0 Max. :290.0 Max. :48.52

Distribution of Height

plotly::plot_ly(mlb,  x = ~Height.inches., type = "histogram")

The majority of the players appear to be around 72-76 inches, as seen by the histogram detailing the frequency of players at each height(in). The two shortest players, represented by the leftmost bar on the graph, are 67in. tall, while the tallest player is 83in. tall.

Analyzing Team Tex

Number of players:

playerCount <- mlb$Team %>% table %>% data.frame
colnames(playerCount) <- c("Team", "Number.Players")
playerCount
##    Team Number.Players
## 1   ANA             35
## 2   ARZ             28
## 3   ATL             37
## 4   BAL             35
## 5   BOS             36
## 6   CHC             36
## 7   CIN             36
## 8   CLE             35
## 9   COL             35
## 10  CWS             33
## 11  DET             37
## 12  FLA             32
## 13  HOU             34
## 14   KC             35
## 15   LA             33
## 16  MIN             33
## 17  MLW             35
## 18  NYM             38
## 19  NYY             32
## 20  OAK             37
## 21  PHI             36
## 22  PIT             35
## 23   SD             33
## 24  SEA             34
## 25   SF             34
## 26  STL             32
## 27   TB             33
## 28  TEX             35
## 29  TOR             34
## 30  WAS             36
plotly::plot_ly(playerCount, x = ~Team, y = ~Number.Players, type = "bar")

Texas has 35 players in total. Compared to the other teams, this appears to towards the middle of the distribution.

Youngest player:

TEX <- mlb %>% subset(Team == "TEX")
knitr::kable(head(TEX <- TEX %>% arrange(Age), n=5))
First.Name Last.Name Team Position Height.inches. Weight.pounds. Age
Joaquin Arias TEX Shortstop 73 160 22.44
Brandon McCarthy TEX Starting Pitcher 79 190 23.65
Edinson Volquez TEX Starting Pitcher 73 190 23.66
Scott Feldman TEX Relief Pitcher 77 210 24.06
Wes Littleton TEX Relief Pitcher 74 210 24.49
plotly::plot_ly(TEX,  x = ~Last.Name, y = ~Age, type = "bar")
TEX %>% slice_head(n = 1) %>% knitr::kable()
First.Name Last.Name Team Position Height.inches. Weight.pounds. Age
Joaquin Arias TEX Shortstop 73 160 22.44

The youngest player is Joaquin Arias, who is 22.44 years old.

Players per position:

positions <- TEX$Position %>% table %>% data.frame
colnames(positions) <- c("Position", "Number.Players")
positions
##            Position Number.Players
## 1           Catcher              4
## 2 Designated Hitter              1
## 3     First Baseman              1
## 4        Outfielder              6
## 5    Relief Pitcher             11
## 6    Second Baseman              1
## 7         Shortstop              2
## 8  Starting Pitcher              8
## 9     Third Baseman              1
plotly::plot_ly(positions, x = ~Position, y = ~Number.Players, type = "bar")

Age distribution of players:

# Age frequency table and bar graph
ageFrequency <- TEX$Age %>% floor %>% table %>% data.frame
colnames(ageFrequency) <- c("Age", "Number.Players")
ageFrequency
##    Age Number.Players
## 1   22              1
## 2   23              2
## 3   24              6
## 4   25              3
## 5   26              6
## 6   27              3
## 7   29              4
## 8   30              2
## 9   31              1
## 10  32              3
## 11  35              2
## 12  38              1
## 13  39              1
plotly::plot_ly(ageFrequency, x = ~Age, y = ~Number.Players, type = "bar")
#Age box plot that displays min and max
ageDistribution <- floor(TEX$Age)
plotly::plot_ly(y = ageDistribution, type = "box")
#Age range
ageDistribution %>% min
## [1] 22
ageDistribution %>% max
## [1] 39

The age range is 22-39 years old.

Height and weight analysis across different positions:

    library(ggplot2)
    
# Average height and weight of each position for Team TEX
    physique <- 
      TEX %>%
      select(Position, Height.inches., Weight.pounds.) %>%
      group_by(Position) %>%
      summarize_all(mean)
    physique
## # A tibble: 9 x 3
##   Position          Height.inches. Weight.pounds.
##   <chr>                      <dbl>          <dbl>
## 1 Catcher                     74.2           204.
## 2 Designated Hitter           77             250 
## 3 First Baseman               75             220 
## 4 Outfielder                  72.3           201.
## 5 Relief Pitcher              74.5           204.
## 6 Second Baseman              72             175 
## 7 Shortstop                   73             175 
## 8 Starting Pitcher            74.9           204 
## 9 Third Baseman               73             200
    physiquePlot <- ggplot(physique, aes(x = Height.inches., y = Weight.pounds., color = Position)) + geom_point()
    physiquePlot

# Average height and weight of each position across MLB
    physique2 <- 
      mlb %>%
      select(Position, Height.inches., Weight.pounds.) %>%
      group_by(Position) %>%
      summarize_all(mean)
    physique2
## # A tibble: 9 x 3
##   Position          Height.inches. Weight.pounds.
##   <chr>                      <dbl>          <dbl>
## 1 Catcher                     72.7           204.
## 2 Designated Hitter           74.2           221.
## 3 First Baseman               74             213.
## 4 Outfielder                  73.0           199.
## 5 Relief Pitcher              74.4           204.
## 6 Second Baseman              71.4           184.
## 7 Shortstop                   71.9           183.
## 8 Starting Pitcher            74.7           205.
## 9 Third Baseman               73.0           201.
    physiquePlot2 <- ggplot(physique2, aes(x = Height.inches., y = Weight.pounds., color = Position)) + geom_point()
    physiquePlot2

Analysis: The average height and weight of various positions were plotted for each position of Team TEX. However, since this was not enough data for an accurate analysis, averages were also taken for each position in the entire league and compared to that of TEX. The results varied, but both plots reveal that designated hitters are on average taller and heavier, while second baseman and shortstop were on the lower ends of the distribution. Outfielder and third basemen had similar average weights of around 200 in TEX, while catcher, relief pitcher, and starting pitcher positions also had similar average weights around 204. Positions varied less in height, and everyone in TEX was within 72 to 77 range. All in all, however, comparing the height and weight averages for positions in team TEX is not enough to make general correlation between position and physique.