Final Assignment - Potential baseball players

I have received a list of 75 potential agents we could sign contracts. In this document, I will provide the list of agents that I believe we should have them in our team.

Here is a first few line of the data of free agents:

head(baseballData)

##   Code   Avg   OPS HR Age Length AAV
## 1    1 0.290 0.950 52  29      1  15
## 2    2 0.275 0.850 30  26      2  12
## 3    3 0.265 0.750 20  31      3  10
## 4    4 0.250 0.700 10  24      4   7
## 5    5 0.225 0.600  5  33      5   4
## 6    6 0.345 0.924 38  30      6  13

Identify missing value in the dataset:

sum(is.na(baseballData))

## [1] 0

This is general information about the data:

glimpse(baseballData)

## Observations: 75
## Variables: 7
## $ Code   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1...
## $ Avg    <dbl> 0.290, 0.275, 0.265, 0.250, 0.225, 0.345, 0.265, 0.260, 0.18...
## $ OPS    <dbl> 0.950, 0.850, 0.750, 0.700, 0.600, 0.924, 0.896, 0.763, 0.61...
## $ HR     <int> 52, 30, 20, 10, 5, 38, 34, 18, 7, 3, 38, 35, 19, 7, 5, 39, 3...
## $ Age    <int> 29, 26, 31, 24, 33, 30, 27, 32, 25, 34, 34, 28, 33, 26, 35, ...
## $ Length <int> 1, 2, 3, 4, 5, 6, 2, 3, 4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 1, 2, ...
## $ AAV    <int> 15, 12, 10, 7, 4, 13, 8, 11, 9, 1, 15, 4, 10, 7, 1, 17, 2, 1...

Explain the variables:

Code: The unique number for each agent
Avg: Batting average
OPS: On the base plus slugging percentage
HR: Home run
Age: Agent’s age
Length: Contract length in years
AAV: Average annual value

Some basis statistic information about the data of free agents:

summary(baseballData[,-1])

##       Avg              OPS               HR             Age       
##  Min.   :0.1780   Min.   :0.4500   Min.   : 0.00   Min.   :23.00  
##  1st Qu.:0.2390   1st Qu.:0.6465   1st Qu.:10.50   1st Qu.:26.00  
##  Median :0.2690   Median :0.8000   Median :19.00   Median :29.00  
##  Mean   :0.2671   Mean   :0.7864   Mean   :21.73   Mean   :29.12  
##  3rd Qu.:0.2950   3rd Qu.:0.9055   3rd Qu.:34.00   3rd Qu.:32.00  
##  Max.   :0.3900   Max.   :1.0530   Max.   :52.00   Max.   :35.00  
##      Length          AAV        
##  Min.   :1.00   Min.   : 1.000  
##  1st Qu.:2.00   1st Qu.: 4.000  
##  Median :3.00   Median : 9.000  
##  Mean   :3.04   Mean   : 9.387  
##  3rd Qu.:4.00   3rd Qu.:13.000  
##  Max.   :6.00   Max.   :27.000

Discover if there are any relationship between Avg, HR and OPS.

plot(baseballData[,c(2,3,4)])

Base on the plot above, it seems there are some positive linear correlation between three variable.

Detailed correlation using Pearson method.

Between Avg and OPS, there is a strong positive correlation (0.7945662).

cor.test(baseballData$Avg, baseballData$OPS)

## 
##  Pearson's product-moment correlation
## 
## data:  baseballData$Avg and baseballData$OPS
## t = 11.181, df = 73, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6924843 0.8654551
## sample estimates:
##       cor 
## 0.7945662

Between Avg and HR, there is a strong positive correlation (0.7846538).

cor.test(baseballData$Avg, baseballData$HR)

## 
##  Pearson's product-moment correlation
## 
## data:  baseballData$Avg and baseballData$HR
## t = 10.814, df = 73, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6785287 0.8586939
## sample estimates:
##       cor 
## 0.7846538

Between OPS and HR, there is a strong positive correlation (0.8446505).

cor.test(baseballData$OPS, baseballData$HR)

## 
##  Pearson's product-moment correlation
## 
## data:  baseballData$OPS and baseballData$HR
## t = 13.481, df = 73, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7641857 0.8992274
## sample estimates:
##       cor 
## 0.8446505

Base on the results above, there are a strong correlation between those three metrics. We can expect that a person who has a great number on metrics is likely to have strong values in other two metrics.

Besides, we should take age in account when choosing agents since the baseball players’ performance starts to decline when they reach certain ages. It can be good if we find what age is the peak of the baseball career.

par(mfrow=c(1,3))
plot(baseballData$Age, baseballData$Avg, xlab = "Age", ylab = "Avg")
plot(baseballData$Age, baseballData$OPS, xlab = "Age", ylab = "OPS")
plot(baseballData$Age, baseballData$HR, xlab = "Age", ylab = "HR")

Base on the plot, there is no linear relationship between Age and three baseball metrics.

Test correlation using Kendall Methods.

cor.test(baseballData$Age, baseballData$Avg, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  baseballData$Age and baseballData$Avg
## z = -0.51944, p-value = 0.6035
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##         tau 
## -0.04256547

cor.test(baseballData$Age, baseballData$OPS, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  baseballData$Age and baseballData$OPS
## z = -0.75361, p-value = 0.4511
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##         tau 
## -0.06149632

cor.test(baseballData$Age, baseballData$OPS, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  baseballData$Age and baseballData$OPS
## z = -0.75361, p-value = 0.4511
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##         tau 
## -0.06149632

It seems there are weak correlations between the age and the metrics. Attempting fit a linear or non-linear function to predict the metrics might give us a result is far from the actual result. Since this the case, we can try to cluster the data to find some group base on the age and three metrics.

Data<- baseballData[,-c(1,6,7)]
Data1 <- scale(Data)

set.seed(010)
# Determine number of clusters
# Determine withinss for clusters 1 - 20 with 15 tries
wss <- 1
for (i in 1:20) wss[i] <- sum(kmeans(Data1,
centers=i, nstart=1000)$withinss)
# Plot the withinss for each cluster
plot(1:20, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")

For me the cluster sum of square after 8th cluster is low and the value is not within groups sum of squares does not change much after that.

set.seed(010)
#K-Means Cluster Analysis
fit <- kmeans(Data1, 8, nstart=1000) # 8 cluster solution
# fit contains all the data about
# each cluster
# get cluster means
aggregate(Data1,by=list(fit$cluster),FUN=mean)

##   Group.1          Avg        OPS         HR        Age
## 1       1 -1.457274025 -1.2087432 -1.0914666  1.2175212
## 2       2  0.414404369  0.6686592  0.8098407 -0.7640150
## 3       3 -0.004633182  0.1007650 -0.2225373  0.7009232
## 4       4 -0.082818133 -0.9447334 -1.1513398  0.6966019
## 5       5 -0.264991136 -0.3934265 -0.7436317 -1.3578921
## 6       6  1.453552342  1.3155322  1.3879499 -0.1741505
## 7       7  0.965740992  1.0411472  1.0684043  0.9494009
## 8       8 -1.023345081 -1.2460264 -0.8220371 -1.1221470

# append cluster assignment to each data point in the data set
Data1 <- data.frame(Data1, fit$cluster)
head(Data1)

##          Avg        OPS         HR         Age fit.cluster
## 1  0.5383179  1.0628035  2.2148656 -0.03370654           6
## 2  0.1853997  0.4131151  0.6049413 -0.87637011           2
## 3 -0.0498791 -0.2365732 -0.1268425  0.52806917           3
## 4 -0.4027973 -0.5614173 -0.8586263 -1.43814582           5
## 5 -0.9909942 -1.2111057 -1.2245182  1.08984488           1
## 6  1.8323512  0.8938845  1.1903683  0.24718131           6

Data <- data.frame(Data, fit$cluster)
head(Data)

##     Avg   OPS HR Age fit.cluster
## 1 0.290 0.950 52  29           6
## 2 0.275 0.850 30  26           2
## 3 0.265 0.750 20  31           3
## 4 0.250 0.700 10  24           5
## 5 0.225 0.600  5  33           1
## 6 0.345 0.924 38  30           6

fit

## K-means clustering with 8 clusters of sizes 11, 15, 13, 5, 7, 10, 6, 8
## 
## Cluster means:
##            Avg        OPS         HR        Age
## 1 -1.457274025 -1.2087432 -1.0914666  1.2175212
## 2  0.414404369  0.6686592  0.8098407 -0.7640150
## 3 -0.004633182  0.1007650 -0.2225373  0.7009232
## 4 -0.082818133 -0.9447334 -1.1513398  0.6966019
## 5 -0.264991136 -0.3934265 -0.7436317 -1.3578921
## 6  1.453552342  1.3155322  1.3879499 -0.1741505
## 7  0.965740992  1.0411472  1.0684043  0.9494009
## 8 -1.023345081 -1.2460264 -0.8220371 -1.1221470
## 
## Clustering vector:
##  [1] 6 2 3 5 1 6 2 3 8 4 7 2 3 8 1 6 2 3 8 1 6 2 3 8 1 6 2 3 8 1 7 2 3 5 1 2 2 3
## [39] 8 1 6 2 3 5 1 6 2 3 5 1 7 2 3 8 1 6 2 3 5 1 6 7 2 3 4 6 7 5 4 4 2 7 5 4 8
## 
## Within cluster sum of squares by cluster:
## [1] 5.927851 7.001057 4.724856 3.773830 3.117132 8.341414 3.326036 5.365938
##  (between_SS / total_SS =  86.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

#Cluster center
ClusCen <- aggregate(Data1,by=list(fit$cluster),FUN=mean)
ClusCen <- ClusCen[,-1]

We have 8 groups of cluster base on this method. There is a 30 years old baseball missing on our list due to injury. Using the cluster above, I try to predict what is his metrics base on his age.

Normalize his age

zScorces <- function(x,mu,sig){
  z = (x-mu)/sig
}
agent1 <- zScorces(30, mean(Data$Age),sd(Data$Age))

Calculate the distance

disAge <- seq(0,7)
for (k in 1:8) {
  disAge[k] = (ClusCen[k,4]-agent1)^2 
}
disAge

## [1] 0.9415594 1.0225179 0.2058817 0.2019788 2.5762608 0.1775205 0.4931124
## [8] 1.8750600

which.min(disAge)

## [1] 6

It seems the agent might belong to cluster 6. This is the list of agents belong to cluster 6:

#All the datasets which contained all the observations (75) are in the same order. The list below is the code of the agents.
which(Data1$fit.cluster == 6)

##  [1]  1  6 16 21 26 41 46 56 61 66

#Cluster 6
Clus6 <- Data[ c(1,6,16,21,26,41,46,56,61,66), ]
summary(Clus6[,-4])

##       Avg              OPS               HR         fit.cluster
##  Min.   :0.2900   Min.   :0.9240   Min.   :31.00   Min.   :6   
##  1st Qu.:0.3120   1st Qu.:0.9507   1st Qu.:38.25   1st Qu.:6   
##  Median :0.3265   Median :0.9940   Median :40.00   Median :6   
##  Mean   :0.3289   Mean   :0.9889   Mean   :40.70   Mean   :6   
##  3rd Qu.:0.3432   3rd Qu.:1.0310   3rd Qu.:44.00   3rd Qu.:6   
##  Max.   :0.3900   Max.   :1.0530   Max.   :52.00   Max.   :6

Since we guess the agent might belong to cluster 6, we might can guess the agent’ metrics might fall in the range above. We may take the mean of the metrics of the cluster as the agent’s metrics.

#Avg
mean(Clus6$Avg)

## [1] 0.3289

#OPS
mean(Clus6$OPS)

## [1] 0.9889

#HR
mean(Clus6$HR)

## [1] 40.7

Another way we can guess the metrics of that agent is taking the average of all the 30 years old agents that we have.

Age30 <- subset(Data, Age == 30)
summary(Age30[,-4])

##       Avg              OPS               HR         fit.cluster 
##  Min.   :0.2590   Min.   :0.7000   Min.   :17.00   Min.   :3.0  
##  1st Qu.:0.2742   1st Qu.:0.7662   1st Qu.:19.75   1st Qu.:3.0  
##  Median :0.2950   Median :0.8860   Median :28.50   Median :4.5  
##  Mean   :0.3002   Mean   :0.8683   Mean   :28.33   Mean   :4.5  
##  3rd Qu.:0.3285   3rd Qu.:0.9728   3rd Qu.:37.25   3rd Qu.:6.0  
##  Max.   :0.3450   Max.   :1.0100   Max.   :39.00   Max.   :6.0

Guessing the metrics:

#Avg
mean(Age30$Avg)

## [1] 0.3001667

#OPS
mean(Age30$OPS)

## [1] 0.8683333

#HR
mean(Age30$HR)

## [1] 28.33333

Before moving further, we might want to look at what agents have the great value in each metrics.

(https://en.wikipedia.org/wiki/Batting_average)

For Avg, everyone is above 0.300 consider exceptional.

BestAvg <- subset(baseballData, Avg > 0.299)
BestAvg <- BestAvg[order(BestAvg$Avg, decreasing = TRUE),]
BestAvg

##    Code   Avg   OPS HR Age Length AAV
## 21   21 0.390 0.999 41  29      3  16
## 61   61 0.348 1.038 45  26      2  26
## 6     6 0.345 0.924 38  30      6  13
## 31   31 0.341 1.001 38  34      3  16
## 16   16 0.338 0.953 39  28      2  17
## 26   26 0.333 1.010 35  30      4  14
## 51   51 0.320 0.981 40  34      4  24
## 66   66 0.320 0.925 46  26      4  24
## 46   46 0.315 0.989 39  30      3  20
## 56   56 0.311 1.048 41  28      4  24
## 32   32 0.310 0.892 34  28      4   4
## 10   10 0.307 0.617  3  34      5   1
## 36   36 0.305 0.803 41  28      3  16
## 11   11 0.300 1.010 38  34      1  15
## 22   22 0.300 0.911 36  26      4   6
## 62   62 0.300 0.950 33  31      3   4

(https://en.wikipedia.org/wiki/On-base_plus_slugging)

For OPS, anything from 0.9 and above are considered great.

BestOPS <- subset(baseballData, OPS > 0.899)
BestOPS <- BestOPS[order(BestOPS$OPS, decreasing = TRUE),]
BestOPS

##    Code   Avg   OPS HR Age Length AAV
## 41   41 0.299 1.053 31  29      4  19
## 56   56 0.311 1.048 41  28      4  24
## 61   61 0.348 1.038 45  26      2  26
## 11   11 0.300 1.010 38  34      1  15
## 26   26 0.333 1.010 35  30      4  14
## 31   31 0.341 1.001 38  34      3  16
## 21   21 0.390 0.999 41  29      3  16
## 46   46 0.315 0.989 39  30      3  20
## 47   47 0.295 0.987 31  27      3   2
## 51   51 0.320 0.981 40  34      4  24
## 16   16 0.338 0.953 39  28      2  17
## 1     1 0.290 0.950 52  29      1  15
## 62   62 0.300 0.950 33  31      3   4
## 57   57 0.294 0.948 31  25      1   6
## 71   71 0.292 0.935 29  26      2  27
## 66   66 0.320 0.925 46  26      4  24
## 6     6 0.345 0.924 38  30      6  13
## 42   42 0.262 0.917 24  26      5   2
## 22   22 0.300 0.911 36  26      4   6
## 67   67 0.299 0.900 31  31      2   3

For HR, we will look at the top 25% (in value) of the list

BestHR <- subset(baseballData, HR >= 34)
BestHR <- BestHR[order(BestHR$HR, decreasing = TRUE),]
BestHR

##    Code   Avg   OPS HR Age Length AAV
## 1     1 0.290 0.950 52  29      1  15
## 66   66 0.320 0.925 46  26      4  24
## 61   61 0.348 1.038 45  26      2  26
## 27   27 0.280 0.893 43  27      5   3
## 21   21 0.390 0.999 41  29      3  16
## 36   36 0.305 0.803 41  28      3  16
## 56   56 0.311 1.048 41  28      4  24
## 51   51 0.320 0.981 40  34      4  24
## 16   16 0.338 0.953 39  28      2  17
## 46   46 0.315 0.989 39  30      3  20
## 6     6 0.345 0.924 38  30      6  13
## 11   11 0.300 1.010 38  34      1  15
## 31   31 0.341 1.001 38  34      3  16
## 72   72 0.289 0.838 38  31      3   4
## 22   22 0.300 0.911 36  26      4   6
## 12   12 0.269 0.877 35  28      2   4
## 26   26 0.333 1.010 35  30      4  14
## 7     7 0.265 0.896 34  27      2   8
## 32   32 0.310 0.892 34  28      4   4
## 37   37 0.280 0.819 34  25      5   6

Here is a list of three agents have highest value in each metrics.

#Avg
which.max(baseballData$Avg)

## [1] 21

#OPS
which.max(baseballData$OPS)

## [1] 41

#HR
which.max(baseballData$HR)

## [1] 1

#Avg
baseballData[21,]

##    Code  Avg   OPS HR Age Length AAV
## 21   21 0.39 0.999 41  29      3  16

#OPS
baseballData[41,]

##    Code   Avg   OPS HR Age Length AAV
## 41   41 0.299 1.053 31  29      4  19

#HR
baseballData[1,]

##   Code  Avg  OPS HR Age Length AAV
## 1    1 0.29 0.95 52  29      1  15

We can see that only the top agents in Avg is in the top list in all the metrics.

When we compared the three lists (Avg, OPS, HR), we found 12 agents appear in all of them. They can considered as superstars since they perform well in three metrics.

The 12 agents are: 21, 61, 6, 31, 16, 26, 51, 66, 46, 56, 11, 22.

Before, we found that there are correlations in three metrics. The fact that these 12 agents are in those three lists is not a surprise. However, it might show that 12 agents are strong athletes, not just get pure luck to be on one of the list (the top metric list).

In those 12 agents, agent code 61 is the most stand out. That agent placed 2 in BA, 3 in OPS and 3 in HR.

We will want agent code 61 in our team right away.

Before finally decide what agents we want, we should consider all the factor in this case.

Cluster the data.

# consider everything (Code not include)
Bas<- baseballData[,-1]
Bas1 <- scale(Bas) # scale the data

set.seed(010)
# Determine number of clusters
# Determine withinss for clusters 1 - 20 with 15 tries
Wss <- 1
for (i in 1:20) Wss[i] <- sum(kmeans(Bas1,
                                     centers=i, nstart=1000)$withinss)
# Plot the withinss for each cluster
plot(1:20, Wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

After 8 clusters, it seems the sum of square is decreasing slower.

set.seed(010)
#K-Means Cluster Analysis
Fit <- kmeans(Bas1, 8, nstart=1000) # 8 cluster solution fit contains all the data about each cluster get cluster means
aggregate(Bas1,by=list(Fit$cluster),FUN=mean)

##   Group.1          Avg        OPS         HR         Age     Length         AAV
## 1       1  1.281126603  1.2461084  1.3105899  0.06661055 -0.1315958  1.59267182
## 2       2  0.005019281  0.2052149 -0.2244137  0.74653750 -0.7484512  0.25407158
## 3       3  0.632429378  0.6543119  0.8427710 -0.66570422  1.2306265 -0.53334033
## 4       4 -1.403801574 -1.1792119 -1.0781615  1.11538014 -0.1596347 -1.04873722
## 5       5 -0.624631567 -0.5484236 -0.8063561 -1.35789214  0.4852596 -0.03839570
## 6       6  0.394536383  0.7502312  0.8163455 -0.34580416 -0.5085630 -0.69082272
## 7       7 -0.091052890 -0.5467993 -0.6573858  0.87917899  1.4105426 -0.17900497
## 8       8 -0.672060782 -1.1215930 -0.8504954 -1.00120915 -1.0683021 -0.09588927

# append cluster assignment to each data point in the data set
Bas1 <- data.frame(Bas1, Fit$cluster)
Bas <- data.frame(Bas, Fit$cluster)
head(Bas)

##     Avg   OPS HR Age Length AAV Fit.cluster
## 1 0.290 0.950 52  29      1  15           1
## 2 0.275 0.850 30  26      2  12           6
## 3 0.265 0.750 20  31      3  10           2
## 4 0.250 0.700 10  24      4   7           5
## 5 0.225 0.600  5  33      5   4           7
## 6 0.345 0.924 38  30      6  13           3

# Information about cluster.
Fit

## K-means clustering with 8 clusters of sizes 14, 9, 8, 11, 7, 9, 8, 9
## 
## Cluster means:
##            Avg        OPS         HR         Age     Length         AAV
## 1  1.281126603  1.2461084  1.3105899  0.06661055 -0.1315958  1.59267182
## 2  0.005019281  0.2052149 -0.2244137  0.74653750 -0.7484512  0.25407158
## 3  0.632429378  0.6543119  0.8427710 -0.66570422  1.2306265 -0.53334033
## 4 -1.403801574 -1.1792119 -1.0781615  1.11538014 -0.1596347 -1.04873722
## 5 -0.624631567 -0.5484236 -0.8063561 -1.35789214  0.4852596 -0.03839570
## 6  0.394536383  0.7502312  0.8163455 -0.34580416 -0.5085630 -0.69082272
## 7 -0.091052890 -0.5467993 -0.6573858  0.87917899  1.4105426 -0.17900497
## 8 -0.672060782 -1.1215930 -0.8504954 -1.00120915 -1.0683021 -0.09588927
## 
## Clustering vector:
##  [1] 1 6 2 5 7 3 6 2 5 7 1 6 2 5 4 1 6 7 8 4 1 3 2 8 4 1 3 7 8 4 1 3 2 8 4 1 3 7
## [39] 8 4 1 3 2 5 4 1 6 2 8 4 1 3 2 8 4 1 6 2 5 4 1 6 3 7 4 1 6 5 7 8 1 6 5 7 8
## 
## Within cluster sum of squares by cluster:
## [1] 33.099439  7.405357 11.066061 14.877132  7.591738 10.185808 12.397532
## [8] 11.595790
##  (between_SS / total_SS =  75.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Since our main goal is winning the game, the important factor to consider the agents should be the baseball metrics (Avg, OPS, HR). We found that agents are in the top of the Avg and OPS lists usually also in the HR list. However, the tops of HR list are not usually in Avg and OPS lists. Maybe we should use Avg and OPS to consider the good players.

# Plot the data with clusters divided by color
# OPS, Avg
ggplot(Bas, aes(OPS, Avg, color = as.factor(Fit.cluster))) + geom_point()+ scale_colour_discrete(name = "Cluster") + geom_vline(xintercept = 0.9, color = "black") + geom_vline(xintercept = 0.7, color = "red") + geom_hline(yintercept=0.3, color = "black") + labs(title="Avg versus OPS", subtitle = "Value consider to be good baseball player")

Any player fall in the top right corner of the graph (i.e.the area above the black horizontal line and right of the black vertical line), they are good players.

Based on the graph, the majority of players in that area are in cluster 1.

Any player belong to the area on the left of the red vertical line has OPS below average. Those players are the player we may want to avoid. It seems most of them belong to cluster 4 and 8.

We can see cluster 3 and 6 contain the players just below the top groups (cluster 1).

In the future, we can use this graph and cluster to select good players and bad players.

In the other hand, we should look at the AAV and Length of those player. We might have to spend tons of money and negotiate the contract with the agents after the contract end. Those factor can be change if other team also want that player.

# Plot the data with clusters divided by color
# Length, AAV
ggplot(Bas, aes(Length, AAV, color = as.factor(Fit.cluster))) + geom_point()+ scale_colour_discrete(name = "Cluster") + labs(title="AAV versus Length", subtitle = "Risk factors")

Cluster 1 - The top group, the players has high AAV and short contract length. It can be a risk for us if we want to hire all of the agents from that group. (The capital available to hire all of them and the potential of losing them after 1 or 2 seasons).

Cluster 3 - The group just below the top group, the player seem to have lower AAV and longer contract length. It might be benefit for us to hire agents in this group.

Base on what we have done I think we should consider sign contract with agents in cluster 1 or 3.

List of players perform best in both Avg and OPS

Best <- subset(Bas, OPS > 0.9 & Avg > 0.3)
Best <- Best[order(Best$OPS, decreasing = TRUE),]
Best

##      Avg   OPS HR Age Length AAV Fit.cluster
## 56 0.311 1.048 41  28      4  24           1
## 61 0.348 1.038 45  26      2  26           1
## 26 0.333 1.010 35  30      4  14           1
## 31 0.341 1.001 38  34      3  16           1
## 21 0.390 0.999 41  29      3  16           1
## 46 0.315 0.989 39  30      3  20           1
## 51 0.320 0.981 40  34      4  24           1
## 16 0.338 0.953 39  28      2  17           1
## 66 0.320 0.925 46  26      4  24           1
## 6  0.345 0.924 38  30      6  13           3

List of players in cluster 3

group3 <- subset(Bas, Fit.cluster == 3)
group3 <- group3[order(group3$OPS, decreasing = TRUE),]
group3

##      Avg   OPS HR Age Length AAV Fit.cluster
## 6  0.345 0.924 38  30      6  13           3
## 42 0.262 0.917 24  26      5   2           3
## 22 0.300 0.911 36  26      4   6           3
## 27 0.280 0.893 43  27      5   3           3
## 32 0.310 0.892 34  28      4   4           3
## 52 0.295 0.891 33  28      5   3           3
## 63 0.280 0.850 24  24      4  11           3
## 37 0.280 0.819 34  25      5   6           3

We will chose the top 5 in the “Best” group and the top 5 in “group3”. We will use OPS as the main ordering factor (the baseball metrics is the most important factor consider in choosing the agents and Avg and OPS are correlated to each other). We pick agents from 2 group since to balance out the risk of paying a large amount of money and revising contract after short time.

#This rank is base on the OPS and Avg measure.
GoodAgent <- baseballData[ c(56,61,26,31,21,6,42,22,27,32), ]
GoodAgent <- GoodAgent[order(GoodAgent$OPS, GoodAgent$Avg, decreasing = TRUE),]
GoodAgent$Rank <- 1:10
GoodAgent

##    Code   Avg   OPS HR Age Length AAV Rank
## 56   56 0.311 1.048 41  28      4  24    1
## 61   61 0.348 1.038 45  26      2  26    2
## 26   26 0.333 1.010 35  30      4  14    3
## 31   31 0.341 1.001 38  34      3  16    4
## 21   21 0.390 0.999 41  29      3  16    5
## 6     6 0.345 0.924 38  30      6  13    6
## 42   42 0.262 0.917 24  26      5   2    7
## 22   22 0.300 0.911 36  26      4   6    8
## 27   27 0.280 0.893 43  27      5   3    9
## 32   32 0.310 0.892 34  28      4   4   10

We might want to argue the ranking position 1 or 2 for agent 61 in the list. Personally, if we have to pick either 56 or 61, I think we should choose agent 61 because that player is performing well in all the metrics (all the metrics rank is balance). However, if we want to have lower risk, agent 56 is a better choice.

Overall, the list (GoodAgent) is the list of player we should consider sign in the contract in our team. Besides, any players in the list called “Best” and “group3” are potential players we might consider to sign contracts.

Final Assignment - Potential baseball players

Minh Le

4/14/2020