I have received a list of 75 potential agents we could sign contracts. In this document, I will provide the list of agents that I believe we should have them in our team.
Here is a first few line of the data of free agents:
head(baseballData)
## Code Avg OPS HR Age Length AAV
## 1 1 0.290 0.950 52 29 1 15
## 2 2 0.275 0.850 30 26 2 12
## 3 3 0.265 0.750 20 31 3 10
## 4 4 0.250 0.700 10 24 4 7
## 5 5 0.225 0.600 5 33 5 4
## 6 6 0.345 0.924 38 30 6 13
Identify missing value in the dataset:
sum(is.na(baseballData))
## [1] 0
This is general information about the data:
glimpse(baseballData)
## Observations: 75
## Variables: 7
## $ Code <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1...
## $ Avg <dbl> 0.290, 0.275, 0.265, 0.250, 0.225, 0.345, 0.265, 0.260, 0.18...
## $ OPS <dbl> 0.950, 0.850, 0.750, 0.700, 0.600, 0.924, 0.896, 0.763, 0.61...
## $ HR <int> 52, 30, 20, 10, 5, 38, 34, 18, 7, 3, 38, 35, 19, 7, 5, 39, 3...
## $ Age <int> 29, 26, 31, 24, 33, 30, 27, 32, 25, 34, 34, 28, 33, 26, 35, ...
## $ Length <int> 1, 2, 3, 4, 5, 6, 2, 3, 4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 1, 2, ...
## $ AAV <int> 15, 12, 10, 7, 4, 13, 8, 11, 9, 1, 15, 4, 10, 7, 1, 17, 2, 1...
Explain the variables:
Code: The unique number for each agent
Avg: Batting average
OPS: On the base plus slugging percentage
HR: Home run
Age: Agent’s age
Length: Contract length in years
AAV: Average annual value
Some basis statistic information about the data of free agents:
summary(baseballData[,-1])
## Avg OPS HR Age
## Min. :0.1780 Min. :0.4500 Min. : 0.00 Min. :23.00
## 1st Qu.:0.2390 1st Qu.:0.6465 1st Qu.:10.50 1st Qu.:26.00
## Median :0.2690 Median :0.8000 Median :19.00 Median :29.00
## Mean :0.2671 Mean :0.7864 Mean :21.73 Mean :29.12
## 3rd Qu.:0.2950 3rd Qu.:0.9055 3rd Qu.:34.00 3rd Qu.:32.00
## Max. :0.3900 Max. :1.0530 Max. :52.00 Max. :35.00
## Length AAV
## Min. :1.00 Min. : 1.000
## 1st Qu.:2.00 1st Qu.: 4.000
## Median :3.00 Median : 9.000
## Mean :3.04 Mean : 9.387
## 3rd Qu.:4.00 3rd Qu.:13.000
## Max. :6.00 Max. :27.000
Discover if there are any relationship between Avg, HR and OPS.
plot(baseballData[,c(2,3,4)])
Base on the plot above, it seems there are some positive linear correlation between three variable.
Detailed correlation using Pearson method.
Between Avg and OPS, there is a strong positive correlation (0.7945662).
cor.test(baseballData$Avg, baseballData$OPS)
##
## Pearson's product-moment correlation
##
## data: baseballData$Avg and baseballData$OPS
## t = 11.181, df = 73, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6924843 0.8654551
## sample estimates:
## cor
## 0.7945662
Between Avg and HR, there is a strong positive correlation (0.7846538).
cor.test(baseballData$Avg, baseballData$HR)
##
## Pearson's product-moment correlation
##
## data: baseballData$Avg and baseballData$HR
## t = 10.814, df = 73, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6785287 0.8586939
## sample estimates:
## cor
## 0.7846538
Between OPS and HR, there is a strong positive correlation (0.8446505).
cor.test(baseballData$OPS, baseballData$HR)
##
## Pearson's product-moment correlation
##
## data: baseballData$OPS and baseballData$HR
## t = 13.481, df = 73, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7641857 0.8992274
## sample estimates:
## cor
## 0.8446505
Base on the results above, there are a strong correlation between those three metrics. We can expect that a person who has a great number on metrics is likely to have strong values in other two metrics.
Besides, we should take age in account when choosing agents since the baseball players’ performance starts to decline when they reach certain ages. It can be good if we find what age is the peak of the baseball career.
par(mfrow=c(1,3))
plot(baseballData$Age, baseballData$Avg, xlab = "Age", ylab = "Avg")
plot(baseballData$Age, baseballData$OPS, xlab = "Age", ylab = "OPS")
plot(baseballData$Age, baseballData$HR, xlab = "Age", ylab = "HR")
Base on the plot, there is no linear relationship between Age and three baseball metrics.
Test correlation using Kendall Methods.
cor.test(baseballData$Age, baseballData$Avg, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: baseballData$Age and baseballData$Avg
## z = -0.51944, p-value = 0.6035
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.04256547
cor.test(baseballData$Age, baseballData$OPS, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: baseballData$Age and baseballData$OPS
## z = -0.75361, p-value = 0.4511
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.06149632
cor.test(baseballData$Age, baseballData$OPS, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: baseballData$Age and baseballData$OPS
## z = -0.75361, p-value = 0.4511
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.06149632
It seems there are weak correlations between the age and the metrics. Attempting fit a linear or non-linear function to predict the metrics might give us a result is far from the actual result. Since this the case, we can try to cluster the data to find some group base on the age and three metrics.
Data<- baseballData[,-c(1,6,7)]
Data1 <- scale(Data)
set.seed(010)
# Determine number of clusters
# Determine withinss for clusters 1 - 20 with 15 tries
wss <- 1
for (i in 1:20) wss[i] <- sum(kmeans(Data1,
centers=i, nstart=1000)$withinss)
# Plot the withinss for each cluster
plot(1:20, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
For me the cluster sum of square after 8th cluster is low and the value is not within groups sum of squares does not change much after that.
set.seed(010)
#K-Means Cluster Analysis
fit <- kmeans(Data1, 8, nstart=1000) # 8 cluster solution
# fit contains all the data about
# each cluster
# get cluster means
aggregate(Data1,by=list(fit$cluster),FUN=mean)
## Group.1 Avg OPS HR Age
## 1 1 -1.457274025 -1.2087432 -1.0914666 1.2175212
## 2 2 0.414404369 0.6686592 0.8098407 -0.7640150
## 3 3 -0.004633182 0.1007650 -0.2225373 0.7009232
## 4 4 -0.082818133 -0.9447334 -1.1513398 0.6966019
## 5 5 -0.264991136 -0.3934265 -0.7436317 -1.3578921
## 6 6 1.453552342 1.3155322 1.3879499 -0.1741505
## 7 7 0.965740992 1.0411472 1.0684043 0.9494009
## 8 8 -1.023345081 -1.2460264 -0.8220371 -1.1221470
# append cluster assignment to each data point in the data set
Data1 <- data.frame(Data1, fit$cluster)
head(Data1)
## Avg OPS HR Age fit.cluster
## 1 0.5383179 1.0628035 2.2148656 -0.03370654 6
## 2 0.1853997 0.4131151 0.6049413 -0.87637011 2
## 3 -0.0498791 -0.2365732 -0.1268425 0.52806917 3
## 4 -0.4027973 -0.5614173 -0.8586263 -1.43814582 5
## 5 -0.9909942 -1.2111057 -1.2245182 1.08984488 1
## 6 1.8323512 0.8938845 1.1903683 0.24718131 6
Data <- data.frame(Data, fit$cluster)
head(Data)
## Avg OPS HR Age fit.cluster
## 1 0.290 0.950 52 29 6
## 2 0.275 0.850 30 26 2
## 3 0.265 0.750 20 31 3
## 4 0.250 0.700 10 24 5
## 5 0.225 0.600 5 33 1
## 6 0.345 0.924 38 30 6
fit
## K-means clustering with 8 clusters of sizes 11, 15, 13, 5, 7, 10, 6, 8
##
## Cluster means:
## Avg OPS HR Age
## 1 -1.457274025 -1.2087432 -1.0914666 1.2175212
## 2 0.414404369 0.6686592 0.8098407 -0.7640150
## 3 -0.004633182 0.1007650 -0.2225373 0.7009232
## 4 -0.082818133 -0.9447334 -1.1513398 0.6966019
## 5 -0.264991136 -0.3934265 -0.7436317 -1.3578921
## 6 1.453552342 1.3155322 1.3879499 -0.1741505
## 7 0.965740992 1.0411472 1.0684043 0.9494009
## 8 -1.023345081 -1.2460264 -0.8220371 -1.1221470
##
## Clustering vector:
## [1] 6 2 3 5 1 6 2 3 8 4 7 2 3 8 1 6 2 3 8 1 6 2 3 8 1 6 2 3 8 1 7 2 3 5 1 2 2 3
## [39] 8 1 6 2 3 5 1 6 2 3 5 1 7 2 3 8 1 6 2 3 5 1 6 7 2 3 4 6 7 5 4 4 2 7 5 4 8
##
## Within cluster sum of squares by cluster:
## [1] 5.927851 7.001057 4.724856 3.773830 3.117132 8.341414 3.326036 5.365938
## (between_SS / total_SS = 86.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
#Cluster center
ClusCen <- aggregate(Data1,by=list(fit$cluster),FUN=mean)
ClusCen <- ClusCen[,-1]
We have 8 groups of cluster base on this method. There is a 30 years old baseball missing on our list due to injury. Using the cluster above, I try to predict what is his metrics base on his age.
Normalize his age
zScorces <- function(x,mu,sig){
z = (x-mu)/sig
}
agent1 <- zScorces(30, mean(Data$Age),sd(Data$Age))
Calculate the distance
disAge <- seq(0,7)
for (k in 1:8) {
disAge[k] = (ClusCen[k,4]-agent1)^2
}
disAge
## [1] 0.9415594 1.0225179 0.2058817 0.2019788 2.5762608 0.1775205 0.4931124
## [8] 1.8750600
which.min(disAge)
## [1] 6
It seems the agent might belong to cluster 6. This is the list of agents belong to cluster 6:
#All the datasets which contained all the observations (75) are in the same order. The list below is the code of the agents.
which(Data1$fit.cluster == 6)
## [1] 1 6 16 21 26 41 46 56 61 66
#Cluster 6
Clus6 <- Data[ c(1,6,16,21,26,41,46,56,61,66), ]
summary(Clus6[,-4])
## Avg OPS HR fit.cluster
## Min. :0.2900 Min. :0.9240 Min. :31.00 Min. :6
## 1st Qu.:0.3120 1st Qu.:0.9507 1st Qu.:38.25 1st Qu.:6
## Median :0.3265 Median :0.9940 Median :40.00 Median :6
## Mean :0.3289 Mean :0.9889 Mean :40.70 Mean :6
## 3rd Qu.:0.3432 3rd Qu.:1.0310 3rd Qu.:44.00 3rd Qu.:6
## Max. :0.3900 Max. :1.0530 Max. :52.00 Max. :6
Since we guess the agent might belong to cluster 6, we might can guess the agent’ metrics might fall in the range above. We may take the mean of the metrics of the cluster as the agent’s metrics.
#Avg
mean(Clus6$Avg)
## [1] 0.3289
#OPS
mean(Clus6$OPS)
## [1] 0.9889
#HR
mean(Clus6$HR)
## [1] 40.7
Another way we can guess the metrics of that agent is taking the average of all the 30 years old agents that we have.
Age30 <- subset(Data, Age == 30)
summary(Age30[,-4])
## Avg OPS HR fit.cluster
## Min. :0.2590 Min. :0.7000 Min. :17.00 Min. :3.0
## 1st Qu.:0.2742 1st Qu.:0.7662 1st Qu.:19.75 1st Qu.:3.0
## Median :0.2950 Median :0.8860 Median :28.50 Median :4.5
## Mean :0.3002 Mean :0.8683 Mean :28.33 Mean :4.5
## 3rd Qu.:0.3285 3rd Qu.:0.9728 3rd Qu.:37.25 3rd Qu.:6.0
## Max. :0.3450 Max. :1.0100 Max. :39.00 Max. :6.0
Guessing the metrics:
#Avg
mean(Age30$Avg)
## [1] 0.3001667
#OPS
mean(Age30$OPS)
## [1] 0.8683333
#HR
mean(Age30$HR)
## [1] 28.33333
Before moving further, we might want to look at what agents have the great value in each metrics.
(https://en.wikipedia.org/wiki/Batting_average)
For Avg, everyone is above 0.300 consider exceptional.
BestAvg <- subset(baseballData, Avg > 0.299)
BestAvg <- BestAvg[order(BestAvg$Avg, decreasing = TRUE),]
BestAvg
## Code Avg OPS HR Age Length AAV
## 21 21 0.390 0.999 41 29 3 16
## 61 61 0.348 1.038 45 26 2 26
## 6 6 0.345 0.924 38 30 6 13
## 31 31 0.341 1.001 38 34 3 16
## 16 16 0.338 0.953 39 28 2 17
## 26 26 0.333 1.010 35 30 4 14
## 51 51 0.320 0.981 40 34 4 24
## 66 66 0.320 0.925 46 26 4 24
## 46 46 0.315 0.989 39 30 3 20
## 56 56 0.311 1.048 41 28 4 24
## 32 32 0.310 0.892 34 28 4 4
## 10 10 0.307 0.617 3 34 5 1
## 36 36 0.305 0.803 41 28 3 16
## 11 11 0.300 1.010 38 34 1 15
## 22 22 0.300 0.911 36 26 4 6
## 62 62 0.300 0.950 33 31 3 4
(https://en.wikipedia.org/wiki/On-base_plus_slugging)
For OPS, anything from 0.9 and above are considered great.
BestOPS <- subset(baseballData, OPS > 0.899)
BestOPS <- BestOPS[order(BestOPS$OPS, decreasing = TRUE),]
BestOPS
## Code Avg OPS HR Age Length AAV
## 41 41 0.299 1.053 31 29 4 19
## 56 56 0.311 1.048 41 28 4 24
## 61 61 0.348 1.038 45 26 2 26
## 11 11 0.300 1.010 38 34 1 15
## 26 26 0.333 1.010 35 30 4 14
## 31 31 0.341 1.001 38 34 3 16
## 21 21 0.390 0.999 41 29 3 16
## 46 46 0.315 0.989 39 30 3 20
## 47 47 0.295 0.987 31 27 3 2
## 51 51 0.320 0.981 40 34 4 24
## 16 16 0.338 0.953 39 28 2 17
## 1 1 0.290 0.950 52 29 1 15
## 62 62 0.300 0.950 33 31 3 4
## 57 57 0.294 0.948 31 25 1 6
## 71 71 0.292 0.935 29 26 2 27
## 66 66 0.320 0.925 46 26 4 24
## 6 6 0.345 0.924 38 30 6 13
## 42 42 0.262 0.917 24 26 5 2
## 22 22 0.300 0.911 36 26 4 6
## 67 67 0.299 0.900 31 31 2 3
For HR, we will look at the top 25% (in value) of the list
BestHR <- subset(baseballData, HR >= 34)
BestHR <- BestHR[order(BestHR$HR, decreasing = TRUE),]
BestHR
## Code Avg OPS HR Age Length AAV
## 1 1 0.290 0.950 52 29 1 15
## 66 66 0.320 0.925 46 26 4 24
## 61 61 0.348 1.038 45 26 2 26
## 27 27 0.280 0.893 43 27 5 3
## 21 21 0.390 0.999 41 29 3 16
## 36 36 0.305 0.803 41 28 3 16
## 56 56 0.311 1.048 41 28 4 24
## 51 51 0.320 0.981 40 34 4 24
## 16 16 0.338 0.953 39 28 2 17
## 46 46 0.315 0.989 39 30 3 20
## 6 6 0.345 0.924 38 30 6 13
## 11 11 0.300 1.010 38 34 1 15
## 31 31 0.341 1.001 38 34 3 16
## 72 72 0.289 0.838 38 31 3 4
## 22 22 0.300 0.911 36 26 4 6
## 12 12 0.269 0.877 35 28 2 4
## 26 26 0.333 1.010 35 30 4 14
## 7 7 0.265 0.896 34 27 2 8
## 32 32 0.310 0.892 34 28 4 4
## 37 37 0.280 0.819 34 25 5 6
Here is a list of three agents have highest value in each metrics.
#Avg
which.max(baseballData$Avg)
## [1] 21
#OPS
which.max(baseballData$OPS)
## [1] 41
#HR
which.max(baseballData$HR)
## [1] 1
#Avg
baseballData[21,]
## Code Avg OPS HR Age Length AAV
## 21 21 0.39 0.999 41 29 3 16
#OPS
baseballData[41,]
## Code Avg OPS HR Age Length AAV
## 41 41 0.299 1.053 31 29 4 19
#HR
baseballData[1,]
## Code Avg OPS HR Age Length AAV
## 1 1 0.29 0.95 52 29 1 15
We can see that only the top agents in Avg is in the top list in all the metrics.
When we compared the three lists (Avg, OPS, HR), we found 12 agents appear in all of them. They can considered as superstars since they perform well in three metrics.
The 12 agents are: 21, 61, 6, 31, 16, 26, 51, 66, 46, 56, 11, 22.
Before, we found that there are correlations in three metrics. The fact that these 12 agents are in those three lists is not a surprise. However, it might show that 12 agents are strong athletes, not just get pure luck to be on one of the list (the top metric list).
In those 12 agents, agent code 61 is the most stand out. That agent placed 2 in BA, 3 in OPS and 3 in HR.
We will want agent code 61 in our team right away.
Before finally decide what agents we want, we should consider all the factor in this case.
Cluster the data.
# consider everything (Code not include)
Bas<- baseballData[,-1]
Bas1 <- scale(Bas) # scale the data
set.seed(010)
# Determine number of clusters
# Determine withinss for clusters 1 - 20 with 15 tries
Wss <- 1
for (i in 1:20) Wss[i] <- sum(kmeans(Bas1,
centers=i, nstart=1000)$withinss)
# Plot the withinss for each cluster
plot(1:20, Wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
After 8 clusters, it seems the sum of square is decreasing slower.
set.seed(010)
#K-Means Cluster Analysis
Fit <- kmeans(Bas1, 8, nstart=1000) # 8 cluster solution fit contains all the data about each cluster get cluster means
aggregate(Bas1,by=list(Fit$cluster),FUN=mean)
## Group.1 Avg OPS HR Age Length AAV
## 1 1 1.281126603 1.2461084 1.3105899 0.06661055 -0.1315958 1.59267182
## 2 2 0.005019281 0.2052149 -0.2244137 0.74653750 -0.7484512 0.25407158
## 3 3 0.632429378 0.6543119 0.8427710 -0.66570422 1.2306265 -0.53334033
## 4 4 -1.403801574 -1.1792119 -1.0781615 1.11538014 -0.1596347 -1.04873722
## 5 5 -0.624631567 -0.5484236 -0.8063561 -1.35789214 0.4852596 -0.03839570
## 6 6 0.394536383 0.7502312 0.8163455 -0.34580416 -0.5085630 -0.69082272
## 7 7 -0.091052890 -0.5467993 -0.6573858 0.87917899 1.4105426 -0.17900497
## 8 8 -0.672060782 -1.1215930 -0.8504954 -1.00120915 -1.0683021 -0.09588927
# append cluster assignment to each data point in the data set
Bas1 <- data.frame(Bas1, Fit$cluster)
Bas <- data.frame(Bas, Fit$cluster)
head(Bas)
## Avg OPS HR Age Length AAV Fit.cluster
## 1 0.290 0.950 52 29 1 15 1
## 2 0.275 0.850 30 26 2 12 6
## 3 0.265 0.750 20 31 3 10 2
## 4 0.250 0.700 10 24 4 7 5
## 5 0.225 0.600 5 33 5 4 7
## 6 0.345 0.924 38 30 6 13 3
# Information about cluster.
Fit
## K-means clustering with 8 clusters of sizes 14, 9, 8, 11, 7, 9, 8, 9
##
## Cluster means:
## Avg OPS HR Age Length AAV
## 1 1.281126603 1.2461084 1.3105899 0.06661055 -0.1315958 1.59267182
## 2 0.005019281 0.2052149 -0.2244137 0.74653750 -0.7484512 0.25407158
## 3 0.632429378 0.6543119 0.8427710 -0.66570422 1.2306265 -0.53334033
## 4 -1.403801574 -1.1792119 -1.0781615 1.11538014 -0.1596347 -1.04873722
## 5 -0.624631567 -0.5484236 -0.8063561 -1.35789214 0.4852596 -0.03839570
## 6 0.394536383 0.7502312 0.8163455 -0.34580416 -0.5085630 -0.69082272
## 7 -0.091052890 -0.5467993 -0.6573858 0.87917899 1.4105426 -0.17900497
## 8 -0.672060782 -1.1215930 -0.8504954 -1.00120915 -1.0683021 -0.09588927
##
## Clustering vector:
## [1] 1 6 2 5 7 3 6 2 5 7 1 6 2 5 4 1 6 7 8 4 1 3 2 8 4 1 3 7 8 4 1 3 2 8 4 1 3 7
## [39] 8 4 1 3 2 5 4 1 6 2 8 4 1 3 2 8 4 1 6 2 5 4 1 6 3 7 4 1 6 5 7 8 1 6 5 7 8
##
## Within cluster sum of squares by cluster:
## [1] 33.099439 7.405357 11.066061 14.877132 7.591738 10.185808 12.397532
## [8] 11.595790
## (between_SS / total_SS = 75.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Since our main goal is winning the game, the important factor to consider the agents should be the baseball metrics (Avg, OPS, HR). We found that agents are in the top of the Avg and OPS lists usually also in the HR list. However, the tops of HR list are not usually in Avg and OPS lists. Maybe we should use Avg and OPS to consider the good players.
# Plot the data with clusters divided by color
# OPS, Avg
ggplot(Bas, aes(OPS, Avg, color = as.factor(Fit.cluster))) + geom_point()+ scale_colour_discrete(name = "Cluster") + geom_vline(xintercept = 0.9, color = "black") + geom_vline(xintercept = 0.7, color = "red") + geom_hline(yintercept=0.3, color = "black") + labs(title="Avg versus OPS", subtitle = "Value consider to be good baseball player")
Any player fall in the top right corner of the graph (i.e.the area above the black horizontal line and right of the black vertical line), they are good players.
Based on the graph, the majority of players in that area are in cluster 1.
Any player belong to the area on the left of the red vertical line has OPS below average. Those players are the player we may want to avoid. It seems most of them belong to cluster 4 and 8.
We can see cluster 3 and 6 contain the players just below the top groups (cluster 1).
In the future, we can use this graph and cluster to select good players and bad players.
In the other hand, we should look at the AAV and Length of those player. We might have to spend tons of money and negotiate the contract with the agents after the contract end. Those factor can be change if other team also want that player.
# Plot the data with clusters divided by color
# Length, AAV
ggplot(Bas, aes(Length, AAV, color = as.factor(Fit.cluster))) + geom_point()+ scale_colour_discrete(name = "Cluster") + labs(title="AAV versus Length", subtitle = "Risk factors")
Cluster 1 - The top group, the players has high AAV and short contract length. It can be a risk for us if we want to hire all of the agents from that group. (The capital available to hire all of them and the potential of losing them after 1 or 2 seasons).
Cluster 3 - The group just below the top group, the player seem to have lower AAV and longer contract length. It might be benefit for us to hire agents in this group.
Base on what we have done I think we should consider sign contract with agents in cluster 1 or 3.
List of players perform best in both Avg and OPS
Best <- subset(Bas, OPS > 0.9 & Avg > 0.3)
Best <- Best[order(Best$OPS, decreasing = TRUE),]
Best
## Avg OPS HR Age Length AAV Fit.cluster
## 56 0.311 1.048 41 28 4 24 1
## 61 0.348 1.038 45 26 2 26 1
## 26 0.333 1.010 35 30 4 14 1
## 31 0.341 1.001 38 34 3 16 1
## 21 0.390 0.999 41 29 3 16 1
## 46 0.315 0.989 39 30 3 20 1
## 51 0.320 0.981 40 34 4 24 1
## 16 0.338 0.953 39 28 2 17 1
## 66 0.320 0.925 46 26 4 24 1
## 6 0.345 0.924 38 30 6 13 3
List of players in cluster 3
group3 <- subset(Bas, Fit.cluster == 3)
group3 <- group3[order(group3$OPS, decreasing = TRUE),]
group3
## Avg OPS HR Age Length AAV Fit.cluster
## 6 0.345 0.924 38 30 6 13 3
## 42 0.262 0.917 24 26 5 2 3
## 22 0.300 0.911 36 26 4 6 3
## 27 0.280 0.893 43 27 5 3 3
## 32 0.310 0.892 34 28 4 4 3
## 52 0.295 0.891 33 28 5 3 3
## 63 0.280 0.850 24 24 4 11 3
## 37 0.280 0.819 34 25 5 6 3
We will chose the top 5 in the “Best” group and the top 5 in “group3”. We will use OPS as the main ordering factor (the baseball metrics is the most important factor consider in choosing the agents and Avg and OPS are correlated to each other). We pick agents from 2 group since to balance out the risk of paying a large amount of money and revising contract after short time.
#This rank is base on the OPS and Avg measure.
GoodAgent <- baseballData[ c(56,61,26,31,21,6,42,22,27,32), ]
GoodAgent <- GoodAgent[order(GoodAgent$OPS, GoodAgent$Avg, decreasing = TRUE),]
GoodAgent$Rank <- 1:10
GoodAgent
## Code Avg OPS HR Age Length AAV Rank
## 56 56 0.311 1.048 41 28 4 24 1
## 61 61 0.348 1.038 45 26 2 26 2
## 26 26 0.333 1.010 35 30 4 14 3
## 31 31 0.341 1.001 38 34 3 16 4
## 21 21 0.390 0.999 41 29 3 16 5
## 6 6 0.345 0.924 38 30 6 13 6
## 42 42 0.262 0.917 24 26 5 2 7
## 22 22 0.300 0.911 36 26 4 6 8
## 27 27 0.280 0.893 43 27 5 3 9
## 32 32 0.310 0.892 34 28 4 4 10
We might want to argue the ranking position 1 or 2 for agent 61 in the list. Personally, if we have to pick either 56 or 61, I think we should choose agent 61 because that player is performing well in all the metrics (all the metrics rank is balance). However, if we want to have lower risk, agent 56 is a better choice.
Overall, the list (GoodAgent) is the list of player we should consider sign in the contract in our team. Besides, any players in the list called “Best” and “group3” are potential players we might consider to sign contracts.