Classification & Clustering Assignment

Author

Blessing David

Introduction

This report provides a thorough analysis that is primarily focused on classifying teams as either home or away, as well as clustering baseball players.


Classification Tree Method:

Code
# Hide code chunks, and display a small button to show them if needed

################################################################################
###########################Classification Tree Method###########################
################################################################################

################################################################################
# Load Packages
################################################################################

library(tidyverse)
library(rpart)
library(rattle)
library(TTR) #Contains the runMean function.


################################################################################
# 1. Load and Prepare Data
################################################################################

pl_training <- read_csv("pl_training.csv")

pl_testing <- read_csv("pl_testing.csv")


################################################################################
# 2. Create the Classification tree model
################################################################################

pl_model_tree <-rpart(home_or_away ~ ftg_diff + 
                     htg_diff + 
                     s_diff + 
                     st_diff + 
                     f_diff +
                     c_diff +
                     y_diff + 
                     r_diff +
                     wdl_ft +
                     wdl_ht, data = pl_training, method = 'class')

fancyRpartPlot(pl_model_tree)

Classification Tree Interpretation

Code
# Hide code chunks, and display a small button to show them if needed but dont show any tibbles

################################################################################
# 3. Interpret the tree model
################################################################################

# Extract the variable importance from the rpart object we have called pl_model_tree
pl_model_tree$variable.importance

#Extract the variable importance as a percentage of all improvements to the model
summary(pl_model_tree)

Predicting If A Team Is The Home Team

From looking at the above classification tree we can see from Path 1; the feature ftg_diff (goals scored by team – goals scored by opponent) is < 0.6, this path leads us to predict that the team is the home team. The label “Home” lets us know the predicted outcome of a team if they satisfy the rules on Path 1. The label 0.38 and 0.62 tells us that of all the teams in the Training dataset that fall into this leaf, 38% scored fewer goals and 62% scored more goals. This figure gives us an idea of how pure the leaf is, and it also provides the probability of a team being the home team. In this case, if a home team follows Path 1 and ends up in this leaf, they are predicted to be a home team with a probability of 0.62 or 62%.

Predicting If A Team Is The Away Team

From looking at the above classification tree we can see from Path 2; the feature ftg_diff (goals scored by team – goals scored by opponent) is < 0.6 and c_diff (corners awarded to team – corners awarded to opponent) < -3.6 and s_diff (shots made by team – shots made by opponent) < 6.6, then the team is predicted to be away. In other words, the rule indicates that under these conditions, the model predicts the team to be playing away from their home stadium. The label “Away” lets us know the predicted outcome of a team if they satisfy the rules on Path 2. The label 0.77 and 0.23 tells us that of all the teams in the Training dataset that fall into this leaf, 77% scored fewer goals and 23% scored more goals. This figure gives us an idea of how pure the leaf is, and it also provides the probability of a team being the home team. In this case, if a away team follows Path 1 and ends up in this leaf, they are predicted to be a away team with a probability of 0.23 or 23%

Important Variables For Predicting If A Team Is Home Or Away

Variable Importance Table

From the classification tree ftg_diff (goals scored by team – goals scored by opponent) is very important because it is the first predictor variable the classification tree splits on. Whereas the variable r_diff (red cards shown to team – red cards shown to opponent) doesn’t even appear in the tree, which suggests that either it’s not important in predicting the whether the team is home or away or is strongly correlated with one of the other variables. Within the variable importance table, c_diff (corners awarded to team – corners awarded to opponent) is the most important variable in predicting home or away status, with an importance score of 24. Following this the ftg_diff (goals scored by team – goals scored by opponent) is the second most important variable in predicting home or away status, with an importance score of 21.

Accuracy Of The Classification Tree

Code
# Hide code chunks, display a small button to show them if needed and don't show any tibbles

################################################################################
# 4. Check The Model Accuracy
################################################################################

# Accuracy on Training Data

pl_training_tree_prob <- predict(pl_model_tree, newdata = pl_training, type = 'prob')
pl_training_tree_prediction <- predict(pl_model_tree, newdata = pl_training, type = 'class')
pl_training_tree_final <- cbind(pl_training, pl_training_tree_prob, Prediction = pl_training_tree_prediction)
head(pl_training_tree_final)

pl_training_tree_tab <- table(pl_training_tree_final$home_or_away, pl_training_tree_final$Prediction, dnn = c('Actual', 'Predicted'))
pl_training_tree_tab

pl_training_tree_acc <- sum(diag(pl_training_tree_tab))/sum(pl_training_tree_tab)
pl_training_tree_acc

# Accuracy on Testing data

pl_testing_tree_prob <- predict(pl_model_tree, newdata = pl_testing, type = 'prob')
pl_testing_tree_prediction <- predict(pl_model_tree, newdata = pl_testing, type = 'class')
pl_testing_tree_final <- cbind(pl_testing, pl_testing_tree_prob, Prediction = pl_testing_tree_prediction)
head(pl_testing_tree_final)

pl_testing_tree_tab <- table(pl_testing_tree_final$home_or_away, pl_testing_tree_final$Prediction, dnn = c('Actual', 'Predicted'))
pl_testing_tree_tab

pl_testing_tree_acc <- sum(diag(pl_testing_tree_tab))/sum(pl_testing_tree_tab)
pl_testing_tree_acc

Training Model Accuracy Testing Model Accuracy

from the above confusion matrix on the training data, we know the following: The overall model accuracy is (183+207)/600 = 0.65 or 65%. Of all the teams the model predicted the team would be an away team, they got 183/274 = 0.67 or 67% correct. Of all the teams the model predicted the team would be a home team, they got 207/326 = 0.64 or 64% correct. Of all matches where the team was away, the model correctly identified 183/302 = 0.61 or 61%. Of all matches where the team was home, the model correctly identified 207/298 = 0.69 or 69%.

These results suggest that the model overfits the training dataset, which means that the model is good at predicting the data it has seen but it doesnt generalise well for new data.

from the above confusion matrix on the testing data, we know the following: The overall model accuracy is (43+58)/160 = 0.63 or 63%. Of all the teams the model predicted the team would be an away team, they got 43/67 = 0.64 or 64% correct. Of all the teams the model predicted the team would be a home team, they got 58/93 = 0.62 or 62% correct. Of all matches where the team was away, the model correctly identified 43/78 = 0.55 or 55%. Of all matches where the team was home, the model correctly identified 58/82 = 0.71 or 71%. The tree model predicts the training dataset well with an overall accuracy of 65%. Additionally for the testing data, it is only 64% accurate. This 1% difference in accuracy between training and testing is minor, which suggests the model is not overfitting and the tree model is generalizeraling well to new, unseen data. We know this because there isn’t a large gap between training and testing accuracy where the model does really well on the training data but not so well on the testing data.


Binary Logistic Regression Model

Code
# Hide code chunks, display a small button to show them if needed and don't show any tibbles

################################################################################
#######################Binary Logistic Regression Model#########################
################################################################################


################################################################################
# 1. Load and Prepare Data
################################################################################

pl_training <- read_csv("pl_training.csv")

pl_testing <- read_csv("pl_testing.csv")


################################################################################
# 2. Set Up levels of response variable
################################################################################

pl_training$home_or_away <- factor(pl_training$home_or_away, levels = c("Home", "Away"))
pl_testing$home_or_away <- factor(pl_testing$home_or_away, levels = c("Home", "Away"))

levels(pl_training$home_or_away)
levels(pl_testing$home_or_away)


################################################################################
# 3. Create the binary logistic regression model
################################################################################

pl_model_lr <- glm(home_or_away ~ ftg_diff  + htg_diff + s_diff + st_diff + c_diff + y_diff + r_diff + wdl_ft + wdl_ht, data = pl_training, family = binomial(link = 'logit'))
summary(pl_model_lr)

Binary Coefficients

Regression Equation

y=ln(π/(1-π)) = 0.143 +0.153.ftg_diff - 0.097.htg_diff - 0.046.s_diff + 0.020.st_diff - 0.022.c_diff + 0.024.y_diff - 0.225.r_diff + 0.440.wdl_ftLose - 0.584.wdl_ftWin - 0.086.wdl_htLose - 0.175.wdl_htWin.

Important predictor Variables

A p-value of 0.0113 means that the predictor variable’ s_diff’ (shots made by team – shots made by opponent) is the only one that is of statistical significance at the 0.05 level when deciding whether a team is home or away. Since this is the case, differences in shots in between the teams can be used to accurately guess if a team is competing at home or away. It is important to note that the wdl_ftWin variable (team won, lost, or tied the game at full time) does not reach statistical importance at the standard 0.05 level, as shown by the dot.

Significant Predictor Variable Impact On The Odds Of A Team Being Classifed As The Home Team.

Exp(-0.04600) = 0.955042 = 0.955 This tells us how much the odds of a team being classified as the home team will change for each 1 unit change in the s_diff predictor variable. Therefore: An increase of 1 shot for the home team over the away team is associated with a (0.955 – 1) x 100% = -0.045 x 100% = -4.5% change in the odds of a team being classified as the home team, which is a 4.5% reduction in the odds of a team being classified as the away team.

Binary Logistic Regression Model Accuracy

Code
# Hide code chunks, and display a small button to show them if needed but dont show any tibbles

################################################################################
# 4. Check the model accuracy
################################################################################

# First apply to training data set

pl_training_lr_pi <- predict(pl_model_lr, newdata = pl_training, type = 'response')
pl_training_lr_pi

pl_training_lr_final <- pl_training %>%
                          mutate(pi = pl_training_lr_pi) %>%
                          mutate(pl_training_lr_prediction = case_when(pi > 0.5 ~ 'Home', 
                                                                       pi <= 0.5 ~ 'Away'))
pl_training_lr_final

pl_training_lr_tab <- table(pl_training_lr_final$home_or_away, pl_training_lr_final$pl_training_lr_prediction, dnn=c('Actual', 'Predicted'))
pl_training_lr_tab

pl_training_lr_acc <- sum(diag(pl_training_lr_tab))/sum(pl_training_lr_tab)
pl_training_lr_acc


# Second apply to testing data set

pl_testing_lr_pi <- predict(pl_model_lr, newdata = pl_testing, type = 'response')
pl_testing_lr_pi

pl_testing_lr_final <- pl_testing %>%
                        mutate(pi = pl_testing_lr_pi) %>%
                        mutate(pl_testing_lr_prediction = case_when(pi > 0.5 ~ 'Home', 
                                                                    pi <= 0.5 ~ 'Away'))
pl_testing_lr_final

pl_testing_lr_tab <- table(pl_testing_lr_final$home_or_away, pl_testing_lr_final$pl_testing_lr_prediction, dnn = c("Actual", "Predicted"))
pl_testing_lr_tab

pl_testing_lr_acc <- sum(diag(pl_testing_lr_tab))/sum(pl_testing_lr_tab)
pl_testing_lr_acc

Training Model Accuracy Testing Model Accuracy

from the above confusion matrix on the training data, we know the following: The overall model accuracy is (178+189)/600 = 0.61 or 61%. Of all the teams the model predicted the team would be an away team, they got 178/291 = 0.61or 61% correct. Of all the teams the model predicted the team would be a home team, they got 189/309 = 0.62 or 62% correct. Of all matches where the team was away, the model correctly identified 178/298 = 0.61 or 61%. Of all matches where the team was home, the model correctly identified 189/302 = 0.63 or 63%.

overall the model seems consistent averaging the same numbers for each prediction

from the above confusion matrix on the testing data, we know the following: The overall model accuracy is (53+51)/160 = 0.65 or 65%. Of all the teams the model predicted the team would be an away team, they got 53/80 = 0.66 or 66% correct. Of all the teams the model predicted the team would be a home team, they got 51/82 = 0.62 or 62% correct. Of all matches where the team was away, the model correctly identified 53/82 = 0.65 or 65%. Of all matches where the team was home, the model correctly identified 51/78 = 0.65 or 65%.

Same here again the model sems consistent averaging the same numbers for each prediction its not a great model but this does suggest the model generalizes well to unseen data and indicates the model is not overfitting.

Model Accuracy

Code
################################################################################
# 5: Compare the accuracy of all models
################################################################################

pl_training_tree_acc
pl_testing_tree_acc

pl_training_lr_acc
pl_testing_lr_acc

Classification Model: Training accuracy: 65% Testing accuracy: 63%

Binary Logistic Regression Model: Training accuracy: 62% Testing Accuracy: 65%

The binary logistic regression model is the best of all the models that we have created for predicting a team as either the home or the away team. The model has a higher accuracy on the testing dataset and is considered better in terms of its ability to generalize to new, unseen data.

Comparison Of Model Variable Important Predictors

Classfication Tree Variable Importance: c_diff: Most important variable for predicting a team as either the home or the away team with an importance score of 24 s_diff: 3rd most important variable for predicting a team as either the home or the away team with an importance score of 15

Binary Logistic Regression Variable Importance: s_diff: With a p value of 0.0113 it is statistical significance at the 0.05 level when deciding whether a team is home or away. c_diff: With a p value of 0.3768 it is not statistical significance at the 0.05 level when deciding whether a team is home or away.

If you look at the links between the variables in the collected information and how the different models use math, you can see that they may prioritise variables in varying manners. There are more straightforward and intricate interactions and connections between factors in the classification tree. This could be whether c_diff is seen as of greater significance here. Binary Logistic Regression looks at how predictors and the resulting log-odds are related in a straight line. Since s_diff is important in this model, it means that its effect on the result can be well described as linear. This is different from c_diff, which may not have a linear effect or may be overwhelmed by other factors in a multiple setting.


Cluster Analysis

The data needs to be scaled before computing the distance matrix for hierarchical clustering and before being entered into the K-means clustering algorithm, so that each variable has a similar effect on the distance calculation, which makes sure that the analysis is fair.

Hierarchical Clustering

Code
# Hide code chunks, and display a small button to show them if needed

################################################################################
###############################Cluster Analysis#################################
################################################################################


################################################################################
# Load Packages 
################################################################################

library(cluster)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(RColorBrewer) # Color Palette

################################################################################
# 1. Import the data and prepare for clustering
################################################################################

base_ball <- read_csv("baseball_hof.csv")
View(base_ball)

#Create sub data set containing only those variables to be included in cluster analysis

base_ball <- base_ball %>%
  select(playerID, hits, runs, home_runs, rbi, stolen_bases)
View(base_ball)


################################################################################
# 2. Select and scale numeric columns 
################################################################################

baseball_stats <- base_ball %>%
  select(-playerID) %>%  # Remove playerID 
  scale()                # Scale numeric data

view(baseball_stats)


################################################################################
# 3. Calculate distance matrix
################################################################################

b1 <- dist(baseball_stats)


################################################################################
# 4. Carry out the hierarchical clustering
################################################################################

h1 <- hclust(b1, method = 'ward.D')
plot(h1, hang = -1)

Code
heatmap(as.matrix(b1), Rowv = as.dendrogram(h1), Colv = 'Rowv')

In both cases, the heatmap and the dendrogram show a clustering pattern among the datasets. The heatmap shows strong proof of clustering throughout the dataset, as is shown by noticeable patterns in the heatmap. This clustering is also confirmed by the arrangement seen in the dendrogram, in which the dataset can be separated into two or three significant cluster structures.

Creation Of A 4-Cluster Solution And Quality

Code
# Hide code chunks, and display a small button to show them if needed

################################################################################
# 5. Decide on number of clusters
################################################################################

clusters1 <- cutree(h1, k=4)


################################################################################
# 6. Assess the quality of the segmentation
################################################################################

sil1 <- silhouette(clusters1, b1)
summary(sil1)
Silhouette of 82 units in 4 clusters from silhouette.default(x = clusters1, dist = b1) :
 Cluster sizes and average silhouette widths:
       17        25        30        10 
0.3202221 0.2078566 0.2958375 0.4325219 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1771  0.2100  0.3257  0.2907  0.4224  0.5725 

The overall cluster analysis has a mean Silhouette score of 0.2907, which means that the analysis has uncovered that the structure is weak and could be artificial. The average Silhouette Scores for each individual cluster also tell us:

Clusters 1 and 3 have a score of 0.3202221 and 0.2958375 respectively, meaning there are similar Silhouette Scores, implying that there is cluster similarity however they are weak.

Cluster 2 has the lowest Silhouette Score of 0.2078566, meaning that these clusters are not substantial at all.

Cluster 4 has the highest Silhouette Score of 0.4325219, suggesting that this cluster is very different, with its population being clearly separated from those in other clusters however they are weak.

Tables And Suitable Graphs To Describe The Properties That Seem Common To Each Cluster Of Players

Code
# Hide code chunks, display a small button to show them if needed and don't show any output

################################################################################
# 7. Profile the clusters
################################################################################

# Combine the baseball data set to a vector containing the clusters that each player belongs to

baseball_clus <-cbind(base_ball, clusters1)
baseball_clus <- mutate(baseball_clus, Cluster = case_when(clusters1 == 1 ~ 'C1',
  clusters1 == 2 ~ 'C2',
  clusters1 == 3 ~ 'C3',
  clusters1 == 4 ~ 'C4'))


# Create subset of data containing only the variables we want to profile the clusters on

baseball_clus <- baseball_clus %>%
  select(hits, runs, home_runs, rbi, stolen_bases, Cluster)


# Profile numerical variables. Calculate mean value of all variables for each cluster.

baseball_clus_means <- baseball_clus %>%
  group_by(Cluster) %>%
  summarise(hits = mean(hits),
            runs = mean(runs),
            home_runs = mean(home_runs),
            rbi = mean(rbi),
            stolen_bases = mean(stolen_bases))

baseball_clus_means


################################################################################
# 8. Graph Creation
################################################################################

# Convert the data set to be in "tidy" format to allow for creation of graph.

baseball_clus_tidy <- baseball_clus_means %>%
  pivot_longer(cols = c(hits, runs, home_runs, rbi, stolen_bases), 
               names_to = "Attributes", 
               values_to = "Average_Value")

baseball_clus_tidy


# Reorder attributes in the Attributes variable to allow for more sensible grouping of attributes and easier interpretation of line graph.

baseball_clus_tidy$Attributes <- factor(baseball_clus_tidy$Attributes, 
                                        levels = c("hits", "runs", "home_runs", "rbi", "stolen_bases"))


# Visualize the mean score of each cluster for each attribute (Line Graph).

ggplot(baseball_clus_tidy, mapping = aes(x = Attributes, y = Average_Value, group = Cluster, colour = Cluster)) +
  geom_line(linewidth = 1) +  # Updated to use 'linewidth' as I kept getting a warning to update it on quarto
  geom_point(size = 2) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
  ylab("Mean Attributes") + 
  scale_x_discrete(labels= c("hits", "runs", "home_runs", "rbi", "stolen_bases")) +
  ggtitle("Mean Score for each Cluster for each Attribute Measure")

Code
# Visualize the mean score of the Hits variable for each cluster (Bar Chart).

ggplot(baseball_clus_means, aes(x = Cluster, y = hits, fill = Cluster)) + # Using the baseball_clus_means data 
  geom_col(show.legend = FALSE) +  # Do not want to show the legend
  scale_fill_manual(values = c("C1" = "#83C5BE", "C2" = "#FFD166", "C3" = "#FF6D6A", "C4" = "#A390E4")) +
  scale_y_continuous(labels = scales::label_number()) +  # Use label_number() for non-percentage scales
  ylab("Average Hits") +
  xlab("Cluster") +
  ggtitle("Average Hits Distribution by Cluster")

Code
# Visualize the mean score of the Rbi variable for each cluster (Bar Chart).

ggplot(baseball_clus_means, aes(x = Cluster, y = hits, fill = Cluster)) + # Using the baseball_clus_means data 
  geom_col(show.legend = FALSE) +  # Do not want to show the legend
  scale_fill_manual(values = c("C1" = "#40E0D0", "C2" = "#DB7093", "C3" = "#F4A460", "C4" = "#87CEEB")) +
  scale_y_continuous(labels = scales::label_number()) +  # Use label_number() for non-percentage scales
  ylab("Average RBI Value") +
  xlab("Cluster") +
  ggtitle("RBI Mean Value Distribution by Cluster")

Code
# The relationships between two variables "Runs" & "Home_runs" (Scatter Plot).

ggplot(baseball_clus_means, aes(x = runs, y = home_runs, color = Cluster)) + # Using the baseball_clus_means data 
  geom_point(alpha = 0.6) +
  labs(title = "Home Runs vs. Runs by Cluster", x = "Runs", y = "Home Runs")

Attribute Line Graph:

Based on the graph, it is clear that C1 is made up players who show excellent work in the area of home runs and possibly runs batted in, implying that it is full of powerful batters who do well at scoring runs.

C2 players do well in both hits and runs, suggesting their regularly reach base and score, therefore providing opportunities for the other team members.

C3 does well when it comes to home runs and runs batted in. This means the players in this group have a major impact on batting, which leads to a higher number of runs being scored.

C4 shows a group of players who do not belong to any of the other clusters. This suggests that there is some kind of specialisation among players within this cluster.

Hits variable Bar Chart:

Based on the graph, it is easy to see that C1 and C2 show a greater mean value of hits, showing that the players in this collective could have great ability in these attributes.

The mean score of C3 is lower than that of C1 and C2, implying that players in this category may have less expertise in hits.

The C4 mean score is the lowest compared to the other three clusters, indicating that players in this group may possess unique strengths or may not prioritise hits as highly.

Rbi variable Bar Chart:

Based on the graph, it is shown that C1 has a very large mean for runs batted in, suggesting that these players often take part in scoring plays.

The mean score of C2 is close to that of C1, implying that these players make significant impacts on the team’s offensive chances, but somewhat less than C1.

C3 has a smaller mean in comparison to C1 and C2, suggesting that there is less contribution to chances for scoring or maybe a separate role in the team’s offensive tactics.

C4 implies that while players in C4 may not score as often as those in C1 and C2, they are more likely to be part of major scoring opportunities compared to those in C3.

To summarise, C1 and C2 seem to consist of groups of players who are very likely to make a major difference to their team’s score. C1 seems to have a tiny advantage over C2 in this regard. C3 seems to be made up of players that may hold specific tasks or are not similarly engaged in scoring opportunities. C4 players hold an intermediate position, showing a higher level of involvement in scoring compared to C3 players, but less interest in scoring compared to C1 and C2 players.

The relationships between two variables “Runs” & “Home_runs” Scatter Plot:

Based on the graph, it is shown that C1 suggests the players in this cluster are likely to be stood out by their common scoring abilities regardless of their number of home runs. It suggests that they have a higher level in reaching base and scoring runs by means other than hitting home runs.

C2 might suggest players that make an important difference towards scoring and hitting home runs. They might be considered versatile players who make valuable contributions to the team’s offensive performance, but they may not necessarily be the top performers in each area.

Cluster C3 may include players who have lower scoring and home run statistics compared to those in other groups. This might suggest players with distinct positions on the squad or those who may need improvement in many areas of how they play.

C4 seems for players that have excellent skills in hitting home runs and also achieve a significant number of runs. These players are likely to be the strongest hitters, who have an important impact on the game by being able to score runs, both for themselves and for their teammates, with forceful hits.

In general, the plot implies a variety of positions or skills inside the clusters. C1 and C3 exhibit contrasting attributes, as C1 has a greater inclination towards achieving better scores via home runs, while C3 shows the likelihood of improvements in both areas. C4 players are noted for their significant influence on scoring and home runs, revealing that they are likely the primary offensive players in their lineup.

K-means Clustering

Code
# Hide code chunks, display a small button to show them if needed and don't show any output

##############################################################################
##########################Cluster Analysis - K-means##########################
##############################################################################

################################################################################
# Load Packages 
################################################################################

library(cluster)
library(tidyverse)
library(dplyr)
library(tibble)

################################################################################
# 1. Import the data and prepare for clustering
################################################################################

ball_2 <- read_csv("baseball_hof.csv")
View(ball_2)


#Create sub data set containing only those variables to be included in cluster analysis

ball_2 <- ball_2 %>%
  select(playerID, hits, runs, home_runs, rbi, stolen_bases)
View(ball_2)


################################################################################
# 2. Select and scale numeric columns 
################################################################################

ball_kstats <- ball_2 %>%
  select(-playerID) %>%  # Remove playerID 
  scale()                # Scale numeric data

view(ball_kstats)


################################################################################
# 3. Carry out the k-means clustering on the scaled dataset
################################################################################
set.seed(101)
kmeans1 <- kmeans(ball_kstats, centers = 4)

kmeans1$cluster #The assignment of each observation (i.e. player) to each cluster.
kmeans1$centers #The centroids for each cluster.
kmeans1$size    #The number of observations in each cluster.
kmeans1$iter    #The number of iterations it took to converge to the final clustering solution.

K-means Clustering Solution Quality

Code
# Hide code chunks, and display a small button to show them if needed

################################################################################
# 4. Assess the quality of the segmentation
################################################################################
b2 <- dist(ball_kstats)
ball_kmeans <- silhouette(kmeans1$cluster, b2)
summary(ball_kmeans)
Silhouette of 82 units in 4 clusters from silhouette.default(x = kmeans1$cluster, dist = b2) :
 Cluster sizes and average silhouette widths:
       28        22        19        13 
0.3322757 0.2178155 0.2687416 0.3421893 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.06303  0.18425  0.28984  0.28842  0.41869  0.54977 

The overall cluster analysis has a mean Silhouette score of 0.28842, which means that the analysis has uncovered that the structure is weak and could be artificial. The average Silhouette Scores for each individual cluster also tell us:

Clusters 1 (0.3322757), 3 (0.2687416) and 4 (0.3421893) Silhouette scores are weak and could be artificial.

Cluster 2 has the lowest Silhouette Score of 0.2178155, meaning that this cluster is not substantial at all.

Tables And Suitable Graphs To Describe The Properties That Seem Common To Each Cluster Of Players

Code
# Hide code chunks, display a small button to show them if needed and don't show any tibbles

################################################################################
# 5. Profile the clusters
################################################################################

# Need to convert kmeans1$center to be a tibble using as_tibble - this allows us to use other dplyr functions, adding a cluster label, and then reshaping the data to long format for easier analysis or visualization

ball2_clus_kmeans_tidy <- as_tibble(kmeans1$centers) %>%
  mutate(Cluster = c("C1", "C2", "C3", "C4")) %>% 
  pivot_longer(cols = c(hits, runs, home_runs, rbi, stolen_bases), 
               names_to = "Attributes", 
               values_to = "Average_Value")
 
ball2_clus_kmeans_tidy


#Combine original baseball dataset to the clusters, create tidier version of clusters variable and select variables we wish to profile segments on.

ball2_clus_kmeans <- ball_2 %>%
                    mutate(clusters = kmeans1$cluster) %>%
                    mutate(Cluster = case_when(clusters == 1 ~ 'C1',
                                               clusters == 2 ~ 'C2',
                                               clusters == 3 ~ 'C3',
                                               clusters == 4 ~ 'C4')) %>%
                    select(hits, runs, home_runs, rbi, stolen_bases, Cluster)


# Profile numerical variables. Calculate mean value of all variables for each cluster.

ball2_clus_means <- ball2_clus_kmeans %>%
  group_by(Cluster) %>%
  summarise(hits = mean(hits),
            runs = mean(runs),
            home_runs = mean(home_runs),
            rbi = mean(rbi),
            stolen_bases = mean(stolen_bases))

ball2_clus_means


################################################################################
# 6. Graph Creation
################################################################################

# Reorder attributes to allow for more sensible grouping of attributes and easier interpretation of line graph.
ball2_clus_kmeans_tidy$Attributes<- factor(ball2_clus_kmeans_tidy$Attributes, levels = c("hits", "runs", "home_runs", "rbi", "stolen_bases"))


# Visualize the mean score of each cluster for each attribute (Line Graph).
ggplot(ball2_clus_kmeans_tidy, mapping = aes(x = Attributes, y = Average_Value, group = Cluster, colour = Cluster)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
  ylab("Mean Attributes") + 
  scale_x_discrete(labels= c("hits", "runs", "home_runs", "rbi", "stolen_bases")) +
  ggtitle("Mean Score for each Cluster for each Attribute Measure")

Code
# Visualize the mean score of the Hits variable for each cluster (Bar Chart).

ggplot(ball2_clus_means, aes(x = Cluster, y = hits, fill = Cluster)) + # Using the ball2_clus_means data 
  geom_col(show.legend = FALSE) +  # Do not want to show the legend
  scale_fill_manual(values = c("C1" = "#83C5BE", "C2" = "#FFD166", "C3" = "#FF6D6A", "C4" = "#A390E4")) +
  scale_y_continuous(labels = scales::label_number()) +  # Use label_number() for non-percentage scales
  ylab("Average Hits") +
  xlab("Cluster") +
  ggtitle("Average Hits Distribution by Cluster")

Code
# Visualize the mean score of the Rbi variable for each cluster (Bar Chart).

ggplot(ball2_clus_means, aes(x = Cluster, y = rbi, fill = Cluster)) + # Using the ball2_clus_means data 
  geom_col(show.legend = FALSE) +  # Do not want to show the legend
  scale_fill_manual(values = c("C1" = "#40E0D0", "C2" = "#DB7093", "C3" = "#F4A460", "C4" = "#87CEEB")) +
  scale_y_continuous(labels = scales::label_number()) +  # Use label_number() for non-percentage scales
  ylab("Average Ribi's") +
  xlab("Cluster") +
  ggtitle("Average Rbi's Distribution by Cluster")

Code
# The relationships between two variables "Runs" & "Home_runs" (Scatter Plot).

ggplot(ball2_clus_means, aes(x = runs, y = home_runs, color = Cluster)) + # Using the ball2_clus_means data 
  geom_point(alpha = 0.6) +
  labs(title = "Home Runs vs. Runs by Cluster", x = "Runs", y = "Home Runs")

Attribute Line Graph:

The data implies that players in C1 have a constant level of success across all attributes, which suggests that they are likely to possess a balanced skill set without any notable exceptional strengths or weaknesses in any one area.

C2 has achieved significant figures in stolen bases, although does not hold a leading position in other categories, suggesting a specialisation in one attribute while displaying poor results in other aspects.

These players may have worse average scores across all assessed qualities, as shown by the fact that C3 falls primarily at the bottom half of the graph. This suggests that individuals may have the potential to enhance their performance or take on tasks that do not prioritise these specific measurements.

C4 has the greatest values in two attributes, while displaying a decrease in other variables. Players in C4 may be individuals who perform well in some parts of the game.

Hits variable Bar Chart:

The C1 cluster has a significant average value for this measure, suggesting that players in C1 have a high level of proficiency in this attribute

The average value of C2 exceeds that of C1, suggesting that C2 include players who are very skilled in reaching the above figure, possibly suggesting a group of talented players.

The average value of C3 is less than that of C1 and C2, indicating that while C3 players may participate in this area, their contribution is not as significant as the players in the first two clusters.

C4 has a greater mean value compared to C3, although it falls short of the levels achieved in C2. This suggests that the players in C4, while skilled in that regard, often do not reach the same highest as those in C2.

Based on this distribution, it can be assumed that both C1 and C4 exhibit a high degree of ability in this variable, with C4 significantly lagging behind. C2 has the highest average, perhaps indicating a group of very proficient or consistently high-performing individuals in this data. C3, while it has a smaller impact nonetheless plays a role, although to a lesser degree compared to the other clusters.

Rbi variable Bar Chart:

Players in the C1 cluster have a high average runs batted in, indicating their constant participation in plays that result in their team scoring runs. This suggests that C1 may consist of those who consistently perform well or are important offensive players.

The C2 players have an average runs batted in that is somewhat lower than the C1 players, suggesting that they are similarly successful in generating scoring chances but may not be as dominating in this aspect as the C1 players.

C3 has a lower mean runs batted in comparison to C1 and C2. Players in this cluster may have varying responsibilities within the team’s offence or may not be as proficient at driving in runs.

C4 has the highest mean runs batted in, suggesting that C4 comprises the most significant offensive pitchers, maybe including power batters or key players who succeed in scoring opportunities.

The wide variation in mean runs batted in across these clusters may indicate varying batting skills, or advantageous opportunities. C1 and C2 seem to be important contributors, although C1 is largely outnumbered by C2. C3 may include of players with diverse skill sets or those who make impacts outside runs batted in. C4 is unique likely suggesting a group with an important ability for generating runs.

The relationships between two variables “Runs” & “Home_runs” Scatter Plot:

C1 may not be the top performer in terms of hitting home runs or scoring points. This implies that they might have positions in the group that prioritise other aspects of the game rather to only focusing on striking with power. Additionally, they could be less experienced players who are still in the process of honing these specific talents.

C2 represents players who make an important contribution in terms of home runs and runs. This might show players with well-rounded abilities who consistently contribute to the team’s offence but are not the top scorers or home run leaders.

The C3 cluster may consist of players that have fewer chances of contributing to scoring runs and hitting home runs,

C2 suggests constant involvement, whilst C4 differs itself with players who are likely to have a major impact on the team’s score due to their skills in hitting home runs.

Hierarchical Clustering and K-means Analysis

Highest Quality Clusters:

Hierarchical Mean Silhouette Score: 0.2907
KMeans Mean Silhouette Score: 0.28842

Although the difference is small, and the clustering solution is weak and could be artificial, hierarchical clustering did produce a few better-quality groups.

Cluster Profile

Hierarchical Cluster Tibble

K-means Cluster Tibble

The tables above show the average values of attributes for hierarchical and k-means clustering methods. In the hierarchical clustering solution, it is evident that C1 exhibits the greatest means for the bulk of the variables, whereas C2 stands out with significantly large numbers in stolen bases. C3 has lower values for all of them, but C4 displays moderate values with the least amount of stolen bases. In the k-means clustering solution, C1 has moderate means without any significant outliers. C2 has the biggest numbers for multiple variables, which may suggest that players in this category specialise in those particular areas. C3 continues to have the least values for all attributes. C4 stands for having the largest mean in hits.

Both results show C3 with the smallest means, suggesting that this cluster has comparable attributes across both types of clustering. But there are also exceptions. For example, in the hierarchical clustering solution, C1 has the biggest mean scores for most variables. Meanwhile, in the k-means clustering solution, the values of C1 are more evenly distributed throughout the attributes. C4 has a more insignificant in the hierarchical clustering solution, but has a significant rise in hits in the k-means clustering solution. That suggests that the k-means clustering may be adding in something that is not focused as much in the hierarchical clustering solution.