Assignment Two - Classification

Author

Sadhbh Murray Donoghue

# load relevant libraries
library(tidyverse)
library(rpart)
library(rattle)
library(TTR)
library(dplyr)
library(cluster)

Question One: Classification

1. Import the data

# Import and name training and testing datasets
pl_training <- read_csv("pl_training.csv")
pl_testing <- read_csv("pl_testing.csv")
view(pl_training)

2. Create and Assess the Classification Tree Model

a. Lago-Penas and Lago-Ballesteros (2011) and Tucker (2005) identified a number of statistically significant variables in discriminating home and away teams: goals, total shots, shots on target, fouls, yellow cards and corners. These variables will be included in our decision tree model.

#|label: Decide on a Classification Tree to Classify Home or Away - Model One (both papers)

pl_model_tree1 <- rpart(home_or_away ~ ftg_diff + s_diff + st_diff + f_diff + y_diff + c_diff,
                       data = pl_training, method = 'class')
fancyRpartPlot(pl_model_tree1)

pl_model_tree1$variable.importance

   c_diff  ftg_diff    s_diff   st_diff    f_diff    y_diff 
13.569282 11.795314  8.591108  5.876283  3.097123  1.658069

pl_training_tree1_prob <- predict(pl_model_tree1, newdata = pl_training, type = "prob")
pl_training_tree1_prediction <- predict(pl_model_tree1, newdata = pl_training, type = 'class')
pl_training_tree1_final <- cbind(pl_training, pl_training_tree1_prob, pl_training_tree1_prediction)
pl_training_tree1_tab <- table(pl_training_tree1_final$home_or_away, 
                              pl_training_tree1_final$pl_training_tree1_prediction, 
                              dnn = c('Actual', 'Prediction'))
pl_training_tree1_tab

      Prediction
Actual Away Home
  Away  183  119
  Home   91  207

pl_training_tree1_acc <- sum(diag(pl_training_tree1_tab))/sum(pl_training_tree1_tab)
pl_training_tree1_acc       # 65%

[1] 0.65

## measure accuracy on testing dataset
pl_testing_tree1_prob <- predict(pl_model_tree1, newdata = pl_testing, type = 'prob')
pl_testing_tree1_prediction <- predict(pl_model_tree1, newdata = pl_testing, type = 'class')
pl_testing_tree1_final <- cbind(pl_testing, pl_testing_tree1_prob, pl_testing_tree1_prediction)
pl_testing_tree1_tab <- table(pl_testing_tree1_final$home_or_away, 
                              pl_testing_tree1_final$pl_testing_tree1_prediction, 
                              dnn = c('Actual', 'Prediction'))
pl_testing_tree1_tab

      Prediction
Actual Away Home
  Away   43   35
  Home   24   58

pl_testing_tree1_acc <- sum(diag(pl_testing_tree1_tab))/sum(pl_testing_tree1_tab)
pl_testing_tree1_acc       # 63.125%

[1] 0.63125

b. Interpret the classification tree

(i) Home team path one: If ftg_diff > 0.5 (i.e. If the number of goals scored by a team - the number of goals scored by their opponent is greater than 0.5), then that first/named team is predicted to be the home team. The label 0.38 and 0.62 tells us that of all the matches in the training dataset that fall into this leaf, 38% were away teams and 62% were home teams. If the first/named team follows this path and ends up in this leaf, they are predicted to be the home team with a probability of 0.62 or 62%.

(ii) Away teams path two: If ftg_diff < 0.5 and c_diff < - 3.5, then the first/named team is predicted to be the away team. In other words, if the first/named team draws or loses (has a goal difference of less than +ve 0.5) and their opponent is awarded 3.5+ more corners, then they are predicted to be the away team. The label 0.77 and 0.23 tells us that of all the matches in the training datset that fall into this leaf, 77% were away teams and 23% were home teams. If the first/named team follows this path and ends up in this leaf, they are predicted to be the away team with a probability of 0.77 of 77%.

pl_model_tree1$variable.importance

   c_diff  ftg_diff    s_diff   st_diff    f_diff    y_diff 
13.569282 11.795314  8.591108  5.876283  3.097123  1.658069

(iii) From viewing the decision tree we can ftg_diff and c_diff are important predictor variables as they are the first and most common variables in the tree respectively. However, we get a better understanding variable importance from the above output. The most important variable is c_diff with a value of 13.57, accounting for 30% of improvements in the model. The second most important variable is ftg_diff with a value of 11.8, accounting for 26% of improvements in the model. s_diff, st_diff, f_diff and y_diff account for 19%, 13%, 7% and 4% of improvements in the model respectively.

pl_training_tree1_prob <- predict(pl_model_tree1, newdata = pl_training, type = "prob")
pl_training_tree1_prediction <- predict(pl_model_tree1, newdata = pl_training, type = 'class')
pl_training_tree1_final <- cbind(pl_training, pl_training_tree1_prob, pl_training_tree1_prediction)
pl_training_tree1_tab <- table(pl_training_tree1_final$home_or_away, 
                              pl_training_tree1_final$pl_training_tree1_prediction, 
                              dnn = c('Actual', 'Prediction'))
pl_training_tree1_tab

      Prediction
Actual Away Home
  Away  183  119
  Home   91  207

pl_training_tree1_acc <- sum(diag(pl_training_tree1_tab))/sum(pl_training_tree1_tab)
pl_training_tree1_acc       # 65%

[1] 0.65

## measure accuracy on testing dataset
pl_testing_tree1_prob <- predict(pl_model_tree1, newdata = pl_testing, type = 'prob')
pl_testing_tree1_prediction <- predict(pl_model_tree1, newdata = pl_testing, type = 'class')
pl_testing_tree1_final <- cbind(pl_testing, pl_testing_tree1_prob, pl_testing_tree1_prediction)
pl_testing_tree1_tab <- table(pl_testing_tree1_final$home_or_away, 
                              pl_testing_tree1_final$pl_testing_tree1_prediction, 
                              dnn = c('Actual', 'Prediction'))
pl_testing_tree1_tab

      Prediction
Actual Away Home
  Away   43   35
  Home   24   58

pl_testing_tree1_acc <- sum(diag(pl_testing_tree1_tab))/sum(pl_testing_tree1_tab)
pl_testing_tree1_acc       # 63.125%

[1] 0.63125

c. The overall model accuracy for the training dataset is 65%

of all the matches in the model that predicted the first/named team as the home team, they got 207/(119+207) = 0.635 or 63.5% correct.
of all the matches in the model that predicted the first/named team as the away team, they got 183/(183+91) = 0.668 or 66.8% correct.
of all the matches where the first/named team is the home team, the model correctly identified 207/(91+207) = 0.695 or 69.5%.
of all the matches where the first/named team is the away team, the model correctly identified 183/(183+119) = 0.606 or 60.6%.

The overall model accuracy for the tresting dataset is 63.125%

of all the matches in the model that predicted the first/named team as the home team, they got 58/(35+58) = 0.624 or 62.4% correct.
of all the matches in the model that predicted the first/named team as the away team, they got 43/(43+24) = 0.642 or 64.2% correct.
of all the matches where the first/named team is the home team, the model correctly identified 58/(24+58) = 0.707 or 70.7%.
of all the matches where the first/named team is the away team, the model correctly identified 43/(43+35) = 0.551 or 55.1%.

The tree model predicts the training dataset well with an overall accuracy of 65%. The model for the testing dataset only drops slightly to 63.125%. This suggests the model is not overfitting the data and it may generalise well to predicting new data. Pruning can still be used to improve the model slightly.

3. Create and Assess the Binary Logistic Regression Model

pl1_model_lr <- glm(home_or_away ~ ftg_diff + s_diff + st_diff + f_diff + c_diff + y_diff,
                    data = pl1_training, family = binomial(link = 'logit'))
summary(pl1_model_lr)


Call:
glm(formula = home_or_away ~ ftg_diff + s_diff + st_diff + f_diff + 
    c_diff + y_diff, family = binomial(link = "logit"), data = pl1_training)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -0.01047    0.08464  -0.124   0.9015  
ftg_diff     0.11191    0.06109   1.832   0.0670 .
s_diff       0.04193    0.01760   2.383   0.0172 *
st_diff     -0.01754    0.04023  -0.436   0.6629  
f_diff      -0.02355    0.01891  -1.245   0.2131  
c_diff       0.02006    0.02526   0.794   0.4272  
y_diff      -0.01615    0.05552  -0.291   0.7712  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 831.75  on 599  degrees of freedom
Residual deviance: 789.43  on 593  degrees of freedom
AIC: 803.43

Number of Fisher Scoring iterations: 4

(ii) \(y = ln\Big(\frac{\pi}{1 + \pi}\Big) = -0.0105 + 0.112.\)ftg_diff \(+ 0.042.\)s_diff \(- 0.018.\)st_diff \(- 0.024.\)f_diff \(+ 0.020.\)c_diff \(- 0.016.\)y_diff

(iii) Importance of the variables - \(H_{O}:\) the coefficient for the ftg_diff \(= 0\), i.e. ftg_diff is not important in predicting match outcome. - \(H_{a}:\) the coefficient for the ftg_diff \(\neq 0\), i.e. ftg_diff is important in predicting match outcome.

The p-value for ftg_dif is \(0.0670 < 0.1\), so it is statistically significant at the \(\alpha = 0.01\) level.

\(H_{O}:\) the coefficient for the s_diff \(= 0\), i.e. s_diff is not important in predicting match outcome.
\(H_{a}:\) the coefficient for the s_diff \(\neq 0\), i.e. s_diff is important in predicting match outcome.

The p-value for s_diff is \(0.0172 < 0.05\), so it is statistically significant at the \(\alpha = 0.05\) level.

\(H_{O}:\) the coefficient for the st_diff \(= 0\), i.e. st_diff is not important in predicting match outcome.
\(H_{a}:\) the coefficient for the st_diff \(\neq 0\), i.e. st_diff is important in predicting match outcome.

The p-value for st_diff is 0.6629 which is not statistically significant.

\(H_{O}:\) the coefficient for the f_diff \(= 0\), i.e. f_diff is not important in predicting match outcome.
\(H_{a}:\) the coefficient for the f_diff \(\neq 0\), i.e. f_diff is important in predicting match outcome.

The p-value for f_siff is 0.2131 which is not statistically significant.

\(H_{O}:\) the coefficient for the c_diff \(= 0\), i.e. c_diff is not important in predicting match outcome.
\(H_{a}:\) the coefficient for the c_diff \(\neq 0\), i.e. c_diff is important in predicting match outcome.

The p-value for c_diff is 0.4272 which is not statistically significant.

\(H_{O}:\) the coefficient for the y_diff \(= 0\), i.e. y_diff is not important in predicting match outcome.
\(H_{a}:\) the coefficient for the y_diff \(\neq 0\), i.e. y_diff is important in predicting match outcome.

The p-value for y_diff is 0.7712 which is not statistically significant.

ftg_diff and s_diff are both important to this model at the 0.01 level of significance.

(iv) The coefficients of ftg_diff (0.112) and s_diff (0.042) are positive, meaning as the value of these variables increases so too does the chance of the first/named team being the home team.

The coefficient of ftg_diff \(= 0.11191\) gives us the following: \[e^{b_{i}} = e^{0.11191} = 1.118412 = 1.119\] This tells us how much the odds of the first/named team being the home team will change for each 1 unit change in the value of the ftg_diff variable. - An increase of one unit in goal difference between the first/named team and their opponent multiplies the odds of the first/named team being the home team by 1.119. - An increase of one unit in goal difference between the first/named team and their opponent is associated with a \((1.119 - 1)* 100% = 0.119*100% = 11.9%\) change in the odds of a success, which is a 11.9% increase in the odds of the named team being the home team.

The coefficient of s_diff \(= 0.042\) gives us the following: \[e^{b_{i}} = e^{0.04193} = 1.0.42821 = 1.043\] This tells us how much the odds of the first/named team being the home team will change for each 1 unit change in the value of the s_diff variable. - An increase of one unit in the shots difference between the first/named team and their opponent multiplies the odds of the first/named team being the home team by 1.043. - An increase of one unit in the shots difference between the first/named team and their opponent is associated with a \((1.043 - 1)* 100% = 0.043*100% = 4.3%\) change in the odds of a success, which is a 4.3% increase in the odds of the named team being the home team.

pl1_training_lr_pi <- predict(pl1_model_lr, newdata = pl1_training, type = 'response')
pl1_training_lr_final <- pl1_training %>%
                          mutate(pi = pl1_training_lr_pi) %>%
                          mutate(pl1_training_lr_prediction = case_when(pi > 0.5 ~ 'Home',
                                                                        pi <= 0.5 ~ 'Away'))

pl1_training_lr_tab <- table(pl1_training_lr_final$home_or_away,
                             pl1_training_lr_final$pl1_training_lr_prediction,
                             dnn = c('Actual', 'Prediction'))
pl1_training_lr_tab

      Prediction
Actual Away Home
  Away  185  117
  Home  128  170

pl1_testing_lr_pi <- predict(pl1_model_lr, newdata = pl1_testing, type = 'response')
pl1_testing_lr_final <- pl1_testing %>%
                         mutate(pi = pl1_testing_lr_pi) %>%
                         mutate(pl1_testing_lr_prediction = case_when(pi > 0.5 ~ 'Home',
                                                                  pi <= 0.5 ~ 'Away'))
pl1_testing_lr_tab <- table(pl1_testing_lr_final$home_or_away,
                             pl1_testing_lr_final$pl1_testing_lr_prediction,
                             dnn = c('Actual', 'Prediction'))
pl1_testing_lr_tab

      Prediction
Actual Away Home
  Away   50   28
  Home   29   53

b. The overall model accuracy for the training dataset is 59.2%

of all the matches in the model that predicted the first/named team as the home team, they got 170/(117+170) = 0.592 or 59.2% correct.
of all the matches in the model that predicted the first/named team as the away team, they got 185/(185+128) = 0.591 04 59.1% correct.
of all the matches where the first/named team is the home team, the model correctly identified 170/(128+170) = 0.570 or 57.0%.
of all the matches where the first/named team is the away team, the model correctly identified 185/(185+117) = 0.613 or 61.3%.

Observing the model used on the testing dataset 64.4%

of all the matches in the model that predicted the first/named team as the home team, they got 53/(53+28) = 0.654 or 65.4% correct.
of all the matches in the model that predicted the first/named team as the away team, they got 50/(50+29) = 0.633 or 63.3% correct.
of all the matches where the first/named team is the home team, the model correctly identified 53/(53+29) = 0.646 or 64.6%.
of all the matches where the first/named team is the away team, the model correctly identified 50/(50+28) = 0.641 or 64.1%.

The tree model predicts the training dataset well enough with an overall accuracy of 59.2%. The models accuracy increases to 64.4% when used on the testing dataset, which is unusual. This suggests the model may generalise well to predicting new data.

4. Comparison of the Models

a. The classification tree and binary logistic regression models produce very similar accuracy levels in predicting whether the named team was home or away for both training (65.0% v 59.2%) and testing datasets (63.125% v 64.4%). However the logistic regression models accuracy increases from training to testing dataset, and it has the higher level of accuracy in predicting the testing dataset. I believe this is a slightly more accurate model.

b. Under the classification tree model c_diff and ftg_diff were found to be the two most important variables in predicting whether the named team was home or away. The logistic regression model found ftg_diff (\(p = 0.670\)) and s_diff (\(p = 0.0172\))vto be the only two statistically significant variables in predicting whether the named team was the home or Away team. So ftg_diff appears to be significant in both models.

Question Two: Clustering

1. Import the data

# Import and name training and testing datasets
baseball_hof <-read_csv("baseball_hof.csv")
view(baseball_hof)

2. Scaling

All data needs to be on a comparable scale for both hierarchical and k-means clustering. The range of the variables differs for this dataset. Scaling will ensure one variable doesn’t dominate the clustering model.

3. Hierarchical Clustering

baseball <- select(baseball_hof, hits:stolen_bases)
baseball_scale <- scale(baseball)
d1 <- dist(baseball_scale)

h1 <- hclust(d1, method = 'ward.D')

plot(h1, hang = -1,cex = 0.45)

heatmap(as.matrix(d1), Rowv = as.dendrogram(h1), Colv = 'Rowv')

c. (i) If there are clusters present in the data, they will present as light coloured blocks around the diagonal from top left to bottom right. There are a number of clusters found in this heatmap, the most obvious being in the top left corner. There are two smaller blockers in the middle third of the map, and a larger less obvious block in the bottom right of the map (which may be split further).

clusters1 <- cutree(h1, k = 4)

sill <- silhouette(clusters1, d1)
summary(sill)

Silhouette of 82 units in 4 clusters from silhouette.default(x = clusters1, dist = d1) :
 Cluster sizes and average silhouette widths:
       17        25        30        10 
0.3202221 0.2078566 0.2958375 0.4325219 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1771  0.2100  0.3257  0.2907  0.4224  0.5725

d. The mean silhouette score of the 4-cluster model is 0.2907 - this is a measure of how similar an object is to the cluster it lies within. This score lies in the range \(0.25 - 0.5\) meaning the structure is weak and may be artificial. The silhouette score of each variable may be be looked at individually:

Cluster 2 has a score of \(0.21 (< 0.25)\), meaning that there is no substantial structure found.
Cluster 1 and 3 have a score of \(0.32\) and \(0.30 (0.25 < x < 0.5)\) respectively, finding what may be an artificial or weak structure.
Cluster 4 with a score of \(0.43 (0.51 < x < 0.7)\) shows evidence of a reasonable clustering structure.

baseball_clus <- cbind(baseball_hof, baseball_scale, clusters1)
colnames(baseball_clus) <- c("playerID", "hits", "runs", "home_runs", "rbi", "stolen_bases",
                             "hits_s", "runs_s", "home_runs_s", "rbi_s", "stolen_bases_s", "clusters1")
baseball_clus <- mutate(baseball_clus, Cluster = case_when(clusters1 == 1 ~ 'C1',
                                                           clusters1 == 2 ~ 'C2',
                                                           clusters1 == 3 ~ 'C3',
                                                           clusters1 == 4 ~ 'C4',))
baseball_clus_means <- baseball_clus %>%
                        group_by(Cluster) %>%
                        summarise(hits = mean(hits_s), 
                                  runs = mean(runs_s),
                                  home_runs = mean(home_runs_s),
                                  rbi = mean(rbi_s), 
                                  stolen_bases = mean(stolen_bases_s))

baseball_clus_tidy <- baseball_clus_means %>%
                        pivot_longer(cols = c(hits, runs, home_runs, rbi, stolen_bases), names_to = 'Event', 
                                     values_to = "Average_Value")
ggplot(baseball_clus_tidy, mapping = aes(x = Event, y = Average_Value, group = Cluster, colour = Cluster)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  theme(axis.text.x = element_text(vjust = 0.7)) +
  ylab("Mean Performance") + 
  scale_x_discrete(labels = c("hits", "runs", "home_runs", "rbi", "stolen_bases")) +
  ggtitle("Mean Performance for each Cluster for each Event
          ")

Cluster 1 has the highest mean performance in hits, runs, home runs and rbi, with a similar to performance to clusters 3 and 4 for stolen bases.
Cluster 2 performs very well in hits and rbi and the best in stolen bases, with poorer relative performance in runs and home runs.
Cluster 3 have a middling performance in hits, home runs and rbi, while performing well in runs and averagely in stolen bases.
Cluster 4 performs the worst in all variables but runs, where they are only slightly out perform cluster 2.

Clusters 3 and 4 follow a similar trend in performance in all variables but stolen bases.

4. K-means Clustering

set.seed(101)
kmeans2 <- kmeans(baseball_scale, centers = 4)

d2 <- dist(baseball_scale)
sil_kmeans2 <- silhouette(kmeans2$cluster, d2)
summary(sil_kmeans2)

Silhouette of 82 units in 4 clusters from silhouette.default(x = kmeans2$cluster, dist = d2) :
 Cluster sizes and average silhouette widths:
       28        22        19        13 
0.3322757 0.2178155 0.2687416 0.3421893 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.06303  0.18425  0.28984  0.28842  0.41869  0.54977

b. The mean silhouette score of the 4-cluster model is 0.28843 - this is a measure of how similar an object is to the cluster it lies within. This score lies in the range \(0.25 - 0.5\) meaning the structure is weak and may be artificial. The silhouette score of each variable may be be looked at individually:

Clusters 1, 3 and 4 have scores of \(0.33\), \(0.27\) and \(0.34 (0.25 < x < 0.5)\), finding what may be an artificial or weak structure.
Cluster 2 has a score of \(0.22 (< 0.25)\), meaning that there is no substantial structure found.

baseball_clus_kmeans_tidy <- as.tibble(kmeans2$centers) %>%
                              mutate(Cluster = c("C1", "C2", "C3", "C4")) %>%
                              pivot_longer(cols = c("hits", "runs", "home_runs", "rbi", "stolen_bases"), 
                                           names_to = "Event", values_to = "Average_Value")

ggplot(baseball_clus_kmeans_tidy, mapping = aes(x = Event, y = Average_Value, group = Cluster, colour = Cluster)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
  ylab("Mean Performance") + 
  scale_x_discrete(labels= c("hits", "runs", "home_runs", "rbi", "stolen_bases")) +
  ggtitle("Mean Performance for each Cluster for each Event")

c. - Cluster 1 have a middling performance in hits, home runs and rbi, while performing well in runs,and averagely in stolen bases. - Cluster 2 performs very well in hits and rbi and the best in stolen bases, with poorer relative performance in runs and home runs. - Cluster 3 performs the worst in all variables but runs, where they are only slightly out perform cluster 2. - Cluster 4 has the highest mean performance in hits, runs, home runs and rbi, with a similar to performance to clusters 1 and 3 for stolen bases.

Clusters 1 and 3 follow a similar trend in performance in all variables but stolen bases.

5. Comparison of the Models

a. While the overall silhouette score of both models, hierarchical (0.2907) and k-means (0.28842), was poor and showed no significant evidence of structure. The hierarchical clustering method produced individual clusters with slightly higher quality and structure than those of the k-means model. The hierarchical model produced an individual cluster with a silhouette score of 0.43 (reasonable structure) however the highest cluster silhouette score of the k-means model was 0.34 (poor/artificial structure). The hierarchical model also had the lowest individual cluster silhouette score of 0.21 versus the lowest score of the k-means model of 0.22. The hierarchical model may be slightly better quality.

b. The clusters of each model had significant similarities and trends - the clusters could nearly be paired off as follows:

hierarchical cluster 1 - kmeans cluster 4
hierarchical cluster 2 - kmeans cluster 2
hierarchical cluster 3 - kmeans cluster 1
hierarchical cluster 4 - kmeans cluster 3

References

Lago‐Peñas, C., & Ballesteros, J. L. (2011b). Game location and team quality effects on performance profiles in professional soccer. PubMed.

Tucker, W. C., Mellalieu, D. S., James, N., & Taylor, B. J. (2005). Game location Effects in Professional soccer: a case study. International Journal of Performance Analysis in Sport, 5(2), 23–35.