library(tidyverse)
library(rpart)
library(rattle)
library(TTR) #Contains the runMean function.
pl_1 <- read_csv("pl_testing.csv")
pl_testing <- read_csv ("pl_training.csv")Classification and Clustering Assignment
Question 1
The first thing I needed to do was to set my working directory to the place where all of my datasets were for this assignment.
- Import the pl_training.csv and pl_testing.csv datasets into R.
2. Classification Tree Method:
a. Create and visualise a classification tree model that will allow you to classify a team as either the home or the away team.
pl_model_tree <- rpart(home_or_away ~ ftg_diff + htg_diff + s_diff + st_diff + f_diff + c_diff + y_diff + r_diff + wdl_ft + wdl_ht, data = pl_1, method = 'class')
fancyRpartPlot(pl_model_tree) pl_model_tree$variable.importance ftg_diff s_diff st_diff htg_diff c_diff wdl_ht f_diff
11.4901855 9.1842740 9.1117851 8.7712959 8.2354829 4.9026088 4.3887718
wdl_ft y_diff r_diff
1.0000000 0.4112758 0.2871094
b. Interpret the classification tree:
So just by glancing at the classification tree, the first predictor variable that the tree splits on is ftg_difference, so obviously the difference of the goals scored is an important factor in determining whether a team is home or away. The tree is showing that if the full time goal difference is greater than -0.5 the team is the home team, and the leaf is showing that of all the games that fall into that category, 66% of the time that team is the home team. The nodes seemed to be a bit confusing with so many variables, so after using the $variable.importance code, we found that the most important predictors of home or away teams were from ftg_diff (which we had already established), s_diff, st_diff, htg_diff, and c_diff . From obtaining this information, a new classification tree was then computed to see the results:
pl_model_tree_new <- rpart(home_or_away ~ ftg_diff + htg_diff + s_diff + st_diff + f_diff + c_diff , data = pl_1, method = 'class')
fancyRpartPlot(pl_model_tree_new)This tree produced literally the exact same classification tree.
i. Clearly state one rule for predicting if a team is the home team. Your answer should also address how pure the node is.
A clear rule for a home team is that ftg_diff is not greater than -0.5 goals AND c_diff is not greater than 3.5, the team is HOME, and of the games where this is true, the team is true 88% of the time, making this the purest node on the tree. So in other words, teams that lose by less than 0.5 goals and have a differnce in corners of less than 3.5 %88 the home team.
###ii. Clearly state one rule for predicting if a team is the away team. Your answer should also address how pure the node is.
A clear rule for predicting if a team is the away team is IF the ftg_diff is greater than -0.5 AND f_diff is greater than 0.5 fouls, that team is most likely the AWAY team. If teams are on this leaf, there is an 88% chance that it is the away team and only a 12% chance of being the home team, making this node very pure.
iii. Which variables are the most important for predicting if a team is home or away? Refer to the variable importance values in your answer.
summary(pl_model_tree_new)Call:
rpart(formula = home_or_away ~ ftg_diff + htg_diff + s_diff +
st_diff + f_diff + c_diff, data = pl_1, method = "class")
n= 160
CP nsplit rel error xerror xstd
1 0.33333333 0 1.0000000 1.1538462 0.08044808
2 0.03418803 1 0.6666667 0.8461538 0.07983279
3 0.02564103 4 0.5641026 0.8974359 0.08044808
4 0.01282051 6 0.5128205 0.9102564 0.08056930
5 0.01000000 8 0.4871795 0.9102564 0.08056930
Variable importance
ftg_diff s_diff st_diff htg_diff c_diff f_diff
22 17 17 17 16 11
Node number 1: 160 observations, complexity param=0.3333333
predicted class=Home expected loss=0.4875 P(node) =1
class counts: 78 82
probabilities: 0.487 0.513
left son=2 (64 obs) right son=3 (96 obs)
Primary splits:
ftg_diff < -0.5 to the left, improve=9.918750, (0 missing)
c_diff < -2.5 to the left, improve=9.118091, (0 missing)
s_diff < 10.5 to the left, improve=7.601282, (0 missing)
st_diff < 0.5 to the left, improve=6.120426, (0 missing)
f_diff < 1.5 to the right, improve=5.464164, (0 missing)
Surrogate splits:
htg_diff < -0.5 to the left, agree=0.794, adj=0.484, (0 split)
st_diff < -0.5 to the left, agree=0.794, adj=0.484, (0 split)
s_diff < -3.5 to the left, agree=0.750, adj=0.375, (0 split)
c_diff < -0.5 to the left, agree=0.669, adj=0.172, (0 split)
f_diff < 5.5 to the right, agree=0.631, adj=0.078, (0 split)
Node number 2: 64 observations, complexity param=0.01282051
predicted class=Away expected loss=0.296875 P(node) =0.4
class counts: 45 19
probabilities: 0.703 0.297
left son=4 (42 obs) right son=5 (22 obs)
Primary splits:
f_diff < -0.5 to the right, improve=2.766369, (0 missing)
c_diff < -2.5 to the left, improve=1.914828, (0 missing)
ftg_diff < -3.5 to the left, improve=1.611607, (0 missing)
htg_diff < -2.5 to the left, improve=1.385417, (0 missing)
s_diff < 1.5 to the right, improve=1.126983, (0 missing)
Surrogate splits:
c_diff < 5 to the left, agree=0.688, adj=0.091, (0 split)
Node number 3: 96 observations, complexity param=0.03418803
predicted class=Home expected loss=0.34375 P(node) =0.6
class counts: 33 63
probabilities: 0.344 0.656
left son=6 (64 obs) right son=7 (32 obs)
Primary splits:
c_diff < 3.5 to the left, improve=4.593750, (0 missing)
s_diff < 10.5 to the left, improve=3.520426, (0 missing)
f_diff < 2.5 to the right, improve=2.501383, (0 missing)
htg_diff < 0.5 to the right, improve=1.463804, (0 missing)
st_diff < 0.5 to the left, improve=1.425221, (0 missing)
Surrogate splits:
s_diff < 7.5 to the left, agree=0.792, adj=0.375, (0 split)
st_diff < 4.5 to the left, agree=0.708, adj=0.125, (0 split)
Node number 4: 42 observations
predicted class=Away expected loss=0.1904762 P(node) =0.2625
class counts: 34 8
probabilities: 0.810 0.190
Node number 5: 22 observations, complexity param=0.01282051
predicted class=Away expected loss=0.5 P(node) =0.1375
class counts: 11 11
probabilities: 0.500 0.500
left son=10 (14 obs) right son=11 (8 obs)
Primary splits:
s_diff < -6 to the right, improve=0.3928571, (0 missing)
st_diff < -3.5 to the right, improve=0.3928571, (0 missing)
f_diff < -3.5 to the left, improve=0.3666667, (0 missing)
ftg_diff < -1.5 to the left, improve=0.1047619, (0 missing)
c_diff < 0.5 to the left, improve=0.1047619, (0 missing)
Surrogate splits:
st_diff < -5 to the right, agree=0.864, adj=0.625, (0 split)
htg_diff < -0.5 to the left, agree=0.727, adj=0.250, (0 split)
Node number 6: 64 observations, complexity param=0.03418803
predicted class=Home expected loss=0.453125 P(node) =0.4
class counts: 29 35
probabilities: 0.453 0.547
left son=12 (24 obs) right son=13 (40 obs)
Primary splits:
f_diff < 2.5 to the right, improve=1.3020830, (0 missing)
c_diff < -2.5 to the left, improve=1.2728630, (0 missing)
s_diff < 1.5 to the left, improve=1.1615650, (0 missing)
htg_diff < 0.5 to the right, improve=1.0911840, (0 missing)
st_diff < 0.5 to the left, improve=0.3669551, (0 missing)
Surrogate splits:
s_diff < 4.5 to the right, agree=0.656, adj=0.083, (0 split)
st_diff < 4.5 to the right, agree=0.656, adj=0.083, (0 split)
Node number 7: 32 observations
predicted class=Home expected loss=0.125 P(node) =0.2
class counts: 4 28
probabilities: 0.125 0.875
Node number 10: 14 observations
predicted class=Away expected loss=0.4285714 P(node) =0.0875
class counts: 8 6
probabilities: 0.571 0.429
Node number 11: 8 observations
predicted class=Home expected loss=0.375 P(node) =0.05
class counts: 3 5
probabilities: 0.375 0.625
Node number 12: 24 observations, complexity param=0.03418803
predicted class=Away expected loss=0.4166667 P(node) =0.15
class counts: 14 10
probabilities: 0.583 0.417
left son=24 (12 obs) right son=25 (12 obs)
Primary splits:
htg_diff < 0.5 to the right, improve=3.0000000, (0 missing)
ftg_diff < 1.5 to the right, improve=2.0416670, (0 missing)
f_diff < 3.5 to the left, improve=2.0416670, (0 missing)
st_diff < 2.5 to the right, improve=0.8414918, (0 missing)
c_diff < -1.5 to the left, improve=0.6666667, (0 missing)
Surrogate splits:
ftg_diff < 0.5 to the right, agree=0.667, adj=0.333, (0 split)
c_diff < -1.5 to the left, agree=0.667, adj=0.333, (0 split)
s_diff < -6 to the left, agree=0.625, adj=0.250, (0 split)
st_diff < -0.5 to the right, agree=0.625, adj=0.250, (0 split)
f_diff < 3.5 to the left, agree=0.583, adj=0.167, (0 split)
Node number 13: 40 observations, complexity param=0.02564103
predicted class=Home expected loss=0.375 P(node) =0.25
class counts: 15 25
probabilities: 0.375 0.625
left son=26 (23 obs) right son=27 (17 obs)
Primary splits:
s_diff < 1.5 to the left, improve=2.330563, (0 missing)
st_diff < 0.5 to the left, improve=1.657268, (0 missing)
htg_diff < 1.5 to the left, improve=1.250000, (0 missing)
ftg_diff < 0.5 to the left, improve=1.203333, (0 missing)
c_diff < -2.5 to the left, improve=1.203333, (0 missing)
Surrogate splits:
st_diff < 2.5 to the left, agree=0.850, adj=0.647, (0 split)
c_diff < 0.5 to the left, agree=0.700, adj=0.294, (0 split)
htg_diff < 1.5 to the left, agree=0.675, adj=0.235, (0 split)
ftg_diff < 0.5 to the left, agree=0.650, adj=0.176, (0 split)
f_diff < -6.5 to the right, agree=0.600, adj=0.059, (0 split)
Node number 24: 12 observations
predicted class=Away expected loss=0.1666667 P(node) =0.075
class counts: 10 2
probabilities: 0.833 0.167
Node number 25: 12 observations
predicted class=Home expected loss=0.3333333 P(node) =0.075
class counts: 4 8
probabilities: 0.333 0.667
Node number 26: 23 observations, complexity param=0.02564103
predicted class=Away expected loss=0.4782609 P(node) =0.14375
class counts: 12 11
probabilities: 0.522 0.478
left son=52 (16 obs) right son=53 (7 obs)
Primary splits:
st_diff < 0.5 to the left, improve=1.1211180, (0 missing)
c_diff < -2.5 to the left, improve=1.0540180, (0 missing)
s_diff < -7.5 to the right, improve=0.5282609, (0 missing)
f_diff < -3.5 to the left, improve=0.2615942, (0 missing)
ftg_diff < 0.5 to the left, improve=0.1903821, (0 missing)
Surrogate splits:
htg_diff < 1.5 to the left, agree=0.783, adj=0.286, (0 split)
f_diff < -5.5 to the right, agree=0.783, adj=0.286, (0 split)
ftg_diff < 2.5 to the left, agree=0.739, adj=0.143, (0 split)
s_diff < -0.5 to the left, agree=0.739, adj=0.143, (0 split)
Node number 27: 17 observations
predicted class=Home expected loss=0.1764706 P(node) =0.10625
class counts: 3 14
probabilities: 0.176 0.824
Node number 52: 16 observations
predicted class=Away expected loss=0.375 P(node) =0.1
class counts: 10 6
probabilities: 0.625 0.375
Node number 53: 7 observations
predicted class=Home expected loss=0.2857143 P(node) =0.04375
class counts: 2 5
probabilities: 0.286 0.714
Using the summary() code, it tells us that the most important variables in being able to predict home or away teams are, in order:
- ftg_diff (26%)
- s_diff (17%)
- st_diff (17%)
- htg_diff (17%)
- c_diff (16%)
- f_diff (11%)
The percentages here represent the percentage improvements to the model that can be attributed to that particular variable.
c. Assess the accuracy of the classification tree using both the training and the testing datasets.
In order to assess the accuracy of the classification tree, we used the predict function in r studio to eventually create a cross-tabulation of what the model predicts and what actually happened.
First we tested the training dataset (pl_1)
pl_model_tree_prob <- predict(pl_model_tree, newdata = pl_1, type = "prob")
pl_model_tree_prediction <- predict(pl_model_tree, newdata =pl_1,
type = 'class')
pl_model_tree_final <- cbind(pl_1, pl_model_tree_prob,
pl_model_tree_prediction)
pl_model_tree_tab <- table(pl_model_tree_final$home_or_away,
pl_model_tree_final$pl_model_tree_prediction,
dnn = c('Actual', 'Predicted'))
pl_model_tree_tab Predicted
Actual Away Home
Away 62 16
Home 22 60
From the above confusion matrix, we know the following: • The overall model accuracy is (62 + 60)/160 = 0.76 or 76%. • Of all matches the model predicted the team was away, they got 62/84 = 0.74 or 74% correct. • Of all matches the model predicted the team be home, they got 60/76 = .79 or 79% correct. • Of all matches where the predicted away team was actually away, the model correctly identified 62/86 = .72 or 72%. • Of all matches where the predicted home team actually was home, the model correctly identified 60/82 = .73 or 73%.
Next we had to test the testing dataset (pl_testing)
pl_model_tree_prob_testing <- predict (pl_model_tree, newdata = pl_testing , type = "prob")
pl_model_tree_prediction_testing <- predict(pl_model_tree, newdata = pl_testing,
type = 'class')
pl_model_tree_final_testing <- cbind(pl_testing, pl_model_tree_prob_testing,
pl_model_tree_prediction_testing)
pl_model_tree_tab_testing <- table(pl_model_tree_final_testing$home_or_away,
pl_model_tree_final_testing$pl_model_tree_prediction_testing,
dnn = c('Actual', 'Predicted'))
pl_model_tree_tab_testing Predicted
Actual Away Home
Away 175 127
Home 127 171
From the above confusion matrix, we know the following: • The overall model accuracy is (175 + 171)/600 = 0.58 or 58%. • Of all matches the model predicted the team was away, they got 175/302 = 0.58 or 58% correct. • Of all matches the model predicted the team be home, they got 171/298 = .57 or 57% correct. • Of all matches where the predicted away team was actually away, the model correctly identified 175/86 = .58 or 58%. • Of all matches where the predicted home team actually was home, the model correctly identified 171/298 = .57 or 57%.
i. Based on your findings, do you think the classification tree is overfitting the training dataset? Explain your answer. Note that if you detect overfitting, then for the purpose of this assignment you do not need to create a pruned version of the tree.
Our model predicts that 76% of our overall training data correctly, but only 58% of the overall data for our testing dataset, which means that our classification tree is a pretty good model for the training data but does not predict new data too well. Our findings would suggest overfitting with the training data, and if we were to continue with this particular classification tree, pruning may be the next step.
3. Binary Logistic Regression Method:
a. Create a binary logistic regression model that will allow you to classify a team as either the home or away team.
So, just like with the classification tree, we are going to use all of the variables to see if they have any affect on knowing whether a team is home or away. We will be using our same training dataset (pl_1) and our test dataset will be pl_testing
pl_1$home_or_away <- factor(pl_1$home_or_away, levels = c("Away", "Home"))
pl_testing$home_or_away <- factor(pl_testing$home_or_away, levels = c("Away", "Home"))
levels(pl_1$home_or_away)[1] "Away" "Home"
levels(pl_testing$home_or_away) #To check the levels are sorted correctly[1] "Away" "Home"
i. Make sure to correctly prepare the response/target variable, i.e. convert to a factor with levels ordered appropriately.
So now our target variable (home or away) is set and ordered as the failure (away) as the first option and success (home) being the second option.
#Binary Logistic Regression Model
pl_model <- glm(home_or_away ~ ftg_diff + htg_diff + s_diff + st_diff + f_diff + c_diff + y_diff + r_diff + wdl_ft + wdl_ht, data = pl_1, family = binomial(link = "logit"))
summary(pl_model)
Call:
glm(formula = home_or_away ~ ftg_diff + htg_diff + s_diff + st_diff +
f_diff + c_diff + y_diff + r_diff + wdl_ft + wdl_ht, family = binomial(link = "logit"),
data = pl_1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.73664 0.43997 1.674 0.09407 .
ftg_diff -0.14028 0.23626 -0.594 0.55269
htg_diff 1.07108 0.49514 2.163 0.03053 *
s_diff 0.05090 0.04373 1.164 0.24441
st_diff -0.05168 0.09383 -0.551 0.58174
f_diff -0.08778 0.04638 -1.893 0.05842 .
c_diff 0.04308 0.05356 0.804 0.42122
y_diff 0.04629 0.12855 0.360 0.71879
r_diff 0.92960 0.61651 1.508 0.13159
wdl_ftLose -1.15189 0.64644 -1.782 0.07477 .
wdl_ftWin 0.47384 0.64091 0.739 0.45971
wdl_htLose 0.99180 0.76045 1.304 0.19216
wdl_htWin -2.09852 0.73795 -2.844 0.00446 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 221.71 on 159 degrees of freedom
Residual deviance: 178.30 on 147 degrees of freedom
AIC: 204.3
Number of Fisher Scoring iterations: 4
So the problem we encountered is that an error kept coming up each time we tried to run the binary regression model, and the error was that one of the variables was coming up as not being able to be read. The reason for this issue (after LONG deliberations) was the realisation that values for the variable home_or_away were written as Home and Away rather than home and away, which meant that our code was completely wrong. Finally we corrected the problem and were able to get to a regression equation!
ii. Clearly state the regression equation.
y=ln(π/(1-π))= 0.73664 -0.14028* ftg_diff +1.07108* htg_diff +0.05090* s_diff-0.05168* st_diff-0.08778* f_diff+0.04308* c_diff+0.04629* y_diff+0.92960* r_diff-1.15189* wdl_ftLose +0.47384* wdl_ftWin +0.99180* wdl_htLose -2.09852* wdl_htWin
###iii. Which predictor variables are important in classifying a team as being the home or away team? Include p-values and refer to level of significance in your answer.
The predictor variables that are important in classifying a team as being home or away are:
htg_diff, which had a p value of 0.03053 and wdl_htWin, which had a p value of 0.00446, which means that we would reject H0 for each of these variables (which would state that both of them ARE NOT important indicators in predicting whether a team is home or away).
So, in other words, half time goal difference (htg_diff) is sigificant to the .05 significance level and whether a team was winning, losing, or drawing at half time (wdl_htwin) has a .01 significance level of predicting whether a team is home or away.
###iv. For each significant predictor variable identified in the previous question, calculate and clearly state the impact that variable has on the odds of a team being classifed as the home team.
So for htg_diff, which has a coefficient of 1.07108.
Which means that e^(b_i ) = e^ 1.07108 = 2.919, meaning that an increase of 1 goal difference between two teams at halftime multiplies the odds of the the team with the positive goal difference being the home team almost three times (2.919).
Equally for wdl_htwin, there is a negative coefficient of -2.09852, which mean that e^(b_i ) = e^-2.09852 = 0.123
So I’m not entirely sure how to interpret this considering that wdl_htwin is a categorical variable, but I’m assuming that if team increases its chances to be winning a game at halftime then it is associated with a 0.123 – 1 X 100% = -.877 X 100% = -87.7% change in the odds of a success, which is a 88% reduction in the odds of the that team being the home team (I am not entirely sure of this explanation with a categorical variable). In simple terms, I think it means that the more the categorical variable changes (from winning to draw or to losing), the less chance there is for the team to be the home team.
b. Fully assess the accuracy of the logistic regression model using both the training and the testing datasets.
#Training dataset
pl_model_pi <- predict(pl_model, newdata = pl_1, type = 'response')
pl_model_pi 1 2 3 4 5 6 7
0.22792730 0.50609215 0.33998406 0.34817463 0.57463749 0.70927573 0.29282301
8 9 10 11 12 13 14
0.34566808 0.40044453 0.18073676 0.06481144 0.42552349 0.49374639 0.51589769
15 16 17 18 19 20 21
0.61764525 0.32100911 0.36834663 0.76183759 0.79941221 0.51513127 0.65703755
22 23 24 25 26 27 28
0.58709373 0.75500071 0.79143377 0.92241004 0.54045221 0.84525716 0.54330179
29 30 31 32 33 34 35
0.80793468 0.55379132 0.32372405 0.71869321 0.09951637 0.60518481 0.91819281
36 37 38 39 40 41 42
0.68724594 0.68970138 0.24920305 0.41530812 0.45020364 0.61345633 0.38822793
43 44 45 46 47 48 49
0.77092715 0.79611653 0.56716165 0.86700040 0.87090023 0.82932191 0.76495446
50 51 52 53 54 55 56
0.73260941 0.71374359 0.79858149 0.19149135 0.89663213 0.74517103 0.78911180
57 58 59 60 61 62 63
0.58977055 0.47767381 0.60436328 0.96547865 0.86632998 0.59674534 0.79861587
64 65 66 67 68 69 70
0.64133098 0.62580656 0.77753465 0.92996688 0.58034164 0.83490145 0.79836040
71 72 73 74 75 76 77
0.70741001 0.53203601 0.82729040 0.74036647 0.71945866 0.86413460 0.95672792
78 79 80 81 82 83 84
0.81189247 0.79908438 0.84832836 0.83678635 0.84667690 0.24106758 0.85429740
85 86 87 88 89 90 91
0.34531989 0.35153756 0.60782369 0.73799398 0.49716232 0.47230855 0.31715717
92 93 94 95 96 97 98
0.17383768 0.60769853 0.86080909 0.33252157 0.77293889 0.40804976 0.68127396
99 100 101 102 103 104 105
0.42958161 0.15705333 0.55092248 0.70097084 0.38104100 0.40355738 0.40717370
106 107 108 109 110 111 112
0.39593812 0.36517765 0.68812699 0.50764301 0.47253496 0.21991663 0.54259179
113 114 115 116 117 118 119
0.60313282 0.23570793 0.26369853 0.53575816 0.06048600 0.62041673 0.27351594
120 121 122 123 124 125 126
0.52511013 0.09550808 0.72217085 0.52933412 0.25360825 0.36641385 0.23892874
127 128 129 130 131 132 133
0.28685182 0.40496887 0.74542183 0.62330329 0.27380849 0.07785593 0.29350723
134 135 136 137 138 139 140
0.73181300 0.22566740 0.45762103 0.67189165 0.15007791 0.64304512 0.07808630
141 142 143 144 145 146 147
0.24371706 0.20302604 0.31069930 0.35762761 0.15588689 0.19496836 0.55525549
148 149 150 151 152 153 154
0.03226201 0.18222104 0.01803438 0.54011515 0.39178630 0.13261484 0.17581713
155 156 157 158 159 160
0.15996976 0.33914188 0.29809164 0.12499032 0.11709195 0.11231466
pl_model_regression_final <- pl_1 %>%
mutate(pi = pl_model_pi) %>%
mutate(pl_model_regression_prediction = case_when(pi > 0.5 ~ 'Home',
pi <= 0.5 ~ 'Away'))
pl_model_regression_final# A tibble: 160 × 14
team ftg_diff htg_diff s_diff st_diff f_diff c_diff y_diff r_diff wdl_ft
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Burnley -2 -2 -4 -6 1 0 1 0 Lose
2 Leicester 0 0 -5 -3 4 -5 -1 0 Draw
3 Burnley -1 0 -8 -2 -1 -4 0 0 Lose
4 Sunderla… -3 -1 -18 -6 -4 -8 1 0 Lose
5 West Brom -2 -1 0 -4 -4 -1 0 0 Lose
6 Middlesb… 0 0 -1 -3 1 0 3 0 Draw
7 Stoke 0 1 -19 -7 -1 -6 4 0 Draw
8 Leicester -3 -2 -4 -8 -1 7 -2 0 Lose
9 Stoke -1 0 -15 -6 5 -6 2 1 Lose
10 Swansea -2 1 -14 -7 -1 -2 0 0 Lose
# ℹ 150 more rows
# ℹ 4 more variables: wdl_ht <chr>, home_or_away <fct>, pi <dbl>,
# pl_model_regression_prediction <chr>
pl_model_regression_tab <- table(pl_model_regression_final$home_or_away, pl_model_regression_final$pl_model_regression_prediction, dnn=c('Actual', 'Predicted'))
pl_model_regression_tab Predicted
Actual Away Home
Away 53 25
Home 20 62
pl_model_regression_acc <- sum(diag(pl_model_regression_tab))/sum(pl_model_regression_tab)
pl_model_regression_acc[1] 0.71875
From the above confusion matrix, we know:
**The overall model accuracy is (53+ 62)/160 = 0.72 or 72%.**
Of all matches the model predicted the superior team would lose, they got 53/73 = 0.73 or 73% correct.
Of all matches the model predicted the superior team would win, they got 62/87 or 71% correct.
Of all matches where the superior team lost, the model correctly identified 53/78 or 68%.
Of all matches where the superior team won, the model correctly identified 62/82 or 76%.
So 72% is not incredibly high, but remember, our classification tree overall model accurracy was 76%. Now we have to assess our regression equation with unseen data (pl_testing)
#Testing dataset
pl_model_testing_pi <- predict(pl_model, newdata = pl_testing, type = 'response')
pl_model_testing_pi 1 2 3 4 5 6 7
0.83430022 0.54759585 0.85685970 0.61794364 0.69748402 0.68105293 0.72313293
8 9 10 11 12 13 14
0.11103605 0.58131785 0.14216648 0.34197717 0.29977092 0.46280943 0.79973801
15 16 17 18 19 20 21
0.56136039 0.11942824 0.33033781 0.36046833 0.67547674 0.15832353 0.59292954
22 23 24 25 26 27 28
0.83437781 0.53615890 0.38674038 0.58832859 0.28031186 0.44020388 0.34142836
29 30 31 32 33 34 35
0.20548973 0.17152690 0.62928029 0.18136046 0.71193101 0.80715924 0.48314900
36 37 38 39 40 41 42
0.66674929 0.18676370 0.42286764 0.59533495 0.77658353 0.70764191 0.75402463
43 44 45 46 47 48 49
0.26999656 0.69520038 0.67066470 0.53594389 0.36331923 0.40902821 0.84139526
50 51 52 53 54 55 56
0.45001487 0.34240311 0.72899540 0.04310200 0.74933105 0.38788327 0.64248887
57 58 59 60 61 62 63
0.47272585 0.37173379 0.61191698 0.23226328 0.37877244 0.73086705 0.40950600
64 65 66 67 68 69 70
0.69786431 0.19551913 0.62509434 0.77681122 0.55118713 0.54839123 0.24015353
71 72 73 74 75 76 77
0.60983900 0.38987884 0.10588165 0.23256547 0.39952235 0.34269822 0.86108718
78 79 80 81 82 83 84
0.77323763 0.42416351 0.59515376 0.65317935 0.72708453 0.37613167 0.52525523
85 86 87 88 89 90 91
0.84543085 0.59366010 0.17704892 0.71213082 0.24017231 0.50889928 0.90014179
92 93 94 95 96 97 98
0.39766832 0.20634694 0.47610661 0.49389655 0.69119914 0.65052780 0.65753937
99 100 101 102 103 104 105
0.34480329 0.55402439 0.46632478 0.71108588 0.87016472 0.76600660 0.67748186
106 107 108 109 110 111 112
0.83263961 0.85744536 0.53808811 0.52770380 0.03466494 0.46464427 0.68900951
113 114 115 116 117 118 119
0.37131189 0.50746487 0.69346035 0.32653044 0.55065961 0.46311501 0.39569555
120 121 122 123 124 125 126
0.87859606 0.56007621 0.65484326 0.75538041 0.89782129 0.67052111 0.60944321
127 128 129 130 131 132 133
0.76340343 0.53857792 0.32343206 0.15580991 0.67136895 0.77287355 0.85318533
134 135 136 137 138 139 140
0.55859708 0.83972761 0.32713548 0.60721251 0.44979004 0.19346947 0.72205186
141 142 143 144 145 146 147
0.82856157 0.26336374 0.52707403 0.60715725 0.38171602 0.74168588 0.70367643
148 149 150 151 152 153 154
0.59913651 0.37748867 0.25261510 0.07113325 0.67158075 0.46662105 0.39027319
155 156 157 158 159 160 161
0.69101255 0.75629854 0.86641707 0.45001297 0.91919511 0.21830828 0.81142756
162 163 164 165 166 167 168
0.72749850 0.70710549 0.46689420 0.74334093 0.66046492 0.66701352 0.43851645
169 170 171 172 173 174 175
0.59389747 0.87398614 0.22380350 0.62668696 0.64344464 0.74363623 0.87008580
176 177 178 179 180 181 182
0.56461638 0.67526969 0.41795680 0.39437439 0.70355638 0.76057574 0.81104676
183 184 185 186 187 188 189
0.45666666 0.55876285 0.73177852 0.49942627 0.87585988 0.64548093 0.41786150
190 191 192 193 194 195 196
0.56860656 0.20007395 0.82654382 0.46579576 0.85453455 0.41436078 0.11010509
197 198 199 200 201 202 203
0.87218334 0.50905172 0.59896331 0.77621775 0.78567396 0.85056385 0.80548890
204 205 206 207 208 209 210
0.66124795 0.78019282 0.28541920 0.58067060 0.84206108 0.62497829 0.34586195
211 212 213 214 215 216 217
0.71533733 0.70034885 0.55402842 0.46466741 0.84800192 0.82506127 0.68060033
218 219 220 221 222 223 224
0.35399649 0.37270186 0.80573008 0.28903089 0.89633476 0.50483344 0.62751852
225 226 227 228 229 230 231
0.41320273 0.70887628 0.24918702 0.51586463 0.76309451 0.67752559 0.77998655
232 233 234 235 236 237 238
0.27755008 0.54390826 0.53445963 0.51063228 0.69443110 0.80037747 0.89685156
239 240 241 242 243 244 245
0.58031387 0.47162086 0.61901326 0.68596895 0.52057691 0.56812027 0.62427295
246 247 248 249 250 251 252
0.67377533 0.47241243 0.64756366 0.59033910 0.23846181 0.55108216 0.89598717
253 254 255 256 257 258 259
0.75844483 0.75148618 0.53847879 0.32680674 0.67106972 0.83948260 0.64582635
260 261 262 263 264 265 266
0.40317352 0.94482345 0.58580650 0.86305912 0.56219511 0.65839607 0.80817772
267 268 269 270 271 272 273
0.17855678 0.68172653 0.70548047 0.70700915 0.73891332 0.92450379 0.80265560
274 275 276 277 278 279 280
0.90170368 0.95646092 0.90859549 0.73700802 0.91760107 0.70568952 0.07765133
281 282 283 284 285 286 287
0.97553605 0.65349265 0.85881198 0.77442325 0.66612976 0.92083062 0.65055781
288 289 290 291 292 293 294
0.43282443 0.41328485 0.63295434 0.62879268 0.87790785 0.88911591 0.90208625
295 296 297 298 299 300 301
0.93046318 0.46221048 0.89522608 0.85268576 0.30551276 0.78284505 0.27008026
302 303 304 305 306 307 308
0.71270583 0.80983146 0.31167095 0.81131826 0.67143232 0.62556327 0.57824733
309 310 311 312 313 314 315
0.81546530 0.58492047 0.83802724 0.45947599 0.35676697 0.64139320 0.77700101
316 317 318 319 320 321 322
0.77322298 0.84374564 0.58094304 0.59752610 0.56508908 0.67704478 0.88465978
323 324 325 326 327 328 329
0.60327845 0.46413871 0.76831956 0.76850236 0.38784531 0.77838304 0.85045176
330 331 332 333 334 335 336
0.73565326 0.73901204 0.91451694 0.56613594 0.90907485 0.63841796 0.34605564
337 338 339 340 341 342 343
0.82356860 0.26795775 0.86268113 0.75142857 0.33235467 0.91355034 0.55661088
344 345 346 347 348 349 350
0.37346111 0.58736708 0.66444141 0.65672640 0.41468497 0.38805059 0.79514466
351 352 353 354 355 356 357
0.51411981 0.29454465 0.80966531 0.45157771 0.94205872 0.42560290 0.53611913
358 359 360 361 362 363 364
0.42886727 0.67516095 0.28952895 0.57826816 0.44960052 0.78918495 0.70766811
365 366 367 368 369 370 371
0.54569017 0.61639134 0.51362613 0.38447740 0.90112365 0.30518700 0.64331125
372 373 374 375 376 377 378
0.37621087 0.69853852 0.58627755 0.71215597 0.77609534 0.70731756 0.76900283
379 380 381 382 383 384 385
0.57700843 0.58414270 0.26325558 0.29731386 0.49855371 0.69851602 0.45396892
386 387 388 389 390 391 392
0.54847481 0.39828961 0.20872252 0.35723365 0.33389809 0.63819298 0.69851685
393 394 395 396 397 398 399
0.19725061 0.52590181 0.60903784 0.41818468 0.73799966 0.44624766 0.36857009
400 401 402 403 404 405 406
0.59651783 0.66095521 0.27611046 0.58187410 0.28850770 0.77839555 0.45596816
407 408 409 410 411 412 413
0.50915886 0.36956069 0.39433392 0.37110224 0.30805718 0.42044501 0.65533764
414 415 416 417 418 419 420
0.60473164 0.95325862 0.71847159 0.24843385 0.78948541 0.36090764 0.68251887
421 422 423 424 425 426 427
0.49472183 0.60167033 0.86888318 0.37406287 0.83494591 0.52795660 0.23433941
428 429 430 431 432 433 434
0.43196574 0.32330905 0.19169915 0.20132812 0.16481786 0.52116000 0.31941484
435 436 437 438 439 440 441
0.50198959 0.40704454 0.38553763 0.66252383 0.60505047 0.79870996 0.26388498
442 443 444 445 446 447 448
0.39427557 0.19889039 0.77518344 0.12263693 0.60101108 0.73840117 0.75327056
449 450 451 452 453 454 455
0.22989484 0.67196109 0.79655218 0.73844592 0.73009043 0.59709248 0.54704329
456 457 458 459 460 461 462
0.68421777 0.31575398 0.90533155 0.71686184 0.53361773 0.49759406 0.31736010
463 464 465 466 467 468 469
0.10145752 0.39691876 0.47231046 0.88803111 0.50349303 0.26980047 0.64380507
470 471 472 473 474 475 476
0.71663904 0.43335576 0.48392935 0.60231790 0.71751074 0.35852543 0.28867327
477 478 479 480 481 482 483
0.60068753 0.17724059 0.63072178 0.26045461 0.50491902 0.23581165 0.18734901
484 485 486 487 488 489 490
0.24718017 0.31311516 0.14575187 0.83849256 0.61529390 0.42331124 0.75524636
491 492 493 494 495 496 497
0.35717237 0.21091944 0.23240760 0.36654353 0.50861839 0.85547176 0.09692285
498 499 500 501 502 503 504
0.41394010 0.32901635 0.38971125 0.16651457 0.11399983 0.15027445 0.15591503
505 506 507 508 509 510 511
0.75562824 0.53155128 0.28900354 0.64708426 0.34592121 0.46401951 0.43099624
512 513 514 515 516 517 518
0.37183647 0.48657171 0.37087832 0.28418802 0.23425122 0.57200299 0.55209853
519 520 521 522 523 524 525
0.41804008 0.30299511 0.50981153 0.23122225 0.75217913 0.61204727 0.81298361
526 527 528 529 530 531 532
0.67518999 0.40745722 0.25847432 0.55173845 0.85218980 0.74069927 0.02551766
533 534 535 536 537 538 539
0.54747535 0.38946841 0.10152612 0.67976628 0.26461984 0.34625273 0.45069792
540 541 542 543 544 545 546
0.25108567 0.57057749 0.33105616 0.30593039 0.67873726 0.82973463 0.28499111
547 548 549 550 551 552 553
0.33696309 0.70049188 0.37366308 0.07835562 0.58154355 0.65498251 0.60136924
554 555 556 557 558 559 560
0.26414845 0.29056624 0.29751699 0.28654306 0.52018139 0.04101451 0.34115617
561 562 563 564 565 566 567
0.30454270 0.40910931 0.36318531 0.14295663 0.27535023 0.14808471 0.77112355
568 569 570 571 572 573 574
0.50837819 0.48043493 0.47859590 0.60657957 0.75934818 0.15317135 0.35257450
575 576 577 578 579 580 581
0.19449454 0.46319522 0.20718708 0.16590157 0.23396972 0.89689628 0.15609593
582 583 584 585 586 587 588
0.47811284 0.10746050 0.26850478 0.28232014 0.43717365 0.22213497 0.10325649
589 590 591 592 593 594 595
0.48971316 0.50972690 0.16482925 0.15550362 0.30184431 0.23549623 0.08369024
596 597 598 599 600
0.07363737 0.14202248 0.46007443 0.20586417 0.28367264
pl_model_regression_testing_final <- pl_testing %>%
mutate(pi = pl_model_testing_pi) %>%
mutate(pl_model_regression_testing_prediction = case_when(pi > 0.5 ~ 'Home',
pi <= 0.5 ~ 'Away'))
pl_model_regression_testing_final# A tibble: 600 × 14
team ftg_diff htg_diff s_diff st_diff f_diff c_diff y_diff r_diff wdl_ft
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Watford 1 0 14 -2 5 5 -1 0 Win
2 Southamp… 0 0 -12 -2 -3 -7 0 0 Draw
3 Crystal … 1 0 1 -2 -7 1 -2 0 Win
4 Bournemo… 1 1 -1 -1 -4 2 0 0 Win
5 West Ham -3 -1 -5 -8 -5 4 3 0 Lose
6 Leicester 0 0 -2 -4 0 -3 1 0 Draw
7 Bournemo… 0 0 -8 -3 -5 3 -2 0 Draw
8 West Brom -4 -2 -12 -7 7 -6 1 0 Lose
9 Southamp… -2 -1 -4 -6 -3 4 0 0 Lose
10 Swansea -2 -3 -3 -2 -7 2 -2 0 Lose
# ℹ 590 more rows
# ℹ 4 more variables: wdl_ht <chr>, home_or_away <fct>, pi <dbl>,
# pl_model_regression_testing_prediction <chr>
pl_model_regression_testing_tab <- table(pl_model_regression_testing_final$home_or_away, pl_model_regression_testing_final$pl_model_regression_testing_prediction, dnn=c('Actual', 'Predicted'))
pl_model_regression_testing_tab Predicted
Actual Away Home
Away 158 144
Home 99 199
pl_model_regression_testing_acc <- sum(diag(pl_model_regression_testing_tab))/sum(pl_model_regression_testing_tab)
pl_model_regression_testing_acc[1] 0.595
Predicted Actual Away Home Away 53 25 Home 20 62
From the above confusion matrix, we know:
**The overall model accuracy is (158+ 199)/160 = 0.60 or 60%.**
Of all matches the model predicted the superior team would lose, they got 158/257 = 0.61 or 61% correct.
Of all matches the model predicted the superior team would win, they got 199/343 or 58% correct.
Of all matches where the superior team lost, the model correctly identified 158/302 or 52%.
Of all matches where the superior team won, the model correctly identified 199/298 or 67%.
4 . Compare and contrast the Classification Tree model and the Binary Logistic Regression model. For example:
a. Which model is more accurate?
Model Training Testing
Binary Logistic Regression 72% 60%
Classification Tree 76% 58%
So for the training dataset, the classification tree was a little more accurate (76% to 72%), but for the testing dataset, the dataset that we would probably base which model we would use, the Binary Logistic Regression model is a little more accurate (60% to 58%). If we were to continue this study, we would probably prune our classification tree to see if we could get a little more accurate in our predictions.
b. How do the models compare in terms of variables that are considered important predictors of the target variable?
The Classification tree considers more variables as important predictors (1. ftg_diff (26%), 2. s_diff (17%) , 3. st_diff (17%) , 4. htg_diff (17%) , 5. c_diff (16%) , 6.f_diff (11%)), while the Binary Logisitic Regression Model considered only htg_diff and wdl_htWin.
Question 2
1. Import the dataset
Working directory is already set
baseball_hof <- read_csv("baseball_hof.csv")2. Does the data need to be scaled before computing the distance matrix for hierarchical clustering or before being entered into the K-means clustering algorithm? Explain your answer
Yes, we do need to scale data because although some of the variables may be relatively on the same scale (hits, runs, and maybe even RBIs), home runs and stolen bases are on a much lower scale, so the higher numbered variables (aforementioned) would dominate the clustering. So we will scale the data first (note: we had to remove names first from dataset hence the new dataset baseball_no_names was created).
3. Hierchical Clustering:
a. Create a suitable distance matrix containing the Euclidean distance between all pairs of players.
baseball_no_names <- select(baseball_hof, hits:stolen_bases)
baseball_hof_scale <- scale(baseball_no_names) #Scaling Data
distance_1 <- dist(baseball_hof_scale) #Computing Distanceb. Carry out a hierarchical clustering using the hclust function. You must use method = ‘ward.D’.
h1 <- hclust(distance_1, method = 'ward.D')
plot(h1, hang= -1 , cex = 0.6 , ylim = c(0, 10))
abline(h = 8, col = "red") # Adjust height threshold (e.g., h = 10)# Set plot size
options(repr.plot.width = 2, repr.plot.height = 6)c. Visualise the results of the hierarchical clustering using a dendrogram and a heatmap. Note that the heatmap will take several seconds to appear because there is a large number of players in the dataset. i. Does the heatmap provide evidence of any clustering structure within the dataset? Explain your answer.
The dendrogram is above and the heatmap is here below:
heatmap(as.matrix(distance_1), Rowv = as.dendrogram(h1), Colv = 'Rowv')The heatmap shows some pretty big clustering amongst the data. The bottom righthand corner of the heatmap is very light orange showing a pretty big cluster there, and then as you go up the diagonal line of symmetry on the heatmap, there are a couple of smaller light orange boxes again. There is another fairly big light square in the upper left hand corner of the map as well.
d. Create a 4-cluster solution and assess the quality of this solution.
library(cluster) # Load the cluster package
clusters1 <- cutree(h1, k = 4)
sil1 <- silhouette(clusters1, distance_1)summary(sil1)Silhouette of 82 units in 4 clusters from silhouette.default(x = clusters1, dist = distance_1) :
Cluster sizes and average silhouette widths:
17 25 30 10
0.3202221 0.2078566 0.2958375 0.4325219
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1771 0.2100 0.3257 0.2907 0.4224 0.5725
So of our 4 clusters, the sizes of those clusters (as seen above) were
Cluster 1 17 Cluster 2 25 Cluster 3 30 Cluster 4 10
which has all 82 players in a cluster. The overall mean of the clusters is .29, which is weak. The silhouette scores (.32, .21, .30, .43) show that there are no strong clusters here, and that Clusters 1, 3, and 4 are weak and could be artificial, and that Cluster 2 is has no real substantial structure. If we were reporting this study, we would have to report that there are no substantial clustering structure here.
e. Use tables and suitable graphs to help you describe the properties that seem common to each cluster of players.
library(dplyr)
baseball_hof_clus <- cbind(baseball_hof, baseball_hof_scale, clusters1)
#This puts player names, their scaled version of the variables and the clusters that put the players together
#Updating Column Names so we're not confused
colnames(baseball_hof_clus) <- c("playerID", "hits", "runs", "home_runs","rbi",
"stolen_bases", "hits_scaled", "runs_scaled", "home_runs_scaled", "rbi_scaled",
"stolen_bases_scaled", "clusters1" )
# Update values in the "clusters1" column
baseball_hof_clus$clusters1 <- case_when(
baseball_hof_clus$clusters1 == 1 ~ 'C1',
baseball_hof_clus$clusters1 == 2 ~ 'C2',
baseball_hof_clus$clusters1 == 3 ~ 'C3',
baseball_hof_clus$clusters1 == 4 ~ 'C4'
)
#Calculating the mean of the scaled variables
#Calculate mean value of all scaled variables for each cluster. Why not calculate the mean value of the original variables?
baseball_hof_clus_means <- baseball_hof_clus %>%
group_by(clusters1) %>%
summarise(mean_hits = mean(hits_scaled),
mean_runs = mean(runs_scaled),
mean_home_runs = mean(home_runs_scaled),
mean_rbi = mean(rbi_scaled),
mean_stolen_bases = mean(stolen_bases_scaled))
baseball_hof_clus_means# A tibble: 4 × 6
clusters1 mean_hits mean_runs mean_home_runs mean_rbi mean_stolen_bases
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 C1 0.804 0.973 0.984 1.26 -0.329
2 C2 0.617 0.430 -0.963 -0.662 1.12
3 C3 -0.427 -0.396 0.456 0.222 -0.532
4 C4 -1.63 -1.54 -0.634 -1.15 -0.634
#To get rid of negative means
# Function to convert negative values to positive
convert_to_positive <- function(x) {
ifelse(x < 0, abs(x), x)
}
# Apply the function to all columns using mutate_if
positive_mean_tibble_baseball_hof_clus <- baseball_hof_clus_means %>%
mutate_if(is.numeric, convert_to_positive)
#Convert tibble to dataset so that we can graph
baseball_hof_final <- positive_mean_tibble_baseball_hof_clus %>%
pivot_longer(cols = c( mean_hits, mean_runs, mean_home_runs, mean_rbi, mean_stolen_bases), names_to = "Event", values_to = "Average_Value")
baseball_hof_final# A tibble: 20 × 3
clusters1 Event Average_Value
<chr> <chr> <dbl>
1 C1 mean_hits 0.804
2 C1 mean_runs 0.973
3 C1 mean_home_runs 0.984
4 C1 mean_rbi 1.26
5 C1 mean_stolen_bases 0.329
6 C2 mean_hits 0.617
7 C2 mean_runs 0.430
8 C2 mean_home_runs 0.963
9 C2 mean_rbi 0.662
10 C2 mean_stolen_bases 1.12
11 C3 mean_hits 0.427
12 C3 mean_runs 0.396
13 C3 mean_home_runs 0.456
14 C3 mean_rbi 0.222
15 C3 mean_stolen_bases 0.532
16 C4 mean_hits 1.63
17 C4 mean_runs 1.54
18 C4 mean_home_runs 0.634
19 C4 mean_rbi 1.15
20 C4 mean_stolen_bases 0.634
#Now we can visualise the data
ggplot(baseball_hof_final, mapping = aes(x = Event, y = Average_Value, group = clusters1, colour = clusters1)) +
geom_line(size = 1) +
geom_point(size = 2) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
ylab("Mean KPIs") +
scale_x_discrete(labels= c("Hits", "Runs", "Home Runs", "RBIs", "Stolen Bases" )) +
ggtitle("Mean KPI for each Cluster for each Event")So from the line graph above we can see that the players in Cluster 3 (C3) are statistically the worst of the four clusters, so if this is a dataset that is trying to figure out who should be inducted into the Baseball Hall of Fame, then this cluster of players should probably be the last group to be viewed to get in. Cluster 2 contain players that get on base the most, and not necessarily through hits, so these would be speedy players who either get walked a lot (baseball term when a pitcher throws four bad pitches and the batter is awarded being sent to first base) or through bunts or they may even be a designated runner, players who don’t even hit but are subbed in when other hitters get on base. We can confirm that this cluster contains the most speedy players as well from the fact that they have the most stolen bases by a fairly large margin. Cluster 1 are heavy hitters that swing for the fences. They probably strike out the most because they are swinging with power trying to hit the ball out of the park. The are probably a bit heavier, as we see they don’t steal many bases. It’s surprising they don’t have more RBIs (as I’d assume they would hit 4th in the batting order which usually means other people are on base already), but guys like Babe Ruth and Hank Aaron are probably in this group (and yes, upon checking the original dataset, both of those players are in Cluster 1!). Cluster 4 then are the efficiency guys, guys who constantly make contact with the ball (the most hits), get on base the most, drive in the most runs (RBIs), can still hit home runs at a high rate, and still steal bases at a pretty good rate as well–we’d call them the jacks-of-all-trades.
4. K-means Clustering:
a. Carry out a K-means clustering that will produce 4 clusters. Remember to use set.seed(101) to ensure that your results are reproducible.
Our dataset (baseball_hof) is already loaded and the working directory already set. We have also already created a dataset without the athlete names (baseball_no_names), and we have also scaled the data already (baseball_hof_scale), so we can get straight into carrying out the K-means clustering.
set.seed(101)
kmeans1 <- kmeans(baseball_hof_scale, centers = 4)
kmeans1$cluster #The assignment of each observation (i.e. athlete) to each cluster. [1] 4 2 2 1 1 3 1 2 2 3 1 2 3 2 3 2 3 2 3 1 3 1 1 4 2 3 4 1 2 3 1 2 1 1 1 1 2 1
[39] 3 2 3 1 3 1 4 1 3 2 2 4 4 4 1 3 3 2 1 4 1 4 3 1 4 3 1 1 2 2 1 2 1 3 1 1 3 2
[77] 2 1 4 4 4 2
kmeans1$centers #The centroids for each cluster. hits runs home_runs rbi stolen_bases
1 -0.2044605 -0.1624618 0.5660315 0.3798416 -0.4849247
2 0.7521677 0.5537557 -0.9804896 -0.5809428 1.2098245
3 -1.1508405 -1.2010894 -0.5701789 -0.8728960 -0.4647733
4 0.8494748 1.1682310 1.2734836 1.4407847 -0.3236583
kmeans1$size #The number of observations in each cluster.[1] 28 22 19 13
kmeans1$iter #The number of iterations it took to converge to the final clustering solution.[1] 3
b. Assess the quality of the clustering solution by calculating silhouette scores.
# distance_1 has already been created which is the distance between the groups
sil_kmeans1 <- silhouette(kmeans1$cluster, distance_1)
summary(sil_kmeans1)Silhouette of 82 units in 4 clusters from silhouette.default(x = kmeans1$cluster, dist = distance_1) :
Cluster sizes and average silhouette widths:
28 22 19 13
0.3322757 0.2178155 0.2687416 0.3421893
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.06303 0.18425 0.28984 0.28842 0.41869 0.54977
Score Interpretation 0.71-1 A strong clustering structure has been found. 0.51-0.7 A reasonable clustering structure has been found. 0.25-0.5 The structure is weak and could be artificial. < 0.25 No substantial structure has been found.
Assessing quality of clustering So from this score interpretation, we can see that the overall clustering mean is .29, which means the structure is weak and could be artificial.
From the individiual clusters, the first (.33), third (.27), and fourth (.34) clusters all have weak and possibly artificial structures as well, and the second cluster (.22) has no substantial structure, which is not too unlike what we found with the hierarchical clustering (in which we also had three clusters with weak structures and one that was not substantial).
library(tidyr)
baseball_clus_kmeans_tidy <- as_tibble(kmeans1$centers) %>%
mutate(Cluster = c("C1", "C2", "C3", "C4")) %>%
pivot_longer(cols = c(hits, runs, home_runs, rbi, stolen_bases),
names_to = "KPI", values_to = "Average_Value")
# Define a function to convert negative values to positive
convert_to_positive <- function(x) {
ifelse(x < 0, abs(x), x)
}
# Apply the function to numeric columns using mutate_if
baseball_clus_kmeans_tidy <- baseball_clus_kmeans_tidy %>%
mutate_if(is.numeric, convert_to_positive)Graphing and interpreting clusters
library(ggplot2)
ggplot(baseball_clus_kmeans_tidy, mapping = aes(x = KPI, y = Average_Value, group = Cluster, colour = Cluster)) +
geom_line(size = 1) +
geom_point(size = 2) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
ylab("Mean KPI") +
scale_x_discrete(labels= c("Hits", "Runs", "Home Runs", "RBIs", "Stolen Bases")) +
ggtitle("Mean for each Cluster for each KPI")So these clusters are noticeably different than the Hierarchical Clustering results. We can still see that Cluster 4 is the power group–the group that has the power hitters that are probably bigger in size, hit a lot of home runs and don’t steal bases much. This cluster, though, does have higher RBIs and also more runs than the similar cluster in hierarchical. Cluster 1, similar to the last clustering exercise, is the worst players statistically in every category. Cluster 2 has our faster, probably slimmer players (think Ricky Henderson) who steal a lot of bases but don’t have a lot of power and do not hit too many home runs. And then Cluster3 are the jacks-of-all-trades again. They are the most efficient hitters that drive in a lot of runs, are pretty decent home run hitters, and in this group do not steal too many bases.
5. Compare and contrast the clusters produced by Hierarchical Clustering and K-means. For example:
a. Which algorithm produced the highest quality clusters?
So neither algorithm produced high quality clusters statistically. I'd be interested to see if we ran the K-means again with three clusters would the results be different. The overall mean from the Hierarchical was .29 and from K-means was also .29, both weak and possibly artificial.
b. Did both algorithms produce clusters with a similar profile? Or are they any noticeable differences?
Although the graphs do look different at first glance, the profiles of the clusters are fairly similar for both algorithms. These two examples below are the “Power hitter” clusters that we noticed from the graphs. I think the K-means algorithm, in this case, did a better job collecting more power hitters, as there is a big difference of the mean of home runs of that cluster compared to C1 in the Hierarchical compared to the other clusters in that algorithm. It would be interesting to see what the data of both algorithms looked like if we took out all of the weaker player clusters from both and then ran the algorithms with the remaining players. The K-means is much easier to run and interpret, while the Hierarchical takes a lot more data manipulation to get the desired results, but both were interesting and made it fun looking through statistics of baseball players that I grew up watching and looking up to.
C1-Hierarchical
C1 mean_hits 0.8041640
C1 mean_runs 0.9734091
C1 mean_home_runs 0.9842003
C1 mean_rbi 1.2601753
C1 mean_stolen_bases 0.3293725
C4-K-means 4 0.8494748
1.1682310
1.2734836
1.4407847
0.3236583