Classification and Clustering Assignment

Author

Lawrence Summers

Question 1

The first thing I needed to do was to set my working directory to the place where all of my datasets were for this assignment.

Import the pl_training.csv and pl_testing.csv datasets into R.

library(tidyverse)
library(rpart)
library(rattle)
library(TTR) #Contains the runMean function.

pl_1 <- read_csv("pl_testing.csv")

pl_testing <- read_csv ("pl_training.csv")

2. Classification Tree Method:

a.  Create and visualise a classification tree model that will allow you to classify a team as either the home or the away team.

pl_model_tree <- rpart(home_or_away ~ ftg_diff + htg_diff + s_diff + st_diff + f_diff + c_diff + y_diff + r_diff + wdl_ft + wdl_ht, data = pl_1, method = 'class')

fancyRpartPlot(pl_model_tree)

 pl_model_tree$variable.importance

  ftg_diff     s_diff    st_diff   htg_diff     c_diff     wdl_ht     f_diff 
11.4901855  9.1842740  9.1117851  8.7712959  8.2354829  4.9026088  4.3887718 
    wdl_ft     y_diff     r_diff 
 1.0000000  0.4112758  0.2871094

b. Interpret the classification tree:

So just by glancing at the classification tree, the first predictor variable that the tree splits on is ftg_difference, so obviously the difference of the goals scored is an important factor in determining whether a team is home or away. The tree is showing that if the full time goal difference is greater than -0.5 the team is the home team, and the leaf is showing that of all the games that fall into that category, 66% of the time that team is the home team. The nodes seemed to be a bit confusing with so many variables, so after using the $variable.importance code, we found that the most important predictors of home or away teams were from ftg_diff (which we had already established), s_diff, st_diff, htg_diff, and c_diff . From obtaining this information, a new classification tree was then computed to see the results:

pl_model_tree_new <- rpart(home_or_away ~ ftg_diff + htg_diff + s_diff + st_diff + f_diff + c_diff , data = pl_1, method = 'class')

fancyRpartPlot(pl_model_tree_new)

This tree produced literally the exact same classification tree.

i. Clearly state one rule for predicting if a team is the home team. Your answer should also address how pure the node is.

A clear rule for a home team is that ftg_diff is not greater than -0.5 goals AND c_diff is not greater than 3.5, the team is HOME, and of the games where this is true, the team is true 88% of the time, making this the purest node on the tree. So in other words, teams that lose by less than 0.5 goals and have a differnce in corners of less than 3.5 %88 the home team.

###ii. Clearly state one rule for predicting if a team is the away team. Your answer should also address how pure the node is.

A clear rule for predicting if a team is the away team is IF the ftg_diff is greater than -0.5 AND f_diff is greater than 0.5 fouls, that team is most likely the AWAY team. If teams are on this leaf, there is an 88% chance that it is the away team and only a 12% chance of being the home team, making this node very pure.

iii. Which variables are the most important for predicting if a team is home or away? Refer to the variable importance values in your answer.

summary(pl_model_tree_new)

Call:
rpart(formula = home_or_away ~ ftg_diff + htg_diff + s_diff + 
    st_diff + f_diff + c_diff, data = pl_1, method = "class")
  n= 160 

          CP nsplit rel error    xerror       xstd
1 0.33333333      0 1.0000000 1.1538462 0.08044808
2 0.03418803      1 0.6666667 0.8461538 0.07983279
3 0.02564103      4 0.5641026 0.8974359 0.08044808
4 0.01282051      6 0.5128205 0.9102564 0.08056930
5 0.01000000      8 0.4871795 0.9102564 0.08056930

Variable importance
ftg_diff   s_diff  st_diff htg_diff   c_diff   f_diff 
      22       17       17       17       16       11 

Node number 1: 160 observations,    complexity param=0.3333333
  predicted class=Home  expected loss=0.4875  P(node) =1
    class counts:    78    82
   probabilities: 0.487 0.513 
  left son=2 (64 obs) right son=3 (96 obs)
  Primary splits:
      ftg_diff < -0.5 to the left,  improve=9.918750, (0 missing)
      c_diff   < -2.5 to the left,  improve=9.118091, (0 missing)
      s_diff   < 10.5 to the left,  improve=7.601282, (0 missing)
      st_diff  < 0.5  to the left,  improve=6.120426, (0 missing)
      f_diff   < 1.5  to the right, improve=5.464164, (0 missing)
  Surrogate splits:
      htg_diff < -0.5 to the left,  agree=0.794, adj=0.484, (0 split)
      st_diff  < -0.5 to the left,  agree=0.794, adj=0.484, (0 split)
      s_diff   < -3.5 to the left,  agree=0.750, adj=0.375, (0 split)
      c_diff   < -0.5 to the left,  agree=0.669, adj=0.172, (0 split)
      f_diff   < 5.5  to the right, agree=0.631, adj=0.078, (0 split)

Node number 2: 64 observations,    complexity param=0.01282051
  predicted class=Away  expected loss=0.296875  P(node) =0.4
    class counts:    45    19
   probabilities: 0.703 0.297 
  left son=4 (42 obs) right son=5 (22 obs)
  Primary splits:
      f_diff   < -0.5 to the right, improve=2.766369, (0 missing)
      c_diff   < -2.5 to the left,  improve=1.914828, (0 missing)
      ftg_diff < -3.5 to the left,  improve=1.611607, (0 missing)
      htg_diff < -2.5 to the left,  improve=1.385417, (0 missing)
      s_diff   < 1.5  to the right, improve=1.126983, (0 missing)
  Surrogate splits:
      c_diff < 5    to the left,  agree=0.688, adj=0.091, (0 split)

Node number 3: 96 observations,    complexity param=0.03418803
  predicted class=Home  expected loss=0.34375  P(node) =0.6
    class counts:    33    63
   probabilities: 0.344 0.656 
  left son=6 (64 obs) right son=7 (32 obs)
  Primary splits:
      c_diff   < 3.5  to the left,  improve=4.593750, (0 missing)
      s_diff   < 10.5 to the left,  improve=3.520426, (0 missing)
      f_diff   < 2.5  to the right, improve=2.501383, (0 missing)
      htg_diff < 0.5  to the right, improve=1.463804, (0 missing)
      st_diff  < 0.5  to the left,  improve=1.425221, (0 missing)
  Surrogate splits:
      s_diff  < 7.5  to the left,  agree=0.792, adj=0.375, (0 split)
      st_diff < 4.5  to the left,  agree=0.708, adj=0.125, (0 split)

Node number 4: 42 observations
  predicted class=Away  expected loss=0.1904762  P(node) =0.2625
    class counts:    34     8
   probabilities: 0.810 0.190 

Node number 5: 22 observations,    complexity param=0.01282051
  predicted class=Away  expected loss=0.5  P(node) =0.1375
    class counts:    11    11
   probabilities: 0.500 0.500 
  left son=10 (14 obs) right son=11 (8 obs)
  Primary splits:
      s_diff   < -6   to the right, improve=0.3928571, (0 missing)
      st_diff  < -3.5 to the right, improve=0.3928571, (0 missing)
      f_diff   < -3.5 to the left,  improve=0.3666667, (0 missing)
      ftg_diff < -1.5 to the left,  improve=0.1047619, (0 missing)
      c_diff   < 0.5  to the left,  improve=0.1047619, (0 missing)
  Surrogate splits:
      st_diff  < -5   to the right, agree=0.864, adj=0.625, (0 split)
      htg_diff < -0.5 to the left,  agree=0.727, adj=0.250, (0 split)

Node number 6: 64 observations,    complexity param=0.03418803
  predicted class=Home  expected loss=0.453125  P(node) =0.4
    class counts:    29    35
   probabilities: 0.453 0.547 
  left son=12 (24 obs) right son=13 (40 obs)
  Primary splits:
      f_diff   < 2.5  to the right, improve=1.3020830, (0 missing)
      c_diff   < -2.5 to the left,  improve=1.2728630, (0 missing)
      s_diff   < 1.5  to the left,  improve=1.1615650, (0 missing)
      htg_diff < 0.5  to the right, improve=1.0911840, (0 missing)
      st_diff  < 0.5  to the left,  improve=0.3669551, (0 missing)
  Surrogate splits:
      s_diff  < 4.5  to the right, agree=0.656, adj=0.083, (0 split)
      st_diff < 4.5  to the right, agree=0.656, adj=0.083, (0 split)

Node number 7: 32 observations
  predicted class=Home  expected loss=0.125  P(node) =0.2
    class counts:     4    28
   probabilities: 0.125 0.875 

Node number 10: 14 observations
  predicted class=Away  expected loss=0.4285714  P(node) =0.0875
    class counts:     8     6
   probabilities: 0.571 0.429 

Node number 11: 8 observations
  predicted class=Home  expected loss=0.375  P(node) =0.05
    class counts:     3     5
   probabilities: 0.375 0.625 

Node number 12: 24 observations,    complexity param=0.03418803
  predicted class=Away  expected loss=0.4166667  P(node) =0.15
    class counts:    14    10
   probabilities: 0.583 0.417 
  left son=24 (12 obs) right son=25 (12 obs)
  Primary splits:
      htg_diff < 0.5  to the right, improve=3.0000000, (0 missing)
      ftg_diff < 1.5  to the right, improve=2.0416670, (0 missing)
      f_diff   < 3.5  to the left,  improve=2.0416670, (0 missing)
      st_diff  < 2.5  to the right, improve=0.8414918, (0 missing)
      c_diff   < -1.5 to the left,  improve=0.6666667, (0 missing)
  Surrogate splits:
      ftg_diff < 0.5  to the right, agree=0.667, adj=0.333, (0 split)
      c_diff   < -1.5 to the left,  agree=0.667, adj=0.333, (0 split)
      s_diff   < -6   to the left,  agree=0.625, adj=0.250, (0 split)
      st_diff  < -0.5 to the right, agree=0.625, adj=0.250, (0 split)
      f_diff   < 3.5  to the left,  agree=0.583, adj=0.167, (0 split)

Node number 13: 40 observations,    complexity param=0.02564103
  predicted class=Home  expected loss=0.375  P(node) =0.25
    class counts:    15    25
   probabilities: 0.375 0.625 
  left son=26 (23 obs) right son=27 (17 obs)
  Primary splits:
      s_diff   < 1.5  to the left,  improve=2.330563, (0 missing)
      st_diff  < 0.5  to the left,  improve=1.657268, (0 missing)
      htg_diff < 1.5  to the left,  improve=1.250000, (0 missing)
      ftg_diff < 0.5  to the left,  improve=1.203333, (0 missing)
      c_diff   < -2.5 to the left,  improve=1.203333, (0 missing)
  Surrogate splits:
      st_diff  < 2.5  to the left,  agree=0.850, adj=0.647, (0 split)
      c_diff   < 0.5  to the left,  agree=0.700, adj=0.294, (0 split)
      htg_diff < 1.5  to the left,  agree=0.675, adj=0.235, (0 split)
      ftg_diff < 0.5  to the left,  agree=0.650, adj=0.176, (0 split)
      f_diff   < -6.5 to the right, agree=0.600, adj=0.059, (0 split)

Node number 24: 12 observations
  predicted class=Away  expected loss=0.1666667  P(node) =0.075
    class counts:    10     2
   probabilities: 0.833 0.167 

Node number 25: 12 observations
  predicted class=Home  expected loss=0.3333333  P(node) =0.075
    class counts:     4     8
   probabilities: 0.333 0.667 

Node number 26: 23 observations,    complexity param=0.02564103
  predicted class=Away  expected loss=0.4782609  P(node) =0.14375
    class counts:    12    11
   probabilities: 0.522 0.478 
  left son=52 (16 obs) right son=53 (7 obs)
  Primary splits:
      st_diff  < 0.5  to the left,  improve=1.1211180, (0 missing)
      c_diff   < -2.5 to the left,  improve=1.0540180, (0 missing)
      s_diff   < -7.5 to the right, improve=0.5282609, (0 missing)
      f_diff   < -3.5 to the left,  improve=0.2615942, (0 missing)
      ftg_diff < 0.5  to the left,  improve=0.1903821, (0 missing)
  Surrogate splits:
      htg_diff < 1.5  to the left,  agree=0.783, adj=0.286, (0 split)
      f_diff   < -5.5 to the right, agree=0.783, adj=0.286, (0 split)
      ftg_diff < 2.5  to the left,  agree=0.739, adj=0.143, (0 split)
      s_diff   < -0.5 to the left,  agree=0.739, adj=0.143, (0 split)

Node number 27: 17 observations
  predicted class=Home  expected loss=0.1764706  P(node) =0.10625
    class counts:     3    14
   probabilities: 0.176 0.824 

Node number 52: 16 observations
  predicted class=Away  expected loss=0.375  P(node) =0.1
    class counts:    10     6
   probabilities: 0.625 0.375 

Node number 53: 7 observations
  predicted class=Home  expected loss=0.2857143  P(node) =0.04375
    class counts:     2     5
   probabilities: 0.286 0.714

Using the summary() code, it tells us that the most important variables in being able to predict home or away teams are, in order:

ftg_diff (26%)
s_diff (17%)
st_diff (17%)
htg_diff (17%)
c_diff (16%)
f_diff (11%)

The percentages here represent the percentage improvements to the model that can be attributed to that particular variable.

c. Assess the accuracy of the classification tree using both the training and the testing datasets.

In order to assess the accuracy of the classification tree, we used the predict function in r studio to eventually create a cross-tabulation of what the model predicts and what actually happened.

First we tested the training dataset (pl_1)

 pl_model_tree_prob <- predict(pl_model_tree, newdata = pl_1, type = "prob")

 pl_model_tree_prediction <- predict(pl_model_tree, newdata =pl_1, 
type = 'class')


pl_model_tree_final <- cbind(pl_1, pl_model_tree_prob, 
            pl_model_tree_prediction)



pl_model_tree_tab <- table(pl_model_tree_final$home_or_away, 
pl_model_tree_final$pl_model_tree_prediction, 
dnn = c('Actual', 'Predicted'))

pl_model_tree_tab

      Predicted
Actual Away Home
  Away   62   16
  Home   22   60

From the above confusion matrix, we know the following: • The overall model accuracy is (62 + 60)/160 = 0.76 or 76%. • Of all matches the model predicted the team was away, they got 62/84 = 0.74 or 74% correct. • Of all matches the model predicted the team be home, they got 60/76 = .79 or 79% correct. • Of all matches where the predicted away team was actually away, the model correctly identified 62/86 = .72 or 72%. • Of all matches where the predicted home team actually was home, the model correctly identified 60/82 = .73 or 73%.

Next we had to test the testing dataset (pl_testing)

pl_model_tree_prob_testing <- predict (pl_model_tree, newdata = pl_testing , type = "prob")

 pl_model_tree_prediction_testing <- predict(pl_model_tree, newdata = pl_testing, 
type = 'class')


pl_model_tree_final_testing <- cbind(pl_testing, pl_model_tree_prob_testing, 
            pl_model_tree_prediction_testing)



pl_model_tree_tab_testing <- table(pl_model_tree_final_testing$home_or_away, 
pl_model_tree_final_testing$pl_model_tree_prediction_testing, 
dnn = c('Actual', 'Predicted'))

pl_model_tree_tab_testing

      Predicted
Actual Away Home
  Away  175  127
  Home  127  171

From the above confusion matrix, we know the following: • The overall model accuracy is (175 + 171)/600 = 0.58 or 58%. • Of all matches the model predicted the team was away, they got 175/302 = 0.58 or 58% correct. • Of all matches the model predicted the team be home, they got 171/298 = .57 or 57% correct. • Of all matches where the predicted away team was actually away, the model correctly identified 175/86 = .58 or 58%. • Of all matches where the predicted home team actually was home, the model correctly identified 171/298 = .57 or 57%.

i. Based on your findings, do you think the classification tree is overfitting the training dataset? Explain your answer. Note that if you detect overfitting, then for the purpose of this assignment you do not need to create a pruned version of the tree.

Our model predicts that 76% of our overall training data correctly, but only 58% of the overall data for our testing dataset, which means that our classification tree is a pretty good model for the training data but does not predict new data too well. Our findings would suggest overfitting with the training data, and if we were to continue with this particular classification tree, pruning may be the next step.

3. Binary Logistic Regression Method:

a. Create a binary logistic regression model that will allow you to classify a team as either the home or away team.

So, just like with the classification tree, we are going to use all of the variables to see if they have any affect on knowing whether a team is home or away. We will be using our same training dataset (pl_1) and our test dataset will be pl_testing

pl_1$home_or_away <- factor(pl_1$home_or_away, levels = c("Away", "Home"))
 pl_testing$home_or_away <- factor(pl_testing$home_or_away, levels = c("Away", "Home"))
 
 levels(pl_1$home_or_away)

[1] "Away" "Home"

levels(pl_testing$home_or_away)   #To check the levels are sorted correctly

[1] "Away" "Home"

i. Make sure to correctly prepare the response/target variable, i.e. convert to a factor with levels ordered appropriately.

So now our target variable (home or away) is set and ordered as the failure (away) as the first option and success (home) being the second option.

#Binary Logistic Regression Model

pl_model <- glm(home_or_away ~ ftg_diff + htg_diff + s_diff + st_diff + f_diff + c_diff + y_diff + r_diff + wdl_ft + wdl_ht, data = pl_1, family = binomial(link = "logit"))

summary(pl_model)


Call:
glm(formula = home_or_away ~ ftg_diff + htg_diff + s_diff + st_diff + 
    f_diff + c_diff + y_diff + r_diff + wdl_ft + wdl_ht, family = binomial(link = "logit"), 
    data = pl_1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  0.73664    0.43997   1.674  0.09407 . 
ftg_diff    -0.14028    0.23626  -0.594  0.55269   
htg_diff     1.07108    0.49514   2.163  0.03053 * 
s_diff       0.05090    0.04373   1.164  0.24441   
st_diff     -0.05168    0.09383  -0.551  0.58174   
f_diff      -0.08778    0.04638  -1.893  0.05842 . 
c_diff       0.04308    0.05356   0.804  0.42122   
y_diff       0.04629    0.12855   0.360  0.71879   
r_diff       0.92960    0.61651   1.508  0.13159   
wdl_ftLose  -1.15189    0.64644  -1.782  0.07477 . 
wdl_ftWin    0.47384    0.64091   0.739  0.45971   
wdl_htLose   0.99180    0.76045   1.304  0.19216   
wdl_htWin   -2.09852    0.73795  -2.844  0.00446 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 221.71  on 159  degrees of freedom
Residual deviance: 178.30  on 147  degrees of freedom
AIC: 204.3

Number of Fisher Scoring iterations: 4

So the problem we encountered is that an error kept coming up each time we tried to run the binary regression model, and the error was that one of the variables was coming up as not being able to be read. The reason for this issue (after LONG deliberations) was the realisation that values for the variable home_or_away were written as Home and Away rather than home and away, which meant that our code was completely wrong. Finally we corrected the problem and were able to get to a regression equation!

ii. Clearly state the regression equation.

y=ln(π/(1-π))= 0.73664 -0.14028* ftg_diff +1.07108* htg_diff +0.05090* s_diff-0.05168* st_diff-0.08778* f_diff+0.04308* c_diff+0.04629* y_diff+0.92960* r_diff-1.15189* wdl_ftLose +0.47384* wdl_ftWin +0.99180* wdl_htLose -2.09852* wdl_htWin

###iii. Which predictor variables are important in classifying a team as being the home or away team? Include p-values and refer to level of significance in your answer.

The predictor variables that are important in classifying a team as being home or away are:

htg_diff, which had a p value of 0.03053 and wdl_htWin, which had a p value of 0.00446, which means that we would reject H0 for each of these variables (which would state that both of them ARE NOT important indicators in predicting whether a team is home or away).

So, in other words, half time goal difference (htg_diff) is sigificant to the .05 significance level and whether a team was winning, losing, or drawing at half time (wdl_htwin) has a .01 significance level of predicting whether a team is home or away.

###iv. For each significant predictor variable identified in the previous question, calculate and clearly state the impact that variable has on the odds of a team being classifed as the home team.

So for htg_diff, which has a coefficient of 1.07108.

Which means that e^(b_i ) = e^ 1.07108 = 2.919, meaning that an increase of 1 goal difference between two teams at halftime multiplies the odds of the the team with the positive goal difference being the home team almost three times (2.919).

Equally for wdl_htwin, there is a negative coefficient of -2.09852, which mean that e^(b_i ) = e^-2.09852 = 0.123

So I’m not entirely sure how to interpret this considering that wdl_htwin is a categorical variable, but I’m assuming that if team increases its chances to be winning a game at halftime then it is associated with a 0.123 – 1 X 100% = -.877 X 100% = -87.7% change in the odds of a success, which is a 88% reduction in the odds of the that team being the home team (I am not entirely sure of this explanation with a categorical variable). In simple terms, I think it means that the more the categorical variable changes (from winning to draw or to losing), the less chance there is for the team to be the home team.

b. Fully assess the accuracy of the logistic regression model using both the training and the testing datasets.

#Training dataset

pl_model_pi <- predict(pl_model, newdata = pl_1, type = 'response')
pl_model_pi

         1          2          3          4          5          6          7 
0.22792730 0.50609215 0.33998406 0.34817463 0.57463749 0.70927573 0.29282301 
         8          9         10         11         12         13         14 
0.34566808 0.40044453 0.18073676 0.06481144 0.42552349 0.49374639 0.51589769 
        15         16         17         18         19         20         21 
0.61764525 0.32100911 0.36834663 0.76183759 0.79941221 0.51513127 0.65703755 
        22         23         24         25         26         27         28 
0.58709373 0.75500071 0.79143377 0.92241004 0.54045221 0.84525716 0.54330179 
        29         30         31         32         33         34         35 
0.80793468 0.55379132 0.32372405 0.71869321 0.09951637 0.60518481 0.91819281 
        36         37         38         39         40         41         42 
0.68724594 0.68970138 0.24920305 0.41530812 0.45020364 0.61345633 0.38822793 
        43         44         45         46         47         48         49 
0.77092715 0.79611653 0.56716165 0.86700040 0.87090023 0.82932191 0.76495446 
        50         51         52         53         54         55         56 
0.73260941 0.71374359 0.79858149 0.19149135 0.89663213 0.74517103 0.78911180 
        57         58         59         60         61         62         63 
0.58977055 0.47767381 0.60436328 0.96547865 0.86632998 0.59674534 0.79861587 
        64         65         66         67         68         69         70 
0.64133098 0.62580656 0.77753465 0.92996688 0.58034164 0.83490145 0.79836040 
        71         72         73         74         75         76         77 
0.70741001 0.53203601 0.82729040 0.74036647 0.71945866 0.86413460 0.95672792 
        78         79         80         81         82         83         84 
0.81189247 0.79908438 0.84832836 0.83678635 0.84667690 0.24106758 0.85429740 
        85         86         87         88         89         90         91 
0.34531989 0.35153756 0.60782369 0.73799398 0.49716232 0.47230855 0.31715717 
        92         93         94         95         96         97         98 
0.17383768 0.60769853 0.86080909 0.33252157 0.77293889 0.40804976 0.68127396 
        99        100        101        102        103        104        105 
0.42958161 0.15705333 0.55092248 0.70097084 0.38104100 0.40355738 0.40717370 
       106        107        108        109        110        111        112 
0.39593812 0.36517765 0.68812699 0.50764301 0.47253496 0.21991663 0.54259179 
       113        114        115        116        117        118        119 
0.60313282 0.23570793 0.26369853 0.53575816 0.06048600 0.62041673 0.27351594 
       120        121        122        123        124        125        126 
0.52511013 0.09550808 0.72217085 0.52933412 0.25360825 0.36641385 0.23892874 
       127        128        129        130        131        132        133 
0.28685182 0.40496887 0.74542183 0.62330329 0.27380849 0.07785593 0.29350723 
       134        135        136        137        138        139        140 
0.73181300 0.22566740 0.45762103 0.67189165 0.15007791 0.64304512 0.07808630 
       141        142        143        144        145        146        147 
0.24371706 0.20302604 0.31069930 0.35762761 0.15588689 0.19496836 0.55525549 
       148        149        150        151        152        153        154 
0.03226201 0.18222104 0.01803438 0.54011515 0.39178630 0.13261484 0.17581713 
       155        156        157        158        159        160 
0.15996976 0.33914188 0.29809164 0.12499032 0.11709195 0.11231466

pl_model_regression_final <- pl_1 %>%
                          mutate(pi = pl_model_pi) %>%
                          mutate(pl_model_regression_prediction = case_when(pi > 0.5 ~ 'Home', 
                                                                       pi <= 0.5 ~ 'Away'))
pl_model_regression_final

# A tibble: 160 × 14
   team      ftg_diff htg_diff s_diff st_diff f_diff c_diff y_diff r_diff wdl_ft
   <chr>        <dbl>    <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr> 
 1 Burnley         -2       -2     -4      -6      1      0      1      0 Lose  
 2 Leicester        0        0     -5      -3      4     -5     -1      0 Draw  
 3 Burnley         -1        0     -8      -2     -1     -4      0      0 Lose  
 4 Sunderla…       -3       -1    -18      -6     -4     -8      1      0 Lose  
 5 West Brom       -2       -1      0      -4     -4     -1      0      0 Lose  
 6 Middlesb…        0        0     -1      -3      1      0      3      0 Draw  
 7 Stoke            0        1    -19      -7     -1     -6      4      0 Draw  
 8 Leicester       -3       -2     -4      -8     -1      7     -2      0 Lose  
 9 Stoke           -1        0    -15      -6      5     -6      2      1 Lose  
10 Swansea         -2        1    -14      -7     -1     -2      0      0 Lose  
# ℹ 150 more rows
# ℹ 4 more variables: wdl_ht <chr>, home_or_away <fct>, pi <dbl>,
#   pl_model_regression_prediction <chr>

pl_model_regression_tab <- table(pl_model_regression_final$home_or_away, pl_model_regression_final$pl_model_regression_prediction, dnn=c('Actual', 'Predicted'))
pl_model_regression_tab

      Predicted
Actual Away Home
  Away   53   25
  Home   20   62

pl_model_regression_acc <- sum(diag(pl_model_regression_tab))/sum(pl_model_regression_tab)
pl_model_regression_acc

[1] 0.71875

From the above confusion matrix, we know:

**The overall model accuracy is (53+ 62)/160 = 0.72 or 72%.**

Of all matches the model predicted the superior team would lose, they got 53/73 = 0.73 or 73% correct.

Of all matches the model predicted the superior team would win, they got 62/87 or 71% correct.

Of all matches where the superior team lost, the model correctly identified 53/78 or 68%.

Of all matches where the superior team won, the model correctly identified 62/82 or 76%.

So 72% is not incredibly high, but remember, our classification tree overall model accurracy was 76%. Now we have to assess our regression equation with unseen data (pl_testing)

#Testing dataset

pl_model_testing_pi <- predict(pl_model, newdata = pl_testing, type = 'response')
pl_model_testing_pi

         1          2          3          4          5          6          7 
0.83430022 0.54759585 0.85685970 0.61794364 0.69748402 0.68105293 0.72313293 
         8          9         10         11         12         13         14 
0.11103605 0.58131785 0.14216648 0.34197717 0.29977092 0.46280943 0.79973801 
        15         16         17         18         19         20         21 
0.56136039 0.11942824 0.33033781 0.36046833 0.67547674 0.15832353 0.59292954 
        22         23         24         25         26         27         28 
0.83437781 0.53615890 0.38674038 0.58832859 0.28031186 0.44020388 0.34142836 
        29         30         31         32         33         34         35 
0.20548973 0.17152690 0.62928029 0.18136046 0.71193101 0.80715924 0.48314900 
        36         37         38         39         40         41         42 
0.66674929 0.18676370 0.42286764 0.59533495 0.77658353 0.70764191 0.75402463 
        43         44         45         46         47         48         49 
0.26999656 0.69520038 0.67066470 0.53594389 0.36331923 0.40902821 0.84139526 
        50         51         52         53         54         55         56 
0.45001487 0.34240311 0.72899540 0.04310200 0.74933105 0.38788327 0.64248887 
        57         58         59         60         61         62         63 
0.47272585 0.37173379 0.61191698 0.23226328 0.37877244 0.73086705 0.40950600 
        64         65         66         67         68         69         70 
0.69786431 0.19551913 0.62509434 0.77681122 0.55118713 0.54839123 0.24015353 
        71         72         73         74         75         76         77 
0.60983900 0.38987884 0.10588165 0.23256547 0.39952235 0.34269822 0.86108718 
        78         79         80         81         82         83         84 
0.77323763 0.42416351 0.59515376 0.65317935 0.72708453 0.37613167 0.52525523 
        85         86         87         88         89         90         91 
0.84543085 0.59366010 0.17704892 0.71213082 0.24017231 0.50889928 0.90014179 
        92         93         94         95         96         97         98 
0.39766832 0.20634694 0.47610661 0.49389655 0.69119914 0.65052780 0.65753937 
        99        100        101        102        103        104        105 
0.34480329 0.55402439 0.46632478 0.71108588 0.87016472 0.76600660 0.67748186 
       106        107        108        109        110        111        112 
0.83263961 0.85744536 0.53808811 0.52770380 0.03466494 0.46464427 0.68900951 
       113        114        115        116        117        118        119 
0.37131189 0.50746487 0.69346035 0.32653044 0.55065961 0.46311501 0.39569555 
       120        121        122        123        124        125        126 
0.87859606 0.56007621 0.65484326 0.75538041 0.89782129 0.67052111 0.60944321 
       127        128        129        130        131        132        133 
0.76340343 0.53857792 0.32343206 0.15580991 0.67136895 0.77287355 0.85318533 
       134        135        136        137        138        139        140 
0.55859708 0.83972761 0.32713548 0.60721251 0.44979004 0.19346947 0.72205186 
       141        142        143        144        145        146        147 
0.82856157 0.26336374 0.52707403 0.60715725 0.38171602 0.74168588 0.70367643 
       148        149        150        151        152        153        154 
0.59913651 0.37748867 0.25261510 0.07113325 0.67158075 0.46662105 0.39027319 
       155        156        157        158        159        160        161 
0.69101255 0.75629854 0.86641707 0.45001297 0.91919511 0.21830828 0.81142756 
       162        163        164        165        166        167        168 
0.72749850 0.70710549 0.46689420 0.74334093 0.66046492 0.66701352 0.43851645 
       169        170        171        172        173        174        175 
0.59389747 0.87398614 0.22380350 0.62668696 0.64344464 0.74363623 0.87008580 
       176        177        178        179        180        181        182 
0.56461638 0.67526969 0.41795680 0.39437439 0.70355638 0.76057574 0.81104676 
       183        184        185        186        187        188        189 
0.45666666 0.55876285 0.73177852 0.49942627 0.87585988 0.64548093 0.41786150 
       190        191        192        193        194        195        196 
0.56860656 0.20007395 0.82654382 0.46579576 0.85453455 0.41436078 0.11010509 
       197        198        199        200        201        202        203 
0.87218334 0.50905172 0.59896331 0.77621775 0.78567396 0.85056385 0.80548890 
       204        205        206        207        208        209        210 
0.66124795 0.78019282 0.28541920 0.58067060 0.84206108 0.62497829 0.34586195 
       211        212        213        214        215        216        217 
0.71533733 0.70034885 0.55402842 0.46466741 0.84800192 0.82506127 0.68060033 
       218        219        220        221        222        223        224 
0.35399649 0.37270186 0.80573008 0.28903089 0.89633476 0.50483344 0.62751852 
       225        226        227        228        229        230        231 
0.41320273 0.70887628 0.24918702 0.51586463 0.76309451 0.67752559 0.77998655 
       232        233        234        235        236        237        238 
0.27755008 0.54390826 0.53445963 0.51063228 0.69443110 0.80037747 0.89685156 
       239        240        241        242        243        244        245 
0.58031387 0.47162086 0.61901326 0.68596895 0.52057691 0.56812027 0.62427295 
       246        247        248        249        250        251        252 
0.67377533 0.47241243 0.64756366 0.59033910 0.23846181 0.55108216 0.89598717 
       253        254        255        256        257        258        259 
0.75844483 0.75148618 0.53847879 0.32680674 0.67106972 0.83948260 0.64582635 
       260        261        262        263        264        265        266 
0.40317352 0.94482345 0.58580650 0.86305912 0.56219511 0.65839607 0.80817772 
       267        268        269        270        271        272        273 
0.17855678 0.68172653 0.70548047 0.70700915 0.73891332 0.92450379 0.80265560 
       274        275        276        277        278        279        280 
0.90170368 0.95646092 0.90859549 0.73700802 0.91760107 0.70568952 0.07765133 
       281        282        283        284        285        286        287 
0.97553605 0.65349265 0.85881198 0.77442325 0.66612976 0.92083062 0.65055781 
       288        289        290        291        292        293        294 
0.43282443 0.41328485 0.63295434 0.62879268 0.87790785 0.88911591 0.90208625 
       295        296        297        298        299        300        301 
0.93046318 0.46221048 0.89522608 0.85268576 0.30551276 0.78284505 0.27008026 
       302        303        304        305        306        307        308 
0.71270583 0.80983146 0.31167095 0.81131826 0.67143232 0.62556327 0.57824733 
       309        310        311        312        313        314        315 
0.81546530 0.58492047 0.83802724 0.45947599 0.35676697 0.64139320 0.77700101 
       316        317        318        319        320        321        322 
0.77322298 0.84374564 0.58094304 0.59752610 0.56508908 0.67704478 0.88465978 
       323        324        325        326        327        328        329 
0.60327845 0.46413871 0.76831956 0.76850236 0.38784531 0.77838304 0.85045176 
       330        331        332        333        334        335        336 
0.73565326 0.73901204 0.91451694 0.56613594 0.90907485 0.63841796 0.34605564 
       337        338        339        340        341        342        343 
0.82356860 0.26795775 0.86268113 0.75142857 0.33235467 0.91355034 0.55661088 
       344        345        346        347        348        349        350 
0.37346111 0.58736708 0.66444141 0.65672640 0.41468497 0.38805059 0.79514466 
       351        352        353        354        355        356        357 
0.51411981 0.29454465 0.80966531 0.45157771 0.94205872 0.42560290 0.53611913 
       358        359        360        361        362        363        364 
0.42886727 0.67516095 0.28952895 0.57826816 0.44960052 0.78918495 0.70766811 
       365        366        367        368        369        370        371 
0.54569017 0.61639134 0.51362613 0.38447740 0.90112365 0.30518700 0.64331125 
       372        373        374        375        376        377        378 
0.37621087 0.69853852 0.58627755 0.71215597 0.77609534 0.70731756 0.76900283 
       379        380        381        382        383        384        385 
0.57700843 0.58414270 0.26325558 0.29731386 0.49855371 0.69851602 0.45396892 
       386        387        388        389        390        391        392 
0.54847481 0.39828961 0.20872252 0.35723365 0.33389809 0.63819298 0.69851685 
       393        394        395        396        397        398        399 
0.19725061 0.52590181 0.60903784 0.41818468 0.73799966 0.44624766 0.36857009 
       400        401        402        403        404        405        406 
0.59651783 0.66095521 0.27611046 0.58187410 0.28850770 0.77839555 0.45596816 
       407        408        409        410        411        412        413 
0.50915886 0.36956069 0.39433392 0.37110224 0.30805718 0.42044501 0.65533764 
       414        415        416        417        418        419        420 
0.60473164 0.95325862 0.71847159 0.24843385 0.78948541 0.36090764 0.68251887 
       421        422        423        424        425        426        427 
0.49472183 0.60167033 0.86888318 0.37406287 0.83494591 0.52795660 0.23433941 
       428        429        430        431        432        433        434 
0.43196574 0.32330905 0.19169915 0.20132812 0.16481786 0.52116000 0.31941484 
       435        436        437        438        439        440        441 
0.50198959 0.40704454 0.38553763 0.66252383 0.60505047 0.79870996 0.26388498 
       442        443        444        445        446        447        448 
0.39427557 0.19889039 0.77518344 0.12263693 0.60101108 0.73840117 0.75327056 
       449        450        451        452        453        454        455 
0.22989484 0.67196109 0.79655218 0.73844592 0.73009043 0.59709248 0.54704329 
       456        457        458        459        460        461        462 
0.68421777 0.31575398 0.90533155 0.71686184 0.53361773 0.49759406 0.31736010 
       463        464        465        466        467        468        469 
0.10145752 0.39691876 0.47231046 0.88803111 0.50349303 0.26980047 0.64380507 
       470        471        472        473        474        475        476 
0.71663904 0.43335576 0.48392935 0.60231790 0.71751074 0.35852543 0.28867327 
       477        478        479        480        481        482        483 
0.60068753 0.17724059 0.63072178 0.26045461 0.50491902 0.23581165 0.18734901 
       484        485        486        487        488        489        490 
0.24718017 0.31311516 0.14575187 0.83849256 0.61529390 0.42331124 0.75524636 
       491        492        493        494        495        496        497 
0.35717237 0.21091944 0.23240760 0.36654353 0.50861839 0.85547176 0.09692285 
       498        499        500        501        502        503        504 
0.41394010 0.32901635 0.38971125 0.16651457 0.11399983 0.15027445 0.15591503 
       505        506        507        508        509        510        511 
0.75562824 0.53155128 0.28900354 0.64708426 0.34592121 0.46401951 0.43099624 
       512        513        514        515        516        517        518 
0.37183647 0.48657171 0.37087832 0.28418802 0.23425122 0.57200299 0.55209853 
       519        520        521        522        523        524        525 
0.41804008 0.30299511 0.50981153 0.23122225 0.75217913 0.61204727 0.81298361 
       526        527        528        529        530        531        532 
0.67518999 0.40745722 0.25847432 0.55173845 0.85218980 0.74069927 0.02551766 
       533        534        535        536        537        538        539 
0.54747535 0.38946841 0.10152612 0.67976628 0.26461984 0.34625273 0.45069792 
       540        541        542        543        544        545        546 
0.25108567 0.57057749 0.33105616 0.30593039 0.67873726 0.82973463 0.28499111 
       547        548        549        550        551        552        553 
0.33696309 0.70049188 0.37366308 0.07835562 0.58154355 0.65498251 0.60136924 
       554        555        556        557        558        559        560 
0.26414845 0.29056624 0.29751699 0.28654306 0.52018139 0.04101451 0.34115617 
       561        562        563        564        565        566        567 
0.30454270 0.40910931 0.36318531 0.14295663 0.27535023 0.14808471 0.77112355 
       568        569        570        571        572        573        574 
0.50837819 0.48043493 0.47859590 0.60657957 0.75934818 0.15317135 0.35257450 
       575        576        577        578        579        580        581 
0.19449454 0.46319522 0.20718708 0.16590157 0.23396972 0.89689628 0.15609593 
       582        583        584        585        586        587        588 
0.47811284 0.10746050 0.26850478 0.28232014 0.43717365 0.22213497 0.10325649 
       589        590        591        592        593        594        595 
0.48971316 0.50972690 0.16482925 0.15550362 0.30184431 0.23549623 0.08369024 
       596        597        598        599        600 
0.07363737 0.14202248 0.46007443 0.20586417 0.28367264

pl_model_regression_testing_final <- pl_testing %>%
                          mutate(pi = pl_model_testing_pi) %>%
                          mutate(pl_model_regression_testing_prediction = case_when(pi > 0.5 ~ 'Home', 
                                                                       pi <= 0.5 ~ 'Away'))
pl_model_regression_testing_final

# A tibble: 600 × 14
   team      ftg_diff htg_diff s_diff st_diff f_diff c_diff y_diff r_diff wdl_ft
   <chr>        <dbl>    <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr> 
 1 Watford          1        0     14      -2      5      5     -1      0 Win   
 2 Southamp…        0        0    -12      -2     -3     -7      0      0 Draw  
 3 Crystal …        1        0      1      -2     -7      1     -2      0 Win   
 4 Bournemo…        1        1     -1      -1     -4      2      0      0 Win   
 5 West Ham        -3       -1     -5      -8     -5      4      3      0 Lose  
 6 Leicester        0        0     -2      -4      0     -3      1      0 Draw  
 7 Bournemo…        0        0     -8      -3     -5      3     -2      0 Draw  
 8 West Brom       -4       -2    -12      -7      7     -6      1      0 Lose  
 9 Southamp…       -2       -1     -4      -6     -3      4      0      0 Lose  
10 Swansea         -2       -3     -3      -2     -7      2     -2      0 Lose  
# ℹ 590 more rows
# ℹ 4 more variables: wdl_ht <chr>, home_or_away <fct>, pi <dbl>,
#   pl_model_regression_testing_prediction <chr>

pl_model_regression_testing_tab <- table(pl_model_regression_testing_final$home_or_away, pl_model_regression_testing_final$pl_model_regression_testing_prediction, dnn=c('Actual', 'Predicted'))
pl_model_regression_testing_tab

      Predicted
Actual Away Home
  Away  158  144
  Home   99  199

pl_model_regression_testing_acc <- sum(diag(pl_model_regression_testing_tab))/sum(pl_model_regression_testing_tab)
pl_model_regression_testing_acc

[1] 0.595

Predicted Actual Away Home Away 53 25 Home 20 62

From the above confusion matrix, we know:

**The overall model accuracy is (158+ 199)/160 = 0.60 or 60%.**

Of all matches the model predicted the superior team would lose, they got 158/257 = 0.61 or 61% correct.

Of all matches the model predicted the superior team would win, they got 199/343 or 58% correct.

Of all matches where the superior team lost, the model correctly identified 158/302 or 52%.

Of all matches where the superior team won, the model correctly identified 199/298 or 67%.

4 . Compare and contrast the Classification Tree model and the Binary Logistic Regression model. For example:

a. Which model is more accurate?

  Model                         Training        Testing
  Binary Logistic Regression      72%             60%
  Classification Tree             76%             58%

So for the training dataset, the classification tree was a little more accurate (76% to 72%), but for the testing dataset, the dataset that we would probably base which model we would use, the Binary Logistic Regression model is a little more accurate (60% to 58%). If we were to continue this study, we would probably prune our classification tree to see if we could get a little more accurate in our predictions.

b. How do the models compare in terms of variables that are considered important predictors of the target variable?

The Classification tree considers more variables as important predictors (1. ftg_diff (26%), 2. s_diff (17%) , 3. st_diff (17%) , 4. htg_diff (17%) , 5. c_diff (16%) , 6.f_diff (11%)), while the Binary Logisitic Regression Model considered only htg_diff and wdl_htWin.

Question 2

1. Import the dataset

Working directory is already set

baseball_hof <- read_csv("baseball_hof.csv")

2. Does the data need to be scaled before computing the distance matrix for hierarchical clustering or before being entered into the K-means clustering algorithm? Explain your answer

Yes, we do need to scale data because although some of the variables may be relatively on the same scale (hits, runs, and maybe even RBIs), home runs and stolen bases are on a much lower scale, so the higher numbered variables (aforementioned) would dominate the clustering. So we will scale the data first (note: we had to remove names first from dataset hence the new dataset baseball_no_names was created).

3. Hierchical Clustering:

a. Create a suitable distance matrix containing the Euclidean distance between all pairs of players.

baseball_no_names <- select(baseball_hof, hits:stolen_bases)

baseball_hof_scale <- scale(baseball_no_names) #Scaling Data
distance_1 <- dist(baseball_hof_scale) #Computing Distance

b. Carry out a hierarchical clustering using the hclust function. You must use method = ‘ward.D’.

h1 <- hclust(distance_1, method = 'ward.D')

plot(h1, hang= -1 , cex = 0.6 ,  ylim = c(0, 10))
abline(h = 8, col = "red")  # Adjust height threshold (e.g., h = 10)

# Set plot size
options(repr.plot.width = 2, repr.plot.height = 6)

c. Visualise the results of the hierarchical clustering using a dendrogram and a heatmap. Note that the heatmap will take several seconds to appear because there is a large number of players in the dataset. i. Does the heatmap provide evidence of any clustering structure within the dataset? Explain your answer.

The dendrogram is above and the heatmap is here below:

heatmap(as.matrix(distance_1), Rowv = as.dendrogram(h1), Colv = 'Rowv')

The heatmap shows some pretty big clustering amongst the data. The bottom righthand corner of the heatmap is very light orange showing a pretty big cluster there, and then as you go up the diagonal line of symmetry on the heatmap, there are a couple of smaller light orange boxes again. There is another fairly big light square in the upper left hand corner of the map as well.

d. Create a 4-cluster solution and assess the quality of this solution.

library(cluster)  # Load the cluster package


clusters1 <- cutree(h1, k = 4)

sil1 <- silhouette(clusters1, distance_1)

summary(sil1)

Silhouette of 82 units in 4 clusters from silhouette.default(x = clusters1, dist = distance_1) :
 Cluster sizes and average silhouette widths:
       17        25        30        10 
0.3202221 0.2078566 0.2958375 0.4325219 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1771  0.2100  0.3257  0.2907  0.4224  0.5725

So of our 4 clusters, the sizes of those clusters (as seen above) were

Cluster 1 17 Cluster 2 25 Cluster 3 30 Cluster 4 10

which has all 82 players in a cluster. The overall mean of the clusters is .29, which is weak. The silhouette scores (.32, .21, .30, .43) show that there are no strong clusters here, and that Clusters 1, 3, and 4 are weak and could be artificial, and that Cluster 2 is has no real substantial structure. If we were reporting this study, we would have to report that there are no substantial clustering structure here.

e. Use tables and suitable graphs to help you describe the properties that seem common to each cluster of players.

library(dplyr)

baseball_hof_clus <- cbind(baseball_hof, baseball_hof_scale, clusters1)

#This puts player names, their scaled version of the variables and the clusters that put the players together



#Updating Column Names so we're not confused

colnames(baseball_hof_clus) <- c("playerID", "hits", "runs", "home_runs","rbi", 
   "stolen_bases", "hits_scaled", "runs_scaled", "home_runs_scaled", "rbi_scaled", 
   "stolen_bases_scaled", "clusters1" )



# Update values in the "clusters1" column
baseball_hof_clus$clusters1 <- case_when(
  baseball_hof_clus$clusters1 == 1 ~ 'C1', 
  baseball_hof_clus$clusters1 == 2 ~ 'C2', 
  baseball_hof_clus$clusters1 == 3 ~ 'C3', 
  baseball_hof_clus$clusters1 == 4 ~ 'C4'
)

#Calculating the mean of the scaled variables

#Calculate mean value of all scaled variables for each cluster. Why not calculate the mean value of the original variables?
baseball_hof_clus_means <- baseball_hof_clus %>%
                      group_by(clusters1) %>%
                      summarise(mean_hits = mean(hits_scaled),
                                mean_runs = mean(runs_scaled),
                                mean_home_runs = mean(home_runs_scaled),
                               mean_rbi = mean(rbi_scaled),
                                mean_stolen_bases = mean(stolen_bases_scaled))

baseball_hof_clus_means

# A tibble: 4 × 6
  clusters1 mean_hits mean_runs mean_home_runs mean_rbi mean_stolen_bases
  <chr>         <dbl>     <dbl>          <dbl>    <dbl>             <dbl>
1 C1            0.804     0.973          0.984    1.26             -0.329
2 C2            0.617     0.430         -0.963   -0.662             1.12 
3 C3           -0.427    -0.396          0.456    0.222            -0.532
4 C4           -1.63     -1.54          -0.634   -1.15             -0.634

#To get rid of negative means

# Function to convert negative values to positive
convert_to_positive <- function(x) {
  ifelse(x < 0, abs(x), x)
}

# Apply the function to all columns using mutate_if
positive_mean_tibble_baseball_hof_clus <- baseball_hof_clus_means %>%
  mutate_if(is.numeric, convert_to_positive)


#Convert tibble to dataset so that we can graph 


baseball_hof_final <- positive_mean_tibble_baseball_hof_clus %>%
  pivot_longer(cols = c( mean_hits, mean_runs, mean_home_runs, mean_rbi, mean_stolen_bases), names_to = "Event", values_to = "Average_Value") 

baseball_hof_final

# A tibble: 20 × 3
   clusters1 Event             Average_Value
   <chr>     <chr>                     <dbl>
 1 C1        mean_hits                 0.804
 2 C1        mean_runs                 0.973
 3 C1        mean_home_runs            0.984
 4 C1        mean_rbi                  1.26 
 5 C1        mean_stolen_bases         0.329
 6 C2        mean_hits                 0.617
 7 C2        mean_runs                 0.430
 8 C2        mean_home_runs            0.963
 9 C2        mean_rbi                  0.662
10 C2        mean_stolen_bases         1.12 
11 C3        mean_hits                 0.427
12 C3        mean_runs                 0.396
13 C3        mean_home_runs            0.456
14 C3        mean_rbi                  0.222
15 C3        mean_stolen_bases         0.532
16 C4        mean_hits                 1.63 
17 C4        mean_runs                 1.54 
18 C4        mean_home_runs            0.634
19 C4        mean_rbi                  1.15 
20 C4        mean_stolen_bases         0.634

#Now we can visualise the data

ggplot(baseball_hof_final, mapping = aes(x = Event, y = Average_Value, group = clusters1, colour = clusters1)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
  ylab("Mean KPIs") + 
  scale_x_discrete(labels= c("Hits", "Runs", "Home Runs", "RBIs", "Stolen Bases" )) +
  ggtitle("Mean KPI for each Cluster for each Event")

So from the line graph above we can see that the players in Cluster 3 (C3) are statistically the worst of the four clusters, so if this is a dataset that is trying to figure out who should be inducted into the Baseball Hall of Fame, then this cluster of players should probably be the last group to be viewed to get in. Cluster 2 contain players that get on base the most, and not necessarily through hits, so these would be speedy players who either get walked a lot (baseball term when a pitcher throws four bad pitches and the batter is awarded being sent to first base) or through bunts or they may even be a designated runner, players who don’t even hit but are subbed in when other hitters get on base. We can confirm that this cluster contains the most speedy players as well from the fact that they have the most stolen bases by a fairly large margin. Cluster 1 are heavy hitters that swing for the fences. They probably strike out the most because they are swinging with power trying to hit the ball out of the park. The are probably a bit heavier, as we see they don’t steal many bases. It’s surprising they don’t have more RBIs (as I’d assume they would hit 4th in the batting order which usually means other people are on base already), but guys like Babe Ruth and Hank Aaron are probably in this group (and yes, upon checking the original dataset, both of those players are in Cluster 1!). Cluster 4 then are the efficiency guys, guys who constantly make contact with the ball (the most hits), get on base the most, drive in the most runs (RBIs), can still hit home runs at a high rate, and still steal bases at a pretty good rate as well–we’d call them the jacks-of-all-trades.

4. K-means Clustering:

a. Carry out a K-means clustering that will produce 4 clusters. Remember to use set.seed(101) to ensure that your results are reproducible.

Our dataset (baseball_hof) is already loaded and the working directory already set. We have also already created a dataset without the athlete names (baseball_no_names), and we have also scaled the data already (baseball_hof_scale), so we can get straight into carrying out the K-means clustering.

set.seed(101)
kmeans1 <- kmeans(baseball_hof_scale, centers = 4)

kmeans1$cluster #The assignment of each observation (i.e. athlete) to each cluster.

 [1] 4 2 2 1 1 3 1 2 2 3 1 2 3 2 3 2 3 2 3 1 3 1 1 4 2 3 4 1 2 3 1 2 1 1 1 1 2 1
[39] 3 2 3 1 3 1 4 1 3 2 2 4 4 4 1 3 3 2 1 4 1 4 3 1 4 3 1 1 2 2 1 2 1 3 1 1 3 2
[77] 2 1 4 4 4 2

kmeans1$centers #The centroids for each cluster.

        hits       runs  home_runs        rbi stolen_bases
1 -0.2044605 -0.1624618  0.5660315  0.3798416   -0.4849247
2  0.7521677  0.5537557 -0.9804896 -0.5809428    1.2098245
3 -1.1508405 -1.2010894 -0.5701789 -0.8728960   -0.4647733
4  0.8494748  1.1682310  1.2734836  1.4407847   -0.3236583

kmeans1$size    #The number of observations in each cluster.

[1] 28 22 19 13

kmeans1$iter    #The number of iterations it took to converge to the final clustering solution.

[1] 3

b. Assess the quality of the clustering solution by calculating silhouette scores.

 # distance_1 has already been created which is the distance between the groups 


 sil_kmeans1 <- silhouette(kmeans1$cluster, distance_1)
summary(sil_kmeans1)

Silhouette of 82 units in 4 clusters from silhouette.default(x = kmeans1$cluster, dist = distance_1) :
 Cluster sizes and average silhouette widths:
       28        22        19        13 
0.3322757 0.2178155 0.2687416 0.3421893 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.06303  0.18425  0.28984  0.28842  0.41869  0.54977

Score Interpretation 0.71-1 A strong clustering structure has been found. 0.51-0.7 A reasonable clustering structure has been found. 0.25-0.5 The structure is weak and could be artificial. < 0.25 No substantial structure has been found.

Assessing quality of clustering So from this score interpretation, we can see that the overall clustering mean is .29, which means the structure is weak and could be artificial.

From the individiual clusters, the first (.33), third (.27), and fourth (.34) clusters all have weak and possibly artificial structures as well, and the second cluster (.22) has no substantial structure, which is not too unlike what we found with the hierarchical clustering (in which we also had three clusters with weak structures and one that was not substantial).

library(tidyr)

baseball_clus_kmeans_tidy <- as_tibble(kmeans1$centers) %>%
         mutate(Cluster = c("C1", "C2", "C3", "C4")) %>% 
         pivot_longer(cols = c(hits, runs, home_runs, rbi, stolen_bases), 
    names_to = "KPI", values_to = "Average_Value") 

# Define a function to convert negative values to positive
convert_to_positive <- function(x) {
  ifelse(x < 0, abs(x), x)
}

# Apply the function to numeric columns using mutate_if
baseball_clus_kmeans_tidy <- baseball_clus_kmeans_tidy %>%
  mutate_if(is.numeric, convert_to_positive)

Graphing and interpreting clusters

library(ggplot2)

ggplot(baseball_clus_kmeans_tidy, mapping = aes(x = KPI, y = Average_Value, group = Cluster, colour = Cluster)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
  ylab("Mean KPI") + 
  scale_x_discrete(labels= c("Hits", "Runs", "Home Runs", "RBIs", "Stolen Bases")) +
  ggtitle("Mean for each Cluster for each KPI")

So these clusters are noticeably different than the Hierarchical Clustering results. We can still see that Cluster 4 is the power group–the group that has the power hitters that are probably bigger in size, hit a lot of home runs and don’t steal bases much. This cluster, though, does have higher RBIs and also more runs than the similar cluster in hierarchical. Cluster 1, similar to the last clustering exercise, is the worst players statistically in every category. Cluster 2 has our faster, probably slimmer players (think Ricky Henderson) who steal a lot of bases but don’t have a lot of power and do not hit too many home runs. And then Cluster3 are the jacks-of-all-trades again. They are the most efficient hitters that drive in a lot of runs, are pretty decent home run hitters, and in this group do not steal too many bases.

5. Compare and contrast the clusters produced by Hierarchical Clustering and K-means. For example:

a. Which algorithm produced the highest quality clusters?

So neither algorithm produced high quality clusters statistically. I'd be interested to see if we ran the K-means again with three clusters would the results be different. The overall mean from the Hierarchical was .29 and from K-means was also .29, both weak and possibly artificial.

b. Did both algorithms produce clusters with a similar profile? Or are they any noticeable differences?

Although the graphs do look different at first glance, the profiles of the clusters are fairly similar for both algorithms. These two examples below are the “Power hitter” clusters that we noticed from the graphs. I think the K-means algorithm, in this case, did a better job collecting more power hitters, as there is a big difference of the mean of home runs of that cluster compared to C1 in the Hierarchical compared to the other clusters in that algorithm. It would be interesting to see what the data of both algorithms looked like if we took out all of the weaker player clusters from both and then ran the algorithms with the remaining players. The K-means is much easier to run and interpret, while the Hierarchical takes a lot more data manipulation to get the desired results, but both were interesting and made it fun looking through statistics of baseball players that I grew up watching and looking up to.

C1-Hierarchical
C1 mean_hits 0.8041640
C1 mean_runs 0.9734091
C1 mean_home_runs 0.9842003
C1 mean_rbi 1.2601753
C1 mean_stolen_bases 0.3293725

C4-K-means 4 0.8494748
1.1682310
1.2734836
1.4407847
0.3236583