Sports Analytics and Insights Assignment 2

Author

Pauric O’Shea- c00248935

Question 1 – Classify Teams as Home or Away

1. Import the pl_training.csv and pl_testing.csv datasets into R.

# Loading Packages and Importing Datasets
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rpart)
library(rattle)
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(dplyr) ## for viualising graphs
library(readr)
pl_training <- read_csv("pl_training.csv")
Rows: 600 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): team, wdl_ft, wdl_ht, home_or_away
dbl (8): ftg_diff, htg_diff, s_diff, st_diff, f_diff, c_diff, y_diff, r_diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(pl_training)
pl_testing <- read_csv("pl_testing.csv")
Rows: 160 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): team, wdl_ft, wdl_ht, home_or_away
dbl (8): ftg_diff, htg_diff, s_diff, st_diff, f_diff, c_diff, y_diff, r_diff

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(pl_testing)

2. Classification Tree Method:

  1. Create and visualise a classification tree model that will allow you to classify a team as either the home or the away team.
#create classification tree model
pl_model_tree <- rpart(home_or_away ~ + c_diff + f_diff + r_diff, data = pl_training, method = 'class')
fancyRpartPlot(pl_model_tree)

  1. Interpret the classification tree:
  1. If one of the teams has a > c_diff of 0.5 than it is predicted to be the home team. The nodes purity is 58%. In this case, the corner diff follows Path 1 and ends up in this leaf, they are predicted to be the home team 58% of the time.
  2. If one of the teams has a < -3.5 c_diff than it is predicted that is the away team. The nodes purity is 43%. In this case, the corner diff is to be predicted to be the away team 43% of the time.
# iii)
pl_model_tree$variable.importance 
  c_diff   f_diff 
8.643357 5.273495 
summary(pl_model_tree) 
Call:
rpart(formula = home_or_away ~ +c_diff + f_diff + r_diff, data = pl_training, 
    method = "class")
  n= 600 

          CP nsplit rel error    xerror       xstd
1 0.14429530      0 1.0000000 1.1006711 0.04091942
2 0.01677852      1 0.8557047 0.9496644 0.04103278
3 0.01000000      4 0.8053691 0.9295302 0.04097786

Variable importance
c_diff f_diff 
    62     38 

Node number 1: 600 observations,    complexity param=0.1442953
  predicted class=Away  expected loss=0.4966667  P(node) =1
    class counts:   302   298
   probabilities: 0.503 0.497 
  left son=2 (325 obs) right son=3 (275 obs)
  Primary splits:
      c_diff < 0.5  to the left,  improve=6.74694600, (0 missing)
      f_diff < -2.5 to the right, improve=4.55806800, (0 missing)
      r_diff < 0.5  to the right, improve=0.04902376, (0 missing)
  Surrogate splits:
      f_diff < -1.5 to the right, agree=0.593, adj=0.113, (0 split)

Node number 2: 325 observations,    complexity param=0.01677852
  predicted class=Away  expected loss=0.4276923  P(node) =0.5416667
    class counts:   186   139
   probabilities: 0.572 0.428 
  left son=4 (130 obs) right son=5 (195 obs)
  Primary splits:
      c_diff < -3.5 to the left,  improve=1.8964100, (0 missing)
      f_diff < -2.5 to the right, improve=1.3829240, (0 missing)
      r_diff < 0.5  to the left,  improve=0.6732439, (0 missing)
  Surrogate splits:
      f_diff < 7.5  to the right, agree=0.612, adj=0.031, (0 split)

Node number 3: 275 observations
  predicted class=Home  expected loss=0.4218182  P(node) =0.4583333
    class counts:   116   159
   probabilities: 0.422 0.578 

Node number 4: 130 observations
  predicted class=Away  expected loss=0.3615385  P(node) =0.2166667
    class counts:    83    47
   probabilities: 0.638 0.362 

Node number 5: 195 observations,    complexity param=0.01677852
  predicted class=Away  expected loss=0.4717949  P(node) =0.325
    class counts:   103    92
   probabilities: 0.528 0.472 
  left son=10 (84 obs) right son=11 (111 obs)
  Primary splits:
      f_diff < 2.5  to the right, improve=1.8390360, (0 missing)
      r_diff < 0.5  to the left,  improve=1.2341880, (0 missing)
      c_diff < -2.5 to the right, improve=0.4391316, (0 missing)

Node number 10: 84 observations
  predicted class=Away  expected loss=0.3928571  P(node) =0.14
    class counts:    51    33
   probabilities: 0.607 0.393 

Node number 11: 111 observations,    complexity param=0.01677852
  predicted class=Home  expected loss=0.4684685  P(node) =0.185
    class counts:    52    59
   probabilities: 0.468 0.532 
  left son=22 (20 obs) right son=23 (91 obs)
  Primary splits:
      f_diff < -5.5 to the left,  improve=2.615543, (0 missing)
      r_diff < 0.5  to the left,  improve=2.006552, (0 missing)
      c_diff < -1.5 to the right, improve=0.638068, (0 missing)

Node number 22: 20 observations
  predicted class=Away  expected loss=0.3  P(node) =0.03333333
    class counts:    14     6
   probabilities: 0.700 0.300 

Node number 23: 91 observations
  predicted class=Home  expected loss=0.4175824  P(node) =0.1516667
    class counts:    38    53
   probabilities: 0.418 0.582 
  1. C_diff is considered to be a higher target value calculating significance of 62 opposed to 42 for f_diff. On the other hand r_diff did not come come as target value.

  2. Assess the accuracy of the classification tree using both the training and the testing datasets.

#Accuracy on Training data
pl_training_tree_prob <- predict(pl_model_tree, newdata = pl_training, type = 'prob')
pl_training_tree_prediction <- predict(pl_model_tree, newdata = pl_training, type = 'class')
pl_training_tree_final <- cbind(pl_training, pl_training_tree_prob, pl_training_tree_prediction)
head(pl_training_tree_final)
            team ftg_diff htg_diff s_diff st_diff f_diff c_diff y_diff r_diff
1        Watford        1        0     14      -2      5      5     -1      0
2    Southampton        0        0    -12      -2     -3     -7      0      0
3 Crystal Palace        1        0      1      -2     -7      1     -2      0
4    Bournemouth        1        1     -1      -1     -4      2      0      0
5       West Ham       -3       -1     -5      -8     -5      4      3      0
6      Leicester        0        0     -2      -4      0     -3      1      0
  wdl_ft wdl_ht home_or_away      Away      Home pl_training_tree_prediction
1    Win   Draw         Home 0.4218182 0.5781818                        Home
2   Draw   Draw         Home 0.6384615 0.3615385                        Away
3    Win   Draw         Home 0.4218182 0.5781818                        Home
4    Win    Win         Home 0.4218182 0.5781818                        Home
5   Lose   Lose         Home 0.4218182 0.5781818                        Home
6   Draw   Draw         Home 0.4175824 0.5824176                        Home
pl_training_tree_tab <- table(pl_training_tree_final$home_or_away, pl_training_tree_final$pl_training_tree_prediction, dnn = c('Actual', 'Predicted'))
pl_training_tree_tab
      Predicted
Actual Away Home
  Away  148  154
  Home   86  212
pl_training_tree_acc <- sum(diag(pl_training_tree_tab))/sum(pl_training_tree_tab)
pl_training_tree_acc
[1] 0.6
#Accuracy on Testing data
pl_testing_tree_prob <- predict(pl_model_tree, newdata = pl_testing, type = 'prob')
pl_testing_tree_prediction <- predict(pl_model_tree, newdata = pl_testing, type = 'class')
pl_testing_tree_final <- cbind(pl_testing, pl_testing_tree_prob, pl_testing_tree_prediction)
head(pl_testing_tree_final)
           team ftg_diff htg_diff s_diff st_diff f_diff c_diff y_diff r_diff
1       Burnley       -2       -2     -4      -6      1      0      1      0
2     Leicester        0        0     -5      -3      4     -5     -1      0
3       Burnley       -1        0     -8      -2     -1     -4      0      0
4    Sunderland       -3       -1    -18      -6     -4     -8      1      0
5     West Brom       -2       -1      0      -4     -4     -1      0      0
6 Middlesbrough        0        0     -1      -3      1      0      3      0
  wdl_ft wdl_ht home_or_away      Away      Home pl_testing_tree_prediction
1   Lose   Lose         Home 0.4175824 0.5824176                       Home
2   Draw   Draw         Home 0.6384615 0.3615385                       Away
3   Lose   Draw         Home 0.6384615 0.3615385                       Away
4   Lose   Lose         Home 0.6384615 0.3615385                       Away
5   Lose   Lose         Home 0.4175824 0.5824176                       Home
6   Draw   Draw         Home 0.4175824 0.5824176                       Home
pl_testing_tree_tab <- table(pl_testing_tree_final$home_or_away, pl_testing_tree_final$pl_testing_tree_prediction, dnn = c('Actual', 'Predicted'))
pl_testing_tree_tab
      Predicted
Actual Away Home
  Away   41   37
  Home   26   56
pl_testing_tree_acc <- sum(diag(pl_testing_tree_tab))/sum(pl_testing_tree_tab)
pl_testing_tree_acc
[1] 0.60625
  1. Training Data:

    • The overall model accuracy is (148 + 212)/600 or 60%.

    • Of all matches the model predicted the home team would have less corners and more fouls, they got 148/234 or 63% correct.

    • Of all matches the model predicted the home team would have more corners and less fouls, they got 154/366 or 42% correct.

    • Of all matches where the homes team actually had less corners and more fouls conceded, the model correctly identified 148/302 or 49%.

    • Of all matches where the home team actually had more corners and less fouls conceded the model correctly identified 212/298 or 71%.

    Testing Data:

    • The overall model accuracy is (41 + 56)/160 or 61%.

    • Of all matches the model predicted the home team would have less corners and more fouls, they got 41/67 or 61% correct.

    • Of all matches the model predicted the home team would have more corners and less fouls, they got 37/93 or 40% correct.

    • Of all matches where the homes team actually had less corners and more fouls conceded, the model correctly identified 41/78 or 53%.

    • Of all matches where the home team actually had more corners and less fouls conceded the model correctly identified 56/82 or 68%.

I do not think the classification tree is over fitting because the variance between the training and testing model is only 1%.

3. Binary Logistic Regression Method:

  1. Create a binary logistic regression model that will allow you to classify a team as either the home or away team.
# Setting up levels of response variable
pl_training$home_or_away <- factor(pl_training$home_or_away, levels = c("Home", "Away"))
pl_testing$home_or_away <- factor(pl_testing$home_or_away, levels = c("Home", "Away"))

levels(pl_training$home_or_away)
[1] "Home" "Away"
levels(pl_testing$home_or_away)
[1] "Home" "Away"
# Create the binary logistic regression model
pl_model_lr <- glm(home_or_away ~ c_diff + f_diff + r_diff, data = pl_training, family = binomial(link = 'logit'))
summary(pl_model_lr)

Call:
glm(formula = home_or_away ~ c_diff + f_diff + r_diff, family = binomial(link = "logit"), 
    data = pl_training)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.01845    0.08314   0.222 0.824399    
c_diff      -0.07041    0.01877  -3.751 0.000176 ***
f_diff       0.03091    0.01743   1.773 0.076279 .  
r_diff      -0.06610    0.24761  -0.267 0.789516    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 831.75  on 599  degrees of freedom
Residual deviance: 810.63  on 596  degrees of freedom
AIC: 818.63

Number of Fisher Scoring iterations: 4
  1. y=ln(π/(1-π))= + 0.018 -0.070.c_diff + 0.031.f_diff - 0.066.r_diff

  2. The p-value for r_diff is 0.7895, which means we accept H0 and conclude that r_diff is not an important predictor of match outcome. Similarly, f_diff is not an important predictor of match outcome 0.0767. The only predictor that is important in predicting match outcome is c_diff of 0.0002.

  3. c_diff -0.07041 Odds Ratio=e −0.070 = 0.933 So, for every one-unit increase in c_diff, the odds of a team being classified as the home team decrease by approximately 6.7%. f_diff 0.03091 Odds Ratio=e 0.031 = 1.031 So, for every one-unit increase in f_diff, the odds of a team being classified as the home team increase by approximately 3.1%. r_diff -0.06610 Odds Ratio=e −0.066 = 0.936 So, for every one-unit increase in r_diff, the odds of a team being classified as the home team decrease by approximately 6.4%.

  1. Fully assess the accuracy of the logistic regression model using both the training and the testing datasets.
#First apply to training dataset

pl_training_lr_pi <- predict(pl_model_lr, newdata = pl_training, type = 'response')
pl_training_lr_pi
        1         2         3         4         5         6         7         8 
0.4553542 0.6031440 0.4333259 0.4388118 0.3970660 0.5571664 0.4140378 0.6586373 
        9        10        11        12        13        14        15        16 
0.4119537 0.4161250 0.5452686 0.6210729 0.5397318 0.3720803 0.6454042 0.5166307 
       17        18        19        20        21        22        23        24 
0.4194603 0.5516554 0.4301590 0.5863701 0.4737313 0.3619026 0.4737313 0.5990214 
       25        26        27        28        29        30        31        32 
0.4715885 0.6434344 0.5341852 0.4827190 0.7448009 0.6636351 0.6074438 0.6137729 
       33        34        35        36        37        38        39        40 
0.4947370 0.4324712 0.6477421 0.4312162 0.7311895 0.5101880 0.5001129 0.4737313 
       41        42        43        44        45        46        47        48 
0.4409299 0.5299046 0.6002494 0.4771510 0.3730821 0.6434344 0.5221994 0.5649817 
       49        50        51        52        53        54        55        56 
0.4388118 0.4464362 0.5757113 0.5590883 0.5948847 0.4398737 0.5990214 0.4161250 
       57        58        59        60        61        62        63        64 
0.4694468 0.5969548 0.4825183 0.5507942 0.5702531 0.4737313 0.5550443 0.5882588 
       65        66        67        68        69        70        71        72 
0.5948847 0.6023104 0.4366959 0.4215551 0.4464362 0.4968860 0.3950098 0.5243439 
       73        74        75        76        77        78        79        80 
0.6394801 0.4958082 0.5778098 0.6158088 0.4119537 0.3909085 0.5418667 0.4934584 
       81        82        83        84        85        86        87        88 
0.5452686 0.6178407 0.5691962 0.4792960 0.3762947 0.5503899 0.6674622 0.6188581 
       89        90        91        92        93        94        95        96 
0.5431362 0.5341852 0.4880906 0.5144836 0.4904396 0.5179080 0.5702531 0.4617573 
       97        98        99       100       101       102       103       104 
0.4904396 0.4540859 0.5234755 0.5410023 0.4249022 0.5384608 0.3183981 0.4857356 
      105       106       107       108       109       110       111       112 
0.4419865 0.5331187 0.3896913 0.4519557 0.5011843 0.5375955 0.5341852 0.4291089 
      113       114       115       116       117       118       119       120 
0.4981648 0.4737313 0.4660332 0.4161250 0.4562177 0.5874154 0.5200541 0.4161250 
      121       122       123       124       125       126       127       128 
0.6000565 0.5516554 0.5461317 0.3638902 0.6884593 0.5840882 0.4270043 0.4968860 
      129       130       131       132       133       134       135       136 
0.5277625 0.6315150 0.4681730 0.4737313 0.4161250 0.5123360 0.3823666 0.5410023 
      137       138       139       140       141       142       143       144 
0.4981648 0.5957235 0.5264875 0.4594217 0.4065603 0.6263087 0.4293059 0.5388668 
      145       146       147       148       149       150       151       152 
0.5605469 0.5187772 0.3750949 0.6043678 0.5333188 0.4882914 0.5907345 0.4011894 
      153       154       155       156       157       158       159       160 
0.4562177 0.4574870 0.4762825 0.4379546 0.4485616 0.5011843 0.3528312 0.6355069 
      161       162       163       164       165       166       167       168 
0.3658824 0.4098728 0.4228029 0.4801650 0.5689991 0.3888698 0.4673063 0.6064214 
      169       170       171       172       173       174       175       176 
0.4065603 0.3547966 0.5256195 0.5571664 0.4485616 0.4694468 0.3446505 0.4519557 
      177       178       179       180       181       182       183       184 
0.5799055 0.4848659 0.4119537 0.4182151 0.3547966 0.5089096 0.5928113 0.4421912 
      185       186       187       188       189       190       191       192 
0.4638946 0.5011843 0.2932505 0.3281583 0.5179080 0.3369257 0.5277625 0.3742793 
      193       194       195       196       197       198       199       200 
0.4913097 0.4003533 0.5702531 0.5101880 0.3771121 0.4077951 0.6335132 0.4379546 
      201       202       203       204       205       206       207       208 
0.4825183 0.4065603 0.4464362 0.5277625 0.3909085 0.5757113 0.5155540 0.4913097 
      209       210       211       212       213       214       215       216 
0.4673063 0.5723585 0.6624923 0.4651669 0.4228029 0.4498272 0.4011894 0.5254191 
      217       218       219       220       221       222       223       224 
0.4303625 0.6023104 0.5473993 0.4270043 0.5757113 0.4694468 0.4345824 0.4870134 
      225       226       227       228       229       230       231       232 
0.5647777 0.3202666 0.6251107 0.4673063 0.4562177 0.3638902 0.4119537 0.6230940 
      233       234       235       236       237       238       239       240 
0.5605469 0.5639217 0.3291038 0.5495283 0.3427114 0.3876486 0.4215551 0.5957235 
      241       242       243       244       245       246       247       248 
0.4215551 0.3950098 0.4979573 0.4194603 0.4498272 0.5101880 0.5736101 0.4638946 
      249       250       251       252       253       254       255       256 
0.5884535 0.6293187 0.6465740 0.3497065 0.2905824 0.3803384 0.6567018 0.4562177 
      257       258       259       260       261       262       263       264 
0.4194603 0.5299046 0.3970660 0.5529202 0.2939726 0.4098728 0.2914720 0.5307717 
      265       266       267       268       269       270       271       272 
0.3896913 0.3876486 0.5884535 0.5144836 0.5144836 0.4630288 0.4400718 0.2886484 
      273       274       275       276       277       278       279       280 
0.4057207 0.2480773 0.3864351 0.3465948 0.4430501 0.3438646 0.4814419 0.4173683 
      281       282       283       284       285       286       287       288 
0.4291089 0.5431362 0.4846652 0.4409299 0.6477421 0.3407775 0.3710736 0.5431362 
      289       290       291       292       293       294       295       296 
0.5786589 0.4194603 0.3856100 0.2968265 0.4649669 0.3340481 0.3015038 0.4303625 
      297       298       299       300       301       302       303       304 
0.3073105 0.4827190 0.5537807 0.4057207 0.5757113 0.4968860 0.3876486 0.5702531 
      305       306       307       308       309       310       311       312 
0.4421912 0.4519557 0.5948847 0.3962329 0.5928113 0.4638946 0.3876486 0.4694468 
      313       314       315       316       317       318       319       320 
0.6364989 0.4968860 0.4077951 0.3630847 0.4925882 0.6323248 0.5894972 0.4574870 
      321       322       323       324       325       326       327       328 
0.5788612 0.4226068 0.5354578 0.6465740 0.3793228 0.4771510 0.5354578 0.4098728 
      329       330       331       332       333       334       335       336 
0.3650751 0.5264875 0.2622755 0.3446505 0.4013888 0.3950098 0.5144836 0.5765615 
      337       338       339       340       341       342       343       344 
0.3607222 0.5778098 0.2761244 0.4990351 0.5091105 0.3319601 0.5354578 0.5681451 
      345       346       347       348       349       350       351       352 
0.4792960 0.4086361 0.5320455 0.6355069 0.3650751 0.4870134 0.4441078 0.5702531 
      353       354       355       356       357       358       359       360 
0.4333259 0.4500262 0.4140378 0.5691962 0.4098728 0.4771510 0.5605469 0.5928113 
      361       362       363       364       365       366       367       368 
0.5563073 0.5397318 0.4119537 0.4583512 0.4388118 0.5354578 0.4540859 0.4207064 
      369       370       371       372       373       374       375       376 
0.4140378 0.4065603 0.5874154 0.5626634 0.5123360 0.6137729 0.4835884 0.4848659 
      377       378       379       380       381       382       383       384 
0.5134132 0.4312162 0.5853303 0.3929573 0.5969548 0.6178407 0.4673063 0.4638946 
      385       386       387       388       389       390       391       392 
0.3909085 0.4398737 0.5299046 0.6323248 0.5243439 0.4587568 0.3898825 0.4660332 
      393       394       395       396       397       398       399       400 
0.5211236 0.4947370 0.4173683 0.4715885 0.5187772 0.4913097 0.5354578 0.4388118 
      401       402       403       404       405       406       407       408 
0.5473993 0.5550443 0.4857356 0.6997568 0.4681730 0.5840882 0.5765615 0.4707211 
      409       410       411       412       413       414       415       416 
0.6895554 0.5003139 0.4760755 0.6190477 0.5571664 0.4956073 0.4715885 0.4750067 
      417       418       419       420       421       422       423       424 
0.5799055 0.5110579 0.5507942 0.5354578 0.5431362 0.5928113 0.4464362 0.5529202 
      425       426       427       428       429       430       431       432 
0.4215551 0.4891613 0.5928113 0.4574870 0.4324712 0.4630288 0.6446071 0.6074438 
      433       434       435       436       437       438       439       440 
0.3195092 0.4249022 0.3762947 0.5819983 0.5123360 0.5277625 0.4814419 0.3771121 
      441       442       443       444       445       446       447       448 
0.5410023 0.5354578 0.5928113 0.4968860 0.6263087 0.4681730 0.5110579 0.4827190 
      449       450       451       452       453       454       455       456 
0.6023104 0.3823666 0.5797097 0.4703140 0.4540859 0.4044879 0.4758750 0.5209230 
      457       458       459       460       461       462       463       464 
0.4421912 0.4182151 0.5529202 0.5516554 0.5329122 0.5711060 0.5605469 0.6064214 
      465       466       467       468       469       470       471       472 
0.5080395 0.3730821 0.6426353 0.6877121 0.5861751 0.5290372 0.4400718 0.4024189 
      473       474       475       476       477       478       479       480 
0.6023104 0.4835884 0.4617573 0.5605469 0.5397318 0.6636351 0.5571664 0.4291089 
      481       482       483       484       485       486       487       488 
0.5243439 0.5907345 0.6536040 0.6465740 0.6158088 0.5003139 0.4161250 0.5452686 
      489       490       491       492       493       494       495       496 
0.5080395 0.4913097 0.6712673 0.6374958 0.6343212 0.5884598 0.4388118 0.4990351 
      497       498       499       500       501       502       503       504 
0.6315150 0.6010845 0.3750949 0.5711060 0.5266879 0.6023104 0.5626634 0.5465366 
      505       506       507       508       509       510       511       512 
0.4278563 0.4814419 0.6178407 0.4333259 0.4936658 0.5418667 0.4814419 0.5080395 
      513       514       515       516       517       518       519       520 
0.5440000 0.5861751 0.6076416 0.4837892 0.4065603 0.4617573 0.5744614 0.5221994 
      521       522       523       524       525       526       527       528 
0.4443127 0.6877121 0.5254191 0.5723585 0.3835753 0.5418667 0.5529202 0.6446071 
      529       530       531       532       533       534       535       536 
0.5969548 0.3856100 0.4506890 0.5626634 0.4485616 0.4451724 0.5529202 0.6789911 
      537       538       539       540       541       542       543       544 
0.6655514 0.5874154 0.4131934 0.6137729 0.5112653 0.5765615 0.5592865 0.4990351 
      545       546       547       548       549       550       551       552 
0.4354380 0.5452686 0.4205105 0.3793290 0.3619026 0.6586373 0.7169647 0.3516640 
      553       554       555       556       557       558       559       560 
0.5529202 0.5894972 0.5234755 0.4792960 0.6117331 0.4562177 0.7136270 0.5990214 
      561       562       563       564       565       566       567       568 
0.5461317 0.7160890 0.4784272 0.6168222 0.6190477 0.6210729 0.4205105 0.4947370 
      569       570       571       572       573       574       575       576 
0.4947370 0.5461317 0.5689991 0.4140378 0.7188684 0.6031440 0.7587413 0.6543918 
      577       578       579       580       581       582       583       584 
0.5660346 0.6644118 0.5277625 0.5915759 0.6251107 0.4485616 0.5245444 0.3607222 
      585       586       587       588       589       590       591       592 
0.6374958 0.5440000 0.5822002 0.6303239 0.4660332 0.4303625 0.7788807 0.6485360 
      593       594       595       596       597       598       599       600 
0.6230940 0.7108167 0.5441994 0.6741094 0.7062095 0.5786589 0.7004877 0.5668896 
pl_training_lr_final <- pl_training %>%
                          mutate(pi = pl_training_lr_pi) %>%
                          mutate(pl_training_lr_prediction = case_when(pi > 0.5 ~ 'Home', 
                                                                       pi <= 0.5 ~ 'Away'))
pl_training_lr_final
# A tibble: 600 × 14
   team      ftg_diff htg_diff s_diff st_diff f_diff c_diff y_diff r_diff wdl_ft
   <chr>        <dbl>    <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr> 
 1 Watford          1        0     14      -2      5      5     -1      0 Win   
 2 Southamp…        0        0    -12      -2     -3     -7      0      0 Draw  
 3 Crystal …        1        0      1      -2     -7      1     -2      0 Win   
 4 Bournemo…        1        1     -1      -1     -4      2      0      0 Win   
 5 West Ham        -3       -1     -5      -8     -5      4      3      0 Lose  
 6 Leicester        0        0     -2      -4      0     -3      1      0 Draw  
 7 Bournemo…        0        0     -8      -3     -5      3     -2      0 Draw  
 8 West Brom       -4       -2    -12      -7      7     -6      1      0 Lose  
 9 Southamp…       -2       -1     -4      -6     -3      4      0      0 Lose  
10 Swansea         -2       -3     -3      -2     -7      2     -2      0 Lose  
# ℹ 590 more rows
# ℹ 4 more variables: wdl_ht <chr>, home_or_away <fct>, pi <dbl>,
#   pl_training_lr_prediction <chr>
pl_training_lr_tab <- table(pl_training_lr_final$home_or_away, pl_training_lr_final$pl_training_lr_prediction, dnn=c('Actual', 'Predicted'))
pl_training_lr_tab
      Predicted
Actual Away Home
  Home  171  127
  Away  131  171
pl_training_lr_acc <- sum(diag(pl_training_lr_tab))/sum(pl_training_lr_tab)
pl_training_lr_acc
[1] 0.57
#Second apply to testing dataset

pl_testing_lr_pi <- predict(pl_model_lr, newdata = pl_testing, type = 'response')
pl_testing_lr_pi
        1         2         3         4         5         6         7         8 
0.5123360 0.6210729 0.5668896 0.6125596 0.4913097 0.5123360 0.6010845 0.3762947 
        9        10        11        12        13        14        15        16 
0.6293248 0.5320455 0.6761711 0.5689991 0.5320455 0.4485616 0.4528181 0.5431362 
       17        18        19        20        21        22        23        24 
0.5256195 0.4236528 0.4848659 0.4291089 0.4421912 0.5915759 0.5375955 0.4737313 
       25        26        27        28        29        30        31        32 
0.3388489 0.5431362 0.3080521 0.5584282 0.4324712 0.5089096 0.5136140 0.4583512 
       33        34        35        36        37        38        39        40 
0.5626634 0.5765615 0.4013888 0.6323248 0.4814419 0.7132930 0.5234755 0.5550443 
       41        42        43        44        45        46        47        48 
0.5668896 0.3619026 0.4024189 0.3202666 0.5473993 0.3300564 0.3619026 0.3929573 
       49        50        51        52        53        54        55        56 
0.5668896 0.3710736 0.4205041 0.4626227 0.5811512 0.4870134 0.5277625 0.5011843 
       57        58        59        60        61        62        63        64 
0.4837892 0.4366959 0.5584282 0.4464362 0.4562177 0.4324712 0.4024189 0.4857356 
       65        66        67        68        69        70        71        72 
0.4630288 0.4345824 0.3919355 0.5948847 0.3540001 0.3835753 0.5605469 0.3960344 
       73        74        75        76        77        78        79        80 
0.4119537 0.4651669 0.4268012 0.3783144 0.2275394 0.5179080 0.3599198 0.4421912 
       81        82        83        84        85        86        87        88 
0.2914720 0.3730821 0.6117331 0.3497065 0.5969548 0.5179080 0.5375955 0.4750067 
       89        90        91        92        93        94        95        96 
0.4400718 0.5626634 0.5266879 0.5723585 0.4660332 0.3690695 0.5157613 0.3407775 
       97        98        99       100       101       102       103       104 
0.5799055 0.4750067 0.5668896 0.6693675 0.4660332 0.5187772 0.4506890 0.5234755 
      105       106       107       108       109       110       111       112 
0.5670933 0.5080395 0.4088304 0.2943118 0.4857356 0.4131934 0.5497272 0.4485616 
      113       114       115       116       117       118       119       120 
0.4904396 0.6335132 0.6076416 0.6465740 0.6555477 0.5990214 0.6198622 0.5418667 
      121       122       123       124       125       126       127       128 
0.6536040 0.4519557 0.5969548 0.6780506 0.5668896 0.7143380 0.6799244 0.4421912 
      129       130       131       132       133       134       135       136 
0.4814419 0.5179080 0.6084713 0.5221994 0.5179080 0.4366959 0.3458067 0.5592865 
      137       138       139       140       141       142       143       144 
0.5786589 0.5819983 0.4333259 0.5397318 0.4596214 0.6210729 0.5874154 0.5894972 
      145       146       147       148       149       150       151       152 
0.6064214 0.6283185 0.5744614 0.6222760 0.6617134 0.5799055 0.4660332 0.6127566 
      153       154       155       156       157       158       159       160 
0.5969548 0.5681451 0.6674622 0.4913097 0.5894972 0.7160890 0.6355069 0.5264875 
pl_testing_lr_final <- pl_testing %>%
                        mutate(pi = pl_testing_lr_pi) %>%
                        mutate(pl_testing_lr_prediction = case_when(pi > 0.5 ~ 'Home', 
                                                                    pi <= 0.5 ~ 'Away'))
pl_testing_lr_final
# A tibble: 160 × 14
   team      ftg_diff htg_diff s_diff st_diff f_diff c_diff y_diff r_diff wdl_ft
   <chr>        <dbl>    <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr> 
 1 Burnley         -2       -2     -4      -6      1      0      1      0 Lose  
 2 Leicester        0        0     -5      -3      4     -5     -1      0 Draw  
 3 Burnley         -1        0     -8      -2     -1     -4      0      0 Lose  
 4 Sunderla…       -3       -1    -18      -6     -4     -8      1      0 Lose  
 5 West Brom       -2       -1      0      -4     -4     -1      0      0 Lose  
 6 Middlesb…        0        0     -1      -3      1      0      3      0 Draw  
 7 Stoke            0        1    -19      -7     -1     -6      4      0 Draw  
 8 Leicester       -3       -2     -4      -8     -1      7     -2      0 Lose  
 9 Stoke           -1        0    -15      -6      5     -6      2      1 Lose  
10 Swansea         -2        1    -14      -7     -1     -2      0      0 Lose  
# ℹ 150 more rows
# ℹ 4 more variables: wdl_ht <chr>, home_or_away <fct>, pi <dbl>,
#   pl_testing_lr_prediction <chr>
pl_testing_lr_tab <- table(pl_testing_lr_final$home_or_away, pl_testing_lr_final$pl_testing_lr_prediction, dnn = c("Actual", "Predicted"))
pl_testing_lr_tab
      Predicted
Actual Away Home
  Home   47   35
  Away   24   54
pl_testing_lr_acc <- sum(diag(pl_testing_lr_tab))/sum(pl_testing_lr_tab)
pl_testing_lr_acc
[1] 0.63125
  1. Compare and contrast the Classification Tree model and the Binary Logistic Regression model.
# Compare the accuracy of all models
pl_training_lr_acc
[1] 0.57
pl_testing_lr_acc
[1] 0.63125
pl_training_tree_acc
[1] 0.6
pl_testing_tree_acc 
[1] 0.60625

i The tree method is less accurate than linear regression model in regards the testing data 61% vs 63%. However, the tree method is more accurate when it comes to the training database 60% vs 57%. Due to there being there is less variance between the training and testing for the tree model it is better fitted and should predict data more accurately.

  1. The tree model considers c_diff to be the most signicant target variable at 62 opposed to fouls at 38.

Question 2 – Clustering Baseball Players

1. Import the baseball_hof.csv file into R.

# Loading Package and Importing Datasets
library(cluster)
baseball_hof <- read_csv("baseball_hof.csv")
Rows: 82 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): playerID
dbl (5): hits, runs, home_runs, rbi, stolen_bases

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(baseball_hof)

2. Does the data need to be scaled before computing the distance matrix for hierarchical clustering or before being entered into the K-means clustering algorithm? Explain your answer.

Scaling is essential for both hierarchical and K-means clustering. For example, in baseball analysis, without scaling, factors with greater magnitudes, such as career hits or RBIs, would disproportionately influence the clustering process, resulting in biased outcomes. Scaling guarantees that every variable has an equal impact, enabling more precise and significant categorization based on the aggregate performance of players rather than individual data.

3. Hierchical Clustering:

# A. #### Creating a subset without the Player ID variable because it can't be used in the cluster analysis. However, we will need it later.
bhof <- select(baseball_hof, hits:stolen_bases)
View(baseball_hof)

##### Compute distances between each pair of players
bhof_scale <- scale(bhof)
bhof1 <- dist(bhof_scale)
#B. ##### Hierarchical clustering using Ward.
h1 <- hclust(bhof1, method = 'ward.D')
#C. ##### Creating a Dendrogram and a heatmap. 
plot(h1, hang = -1)

heatmap(as.matrix(bhof1), Rowv = as.dendrogram(h1), Colv = 'Rowv', labRow = F, labCol = F)

  1. Yes the heatmap provides some evidence of clustering structure within the dataset. You can see light coloured blocks around the diagonal of the heatmap, you can see six blocks which indicates some level of clustering.
#D. ##### 4 Clusters
clusters1 <- cutree(h1, k=4)

##### Assess the quality of the segmentation
sil1 <- silhouette(clusters1, bhof1) 
summary(sil1)
Silhouette of 82 units in 4 clusters from silhouette.default(x = clusters1, dist = bhof1) :
 Cluster sizes and average silhouette widths:
       17        25        30        10 
0.3202221 0.2078566 0.2958375 0.4325219 
Individual silhouette widths:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.1771  0.2100  0.3257  0.2907  0.4224  0.5725 
##### Profile the clusters. 

#Combine original dataset 
bhof_clus <- cbind(baseball_hof, bhof_scale, clusters1)
colnames(bhof_clus) <- c("PlayerID", "hits", "runs", "home_runs","rbi", "stolen_bases",
                         "hits_s", "runs_s", "home_runs_s","rbi_s", "stolen_bases_s", "clusters1")
bhof_clus <- mutate(bhof_clus, Cluster = case_when(clusters1 == 1 ~ 'C1',
                                                       clusters1 == 2 ~ 'C2',
                                                       clusters1 == 3 ~ 'C3',
                                                       clusters1 == 4 ~ 'C4',))
View(bhof_clus)

#Calculating mean value of all scaled variables for each cluster. 
bhof_clus_means <- bhof_clus %>%
                      group_by(Cluster) %>%
                      summarise(hits = mean(hits_s),
                                runs = mean(runs_s),
                                home_runs = mean(home_runs_s),
                                rbi = mean(rbi_s),
                                stolen_bases = mean(stolen_bases_s))

bhof_clus_means
# A tibble: 4 × 6
  Cluster   hits   runs home_runs    rbi stolen_bases
  <chr>    <dbl>  <dbl>     <dbl>  <dbl>        <dbl>
1 C1       0.804  0.973     0.984  1.26        -0.329
2 C2       0.617  0.430    -0.963 -0.662        1.12 
3 C3      -0.427 -0.396     0.456  0.222       -0.532
4 C4      -1.63  -1.54     -0.634 -1.15        -0.634
#E. ####Convert the dataset to be in "tidy" format to allow for creation of line graph.
bhof_clus_tidy <- bhof_clus_means %>%
  pivot_longer(cols = c(hits, runs, home_runs, rbi, stolen_bases), names_to = "KPIS", values_to = "Average_Value") 

bhof_clus_tidy
# A tibble: 20 × 3
   Cluster KPIS         Average_Value
   <chr>   <chr>                <dbl>
 1 C1      hits                 0.804
 2 C1      runs                 0.973
 3 C1      home_runs            0.984
 4 C1      rbi                  1.26 
 5 C1      stolen_bases        -0.329
 6 C2      hits                 0.617
 7 C2      runs                 0.430
 8 C2      home_runs           -0.963
 9 C2      rbi                 -0.662
10 C2      stolen_bases         1.12 
11 C3      hits                -0.427
12 C3      runs                -0.396
13 C3      home_runs            0.456
14 C3      rbi                  0.222
15 C3      stolen_bases        -0.532
16 C4      hits                -1.63 
17 C4      runs                -1.54 
18 C4      home_runs           -0.634
19 C4      rbi                 -1.15 
20 C4      stolen_bases        -0.634
#### Line Graph
ggplot(bhof_clus_tidy, aes(x = KPIS, y = Average_Value, group = Cluster, colour = Cluster)) +
  geom_line(size = 1) + 
  geom_point(size = 2) +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
  ylab("Mean Performance") + 
  scale_x_discrete(labels= c("Hits", "Runs", "Home Runs","RBI", "Stolen Bases")) +
  ggtitle("Mean Performance for each Cluster for each KPI")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

#### Barchart
# Define colors for each category
category_colors <- c("hits" = "red", 
                     "runs" = "blue", 
                     "home_runs" = "green", 
                     "rbi" = "purple", 
                     "stolen_bases" = "orange")

# Calculate percentages within each cluster
bhof1_clus_tidy <- bhof_clus_tidy %>%
  group_by(Cluster) %>%
  mutate(Percentage = (Average_Value / sum(Average_Value)) * 100)

# Create bar chart with different colors for each category
ggplot(bhof1_clus_tidy, aes(x = KPIS, y = Percentage, fill = KPIS)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7),
          panel.background = element_blank()) +
  ylab("Percentage of Mean Performance") + 
  ggtitle("The Mean Percentage for each Cluster for each KPI") +
  facet_wrap(~ Cluster, scales = "free_y") +
  scale_fill_manual(values = category_colors)

4. K-means Clustering:

#A. 
set.seed(101)
kmeans1 <- kmeans(bhof_scale, centers = 4)

kmeans1$cluster 
 [1] 4 2 2 1 1 3 1 2 2 3 1 2 3 2 3 2 3 2 3 1 3 1 1 4 2 3 4 1 2 3 1 2 1 1 1 1 2 1
[39] 3 2 3 1 3 1 4 1 3 2 2 4 4 4 1 3 3 2 1 4 1 4 3 1 4 3 1 1 2 2 1 2 1 3 1 1 3 2
[77] 2 1 4 4 4 2
kmeans1$centers 
        hits       runs  home_runs        rbi stolen_bases
1 -0.2044605 -0.1624618  0.5660315  0.3798416   -0.4849247
2  0.7521677  0.5537557 -0.9804896 -0.5809428    1.2098245
3 -1.1508405 -1.2010894 -0.5701789 -0.8728960   -0.4647733
4  0.8494748  1.1682310  1.2734836  1.4407847   -0.3236583
kmeans1$size    
[1] 28 22 19 13
kmeans1$iter    
[1] 3
#B.
bhof1 <- dist(bhof_scale)

sil_kmeans1 <- silhouette(kmeans1$cluster, bhof1)
summary(sil_kmeans1)
Silhouette of 82 units in 4 clusters from silhouette.default(x = kmeans1$cluster, dist = bhof1) :
 Cluster sizes and average silhouette widths:
       28        22        19        13 
0.3322757 0.2178155 0.2687416 0.3421893 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.06303  0.18425  0.28984  0.28842  0.41869  0.54977 
#C.

### Profile clusters first

bhof_clus_kmeans_tidy <- as_tibble(kmeans1$centers) %>%
                              mutate(Cluster = c("C1", "C2", "C3", "C4")) %>% 
                              pivot_longer(cols = c(hits, runs, home_runs, rbi, stolen_bases), names_to = "KPIS", values_to = "Average_Value") 
 
bhof_clus_kmeans_tidy
# A tibble: 20 × 3
   Cluster KPIS         Average_Value
   <chr>   <chr>                <dbl>
 1 C1      hits                -0.204
 2 C1      runs                -0.162
 3 C1      home_runs            0.566
 4 C1      rbi                  0.380
 5 C1      stolen_bases        -0.485
 6 C2      hits                 0.752
 7 C2      runs                 0.554
 8 C2      home_runs           -0.980
 9 C2      rbi                 -0.581
10 C2      stolen_bases         1.21 
11 C3      hits                -1.15 
12 C3      runs                -1.20 
13 C3      home_runs           -0.570
14 C3      rbi                 -0.873
15 C3      stolen_bases        -0.465
16 C4      hits                 0.849
17 C4      runs                 1.17 
18 C4      home_runs            1.27 
19 C4      rbi                  1.44 
20 C4      stolen_bases        -0.324
### Line Graph
ggplot(bhof_clus_kmeans_tidy, mapping = aes(x = KPIS, y = Average_Value, group = Cluster, colour = Cluster)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7)) +
  ylab("Mean Performance") + 
  scale_x_discrete(labels= c("Hits", "Runs", "Home Runs","RBI", "Stolen Bases")) +
  ggtitle("Mean Performance for each Cluster for each KPI")

#### Barchart
# Define colors for each category
category_colors <- c("hits" = "red", 
                     "runs" = "blue", 
                     "home_runs" = "green", 
                     "rbi" = "purple", 
                     "stolen_bases" = "orange")

# Calculate percentages within each cluster
bhof1_clus_kmeans_tidy <- bhof_clus_kmeans_tidy %>%
  group_by(Cluster) %>%
  mutate(Percentage = (Average_Value / sum(Average_Value)) * 100)

# Create bar chart with different colors for each category
ggplot(bhof1_clus_kmeans_tidy, aes(x = KPIS, y = Percentage, fill = KPIS)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme(axis.text.x = element_text(angle = 30, vjust = 0.7),
          panel.background = element_blank()) +
  ylab("Percentage of Mean Performance") + 
  ggtitle("The Mean Percentage for each Cluster for each KPI") +
  facet_wrap(~ Cluster, scales = "free_y") +
  scale_fill_manual(values = category_colors)

5. Compare and contrast the clusters produced by Hierarchical Clustering and K-means.

  1. The Hierchical Clustering algorithm produced a higher silhouette score of 0.2907 opposed to the K-Means method 0.28842.

  2. Both algorithm produced a similar score determining that they both fall into the category description of a ‘’structure that is weak and could be artificial’’ with a score between 0.25-0.5.