1. Import the pl_training.csv and pl_testing.csv datasets into R.
# Loading Packages and Importing Datasetslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rpart)library(rattle)
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(dplyr) ## for viualising graphslibrary(readr)pl_training <-read_csv("pl_training.csv")
Rows: 600 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): team, wdl_ft, wdl_ht, home_or_away
dbl (8): ftg_diff, htg_diff, s_diff, st_diff, f_diff, c_diff, y_diff, r_diff
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 160 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): team, wdl_ft, wdl_ht, home_or_away
dbl (8): ftg_diff, htg_diff, s_diff, st_diff, f_diff, c_diff, y_diff, r_diff
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(pl_testing)
2. Classification Tree Method:
Create and visualise a classification tree model that will allow you to classify a team as either the home or the away team.
#create classification tree modelpl_model_tree <-rpart(home_or_away ~+ c_diff + f_diff + r_diff, data = pl_training, method ='class')fancyRpartPlot(pl_model_tree)
Interpret the classification tree:
If one of the teams has a > c_diff of 0.5 than it is predicted to be the home team. The nodes purity is 58%. In this case, the corner diff follows Path 1 and ends up in this leaf, they are predicted to be the home team 58% of the time.
If one of the teams has a < -3.5 c_diff than it is predicted that is the away team. The nodes purity is 43%. In this case, the corner diff is to be predicted to be the away team 43% of the time.
# iii)pl_model_tree$variable.importance
c_diff f_diff
8.643357 5.273495
summary(pl_model_tree)
Call:
rpart(formula = home_or_away ~ +c_diff + f_diff + r_diff, data = pl_training,
method = "class")
n= 600
CP nsplit rel error xerror xstd
1 0.14429530 0 1.0000000 1.1006711 0.04091942
2 0.01677852 1 0.8557047 0.9496644 0.04103278
3 0.01000000 4 0.8053691 0.9295302 0.04097786
Variable importance
c_diff f_diff
62 38
Node number 1: 600 observations, complexity param=0.1442953
predicted class=Away expected loss=0.4966667 P(node) =1
class counts: 302 298
probabilities: 0.503 0.497
left son=2 (325 obs) right son=3 (275 obs)
Primary splits:
c_diff < 0.5 to the left, improve=6.74694600, (0 missing)
f_diff < -2.5 to the right, improve=4.55806800, (0 missing)
r_diff < 0.5 to the right, improve=0.04902376, (0 missing)
Surrogate splits:
f_diff < -1.5 to the right, agree=0.593, adj=0.113, (0 split)
Node number 2: 325 observations, complexity param=0.01677852
predicted class=Away expected loss=0.4276923 P(node) =0.5416667
class counts: 186 139
probabilities: 0.572 0.428
left son=4 (130 obs) right son=5 (195 obs)
Primary splits:
c_diff < -3.5 to the left, improve=1.8964100, (0 missing)
f_diff < -2.5 to the right, improve=1.3829240, (0 missing)
r_diff < 0.5 to the left, improve=0.6732439, (0 missing)
Surrogate splits:
f_diff < 7.5 to the right, agree=0.612, adj=0.031, (0 split)
Node number 3: 275 observations
predicted class=Home expected loss=0.4218182 P(node) =0.4583333
class counts: 116 159
probabilities: 0.422 0.578
Node number 4: 130 observations
predicted class=Away expected loss=0.3615385 P(node) =0.2166667
class counts: 83 47
probabilities: 0.638 0.362
Node number 5: 195 observations, complexity param=0.01677852
predicted class=Away expected loss=0.4717949 P(node) =0.325
class counts: 103 92
probabilities: 0.528 0.472
left son=10 (84 obs) right son=11 (111 obs)
Primary splits:
f_diff < 2.5 to the right, improve=1.8390360, (0 missing)
r_diff < 0.5 to the left, improve=1.2341880, (0 missing)
c_diff < -2.5 to the right, improve=0.4391316, (0 missing)
Node number 10: 84 observations
predicted class=Away expected loss=0.3928571 P(node) =0.14
class counts: 51 33
probabilities: 0.607 0.393
Node number 11: 111 observations, complexity param=0.01677852
predicted class=Home expected loss=0.4684685 P(node) =0.185
class counts: 52 59
probabilities: 0.468 0.532
left son=22 (20 obs) right son=23 (91 obs)
Primary splits:
f_diff < -5.5 to the left, improve=2.615543, (0 missing)
r_diff < 0.5 to the left, improve=2.006552, (0 missing)
c_diff < -1.5 to the right, improve=0.638068, (0 missing)
Node number 22: 20 observations
predicted class=Away expected loss=0.3 P(node) =0.03333333
class counts: 14 6
probabilities: 0.700 0.300
Node number 23: 91 observations
predicted class=Home expected loss=0.4175824 P(node) =0.1516667
class counts: 38 53
probabilities: 0.418 0.582
C_diff is considered to be a higher target value calculating significance of 62 opposed to 42 for f_diff. On the other hand r_diff did not come come as target value.
Assess the accuracy of the classification tree using both the training and the testing datasets.
#Accuracy on Training datapl_training_tree_prob <-predict(pl_model_tree, newdata = pl_training, type ='prob')pl_training_tree_prediction <-predict(pl_model_tree, newdata = pl_training, type ='class')pl_training_tree_final <-cbind(pl_training, pl_training_tree_prob, pl_training_tree_prediction)head(pl_training_tree_final)
team ftg_diff htg_diff s_diff st_diff f_diff c_diff y_diff r_diff
1 Watford 1 0 14 -2 5 5 -1 0
2 Southampton 0 0 -12 -2 -3 -7 0 0
3 Crystal Palace 1 0 1 -2 -7 1 -2 0
4 Bournemouth 1 1 -1 -1 -4 2 0 0
5 West Ham -3 -1 -5 -8 -5 4 3 0
6 Leicester 0 0 -2 -4 0 -3 1 0
wdl_ft wdl_ht home_or_away Away Home pl_training_tree_prediction
1 Win Draw Home 0.4218182 0.5781818 Home
2 Draw Draw Home 0.6384615 0.3615385 Away
3 Win Draw Home 0.4218182 0.5781818 Home
4 Win Win Home 0.4218182 0.5781818 Home
5 Lose Lose Home 0.4218182 0.5781818 Home
6 Draw Draw Home 0.4175824 0.5824176 Home
The p-value for r_diff is 0.7895, which means we accept H0 and conclude that r_diff is not an important predictor of match outcome. Similarly, f_diff is not an important predictor of match outcome 0.0767. The only predictor that is important in predicting match outcome is c_diff of 0.0002.
c_diff -0.07041 Odds Ratio=e −0.070 = 0.933 So, for every one-unit increase in c_diff, the odds of a team being classified as the home team decrease by approximately 6.7%. f_diff 0.03091 Odds Ratio=e 0.031 = 1.031 So, for every one-unit increase in f_diff, the odds of a team being classified as the home team increase by approximately 3.1%. r_diff -0.06610 Odds Ratio=e −0.066 = 0.936 So, for every one-unit increase in r_diff, the odds of a team being classified as the home team decrease by approximately 6.4%.
Fully assess the accuracy of the logistic regression model using both the training and the testing datasets.
#First apply to training datasetpl_training_lr_pi <-predict(pl_model_lr, newdata = pl_training, type ='response')pl_training_lr_pi
Compare and contrast the Classification Tree model and the Binary Logistic Regression model.
# Compare the accuracy of all modelspl_training_lr_acc
[1] 0.57
pl_testing_lr_acc
[1] 0.63125
pl_training_tree_acc
[1] 0.6
pl_testing_tree_acc
[1] 0.60625
i The tree method is less accurate than linear regression model in regards the testing data 61% vs 63%. However, the tree method is more accurate when it comes to the training database 60% vs 57%. Due to there being there is less variance between the training and testing for the tree model it is better fitted and should predict data more accurately.
The tree model considers c_diff to be the most signicant target variable at 62 opposed to fouls at 38.
Question 2 – Clustering Baseball Players
1. Import the baseball_hof.csv file into R.
# Loading Package and Importing Datasetslibrary(cluster)baseball_hof <-read_csv("baseball_hof.csv")
Rows: 82 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): playerID
dbl (5): hits, runs, home_runs, rbi, stolen_bases
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(baseball_hof)
2. Does the data need to be scaled before computing the distance matrix for hierarchical clustering or before being entered into the K-means clustering algorithm? Explain your answer.
Scaling is essential for both hierarchical and K-means clustering. For example, in baseball analysis, without scaling, factors with greater magnitudes, such as career hits or RBIs, would disproportionately influence the clustering process, resulting in biased outcomes. Scaling guarantees that every variable has an equal impact, enabling more precise and significant categorization based on the aggregate performance of players rather than individual data.
3. Hierchical Clustering:
# A. #### Creating a subset without the Player ID variable because it can't be used in the cluster analysis. However, we will need it later.bhof <-select(baseball_hof, hits:stolen_bases)View(baseball_hof)##### Compute distances between each pair of playersbhof_scale <-scale(bhof)bhof1 <-dist(bhof_scale)
#B. ##### Hierarchical clustering using Ward.h1 <-hclust(bhof1, method ='ward.D')
#C. ##### Creating a Dendrogram and a heatmap. plot(h1, hang =-1)
Yes the heatmap provides some evidence of clustering structure within the dataset. You can see light coloured blocks around the diagonal of the heatmap, you can see six blocks which indicates some level of clustering.
#D. ##### 4 Clustersclusters1 <-cutree(h1, k=4)##### Assess the quality of the segmentationsil1 <-silhouette(clusters1, bhof1) summary(sil1)
Silhouette of 82 units in 4 clusters from silhouette.default(x = clusters1, dist = bhof1) :
Cluster sizes and average silhouette widths:
17 25 30 10
0.3202221 0.2078566 0.2958375 0.4325219
Individual silhouette widths:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1771 0.2100 0.3257 0.2907 0.4224 0.5725
##### Profile the clusters. #Combine original dataset bhof_clus <-cbind(baseball_hof, bhof_scale, clusters1)colnames(bhof_clus) <-c("PlayerID", "hits", "runs", "home_runs","rbi", "stolen_bases","hits_s", "runs_s", "home_runs_s","rbi_s", "stolen_bases_s", "clusters1")bhof_clus <-mutate(bhof_clus, Cluster =case_when(clusters1 ==1~'C1', clusters1 ==2~'C2', clusters1 ==3~'C3', clusters1 ==4~'C4',))View(bhof_clus)#Calculating mean value of all scaled variables for each cluster. bhof_clus_means <- bhof_clus %>%group_by(Cluster) %>%summarise(hits =mean(hits_s),runs =mean(runs_s),home_runs =mean(home_runs_s),rbi =mean(rbi_s),stolen_bases =mean(stolen_bases_s))bhof_clus_means
#E. ####Convert the dataset to be in "tidy" format to allow for creation of line graph.bhof_clus_tidy <- bhof_clus_means %>%pivot_longer(cols =c(hits, runs, home_runs, rbi, stolen_bases), names_to ="KPIS", values_to ="Average_Value") bhof_clus_tidy
#### Line Graphggplot(bhof_clus_tidy, aes(x = KPIS, y = Average_Value, group = Cluster, colour = Cluster)) +geom_line(size =1) +geom_point(size =2) +theme_minimal() +theme(axis.text.x =element_text(angle =30, vjust =0.7)) +ylab("Mean Performance") +scale_x_discrete(labels=c("Hits", "Runs", "Home Runs","RBI", "Stolen Bases")) +ggtitle("Mean Performance for each Cluster for each KPI")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
#### Barchart# Define colors for each categorycategory_colors <-c("hits"="red", "runs"="blue", "home_runs"="green", "rbi"="purple", "stolen_bases"="orange")# Calculate percentages within each clusterbhof1_clus_tidy <- bhof_clus_tidy %>%group_by(Cluster) %>%mutate(Percentage = (Average_Value /sum(Average_Value)) *100)# Create bar chart with different colors for each categoryggplot(bhof1_clus_tidy, aes(x = KPIS, y = Percentage, fill = KPIS)) +geom_bar(stat ="identity", position ="dodge") +theme(axis.text.x =element_text(angle =30, vjust =0.7),panel.background =element_blank()) +ylab("Percentage of Mean Performance") +ggtitle("The Mean Percentage for each Cluster for each KPI") +facet_wrap(~ Cluster, scales ="free_y") +scale_fill_manual(values = category_colors)
### Line Graphggplot(bhof_clus_kmeans_tidy, mapping =aes(x = KPIS, y = Average_Value, group = Cluster, colour = Cluster)) +geom_line(size =1) +geom_point(size =2) +theme_minimal() +theme(axis.text.x =element_text(angle =30, vjust =0.7)) +ylab("Mean Performance") +scale_x_discrete(labels=c("Hits", "Runs", "Home Runs","RBI", "Stolen Bases")) +ggtitle("Mean Performance for each Cluster for each KPI")
#### Barchart# Define colors for each categorycategory_colors <-c("hits"="red", "runs"="blue", "home_runs"="green", "rbi"="purple", "stolen_bases"="orange")# Calculate percentages within each clusterbhof1_clus_kmeans_tidy <- bhof_clus_kmeans_tidy %>%group_by(Cluster) %>%mutate(Percentage = (Average_Value /sum(Average_Value)) *100)# Create bar chart with different colors for each categoryggplot(bhof1_clus_kmeans_tidy, aes(x = KPIS, y = Percentage, fill = KPIS)) +geom_bar(stat ="identity", position ="dodge") +theme(axis.text.x =element_text(angle =30, vjust =0.7),panel.background =element_blank()) +ylab("Percentage of Mean Performance") +ggtitle("The Mean Percentage for each Cluster for each KPI") +facet_wrap(~ Cluster, scales ="free_y") +scale_fill_manual(values = category_colors)
5. Compare and contrast the clusters produced by Hierarchical Clustering and K-means.
The Hierchical Clustering algorithm produced a higher silhouette score of 0.2907 opposed to the K-Means method 0.28842.
Both algorithm produced a similar score determining that they both fall into the category description of a ‘’structure that is weak and could be artificial’’ with a score between 0.25-0.5.