Here in this part of project, I have decided to do some classifications with my previous dataset from Project 1. For this project, I have pulled the data from English Premier League Results.
library(readxl)
Results <- read_excel("Results.xlsx")
str(Results)
## tibble [380 × 6] (S3: tbl_df/tbl/data.frame)
## $ Home_team: chr [1:380] "Arsenal" "Watford" "Chelsea" "Crystal Palace" ...
## $ Away_team: chr [1:380] "Leicester City" "Liverpool" "Burnley" "Huddersfield Town" ...
## $ Home_goal: num [1:380] 4 3 2 0 1 0 1 0 0 4 ...
## $ Away_goal: num [1:380] 3 3 3 3 0 0 0 2 2 0 ...
## $ Result : chr [1:380] "H" "D" "A" "A" ...
## $ Season : chr [1:380] "2017-2018" "2017-2018" "2017-2018" "2017-2018" ...
summary(Results$Home_goal)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.532 2.000 7.000
Total no. of Observations and Variables
dim(Results)
## [1] 380 6
I am taking almost 1/3 no. of observations from my data set as a sample.
s <- sample(380, 125)
The head() and tail() function in R are often used to read the first and last n rows of a dataset.
head(Results)
## # A tibble: 6 × 6
## Home_team Away_team Home_goal Away_goal Result Season
## <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 Arsenal Leicester City 4 3 H 2017-2018
## 2 Watford Liverpool 3 3 D 2017-2018
## 3 Chelsea Burnley 2 3 A 2017-2018
## 4 Crystal Palace Huddersfield Town 0 3 A 2017-2018
## 5 Everton Stoke City 1 0 H 2017-2018
## 6 Southampton Swansea City 0 0 D 2017-2018
tail(Results)
## # A tibble: 6 × 6
## Home_team Away_team Home_goal Away_goal Result Season
## <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 Manchester United Watford 1 0 H 2017-2018
## 2 Newcastle United Chelsea 3 0 H 2017-2018
## 3 Southampton Manchester City 0 1 A 2017-2018
## 4 Swansea City Stoke City 1 2 A 2017-2018
## 5 Tottenham Hotspur Leicester City 5 4 H 2017-2018
## 6 West Ham United Everton 3 1 H 2017-2018
From the head and tail output, we can notice the data is not shuffled. This means that when we split our data between a train set and test set, the algorithms will never see the features of the middle data or other than the data taken. This can lead to a poor prediction.
So, lets shuffle the data,
#shuffling_data(head)
shuffle_index <- sample(1:nrow(Results))
head(shuffle_index)
## [1] 298 89 291 73 258 371
In the above code; sample(1:nrow(Results)): generates a random list of index from 1 to 380 (i.e. the maximum number of rows).
Lets separate my Results data into two parts. I’m gonna split it into traning and testing data.
library(rpart)
#split data into training data
Results_train <- Results[s,]
dim(Results_train)
## [1] 125 6
#split data into testing data
Results_test <- Results[-s,]
dim(Results_test)
## [1] 255 6
I have used the function prop.table() combined with table() to verify if the randomization process is correct.
prop.table(table(Results_train$Result))
##
## A D H
## 0.312 0.272 0.416
library(rpart)
library(rpart.plot)
fit <- rpart(Result~., data = Results, method = 'class')
rpart.plot(fit, extra = 101)
Here; in the above decision tree:
A = Away_team (it means the case when Away_team is the winner)
H = Home_team (it means the case when Home_team is the winner)
D = Draw (its the case when Away_team and Home_team scored same number of goals in the match)
To make a prediction, we can use the predict() function. The basic syntax of predict for R decision tree is:
#making and testing the prediction
predict_unseen <-predict(fit, Results_test, type = 'class')
table_mat <- table(Results_test$Result, predict_unseen)
table_mat
## predict_unseen
## A D H
## A 69 0 0
## D 2 63 0
## H 1 0 120
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(gmodels)
set.seed(380)
X<-factor(ceiling(runif(380)-0.25))
Y<-factor(ceiling(runif(380)-0.20))
confusionMatrix(X,Y, positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 14 87
## 1 70 209
##
## Accuracy : 0.5868
## 95% CI : (0.5355, 0.6368)
## No Information Rate : 0.7789
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : -0.1187
##
## Mcnemar's Test P-Value : 0.2016
##
## Sensitivity : 0.7061
## Specificity : 0.1667
## Pos Pred Value : 0.7491
## Neg Pred Value : 0.1386
## Prevalence : 0.7789
## Detection Rate : 0.5500
## Detection Prevalence : 0.7342
## Balanced Accuracy : 0.4364
##
## 'Positive' Class : 1
##
CrossTable(X,Y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 380
##
##
## | Y
## X | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 14 | 87 | 101 |
## | 3.105 | 0.881 | |
## | 0.139 | 0.861 | 0.266 |
## | 0.167 | 0.294 | |
## | 0.037 | 0.229 | |
## -------------|-----------|-----------|-----------|
## 1 | 70 | 209 | 279 |
## | 1.124 | 0.319 | |
## | 0.251 | 0.749 | 0.734 |
## | 0.833 | 0.706 | |
## | 0.184 | 0.550 | |
## -------------|-----------|-----------|-----------|
## Column Total | 84 | 296 | 380 |
## | 0.221 | 0.779 | |
## -------------|-----------|-----------|-----------|
##
##
Here; we have accuracy of around 58%. But it might be giving the wrong idea about the result. Lets think about that. Thats where we come across the dual concept of Precision and Recall.
Precision tells us how many of the correctly predicted cases actually turned out to be positive. Recall tells us how many of the actual positive cases we were able to predict correctly with our model.
library(caret)
#calculating precision and recall
precision <- posPredValue(X,Y, positive="1")
recall <- sensitivity(X,Y, positive="1")
f1 <- (2*precision*recall)/(precision+recall)
precision
## [1] 0.7491039
recall
## [1] 0.7060811
f1
## [1] 0.7269565
data.frame(precision, recall, f1)
## precision recall f1
## 1 0.7491039 0.7060811 0.7269565
Here, Precision and Recall are around 75% and 71% respectively. That means 75% of the time, its predictions cases are correct and 71% of the time, it correctly identifies the positive predicitons as per the model.
The ‘prop.table( )’ function will calculate these proportions in R
library(vcd)
## Loading required package: grid
library(grid)
#calculating proportions from the table
prop.table(table(X,Y),1)
## Y
## X 0 1
## 0 0.1386139 0.8613861
## 1 0.2508961 0.7491039
#calculating Pearson's Chi-squared test
chisq.test(table(X,Y),correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: table(X, Y)
## X-squared = 5.4295, df = 1, p-value = 0.0198
plot(table(X,Y), shade = TRUE)
assocstats(table(X,Y))
## X^2 df P(> X^2)
## Likelihood Ratio 5.8363 1 0.015699
## Pearson 5.4295 1 0.019799
##
## Phi-Coefficient : 0.12
## Contingency Coeff.: 0.119
## Cramer's V : 0.12
Note: that the title for the output, ‘Pearson’s Chi-squared test’ indicates that these results are for the uncorrected (not Yates’ adjusted) chi-square test.
Cross-validation techniques is implemented to know whether the designed model is working fine or not, by testing it against those data points which were not present during the training of the model. These data points will serve the purpose of unseen data for the model, and it becomes easy to evaluate the model’s accuracy. Hence, it is very effective technique of machine learning model.
library(caret)
# control parameters
trctrl <- trainControl(method = "cv", n = 4, classProbs = TRUE)
Here, I have used the validation set approach so, I am splitting the data into two sets: one set is used to train the model and the remaining other set is used to test the model.
set.seed(125)
## fitting decision tree classification model
Results <- train(Home_goal ~ Result,
data = Results_test,
method = "rpart",
parms = list(split = "gini"),
trControl = trctrl)
## Warning in train.default(x, y, weights = w, ...): cannnot compute class
## probabilities for regression
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
# model summary
Results
## CART
##
## 255 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 192, 191, 191, 191
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.00000000 1.013131 0.4389224 0.8160779
## 0.01208846 1.021347 0.4290878 0.8288823
## 0.42335769 1.212305 0.3489997 0.9985237
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.
set.seed(125)
## fitting decision tree classification model
Results <- train(Home_goal ~ Result,
data = Results_train,
method = "rpart",
parms = list(split = "gini"),
trControl = trctrl)
## Warning in train.default(x, y, weights = w, ...): cannnot compute class
## probabilities for regression
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
# model summary
Results
## CART
##
## 125 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 94, 94, 94, 93
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.000000000 1.071604 0.3868952 0.8614298
## 0.004525148 1.075179 0.3816804 0.8682860
## 0.377384788 1.244428 0.3203533 1.0024786
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.
plot(varImp(Results), main="Variable Importance with Simple Regression")
Overfitting is the difference in prediction performance between the testing data and the training data.
Overfitting = PerformanceTraining − PerformanceTest
Equivalently, Overfitting is the difference in prediction error between the testing data and the training data.
Overfitting = ErrorTest − ErrorTraining
My model suffers overfitting because the prediction performance in the test data is comparatively lower than the prediction performance in the training data set. Also, the validation loss is slightly greater than the training loss.
library(rpart)
library(rpart.plot)
set.seed(100)
#building a decision tree from test dataset
fit <- rpart(Result~., data = Results_train, method = 'class')
rpart.plot(fit)
set.seed(50)
#calculating a confusion matrix
actual = c('A','D','H')[runif(100, 1,5)] # actual labels
predicted = actual # predicted labels
predicted[runif(30,1,100)] = actual[runif(30,1,100)] # introduce incorrect predictions
cm = as.matrix(table(Actual = actual, Predicted = predicted)) # create the confusion matrix
cm
## Predicted
## Actual A D H
## A 19 2 2
## D 3 18 3
## H 4 1 29
n = sum(cm) # number of instances
nc = nrow(cm) # number of classes
diag = diag(cm) # number of correctly classified instances per class
rowsums = apply(cm, 1, sum) # number of instances per class
colsums = apply(cm, 2, sum) # number of predictions per class
p = rowsums / n # distribution of instances over the actual classes
q = colsums / n # distribution of instances over the predicted classes
#calculating accuracy
accuracy = sum(diag) / n
accuracy
## [1] 0.8148148
#calculating precision, recall, f1_measure
precision = diag / colsums
recall = diag / rowsums
f1 = 2 * precision * recall / (precision + recall)
data.frame(precision, recall, f1)
## precision recall f1
## A 0.7307692 0.8260870 0.7755102
## D 0.8571429 0.7500000 0.8000000
## H 0.8529412 0.8529412 0.8529412
oneVsAll
I am calculating oneVsAll because this is useful to look at the performance of the classifier with respect to one class at a time before averaging the metrics when the instances are not uniformly distributed over the classes. In the following script, I will compute the one-vs-all confusion matrix for each class (3 matrices in this case). We can think of the problem as 3 binary classification tasks where one class is considered the positive class while the combination of all the other classes make up the negative class.
#calculating oneVsAll
oneVsAll = lapply(1 : nc,
function(i){
v = c(cm[i,i],
rowsums[i] - cm[i,i],
colsums[i] - cm[i,i],
n-rowsums[i] - colsums[i] + cm[i,i]);
return(matrix(v, nrow = 2, byrow = T))})
oneVsAll
## [[1]]
## [,1] [,2]
## [1,] 19 4
## [2,] 7 51
##
## [[2]]
## [,1] [,2]
## [1,] 18 6
## [2,] 3 54
##
## [[3]]
## [,1] [,2]
## [1,] 29 5
## [2,] 5 42
Majority-class Metrics
Here, I am using majority-class index to determine a particular class which dominates in my overall dataset. This ensures a high overall accuracy as most of the labels will be predicted correctly. If having a high accuracy is a sole objective, then a naive majority-class model can be better than a learned model in many cases. Below I have calculated the expected results of a majority-class classifier applied on the same sample data set. Recall on the majority class is equal to 1 (all majority class instances will be predicted correctly).
mcIndex = which(rowsums==max(rowsums))[1] # majority-class index
mcAccuracy = as.numeric(p[mcIndex])
mcRecall = 0*p; mcRecall[mcIndex] = 1
mcPrecision = 0*p; mcPrecision[mcIndex] = p[mcIndex]
mcF1 = 0*p; mcF1[mcIndex] = 2 * mcPrecision[mcIndex] / (mcPrecision[mcIndex] + 1)
mcIndex
## H
## 3
mcAccuracy
## [1] 0.4197531
data.frame(mcRecall, mcPrecision, mcF1)
## mcRecall mcPrecision mcF1
## A 0 0.0000000 0.0000000
## D 0 0.0000000 0.0000000
## H 1 0.4197531 0.5913043
Random-guess Metrics
This calculation is useful to compare my model for the same reasons discussed above. If I have to make a random guess and predict any of the possible labels, the expected overall accuracy and recall for all classes would be the same as the probability of picking a certain class. The expected precision would be the same as the probability that a chosen label is actually correct, which is equal to the proportion of instances that belong to a class.
(n / nc) * matrix(rep(p, nc), nc, nc, byrow=F)
## [,1] [,2] [,3]
## [1,] 7.666667 7.666667 7.666667
## [2,] 8.000000 8.000000 8.000000
## [3,] 11.333333 11.333333 11.333333
rgAccuracy = 1 / nc
rgPrecision = p
rgRecall = 0*p + 1 / nc
rgF1 = 2 * p / (nc * p + 1)
rgAccuracy
## [1] 0.3333333
data.frame(rgPrecision, rgRecall, rgF1)
## rgPrecision rgRecall rgF1
## A 0.2839506 0.3333333 0.3066667
## D 0.2962963 0.3333333 0.3137255
## H 0.4197531 0.3333333 0.3715847
prop.table(table(actual, predicted),1)
## predicted
## actual A D H
## A 0.82608696 0.08695652 0.08695652
## D 0.12500000 0.75000000 0.12500000
## H 0.11764706 0.02941176 0.85294118
chisq.test(table(actual, predicted),correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: table(actual, predicted)
## X-squared = 83.624, df = 4, p-value < 2.2e-16
plot(table(actual, predicted),
shade = TRUE)
assocstats(table(actual, predicted))
## X^2 df P(> X^2)
## Likelihood Ratio 79.310 4 2.2204e-16
## Pearson 83.624 4 0.0000e+00
##
## Phi-Coefficient : NA
## Contingency Coeff.: 0.713
## Cramer's V : 0.718
It takes two vectors as the input to perform Chi-square test in R. I did set correct=FALSE
to turn off Yates’ continuity correction. From the decision tree of my original data set, I got a high chi-squared value but a p-value=“0.5644” which is higher than p=“0.05”(standard) significance level. So, I concluded that the tested variables do not have a significant relationship. Also, the value for Phi-coefficient=“0.12”, points to weak positive relationship between the variables which means that there is no statistically significant association between them.
From the decision tree of my random sample data set, I got a high chi-squared value and a p-value=“1.029e-15” which is less than p=“0.05”(standard) significance level. So, I concluded that that the tested variables have a significant relationship between them.
I think that the test results from the decision tree of my original data set are very much acceptable in accordance to the reality of the data set. The data set of English Premier League results does not have a significant connection between the variables. There are several other external factors which could manipulate the outcome of the statistics and probabilities. However, by doing this classification, I was able to find the commonalities of connection between dependent and independent variable within the data set.