Project 1 ‘Part 2’

DATA MINING WITH R

Here in this part of project, I have decided to do some classifications with my previous dataset from Project 1. For this project, I have pulled the data from English Premier League Results.

LOADING MY EXCEL DATA INTO R ENVIRONMENT

library(readxl)
Results <- read_excel("Results.xlsx")
str(Results)

## tibble [380 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Home_team: chr [1:380] "Arsenal" "Watford" "Chelsea" "Crystal Palace" ...
##  $ Away_team: chr [1:380] "Leicester City" "Liverpool" "Burnley" "Huddersfield Town" ...
##  $ Home_goal: num [1:380] 4 3 2 0 1 0 1 0 0 4 ...
##  $ Away_goal: num [1:380] 3 3 3 3 0 0 0 2 2 0 ...
##  $ Result   : chr [1:380] "H" "D" "A" "A" ...
##  $ Season   : chr [1:380] "2017-2018" "2017-2018" "2017-2018" "2017-2018" ...

summary(Results$Home_goal)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.532   2.000   7.000

Total no. of Observations and Variables

dim(Results)

## [1] 380   6

TAKING SAMPLE DATA

I am taking almost 1/3 no. of observations from my data set as a sample.

s <- sample(380, 125)

The head() and tail() function in R

The head() and tail() function in R are often used to read the first and last n rows of a dataset.

head(Results)

## # A tibble: 6 × 6
##   Home_team      Away_team         Home_goal Away_goal Result Season   
##   <chr>          <chr>                 <dbl>     <dbl> <chr>  <chr>    
## 1 Arsenal        Leicester City            4         3 H      2017-2018
## 2 Watford        Liverpool                 3         3 D      2017-2018
## 3 Chelsea        Burnley                   2         3 A      2017-2018
## 4 Crystal Palace Huddersfield Town         0         3 A      2017-2018
## 5 Everton        Stoke City                1         0 H      2017-2018
## 6 Southampton    Swansea City              0         0 D      2017-2018

tail(Results)

## # A tibble: 6 × 6
##   Home_team         Away_team       Home_goal Away_goal Result Season   
##   <chr>             <chr>               <dbl>     <dbl> <chr>  <chr>    
## 1 Manchester United Watford                 1         0 H      2017-2018
## 2 Newcastle United  Chelsea                 3         0 H      2017-2018
## 3 Southampton       Manchester City         0         1 A      2017-2018
## 4 Swansea City      Stoke City              1         2 A      2017-2018
## 5 Tottenham Hotspur Leicester City          5         4 H      2017-2018
## 6 West Ham United   Everton                 3         1 H      2017-2018

From the head and tail output, we can notice the data is not shuffled. This means that when we split our data between a train set and test set, the algorithms will never see the features of the middle data or other than the data taken. This can lead to a poor prediction.

So, lets shuffle the data,

#shuffling_data(head)
shuffle_index <- sample(1:nrow(Results))
head(shuffle_index)

## [1] 298  89 291  73 258 371

In the above code; sample(1:nrow(Results)): generates a random list of index from 1 to 380 (i.e. the maximum number of rows).

SPLITTING THE DATA

Lets separate my Results data into two parts. I’m gonna split it into traning and testing data.

library(rpart)
#split data into training data
Results_train <- Results[s,]
dim(Results_train)

## [1] 125   6

#split data into testing data
Results_test <- Results[-s,]
dim(Results_test)

## [1] 255   6

I have used the function prop.table() combined with table() to verify if the randomization process is correct.

prop.table(table(Results_train$Result))

## 
##     A     D     H 
## 0.312 0.272 0.416

BUILDING A DECISION TREE IN R

library(rpart)
library(rpart.plot)
fit <- rpart(Result~., data = Results, method = 'class')
rpart.plot(fit, extra = 101)

Here; in the above decision tree:

   A = Away_team (it means the case when Away_team is the winner)
      
   H = Home_team (it means the case when Home_team is the winner)
      
   D = Draw (its the case when Away_team and Home_team scored same number of goals in the match)

MAKE A PREDICTION

To make a prediction, we can use the predict() function. The basic syntax of predict for R decision tree is:

#making and testing the prediction
predict_unseen <-predict(fit, Results_test, type = 'class')
table_mat <- table(Results_test$Result, predict_unseen)
table_mat

##    predict_unseen
##       A   D   H
##   A  69   0   0
##   D   2  63   0
##   H   1   0 120

CREATING A CONFUSION MATRIX IN R

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(gmodels)
set.seed(380)
X<-factor(ceiling(runif(380)-0.25))
Y<-factor(ceiling(runif(380)-0.20))

confusionMatrix(X,Y, positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  14  87
##          1  70 209
##                                           
##                Accuracy : 0.5868          
##                  95% CI : (0.5355, 0.6368)
##     No Information Rate : 0.7789          
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : -0.1187         
##                                           
##  Mcnemar's Test P-Value : 0.2016          
##                                           
##             Sensitivity : 0.7061          
##             Specificity : 0.1667          
##          Pos Pred Value : 0.7491          
##          Neg Pred Value : 0.1386          
##              Prevalence : 0.7789          
##          Detection Rate : 0.5500          
##    Detection Prevalence : 0.7342          
##       Balanced Accuracy : 0.4364          
##                                           
##        'Positive' Class : 1               
##

CrossTable(X,Y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  380 
## 
##  
##              | Y 
##            X |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |        14 |        87 |       101 | 
##              |     3.105 |     0.881 |           | 
##              |     0.139 |     0.861 |     0.266 | 
##              |     0.167 |     0.294 |           | 
##              |     0.037 |     0.229 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        70 |       209 |       279 | 
##              |     1.124 |     0.319 |           | 
##              |     0.251 |     0.749 |     0.734 | 
##              |     0.833 |     0.706 |           | 
##              |     0.184 |     0.550 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        84 |       296 |       380 | 
##              |     0.221 |     0.779 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Here; we have accuracy of around 58%. But it might be giving the wrong idea about the result. Lets think about that. Thats where we come across the dual concept of Precision and Recall.

CALCULATING PRECISION AND RECALL IN R

Precision tells us how many of the correctly predicted cases actually turned out to be positive. Recall tells us how many of the actual positive cases we were able to predict correctly with our model.

library(caret)
#calculating precision and recall
precision <- posPredValue(X,Y, positive="1")
recall <- sensitivity(X,Y, positive="1")
f1 <- (2*precision*recall)/(precision+recall)
precision

## [1] 0.7491039

recall

## [1] 0.7060811

f1

## [1] 0.7269565

data.frame(precision, recall, f1)

##   precision    recall        f1
## 1 0.7491039 0.7060811 0.7269565

Here, Precision and Recall are around 75% and 71% respectively. That means 75% of the time, its predictions cases are correct and 71% of the time, it correctly identifies the positive predicitons as per the model.

CHI-SQUARE STATISTICS IN R

The ‘prop.table( )’ function will calculate these proportions in R

library(vcd)

## Loading required package: grid

library(grid)
#calculating proportions from the table
prop.table(table(X,Y),1)

##    Y
## X           0         1
##   0 0.1386139 0.8613861
##   1 0.2508961 0.7491039

#calculating Pearson's Chi-squared test
chisq.test(table(X,Y),correct=FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  table(X, Y)
## X-squared = 5.4295, df = 1, p-value = 0.0198

plot(table(X,Y), shade = TRUE)

assocstats(table(X,Y))

##                     X^2 df P(> X^2)
## Likelihood Ratio 5.8363  1 0.015699
## Pearson          5.4295  1 0.019799
## 
## Phi-Coefficient   : 0.12 
## Contingency Coeff.: 0.119 
## Cramer's V        : 0.12

Note: that the title for the output, ‘Pearson’s Chi-squared test’ indicates that these results are for the uncorrected (not Yates’ adjusted) chi-square test.

CROSS-VALIDATION IN R

Cross-validation techniques is implemented to know whether the designed model is working fine or not, by testing it against those data points which were not present during the training of the model. These data points will serve the purpose of unseen data for the model, and it becomes easy to evaluate the model’s accuracy. Hence, it is very effective technique of machine learning model.

SETTING A CONTROL PARAMETER

library(caret)
# control parameters
trctrl <- trainControl(method = "cv", n = 4, classProbs = TRUE)

RUNNING THE CROSS VALIDATION TECHNIQUE

THE VALIDATION SET APPROACH

Here, I have used the validation set approach so, I am splitting the data into two sets: one set is used to train the model and the remaining other set is used to test the model.

set.seed(125)
## fitting decision tree classification model
Results <- train(Home_goal ~ Result,
                         data = Results_test, 
                         method = "rpart",
                         parms  = list(split = "gini"), 
                         trControl = trctrl)

## Warning in train.default(x, y, weights = w, ...): cannnot compute class
## probabilities for regression

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

# model summary
Results

## CART 
## 
## 255 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## Summary of sample sizes: 192, 191, 191, 191 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE      
##   0.00000000  1.013131  0.4389224  0.8160779
##   0.01208846  1.021347  0.4290878  0.8288823
##   0.42335769  1.212305  0.3489997  0.9985237
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.

set.seed(125)
## fitting decision tree classification model
Results <- train(Home_goal ~ Result,
                         data = Results_train, 
                         method = "rpart",
                         parms  = list(split = "gini"), 
                         trControl = trctrl)

## Warning in train.default(x, y, weights = w, ...): cannnot compute class
## probabilities for regression

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

# model summary
Results

## CART 
## 
## 125 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## Summary of sample sizes: 94, 94, 94, 93 
## Resampling results across tuning parameters:
## 
##   cp           RMSE      Rsquared   MAE      
##   0.000000000  1.071604  0.3868952  0.8614298
##   0.004525148  1.075179  0.3816804  0.8682860
##   0.377384788  1.244428  0.3203533  1.0024786
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.

VARIABLE IMPORTANCE WITH SIMPLE REGRESSION

plot(varImp(Results), main="Variable Importance with Simple Regression")

Overfitting is the difference in prediction performance between the testing data and the training data.

Overfitting = PerformanceTraining − PerformanceTest

Equivalently, Overfitting is the difference in prediction error between the testing data and the training data.

Overfitting = ErrorTest − ErrorTraining

My model suffers overfitting because the prediction performance in the test data is comparatively lower than the prediction performance in the training data set. Also, the validation loss is slightly greater than the training loss.

LETS WORK ON A SAMPLE DATA(n~100) AND MAKE SOME PREDICTIONS

BUILDING A DECISION TREE FROM RANDOM SAMPLE

library(rpart)
library(rpart.plot)
set.seed(100)
#building a decision tree from test dataset
fit <- rpart(Result~., data = Results_train, method = 'class')
rpart.plot(fit)

COMPUTING FOR A CONFUSION MATRIX ON THIS MODEL

set.seed(50)
#calculating a confusion matrix
actual = c('A','D','H')[runif(100, 1,5)] # actual labels
predicted = actual # predicted labels
predicted[runif(30,1,100)] = actual[runif(30,1,100)]  # introduce incorrect predictions
cm = as.matrix(table(Actual = actual, Predicted = predicted)) # create the confusion matrix
cm

##       Predicted
## Actual  A  D  H
##      A 19  2  2
##      D  3 18  3
##      H  4  1 29

n = sum(cm) # number of instances
nc = nrow(cm) # number of classes
diag = diag(cm) # number of correctly classified instances per class 
rowsums = apply(cm, 1, sum) # number of instances per class
colsums = apply(cm, 2, sum) # number of predictions per class
p = rowsums / n # distribution of instances over the actual classes
q = colsums / n # distribution of instances over the predicted classes

#calculating accuracy
accuracy = sum(diag) / n 
accuracy

## [1] 0.8148148

#calculating precision, recall, f1_measure
precision = diag / colsums 
recall = diag / rowsums 
f1 = 2 * precision * recall / (precision + recall) 

data.frame(precision, recall, f1)

##   precision    recall        f1
## A 0.7307692 0.8260870 0.7755102
## D 0.8571429 0.7500000 0.8000000
## H 0.8529412 0.8529412 0.8529412

oneVsAll

I am calculating oneVsAll because this is useful to look at the performance of the classifier with respect to one class at a time before averaging the metrics when the instances are not uniformly distributed over the classes. In the following script, I will compute the one-vs-all confusion matrix for each class (3 matrices in this case). We can think of the problem as 3 binary classification tasks where one class is considered the positive class while the combination of all the other classes make up the negative class.

#calculating  oneVsAll

oneVsAll = lapply(1 : nc,
                      function(i){
                        v = c(cm[i,i],
                              rowsums[i] - cm[i,i],
                              colsums[i] - cm[i,i],
                              n-rowsums[i] - colsums[i] + cm[i,i]);
                        return(matrix(v, nrow = 2, byrow = T))})
oneVsAll

## [[1]]
##      [,1] [,2]
## [1,]   19    4
## [2,]    7   51
## 
## [[2]]
##      [,1] [,2]
## [1,]   18    6
## [2,]    3   54
## 
## [[3]]
##      [,1] [,2]
## [1,]   29    5
## [2,]    5   42

Majority-class Metrics

Here, I am using majority-class index to determine a particular class which dominates in my overall dataset. This ensures a high overall accuracy as most of the labels will be predicted correctly. If having a high accuracy is a sole objective, then a naive majority-class model can be better than a learned model in many cases. Below I have calculated the expected results of a majority-class classifier applied on the same sample data set. Recall on the majority class is equal to 1 (all majority class instances will be predicted correctly).

mcIndex = which(rowsums==max(rowsums))[1] # majority-class index
mcAccuracy = as.numeric(p[mcIndex]) 
mcRecall = 0*p;  mcRecall[mcIndex] = 1
mcPrecision = 0*p; mcPrecision[mcIndex] = p[mcIndex]
mcF1 = 0*p; mcF1[mcIndex] = 2 * mcPrecision[mcIndex] / (mcPrecision[mcIndex] + 1)
mcIndex

## H 
## 3

mcAccuracy

## [1] 0.4197531

data.frame(mcRecall, mcPrecision, mcF1)

##   mcRecall mcPrecision      mcF1
## A        0   0.0000000 0.0000000
## D        0   0.0000000 0.0000000
## H        1   0.4197531 0.5913043

Random-guess Metrics

This calculation is useful to compare my model for the same reasons discussed above. If I have to make a random guess and predict any of the possible labels, the expected overall accuracy and recall for all classes would be the same as the probability of picking a certain class. The expected precision would be the same as the probability that a chosen label is actually correct, which is equal to the proportion of instances that belong to a class.

(n / nc) * matrix(rep(p, nc), nc, nc, byrow=F)

##           [,1]      [,2]      [,3]
## [1,]  7.666667  7.666667  7.666667
## [2,]  8.000000  8.000000  8.000000
## [3,] 11.333333 11.333333 11.333333

rgAccuracy = 1 / nc
  rgPrecision = p
  rgRecall = 0*p + 1 / nc
  rgF1 = 2 * p / (nc * p + 1)
rgAccuracy

## [1] 0.3333333

data.frame(rgPrecision, rgRecall, rgF1)

##   rgPrecision  rgRecall      rgF1
## A   0.2839506 0.3333333 0.3066667
## D   0.2962963 0.3333333 0.3137255
## H   0.4197531 0.3333333 0.3715847

CHI-SQUARE STATISTICS IN R FOR RANDOM SAMPLE

prop.table(table(actual, predicted),1)

##       predicted
## actual          A          D          H
##      A 0.82608696 0.08695652 0.08695652
##      D 0.12500000 0.75000000 0.12500000
##      H 0.11764706 0.02941176 0.85294118

chisq.test(table(actual, predicted),correct=FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  table(actual, predicted)
## X-squared = 83.624, df = 4, p-value < 2.2e-16

plot(table(actual, predicted),
shade = TRUE)

assocstats(table(actual, predicted))

##                     X^2 df   P(> X^2)
## Likelihood Ratio 79.310  4 2.2204e-16
## Pearson          83.624  4 0.0000e+00
## 
## Phi-Coefficient   : NA 
## Contingency Coeff.: 0.713 
## Cramer's V        : 0.718

DECISION TREE ANALYSIS

It takes two vectors as the input to perform Chi-square test in R. I did set correct=FALSE to turn off Yates’ continuity correction. From the decision tree of my original data set, I got a high chi-squared value but a p-value=“0.5644” which is higher than p=“0.05”(standard) significance level. So, I concluded that the tested variables do not have a significant relationship. Also, the value for Phi-coefficient=“0.12”, points to weak positive relationship between the variables which means that there is no statistically significant association between them.

From the decision tree of my random sample data set, I got a high chi-squared value and a p-value=“1.029e-15” which is less than p=“0.05”(standard) significance level. So, I concluded that that the tested variables have a significant relationship between them.

CONCLUSION

I think that the test results from the decision tree of my original data set are very much acceptable in accordance to the reality of the data set. The data set of English Premier League results does not have a significant connection between the variables. There are several other external factors which could manipulate the outcome of the statistics and probabilities. However, by doing this classification, I was able to find the commonalities of connection between dependent and independent variable within the data set.