Initial Proposal Plan

To be able to predict the future winner is always exciting. Our initial proposal was to predict the probability of a team winning the match based on their previous performances.
By this, we intend to address the problems faced by investors/organizations that are investing huge sums of money in the teams.

Key Peer Comments Summary

Most of the comments we received for the proposal of our project were about the usage of matches data, that are we going to consider any particular seasonal matches or yearly matches, after taking the suggestions into consideration we decide to go with the complete matches data.
Some comments were about the specific use of H20 while compared to keras, we used H20 as it is very simple to use compared to Keras, because we are not going to use neural networks for which Keras works efficiently and our data size is also not huge we are going with H20.
Many comments were about what AI models we were planning to use, we implemented Naive Bayes and Deep learning models.

Data Summary

Initially we collected the data through Football api and stored them in three csv files then we clubbed them into a single dataset.

Dataset Description:

The dataset consists of football matches data. It includes 21353 rows and 10 feature variables like Odds.homeWin, Odds.draw, Odds.awayWin, halfTime_score_Home, halfTime_score_Away, fullTime_score_Home,fullTime_score_Away,homeTeam awayTeam, winner in total. where “winner” is y variable to be predicted.

Odds.homeWin	Odds.draw	Odds.awayWin	halfTime_score_Home	halfTime_score_Away	fullTime_score_Home	fullTime_score_Away	homeTeam	awayTeam	winner
2.01	3.24	3.76	0	2	0	2	Fortaleza EC	CA Paranaense	AWAY_TEAM
3.77	3.06	2.10	0	1	0	1	Coritiba FBC	SC Internacional	AWAY_TEAM
2.90	2.93	2.60	3	2	3	2	SC Recife	Cear<e1> SC	HOME_TEAM
2.14	3.24	3.39	0	1	1	1	Santos FC	RB Bragantino	DRAW
1.46	4.46	6.29	0	1	0	1	CR Flamengo	CA Mineiro	AWAY_TEAM

Data Exploration

Plot 1 - Matches and Wins with respect to Home team and Away team

The bar chart gives the information of the number of matches played by the Home team and Away team with respect to their win matches and draw matches.
From the bar chart we can infer that, in total 21260 matches played by both the teams, Home team won about 9115 matches, Away team won about 6531 matches, and about 5614 were draw matches in total matches played by the teams.

Plot 2 - Top Home Winners

The bar chart represents the information of the list of top home winners with respect to the total matches played.
From the bar chart we can conclude that, CA Mineiro stands out as the top home team winner with 310 match wins, followed by, CR Flamengo with 240 match wins, Fluminense FC with 220 match wins, followed by 9 teams in the total match wins with respect to individual teams.

Plot 3 - Top Away Winners

The bar chart represents the information of the list of top away winners with respect to the total matches played.
From the bar chart we can conclude that, CR Flamengo stands out as the top home team winner with 180 match wins, followed by, AC Milan with 154 match wins, SE Palmerias with 150 match wins, followed by 10 teams in the total match wins with respect to individual teams.

Plot 4 - Top Three teams with respect to Home wins, Away wins and Draw

The bar chart gives the information about the Top three teams and their match results when the played at their home, away and the number of matches that are draw.
From the bar chart we can infer that the top CA minerio team won about 310 matches when played at home ground, about 50 matches were draw and lost about 20 matches when they played away. Followed by CR Flamengo, where they won about 240 matches when played at home ground, about 50 matches were draw and lost about 90 matches when they played away.Followed by Fluminense FC, where they won about 220 matches when played at home ground, about 100 matches were draw and lost about 50 matches when they played away.

Complete Data View

Final dataset after data cleaning, wrangling and updating we had 21,260 records; which we are using to split and implement our AI/ML models.
This is our sample data.

Odds.homeWin	Odds.draw	Odds.awayWin	halfTime_score_Home	halfTime_score_Away	fullTime_score_Home	fullTime_score_Away	homeTeam	awayTeam	winner
2.01	3.24	3.76	0	2	0	2	Fortaleza EC	CA Paranaense	AWAY_TEAM
3.77	3.06	2.10	0	1	0	1	Coritiba FBC	SC Internacional	AWAY_TEAM
2.90	2.93	2.60	3	2	3	2	SC Recife	Cear<e1> SC	HOME_TEAM
2.14	3.24	3.39	0	1	1	1	Santos FC	RB Bragantino	DRAW
1.46	4.46	6.29	0	1	0	1	CR Flamengo	CA Mineiro	AWAY_TEAM

Partition and pre-processing of Data

The Dataset is divided into training data and testing data in the ratio of 80%:20%, respectively.
After splitting the dataset into training and testing data, we performed data transformation and pre-processing using “recipe” package, which transformed our data for ML modelling, then we prepared and baked the recipe for transforming the outcome variable(y).

Train Data

## [1] 14883

Test Data

## [1] 6377

AI/ML Models and Results

Naive Bayes model

For Prediction, We used Naive Bayes, which is a Supervised Machine Learning algorithm and deep learning models 1 and 2
We used “e1071” package, which provides the Naive Bayes training function, After processing the model using the training data , tested the model by making predictions against the testing data. Then with test data we achieved an accuracy of 71.48% for the model.

## Confusion Matrix and Statistics
## 
##            y_pred
##             AWAY_TEAM DRAW HOME_TEAM
##   AWAY_TEAM      1392  526        41
##   DRAW            195 1318       171
##   HOME_TEAM        47  839      1848
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7148          
##                  95% CI : (0.7035, 0.7258)
##     No Information Rate : 0.4207          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5753          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: AWAY_TEAM Class: DRAW Class: HOME_TEAM
## Sensitivity                    0.8519      0.4912           0.8971
## Specificity                    0.8805      0.9009           0.7948
## Pos Pred Value                 0.7106      0.7827           0.6759
## Neg Pred Value                 0.9452      0.7091           0.9418
## Prevalence                     0.2562      0.4207           0.3230
## Detection Rate                 0.2183      0.2067           0.2898
## Detection Prevalence           0.3072      0.2641           0.4287
## Balanced Accuracy              0.8662      0.6961           0.8459

Deep learning model using H2o

We used “H20” package, which provides deep learning models, where for processing the model we splitted the training data into three subsets,one for training,one for validation and one for testing in a 70:15:15 ratio to allow H2O for model tuning.
We used two deep learning models, deep learning model 1 and deep learning model 2 respectively, where we performed some tuning on the deep learning model 2 by increasing the hidden layers,limiting the data size,

Summary

After splitting the dataset into 80% of training data and the rest as testing data.
We have applied Naive Bayes model to the test data and got an accuracy of 72%.
Then we have fitted with train and validation data which is obtained from 80% of the original data, the accuracy of the model seems to increase below with test data compared to train and validation data as the epochs increase.
we ran 100, 500 and 1000 epochs in deep learning model-1 we got rmse values of 41%, 39% and 39.2% and their r2 values of 59%, 61% and 60.8% respectively.
we ran 20,50 and 100 epochs in deep learning model-2 we got rmse values of 44%, 42% and 36% and their r2 values of 56%, 58% and 64% respectively.

Football - Match Win Prediction

Initial Proposal Plan

Key Peer Comments Summary

Data Summary

Dataset Description:

Data Exploration

Plot 1 - Matches and Wins with respect to Home team and Away team

Plot 2 - Top Home Winners

Plot 3 - Top Away Winners

Plot 4 - Top Three teams with respect to Home wins, Away wins and Draw

Complete Data View

Partition and pre-processing of Data

Train Data

Test Data

AI/ML Models and Results

Naive Bayes model

Deep learning model using H2o

Summary

Takeways