Ashok Kumar Rayapati | Naveen Kumar Kalluri | Varun Sai Rachulapally
04-27-2022
To be able to predict the future winner is always exciting. Our initial proposal was to predict the probability of a team winning the match based on their previous performances.
By this, we intend to address the problems faced by investors/organizations that are investing huge sums of money in the teams.
Most of the comments we received for the proposal of our project were about the usage of matches data, that are we going to consider any particular seasonal matches or yearly matches, after taking the suggestions into consideration we decide to go with the complete matches data.
Some comments were about the specific use of H20 while compared to keras, we used H20 as it is very simple to use compared to Keras, because we are not going to use neural networks for which Keras works efficiently and our data size is also not huge we are going with H20.
Many comments were about what AI models we were planning to use, we implemented Naive Bayes and Deep learning models.
| Odds.homeWin | Odds.draw | Odds.awayWin | halfTime_score_Home | halfTime_score_Away | fullTime_score_Home | fullTime_score_Away | homeTeam | awayTeam | winner |
|---|---|---|---|---|---|---|---|---|---|
| 2.01 | 3.24 | 3.76 | 0 | 2 | 0 | 2 | Fortaleza EC | CA Paranaense | AWAY_TEAM |
| 3.77 | 3.06 | 2.10 | 0 | 1 | 0 | 1 | Coritiba FBC | SC Internacional | AWAY_TEAM |
| 2.90 | 2.93 | 2.60 | 3 | 2 | 3 | 2 | SC Recife | Cear<e1> SC | HOME_TEAM |
| 2.14 | 3.24 | 3.39 | 0 | 1 | 1 | 1 | Santos FC | RB Bragantino | DRAW |
| 1.46 | 4.46 | 6.29 | 0 | 1 | 0 | 1 | CR Flamengo | CA Mineiro | AWAY_TEAM |
The bar chart gives the information of the number of matches played by the Home team and Away team with respect to their win matches and draw matches.
From the bar chart we can infer that, in total 21260 matches played by both the teams, Home team won about 9115 matches, Away team won about 6531 matches, and about 5614 were draw matches in total matches played by the teams.
The bar chart represents the information of the list of top home winners with respect to the total matches played.
From the bar chart we can conclude that, CA Mineiro stands out as the top home team winner with 310 match wins, followed by, CR Flamengo with 240 match wins, Fluminense FC with 220 match wins, followed by 9 teams in the total match wins with respect to individual teams.
The bar chart represents the information of the list of top away winners with respect to the total matches played.
From the bar chart we can conclude that, CR Flamengo stands out as the top home team winner with 180 match wins, followed by, AC Milan with 154 match wins, SE Palmerias with 150 match wins, followed by 10 teams in the total match wins with respect to individual teams.
The bar chart gives the information about the Top three teams and their match results when the played at their home, away and the number of matches that are draw.
From the bar chart we can infer that the top CA minerio team won about 310 matches when played at home ground, about 50 matches were draw and lost about 20 matches when they played away. Followed by CR Flamengo, where they won about 240 matches when played at home ground, about 50 matches were draw and lost about 90 matches when they played away.Followed by Fluminense FC, where they won about 220 matches when played at home ground, about 100 matches were draw and lost about 50 matches when they played away.
Final dataset after data cleaning, wrangling and updating we had 21,260 records; which we are using to split and implement our AI/ML models.
This is our sample data.
| Odds.homeWin | Odds.draw | Odds.awayWin | halfTime_score_Home | halfTime_score_Away | fullTime_score_Home | fullTime_score_Away | homeTeam | awayTeam | winner |
|---|---|---|---|---|---|---|---|---|---|
| 2.01 | 3.24 | 3.76 | 0 | 2 | 0 | 2 | Fortaleza EC | CA Paranaense | AWAY_TEAM |
| 3.77 | 3.06 | 2.10 | 0 | 1 | 0 | 1 | Coritiba FBC | SC Internacional | AWAY_TEAM |
| 2.90 | 2.93 | 2.60 | 3 | 2 | 3 | 2 | SC Recife | Cear<e1> SC | HOME_TEAM |
| 2.14 | 3.24 | 3.39 | 0 | 1 | 1 | 1 | Santos FC | RB Bragantino | DRAW |
| 1.46 | 4.46 | 6.29 | 0 | 1 | 0 | 1 | CR Flamengo | CA Mineiro | AWAY_TEAM |
The Dataset is divided into training data and testing data in the ratio of 80%:20%, respectively.
After splitting the dataset into training and testing data, we performed data transformation and pre-processing using “recipe” package, which transformed our data for ML modelling, then we prepared and baked the recipe for transforming the outcome variable(y).
## [1] 14883
## [1] 6377
For Prediction, We used Naive Bayes, which is a Supervised Machine Learning algorithm and deep learning models 1 and 2
We used “e1071” package, which provides the Naive Bayes training function, After processing the model using the training data , tested the model by making predictions against the testing data. Then with test data we achieved an accuracy of 71.48% for the model.
## Confusion Matrix and Statistics
##
## y_pred
## AWAY_TEAM DRAW HOME_TEAM
## AWAY_TEAM 1392 526 41
## DRAW 195 1318 171
## HOME_TEAM 47 839 1848
##
## Overall Statistics
##
## Accuracy : 0.7148
## 95% CI : (0.7035, 0.7258)
## No Information Rate : 0.4207
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5753
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: AWAY_TEAM Class: DRAW Class: HOME_TEAM
## Sensitivity 0.8519 0.4912 0.8971
## Specificity 0.8805 0.9009 0.7948
## Pos Pred Value 0.7106 0.7827 0.6759
## Neg Pred Value 0.9452 0.7091 0.9418
## Prevalence 0.2562 0.4207 0.3230
## Detection Rate 0.2183 0.2067 0.2898
## Detection Prevalence 0.3072 0.2641 0.4287
## Balanced Accuracy 0.8662 0.6961 0.8459
We used “H20” package, which provides deep learning models, where for processing the model we splitted the training data into three subsets,one for training,one for validation and one for testing in a 70:15:15 ratio to allow H2O for model tuning.
We used two deep learning models, deep learning model 1 and deep learning model 2 respectively, where we performed some tuning on the deep learning model 2 by increasing the hidden layers,limiting the data size,
After splitting the dataset into 80% of training data and the rest as testing data.
We have applied Naive Bayes model to the test data and got an accuracy of 72%.
Then we have fitted with train and validation data which is obtained from 80% of the original data, the accuracy of the model seems to increase below with test data compared to train and validation data as the epochs increase.
we ran 100, 500 and 1000 epochs in deep learning model-1 we got rmse values of 41%, 39% and 39.2% and their r2 values of 59%, 61% and 60.8% respectively.
we ran 20,50 and 100 epochs in deep learning model-2 we got rmse values of 44%, 42% and 36% and their r2 values of 56%, 58% and 64% respectively.