We decided to use the FIFA World Cup Matches dataset in order to predict the winner of the 2018 World Cup.

Reading the Dataset

Iaquinta = read.csv("C:\\Users\\student\\Desktop\\MATH 421\\Math 421 Final Project\\WorldCupMatches.csv")
library(ggplot2)
library(caret)

## Loading required package: lattice

library(rpart)
library(rattle)

## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(lattice)
summary(Iaquinta)

##       Year                      Datetime               Stage     
##  Min.   :1930                       :3720                 :3720  
##  1st Qu.:1970   27 May 1934 - 16:30 :   8   Round of 16   :  72  
##  Median :1990   08 Jun 1958 - 19:00 :   7   Quarter-finals:  66  
##  Mean   :1985   11 Jun 1958 - 19:00 :   7   Group 1       :  62  
##  3rd Qu.:2002   15 Jun 1958 - 19:00 :   7   Group A       :  60  
##  Max.   :2014   02 Jul 1950 - 15:00 :   4   Group B       :  60  
##  NA's   :3720   (Other)             : 819   (Other)       : 532  
##                       Stadium                  City         Home.Team.Name
##                           :3720                  :3720             :3720  
##  Estadio Azteca           :  19   Mexico City    :  23   Brazil    :  82  
##  Jalisco                  :  14   Montevideo     :  18   Italy     :  57  
##  Olympiastadion           :  14   Rio De Janeiro :  18   Argentina :  54  
##  Nou Camp - Estadio Leï¿½n:  11   Guadalajara    :  17   Germany FR:  43  
##  Estadio Centenario       :  10   Johannesburg   :  15   England   :  35  
##  (Other)                  : 784   (Other)        : 761   (Other)   : 581  
##  Home.Team.Goals  Away.Team.Goals   Away.Team.Name
##  Min.   : 0.000   Min.   :0.000            :3720  
##  1st Qu.: 1.000   1st Qu.:0.000   Mexico   :  38  
##  Median : 2.000   Median :1.000   France   :  30  
##  Mean   : 1.811   Mean   :1.022   Spain    :  29  
##  3rd Qu.: 3.000   3rd Qu.:2.000   Argentina:  27  
##  Max.   :10.000   Max.   :7.000   England  :  27  
##  NA's   :3720     NA's   :3720    (Other)  : 701  
##                          Win.conditions   Attendance    
##                                 :3720   Min.   :  2000  
##                                 : 787   1st Qu.: 30000  
##  Italy win after extra time     :   5   Median : 41580  
##  Argentina win after extra time :   4   Mean   : 45165  
##  Germany win after extra time   :   4   3rd Qu.: 61375  
##  Belgium win after extra time   :   3   Max.   :173850  
##  (Other)                        :  49   NA's   :3722    
##  Half.time.Home.Goals Half.time.Away.Goals                   Referee    
##  Min.   :0.000        Min.   :0.000                              :3720  
##  1st Qu.:0.000        1st Qu.:0.000        Ravshan IRMATOV (UZB) :  10  
##  Median :0.000        Median :0.000        ARCHUNDIA Benito (MEX):   8  
##  Mean   :0.709        Mean   :0.428        LARRIONDA Jorge (URU) :   8  
##  3rd Qu.:1.000        3rd Qu.:1.000        QUINIOU Joel (FRA)    :   8  
##  Max.   :6.000        Max.   :5.000        RODRIGUEZ Marco (MEX) :   8  
##  NA's   :3720         NA's   :3720         (Other)               : 810  
##                            Assistant.1                     Assistant.2  
##                                  :3720                           :3720  
##  ACHIK Redouane (MAR)            :   7   KOCHKAROV Bakhadyr (KGZ):  10  
##  BERANEK Alois (AUT)             :   7   LISTKIEWICZ Michal (POL):   7  
##  GONZALEZ ARCHUNDIA Alfonso (MEX):   7   VERGARA Hector (CAN)    :   7  
##  HERMANS Peter (BEL)             :   7   VROMANS Walter (BEL)    :   7  
##  VERGARA Hector (CAN)            :   7   YUSTE Juan (ESP)        :   7  
##  (Other)                         : 817   (Other)                 : 814  
##     RoundID            MatchID          Home.Team.Initials
##  Min.   :     201   Min.   :       25          :3720      
##  1st Qu.:     262   1st Qu.:     1189   BRA    :  82      
##  Median :     337   Median :     2191   ITA    :  57      
##  Mean   :10661773   Mean   : 61346868   ARG    :  54      
##  3rd Qu.:  249722   3rd Qu.: 43950059   FRG    :  43      
##  Max.   :97410600   Max.   :300186515   ENG    :  35      
##  NA's   :3720       NA's   :3720        (Other): 581      
##  Away.Team.Initials
##         :3720      
##  MEX    :  38      
##  FRA    :  30      
##  ESP    :  29      
##  ARG    :  27      
##  ENG    :  27      
##  (Other): 701

Removing useless columns

Iaquinta[,"RoundID"] = NULL
Iaquinta[,"MatchID"] = NULL
Iaquinta[,"Referee"] = NULL
Iaquinta[,"Assistant.1"] = NULL
Iaquinta[,"Assistant.2"] = NULL
Iaquinta[,"Datetime"] = NULL
Iaquinta[,"Home.Team.Initials"] = NULL
Iaquinta[,"Away.Team.Initials"] = NULL

CHecking for missing values

sum(is.na(Iaquinta))

## [1] 22322

DISCUSSION ON MISSING VALUES

Handling missing values (1) - A9Q4 Missing values of categorical variables are replaced by the most frequent category in the variables

AL=function(x){
  for (i in 1:ncol(x)){
    if (is.numeric(x[,i])){
      x[,i][is.na(x[,i])]=mean(x[,i], na.rm=TRUE)
    }else{
      levels=unique(x[,i])
      x[,i][is.na(x[,i])]=levels[which.max(tabulate(match(x[,i], levels)))]
    }
  }
  return (x)
}
Iaquinta <- AL(Iaquinta)

Comenting on the result

sum(is.na(Iaquinta))

## [1] 0

#We had 22322 missing values in the first place
#This method brings the #of missing values to 0

Handling Missing Values (2) - A4Q4 Input a data frame and return a data frame with numeric missing values being replaced by the mean of the corresponding column.

AL2 <- function(x) {
  lee <- ncol(x)
  for (i in 1: lee) {
    if(is.numeric(x[[i]]) == TRUE) {
     df[[i]][is.na(x[[i]])] <- mean(x[[i]], na.rm = TRUE)
    }
  }
  return(x)
}

sum(is.na(Iaquinta))

## [1] 0

#The number of missing values does not change meaning that we do not have any missing numeric values

Handling Missing Values (3) -A17 # Missing values of numeric variables are replaced by the means of the non-missing values in the variables

Iaquinta22=function(x){
  for (i in 1:ncol(x)){
    if (is.numeric(x[,i])){
      x[,i][is.na(x[,i])]=mean(x[,i], na.rm=TRUE)
    }else{
      levels=unique(x[,i])
      x[,i][is.na(x[,i])]=levels[which.max(tabulate(match(x[,i], levels)))]
    }
  }
  return (x)
}
Iaquinta <- Iaquinta22(Iaquinta)

sum(is.na(Iaquinta))

## [1] 0

#We go from 22322 to 4507 missing values

Taking Care of the levels

levels(Iaquinta$Stage)

##  [1] ""                         "Final"                   
##  [3] "First round"              "Group 1"                 
##  [5] "Group 2"                  "Group 3"                 
##  [7] "Group 4"                  "Group 5"                 
##  [9] "Group 6"                  "Group A"                 
## [11] "Group B"                  "Group C"                 
## [13] "Group D"                  "Group E"                 
## [15] "Group F"                  "Group G"                 
## [17] "Group H"                  "Match for third place"   
## [19] "Play-off for third place" "Preliminary round"       
## [21] "Quarter-finals"           "Round of 16"             
## [23] "Semi-finals"              "Third place"

levels(Iaquinta$Stage)=c("Prelim", "Final", "Prelim", "Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Prelim","Semi_Final","Semi_Final","Prelim","Quarter_Final","Round_16","Semi_Final","Semi_Final")
levels(Iaquinta$Stage)

## [1] "Prelim"        "Final"         "Semi_Final"    "Quarter_Final"
## [5] "Round_16"

Encoding/Recoding Categorical Variables

Recoding categorical variable using one hot encoding (dummy encoding)- Q5A11

dummies_model <- dummyVars(Year ~., data=Iaquinta)
trainData_mat <- predict(dummies_model, newdata =Iaquinta)

trainData <- data.frame(trainData_mat)
trainData$Year <- Iaquinta$Year

This helps he models assigns the year to its corresping World Cup

dummies_model <- dummyVars(Away.Team.Goals ~., data=Iaquinta)
trainData_mat <- predict(dummies_model, newdata =Iaquinta)

trainData <- data.frame(trainData_mat)
trainData$Away.Team.Goals <- Iaquinta$Away.Team.Goals

Based on the number, it helps the model detecting whether or not the number of goals scored belong to a Home Team or an Away Team.

VISUALIZATION AND GRAPHS

library(ggplot2)
ggplot(data = Iaquinta) + geom_density(mapping = aes(x = Attendance, fill = Stage)) + facet_wrap(~Stage)

This graph shows that on average the attendace for the preliminary rounds is essentially below 50,000 people.

For the round of 16, it is pretty diverse, but the concentration is in between 25,000 and 75,000.

For the Quarter final, the attendance is also very diverse where there is no real number that stands out more than the others.

For the semi-finals, the attendance is essentially around 70,000.

For the final, the pick of the attendance is 75,000.

Overall the attendance will vary based on the teams that are playing and the capacity of a stadium.

library(ggplot2)
ggplot(data = Iaquinta) + geom_density(mapping = aes(x = Home.Team.Goals, fill = Stage)) + facet_wrap(~Stage)

This graph shows the number of goals the Home Team scores during the match.

library(ggplot2)
ggplot(data = Iaquinta) + geom_density(mapping = aes(x = Away.Team.Goals, fill = Stage)) + facet_wrap(~Stage)

This graph shows the number of goals the Away team has scored during the match.

library(ggplot2)
ggplot(data = Iaquinta) + geom_density(mapping = aes(x = Attendance, fill = Year)) + facet_wrap(~Year)

This graph show the overall attendace of the audiance during each edition of the World Cup; we can see that from 1930 to 1938 the World Cup became more popular. For obvious reasons, there was no WOrld Cup in 1942 and 1946, before restarting slow in 1950 and regaining popularity afterward.

library(ggplot2)
ggplot(data = Iaquinta) + geom_density(mapping = aes(x = Half.time.Home.Goals, fill = Stage)) + facet_wrap(~Stage)

This graph shows the number of goals the Home teams have scored after 45 minutes.

library(ggplot2)
ggplot(data = Iaquinta) + geom_density(mapping = aes(x = Half.time.Away.Goals, fill = Stage)) + facet_wrap(~Stage)

This graph shows the number of goals the away teams have scored after 45 minutes.

Al22 <- function(Iaquinta,var1,var2) {
  rt = ggplot(data=Iaquinta) + geom_bar(mapping = aes(x = Iaquinta[,var1], fill = Iaquinta[,var2]), position = "dodge")
  return(rt)
}

Al22(Iaquinta, 2, 2)

This graph shows the number of observations we have by stage; as expected there are more information for the preliminary rounds than any other stages and less observations for the fian than any other stages.

Al23 <- function(Iaquinta,var1,var2) {
  rt = ggplot(data=Iaquinta) + geom_density(mapping = aes(x = Iaquinta[,var1], fill = Iaquinta[,var2]), position = "dodge")
  return(rt)
}

Al23(Iaquinta, 2, 2)

## Warning: Width not defined. Set with `position_dodge(width = ?)`

This graph shows that in term of the density of the observations, the prelims are much more imposing than any other catefory.

Al24 <- function(Iaquinta,var1,var2) {
  rt = ggplot(data=Iaquinta) + geom_histogram(mapping = aes(x = Iaquinta[,var1], fill = Iaquinta[,var2]), position = "dodge")
  return(rt)
}

Al24(Iaquinta, 11, 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Just like previously, but with another angle, this graph shows that in term of the density of the observations, the prelims are much more imposing than any other catefory.

Al25=function(x){
  for (i in 1:ncol(x)){
    if (is.numeric(x[,i])){
      print(ggplot(data=x)+geom_density(mapping=aes(x=x[,i]))+xlab(names(x)[i]))
    }
  }
}
Al25(Iaquinta)

Thus graph shows the density of many variables such as attendance, Home Teams’ goals, Away Teams’ goals. Interesting to see that during a match, the Home Team scores frequently 2 goals while the Away Team scores 1.

Al26= function(x){
  for (i in 1:ncol(x)){
    if (is.numeric(x[,i])){
      print(ggplot(data=x)+geom_histogram(mapping=aes(x=x[,i]),fill="red")+xlab(names(x)[i]))
    }
  }
}
Al26(Iaquinta)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Thus graph shows the number of obervations regarding many variables such as the number of goals scored by the home teams after 45 and 90 minutes, the number of goals scored by the away teams after 45 and 90 minutes.

Model Training and Model Tuning

Random Forest

AL5 <- expand.grid(mtry = 3, splitrule = c("gini"),
                     min.node.size = 5)

AL6 <- train(Stage ~ ., data = Iaquinta, method = "ranger",
               trControl = trainControl(method ="cv", 
                                        number = 3, verboseIter = TRUE),
               tuneGrid = AL5)

## + Fold1: mtry=3, splitrule=gini, min.node.size=5 
## - Fold1: mtry=3, splitrule=gini, min.node.size=5 
## + Fold2: mtry=3, splitrule=gini, min.node.size=5 
## - Fold2: mtry=3, splitrule=gini, min.node.size=5 
## + Fold3: mtry=3, splitrule=gini, min.node.size=5 
## - Fold3: mtry=3, splitrule=gini, min.node.size=5 
## Aggregating results
## Fitting final model on full training set

confusionMatrix(AL6)

## Cross-Validated (3 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##                Reference
## Prediction      Prelim Final Semi_Final Quarter_Final Round_16
##   Prelim          95.3   0.4        1.2           1.4      1.6
##   Final            0.0   0.0        0.0           0.0      0.0
##   Semi_Final       0.0   0.0        0.0           0.0      0.0
##   Quarter_Final    0.0   0.0        0.0           0.0      0.0
##   Round_16         0.0   0.0        0.0           0.0      0.0
##                             
##  Accuracy (average) : 0.9534

GLMNET

#myGrid = expand.grid(alpha = 0.1,
#                     lambda = 0.1)

#myControl = trainControl(method = "cv", number = 5)

#model2 = train(target ~ ., train, method = "glmnet", 
#               trControl = myControl,
#               tuneGrid = myGrid)
#confusionMatrix(model2)

#Cross-Validated (5 fold) Confusion Matrix 

#(entries are percentual average cell counts across resamples)
 
#          Reference
#Prediction    0    1
#         0 91.2  8.0
#         1  0.0  0.8
                          
# Accuracy (average) : 0.92

random forest with 10-fold cross validation

myGrid = expand.grid(mtry = c(1:2), splitrule = c("gini"),
                     min.node.size = c(1:2))

rf_Iaquinta10 <- train(Stage~.,data = Iaquinta, method = "ranger", 
               trControl = trainControl(method ="cv", number = 10, verboseIter = TRUE),
               tuneGrid = myGrid)

## + Fold01: mtry=1, splitrule=gini, min.node.size=1 
## - Fold01: mtry=1, splitrule=gini, min.node.size=1 
## + Fold01: mtry=2, splitrule=gini, min.node.size=1 
## - Fold01: mtry=2, splitrule=gini, min.node.size=1 
## + Fold01: mtry=1, splitrule=gini, min.node.size=2 
## - Fold01: mtry=1, splitrule=gini, min.node.size=2 
## + Fold01: mtry=2, splitrule=gini, min.node.size=2 
## - Fold01: mtry=2, splitrule=gini, min.node.size=2 
## + Fold02: mtry=1, splitrule=gini, min.node.size=1 
## - Fold02: mtry=1, splitrule=gini, min.node.size=1 
## + Fold02: mtry=2, splitrule=gini, min.node.size=1 
## - Fold02: mtry=2, splitrule=gini, min.node.size=1 
## + Fold02: mtry=1, splitrule=gini, min.node.size=2 
## - Fold02: mtry=1, splitrule=gini, min.node.size=2 
## + Fold02: mtry=2, splitrule=gini, min.node.size=2 
## - Fold02: mtry=2, splitrule=gini, min.node.size=2 
## + Fold03: mtry=1, splitrule=gini, min.node.size=1 
## - Fold03: mtry=1, splitrule=gini, min.node.size=1 
## + Fold03: mtry=2, splitrule=gini, min.node.size=1 
## - Fold03: mtry=2, splitrule=gini, min.node.size=1 
## + Fold03: mtry=1, splitrule=gini, min.node.size=2 
## - Fold03: mtry=1, splitrule=gini, min.node.size=2 
## + Fold03: mtry=2, splitrule=gini, min.node.size=2 
## - Fold03: mtry=2, splitrule=gini, min.node.size=2 
## + Fold04: mtry=1, splitrule=gini, min.node.size=1 
## - Fold04: mtry=1, splitrule=gini, min.node.size=1 
## + Fold04: mtry=2, splitrule=gini, min.node.size=1 
## - Fold04: mtry=2, splitrule=gini, min.node.size=1 
## + Fold04: mtry=1, splitrule=gini, min.node.size=2 
## - Fold04: mtry=1, splitrule=gini, min.node.size=2 
## + Fold04: mtry=2, splitrule=gini, min.node.size=2 
## - Fold04: mtry=2, splitrule=gini, min.node.size=2 
## + Fold05: mtry=1, splitrule=gini, min.node.size=1 
## - Fold05: mtry=1, splitrule=gini, min.node.size=1 
## + Fold05: mtry=2, splitrule=gini, min.node.size=1 
## - Fold05: mtry=2, splitrule=gini, min.node.size=1 
## + Fold05: mtry=1, splitrule=gini, min.node.size=2 
## - Fold05: mtry=1, splitrule=gini, min.node.size=2 
## + Fold05: mtry=2, splitrule=gini, min.node.size=2 
## - Fold05: mtry=2, splitrule=gini, min.node.size=2 
## + Fold06: mtry=1, splitrule=gini, min.node.size=1 
## - Fold06: mtry=1, splitrule=gini, min.node.size=1 
## + Fold06: mtry=2, splitrule=gini, min.node.size=1 
## - Fold06: mtry=2, splitrule=gini, min.node.size=1 
## + Fold06: mtry=1, splitrule=gini, min.node.size=2 
## - Fold06: mtry=1, splitrule=gini, min.node.size=2 
## + Fold06: mtry=2, splitrule=gini, min.node.size=2 
## - Fold06: mtry=2, splitrule=gini, min.node.size=2 
## + Fold07: mtry=1, splitrule=gini, min.node.size=1 
## - Fold07: mtry=1, splitrule=gini, min.node.size=1 
## + Fold07: mtry=2, splitrule=gini, min.node.size=1 
## - Fold07: mtry=2, splitrule=gini, min.node.size=1 
## + Fold07: mtry=1, splitrule=gini, min.node.size=2 
## - Fold07: mtry=1, splitrule=gini, min.node.size=2 
## + Fold07: mtry=2, splitrule=gini, min.node.size=2 
## - Fold07: mtry=2, splitrule=gini, min.node.size=2 
## + Fold08: mtry=1, splitrule=gini, min.node.size=1 
## - Fold08: mtry=1, splitrule=gini, min.node.size=1 
## + Fold08: mtry=2, splitrule=gini, min.node.size=1 
## - Fold08: mtry=2, splitrule=gini, min.node.size=1 
## + Fold08: mtry=1, splitrule=gini, min.node.size=2 
## - Fold08: mtry=1, splitrule=gini, min.node.size=2 
## + Fold08: mtry=2, splitrule=gini, min.node.size=2 
## - Fold08: mtry=2, splitrule=gini, min.node.size=2 
## + Fold09: mtry=1, splitrule=gini, min.node.size=1 
## - Fold09: mtry=1, splitrule=gini, min.node.size=1 
## + Fold09: mtry=2, splitrule=gini, min.node.size=1 
## - Fold09: mtry=2, splitrule=gini, min.node.size=1 
## + Fold09: mtry=1, splitrule=gini, min.node.size=2 
## - Fold09: mtry=1, splitrule=gini, min.node.size=2 
## + Fold09: mtry=2, splitrule=gini, min.node.size=2 
## - Fold09: mtry=2, splitrule=gini, min.node.size=2 
## + Fold10: mtry=1, splitrule=gini, min.node.size=1 
## - Fold10: mtry=1, splitrule=gini, min.node.size=1 
## + Fold10: mtry=2, splitrule=gini, min.node.size=1 
## - Fold10: mtry=2, splitrule=gini, min.node.size=1 
## + Fold10: mtry=1, splitrule=gini, min.node.size=2 
## - Fold10: mtry=1, splitrule=gini, min.node.size=2 
## + Fold10: mtry=2, splitrule=gini, min.node.size=2 
## - Fold10: mtry=2, splitrule=gini, min.node.size=2 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 1, splitrule = gini, min.node.size = 1 on full training set

rf_Iaquinta10

## Random Forest 
## 
## 4572 samples
##   11 predictor
##    5 classes: 'Prelim', 'Final', 'Semi_Final', 'Quarter_Final', 'Round_16' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 4113, 4115, 4114, 4115, 4115, 4114, ... 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  Accuracy  Kappa
##   1     1              0.953415  0    
##   1     2              0.953415  0    
##   2     1              0.953415  0    
##   2     2              0.953415  0    
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 1, splitrule = gini
##  and min.node.size = 1.

random forest with 7-fold cross validation

myGrid = expand.grid(mtry = c(1:2), splitrule = c("gini"),
                     min.node.size = c(1:2))

rf_Iaquinta7 <- train(Stage~.,data = Iaquinta, method = "ranger", 
               trControl = trainControl(method ="cv", number = 7, verboseIter = TRUE),
               tuneGrid = myGrid)

## + Fold1: mtry=1, splitrule=gini, min.node.size=1 
## - Fold1: mtry=1, splitrule=gini, min.node.size=1 
## + Fold1: mtry=2, splitrule=gini, min.node.size=1 
## - Fold1: mtry=2, splitrule=gini, min.node.size=1 
## + Fold1: mtry=1, splitrule=gini, min.node.size=2 
## - Fold1: mtry=1, splitrule=gini, min.node.size=2 
## + Fold1: mtry=2, splitrule=gini, min.node.size=2 
## - Fold1: mtry=2, splitrule=gini, min.node.size=2 
## + Fold2: mtry=1, splitrule=gini, min.node.size=1 
## - Fold2: mtry=1, splitrule=gini, min.node.size=1 
## + Fold2: mtry=2, splitrule=gini, min.node.size=1 
## - Fold2: mtry=2, splitrule=gini, min.node.size=1 
## + Fold2: mtry=1, splitrule=gini, min.node.size=2 
## - Fold2: mtry=1, splitrule=gini, min.node.size=2 
## + Fold2: mtry=2, splitrule=gini, min.node.size=2 
## - Fold2: mtry=2, splitrule=gini, min.node.size=2 
## + Fold3: mtry=1, splitrule=gini, min.node.size=1 
## - Fold3: mtry=1, splitrule=gini, min.node.size=1 
## + Fold3: mtry=2, splitrule=gini, min.node.size=1 
## - Fold3: mtry=2, splitrule=gini, min.node.size=1 
## + Fold3: mtry=1, splitrule=gini, min.node.size=2 
## - Fold3: mtry=1, splitrule=gini, min.node.size=2 
## + Fold3: mtry=2, splitrule=gini, min.node.size=2 
## - Fold3: mtry=2, splitrule=gini, min.node.size=2 
## + Fold4: mtry=1, splitrule=gini, min.node.size=1 
## - Fold4: mtry=1, splitrule=gini, min.node.size=1 
## + Fold4: mtry=2, splitrule=gini, min.node.size=1 
## - Fold4: mtry=2, splitrule=gini, min.node.size=1 
## + Fold4: mtry=1, splitrule=gini, min.node.size=2 
## - Fold4: mtry=1, splitrule=gini, min.node.size=2 
## + Fold4: mtry=2, splitrule=gini, min.node.size=2 
## - Fold4: mtry=2, splitrule=gini, min.node.size=2 
## + Fold5: mtry=1, splitrule=gini, min.node.size=1 
## - Fold5: mtry=1, splitrule=gini, min.node.size=1 
## + Fold5: mtry=2, splitrule=gini, min.node.size=1 
## - Fold5: mtry=2, splitrule=gini, min.node.size=1 
## + Fold5: mtry=1, splitrule=gini, min.node.size=2 
## - Fold5: mtry=1, splitrule=gini, min.node.size=2 
## + Fold5: mtry=2, splitrule=gini, min.node.size=2 
## - Fold5: mtry=2, splitrule=gini, min.node.size=2 
## + Fold6: mtry=1, splitrule=gini, min.node.size=1 
## - Fold6: mtry=1, splitrule=gini, min.node.size=1 
## + Fold6: mtry=2, splitrule=gini, min.node.size=1 
## - Fold6: mtry=2, splitrule=gini, min.node.size=1 
## + Fold6: mtry=1, splitrule=gini, min.node.size=2 
## - Fold6: mtry=1, splitrule=gini, min.node.size=2 
## + Fold6: mtry=2, splitrule=gini, min.node.size=2 
## - Fold6: mtry=2, splitrule=gini, min.node.size=2 
## + Fold7: mtry=1, splitrule=gini, min.node.size=1 
## - Fold7: mtry=1, splitrule=gini, min.node.size=1 
## + Fold7: mtry=2, splitrule=gini, min.node.size=1 
## - Fold7: mtry=2, splitrule=gini, min.node.size=1 
## + Fold7: mtry=1, splitrule=gini, min.node.size=2 
## - Fold7: mtry=1, splitrule=gini, min.node.size=2 
## + Fold7: mtry=2, splitrule=gini, min.node.size=2 
## - Fold7: mtry=2, splitrule=gini, min.node.size=2 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 1, splitrule = gini, min.node.size = 1 on full training set

rf_Iaquinta7

## Random Forest 
## 
## 4572 samples
##   11 predictor
##    5 classes: 'Prelim', 'Final', 'Semi_Final', 'Quarter_Final', 'Round_16' 
## 
## No pre-processing
## Resampling: Cross-Validated (7 fold) 
## Summary of sample sizes: 3918, 3919, 3919, 3918, 3918, 3921, ... 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  Accuracy   Kappa
##   1     1              0.9534142  0    
##   1     2              0.9534142  0    
##   2     1              0.9534142  0    
##   2     2              0.9534142  0    
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 1, splitrule = gini
##  and min.node.size = 1.

Final Project: Predicting World CUp Winner

Rodrigue Beleho

December 20, 2018

We decided to use the FIFA World Cup Matches dataset in order to predict the winner of the 2018 World Cup.

Reading the Dataset

Removing useless columns

CHecking for missing values

DISCUSSION ON MISSING VALUES

Handling missing values (1) - A9Q4 Missing values of categorical variables are replaced by the most frequent category in the variables

Comenting on the result

Handling Missing Values (2) - A4Q4 Input a data frame and return a data frame with numeric missing values being replaced by the mean of the corresponding column.

Handling Missing Values (3) -A17 # Missing values of numeric variables are replaced by the means of the non-missing values in the variables

Taking Care of the levels

Encoding/Recoding Categorical Variables

Recoding categorical variable using one hot encoding (dummy encoding)- Q5A11

This helps he models assigns the year to its corresping World Cup

Based on the number, it helps the model detecting whether or not the number of goals scored belong to a Home Team or an Away Team.

VISUALIZATION AND GRAPHS

This graph shows that on average the attendace for the preliminary rounds is essentially below 50,000 people.

For the round of 16, it is pretty diverse, but the concentration is in between 25,000 and 75,000.

For the Quarter final, the attendance is also very diverse where there is no real number that stands out more than the others.

For the semi-finals, the attendance is essentially around 70,000.

For the final, the pick of the attendance is 75,000.

Overall the attendance will vary based on the teams that are playing and the capacity of a stadium.

This graph shows the number of goals the Home Team scores during the match.

This graph shows the number of goals the Away team has scored during the match.

This graph show the overall attendace of the audiance during each edition of the World Cup; we can see that from 1930 to 1938 the World Cup became more popular. For obvious reasons, there was no WOrld Cup in 1942 and 1946, before restarting slow in 1950 and regaining popularity afterward.

This graph shows the number of goals the Home teams have scored after 45 minutes.

This graph shows the number of goals the away teams have scored after 45 minutes.

This graph shows the number of observations we have by stage; as expected there are more information for the preliminary rounds than any other stages and less observations for the fian than any other stages.

This graph shows that in term of the density of the observations, the prelims are much more imposing than any other catefory.

Just like previously, but with another angle, this graph shows that in term of the density of the observations, the prelims are much more imposing than any other catefory.

Thus graph shows the density of many variables such as attendance, Home Teams’ goals, Away Teams’ goals. Interesting to see that during a match, the Home Team scores frequently 2 goals while the Away Team scores 1.

Thus graph shows the number of obervations regarding many variables such as the number of goals scored by the home teams after 45 and 90 minutes, the number of goals scored by the away teams after 45 and 90 minutes.

Model Training and Model Tuning

Random Forest

GLMNET

random forest with 10-fold cross validation

random forest with 7-fold cross validation