Predict basket score with Machine Learning

The purpose of this machine learning is to predict the match winner of different match with a algorithm. To do this we have various JSON with the data from 378 of the 380 matches played in the regular phase in the LNB 2018/19.

We will train a algorithm with 300 matches and pretend to predict the other 78 matches, in this case the season it´s complete, but during the season time the DB of match will fill up through the games, and the user can predict the future game after the day with the seasons stats.

Also will use this code to a stats viewer, that i repeat, in a normal season will change after each match. Unfortunately i don´t have the date of each match, which I would use to create some extra filter

1- Load the JSON files

I have 378 JSON files, in sequence the name from “1.json” to “378.json”:

JSON_List <- list()
DF_home <-data.frame ()
DF_away <- data.frame ()


for (i in 1:378) {
Id_JSON = (paste0(i,".json"))
JSON_List[[i]]= rjson::fromJSON(file = Id_JSON, simplify = F)
}

2- Load data from home and away

Into the json it´s a sublist call tm (total match) and into the list, the sublist 1 is for home stats and the sublist 2 is for away, i want to use some fields from this lists.

Name, Score, Two points, Three points, Free Throw, Field goals, Efficiency, Assists, Rebounds, Steals, Lost, Points in the paint, Fast Breaks, Turnovers, Bench points and Blocks.

Also put the home condition to the sublist 1, away condition to the sublist 2 and create a match id for each match.

At the end we delete all single observations.

for (i in 1:378) {
name=as.data.frame(JSON_List[[i]]$tm$"1"$name, nrow = 1)
score=as.data.frame(JSON_List[[i]]$tm$"1"$score, nrow = 1)
tot_sFieldGoalsMade=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sFieldGoalsMade, nrow = 1)
tot_sFieldGoalsAttempted=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sFieldGoalsAttempted, nrow = 1)
tot_sTwoPointersMade=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sTwoPointersMade, nrow = 1)
tot_sTwoPointersAttempted=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sTwoPointersAttempted, nrow = 1)
tot_sThreePointersMade=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sThreePointersMade, nrow = 1)
tot_sThreePointersAttempted=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sThreePointersAttempted, nrow = 1)
tot_sFreeThrowsMade=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sFreeThrowsMade, nrow = 1)
tot_sFreeThrowsAttempted=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sFreeThrowsAttempted, nrow = 1)
tot_sReboundsDefensive=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sReboundsDefensive, nrow = 1)
tot_sReboundsOffensive=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sReboundsOffensive, nrow = 1)
tot_sReboundsTotal=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sReboundsTotal, nrow = 1)
tot_sAssists=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sAssists, nrow = 1)
tot_sBlocks=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sBlocks, nrow = 1)
tot_sTurnovers=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sTurnovers, nrow = 1)
tot_sFoulsPersonal=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sFoulsPersonal, nrow = 1)
tot_sPointsInThePaint=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sPointsInThePaint, nrow = 1)
tot_sPointsSecondChance=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sPointsSecondChance, nrow = 1)
tot_sPointsFromTurnovers=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sPointsFromTurnovers, nrow = 1)
tot_sBenchPoints=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sBenchPoints, nrow = 1)
tot_sPointsFastBreak=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sPointsFastBreak, nrow = 1)
tot_sSteals=as.data.frame(JSON_List[[i]]$tm$"1"$tot_sSteals, nrow = 1)
tot_eff_5=as.data.frame(JSON_List[[i]]$tm$"1"$tot_eff_5, nrow = 1)
match_id = i
condicion = "home"
Observacion = cbind(name,score,tot_sFieldGoalsMade,tot_sFieldGoalsAttempted,tot_sTwoPointersMade,tot_sTwoPointersAttempted,tot_sThreePointersMade,tot_sThreePointersAttempted,tot_sFreeThrowsMade,tot_sFreeThrowsAttempted,tot_sReboundsDefensive,tot_sReboundsOffensive,tot_sReboundsTotal,tot_sAssists,tot_sBlocks,tot_sTurnovers,tot_sFoulsPersonal,tot_sPointsInThePaint,tot_sPointsSecondChance,tot_sPointsFromTurnovers,tot_sBenchPoints,tot_sPointsFastBreak,tot_sSteals,tot_eff_5, match_id, condicion 
)
DF_home= rbind(DF_home, Observacion ) 
}
for (i in 1:378) {
name=as.data.frame(JSON_List[[i]]$tm$"2"$name, nrow = 1)
score=as.data.frame(JSON_List[[i]]$tm$"2"$score, nrow = 1)
tot_sFieldGoalsMade=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sFieldGoalsMade, nrow = 1)
tot_sFieldGoalsAttempted=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sFieldGoalsAttempted, nrow = 1)
tot_sTwoPointersMade=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sTwoPointersMade, nrow = 1)
tot_sTwoPointersAttempted=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sTwoPointersAttempted, nrow = 1)
tot_sThreePointersMade=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sThreePointersMade, nrow = 1)
tot_sThreePointersAttempted=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sThreePointersAttempted, nrow = 1)
tot_sFreeThrowsMade=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sFreeThrowsMade, nrow = 1)
tot_sFreeThrowsAttempted=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sFreeThrowsAttempted, nrow = 1)
tot_sReboundsDefensive=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sReboundsDefensive, nrow = 1)
tot_sReboundsOffensive=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sReboundsOffensive, nrow = 1)
tot_sReboundsTotal=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sReboundsTotal, nrow = 1)
tot_sAssists=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sAssists, nrow = 1)
tot_sBlocks=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sBlocks, nrow = 1)
tot_sTurnovers=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sTurnovers, nrow = 1)
tot_sFoulsPersonal=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sFoulsPersonal, nrow = 1)
tot_sPointsInThePaint=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sPointsInThePaint, nrow = 1)
tot_sPointsSecondChance=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sPointsSecondChance, nrow = 1)
tot_sPointsFromTurnovers=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sPointsFromTurnovers, nrow = 1)
tot_sBenchPoints=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sBenchPoints, nrow = 1)
tot_sPointsFastBreak=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sPointsFastBreak, nrow = 1)
tot_sSteals=as.data.frame(JSON_List[[i]]$tm$"2"$tot_sSteals, nrow = 1)
tot_eff_5=as.data.frame(JSON_List[[i]]$tm$"2"$tot_eff_5, nrow = 1)
match_id = i
condicion = "away"
Observacion = cbind(name,score,tot_sFieldGoalsMade,tot_sFieldGoalsAttempted,tot_sTwoPointersMade,tot_sTwoPointersAttempted,tot_sThreePointersMade,tot_sThreePointersAttempted,tot_sFreeThrowsMade,tot_sFreeThrowsAttempted,tot_sReboundsDefensive,tot_sReboundsOffensive,tot_sReboundsTotal,tot_sAssists,tot_sBlocks,tot_sTurnovers,tot_sFoulsPersonal,tot_sPointsInThePaint,tot_sPointsSecondChance,tot_sPointsFromTurnovers,tot_sBenchPoints,tot_sPointsFastBreak,tot_sSteals, tot_eff_5,match_id, condicion 
)
DF_away= rbind(DF_away, Observacion ) 
}
rm (Observacion, name,score,tot_sFieldGoalsMade,tot_sFieldGoalsAttempted,tot_sTwoPointersMade,tot_sTwoPointersAttempted,tot_sThreePointersMade,tot_sThreePointersAttempted,tot_sFreeThrowsMade,tot_sFreeThrowsAttempted,tot_sReboundsDefensive,tot_sReboundsOffensive,tot_sReboundsTotal,tot_sAssists,tot_sBlocks,tot_sTurnovers,tot_sFoulsPersonal,tot_sPointsInThePaint,tot_sPointsSecondChance,tot_sPointsFromTurnovers,tot_sBenchPoints,tot_sPointsFastBreak,tot_sSteals,match_id, condicion, tot_eff_5 )

head(DF_home, 5)

3- Rename the columns

I change the name to the different columns for those that it´s easy to reminder for me.

names(DF_home) <- c ("name",
"score",
"FG",
"FGA",
"P2",
"P2A",
"P3",
"P3A",
"FT",
"FTA",
"DRBS",
"ORBS",
"RBS",
"ASS",
"BLO",
"TO",
"PF",
"PP",
"OP2",
"PFT",
"BP",
"FB",
"STE",
"EFF",
"match_id",
"condition")

names(DF_away) <- c ("name",
"score",
"FG",
"FGA",
"P2",
"P2A",
"P3",
"P3A",
"FT",
"FTA",
"DRBS",
"ORBS",
"RBS",
"ASS",
"BLO",
"TO",
"PF",
"PP",
"OP2",
"PFT",
"BP",
"FB",
"STE",
"EFF",
"match_id",
"condition")

head (DF_away, 5)

4- Summary the DF

I summary both DB and search if were some NAs, if we see some values are near to be outliers, but i didn´t remove because are normal for basket, in special with a overtime match.

anyNA(DF_away)
[1] FALSE
anyNA(DF_home)
[1] FALSE
summary(DF_away)
     name               score              FG             FGA              P2             P2A       
 Length:378         Min.   : 40.00   Min.   :14.00   Min.   :45.00   Min.   : 9.00   Min.   :18.00  
 Class :character   1st Qu.: 72.00   1st Qu.:26.00   1st Qu.:60.00   1st Qu.:17.00   1st Qu.:35.00  
 Mode  :character   Median : 78.50   Median :28.00   Median :63.00   Median :20.00   Median :39.00  
                    Mean   : 78.42   Mean   :28.47   Mean   :63.39   Mean   :20.46   Mean   :39.69  
                    3rd Qu.: 84.75   3rd Qu.:31.00   3rd Qu.:67.00   3rd Qu.:24.00   3rd Qu.:44.00  
                    Max.   :115.00   Max.   :42.00   Max.   :86.00   Max.   :31.00   Max.   :65.00  
       P3              P3A             FT             FTA             DRBS            ORBS            RBS      
 Min.   : 1.000   Min.   : 8.0   Min.   : 1.00   Min.   : 2.00   Min.   :13.00   Min.   : 1.00   Min.   :22.0  
 1st Qu.: 6.000   1st Qu.:20.0   1st Qu.:10.00   1st Qu.:14.00   1st Qu.:24.00   1st Qu.: 6.00   1st Qu.:32.0  
 Median : 8.000   Median :24.0   Median :13.00   Median :18.00   Median :27.00   Median : 8.00   Median :36.0  
 Mean   : 8.008   Mean   :23.7   Mean   :13.48   Mean   :18.67   Mean   :27.47   Mean   : 8.73   Mean   :36.2  
 3rd Qu.:10.000   3rd Qu.:27.0   3rd Qu.:17.00   3rd Qu.:23.00   3rd Qu.:30.00   3rd Qu.:10.00   3rd Qu.:40.0  
 Max.   :17.000   Max.   :38.0   Max.   :31.00   Max.   :40.00   Max.   :47.00   Max.   :20.00   Max.   :63.0  
      ASS          BLO             TO              PF              PP            OP2              PFT       
 Min.   : 3   Min.   :0.00   Min.   : 3.00   Min.   :11.00   Min.   :10.0   Min.   : 0.000   Min.   : 2.00  
 1st Qu.:12   1st Qu.:1.00   1st Qu.:10.00   1st Qu.:18.00   1st Qu.:28.0   1st Qu.: 6.000   1st Qu.: 9.00  
 Median :15   Median :2.00   Median :13.00   Median :21.00   Median :34.0   Median : 9.000   Median :12.00  
 Mean   :15   Mean   :2.05   Mean   :13.08   Mean   :20.87   Mean   :34.7   Mean   : 8.714   Mean   :12.65  
 3rd Qu.:18   3rd Qu.:3.00   3rd Qu.:15.00   3rd Qu.:23.00   3rd Qu.:40.0   3rd Qu.:12.000   3rd Qu.:16.00  
 Max.   :30   Max.   :8.00   Max.   :26.00   Max.   :43.00   Max.   :56.0   Max.   :20.000   Max.   :32.00  
       BP              FB             STE              EFF            match_id       condition        
 Min.   : 0.00   Min.   : 0.00   Min.   : 1.000   Min.   : 24.00   Min.   :  1.00   Length:378        
 1st Qu.:17.00   1st Qu.: 8.00   1st Qu.: 5.000   1st Qu.: 69.00   1st Qu.: 95.25   Class :character  
 Median :24.00   Median :12.00   Median : 6.000   Median : 81.00   Median :189.50   Mode  :character  
 Mean   :24.55   Mean   :12.53   Mean   : 6.574   Mean   : 80.65   Mean   :189.50                     
 3rd Qu.:31.00   3rd Qu.:17.00   3rd Qu.: 8.000   3rd Qu.: 93.00   3rd Qu.:283.75                     
 Max.   :58.00   Max.   :33.00   Max.   :15.000   Max.   :146.00   Max.   :378.00                     
summary(DF_home)
     name               score           FG             FGA             P2             P2A       
 Length:378         Min.   : 52   Min.   :18.00   Min.   :48.0   Min.   :11.00   Min.   :20.00  
 Class :character   1st Qu.: 76   1st Qu.:27.00   1st Qu.:61.0   1st Qu.:18.25   1st Qu.:36.00  
 Mode  :character   Median : 83   Median :29.00   Median :65.0   Median :21.00   Median :40.00  
                    Mean   : 83   Mean   :29.86   Mean   :64.9   Mean   :21.62   Mean   :40.59  
                    3rd Qu.: 90   3rd Qu.:33.00   3rd Qu.:69.0   3rd Qu.:25.00   3rd Qu.:45.00  
                    Max.   :114   Max.   :44.00   Max.   :96.0   Max.   :36.00   Max.   :62.00  
       P3              P3A              FT             FTA             DRBS            ORBS       
 Min.   : 2.000   Min.   : 8.00   Min.   : 4.00   Min.   : 7.00   Min.   :15.00   Min.   : 1.000  
 1st Qu.: 6.000   1st Qu.:20.00   1st Qu.:11.00   1st Qu.:16.00   1st Qu.:25.00   1st Qu.: 7.000  
 Median : 8.000   Median :24.00   Median :14.00   Median :20.00   Median :28.00   Median :10.000  
 Mean   : 8.238   Mean   :24.31   Mean   :15.04   Mean   :20.79   Mean   :28.07   Mean   : 9.772  
 3rd Qu.:10.000   3rd Qu.:28.00   3rd Qu.:18.00   3rd Qu.:25.00   3rd Qu.:31.00   3rd Qu.:12.000  
 Max.   :20.000   Max.   :50.00   Max.   :43.00   Max.   :59.00   Max.   :44.00   Max.   :21.000  
      RBS             ASS             BLO              TO              PF              PP       
 Min.   :23.00   Min.   : 5.00   Min.   :0.000   Min.   : 2.00   Min.   : 9.00   Min.   : 8.00  
 1st Qu.:34.00   1st Qu.:14.00   1st Qu.:1.000   1st Qu.: 9.00   1st Qu.:17.00   1st Qu.:30.50  
 Median :37.50   Median :17.00   Median :2.000   Median :12.00   Median :20.00   Median :36.00  
 Mean   :37.84   Mean   :17.05   Mean   :2.534   Mean   :11.83   Mean   :19.61   Mean   :37.08  
 3rd Qu.:42.00   3rd Qu.:20.00   3rd Qu.:4.000   3rd Qu.:14.00   3rd Qu.:22.00   3rd Qu.:44.00  
 Max.   :57.00   Max.   :34.00   Max.   :8.000   Max.   :25.00   Max.   :34.00   Max.   :64.00  
      OP2             PFT             BP             FB            STE              EFF        
 Min.   : 0.00   Min.   : 0.0   Min.   : 0.0   Min.   : 0.0   Min.   : 0.000   Min.   : 35.00  
 1st Qu.: 7.00   1st Qu.:11.0   1st Qu.:18.0   1st Qu.: 9.0   1st Qu.: 5.000   1st Qu.: 80.00  
 Median :10.00   Median :14.0   Median :25.0   Median :13.0   Median : 7.000   Median : 93.00  
 Mean   :10.27   Mean   :14.7   Mean   :26.3   Mean   :13.9   Mean   : 6.854   Mean   : 93.26  
 3rd Qu.:13.00   3rd Qu.:18.0   3rd Qu.:33.0   3rd Qu.:18.0   3rd Qu.: 9.000   3rd Qu.:106.75  
 Max.   :24.00   Max.   :37.0   Max.   :55.0   Max.   :47.0   Max.   :16.000   Max.   :155.00  
    match_id       condition        
 Min.   :  1.00   Length:378        
 1st Qu.: 95.25   Class :character  
 Median :189.50   Mode  :character  
 Mean   :189.50                     
 3rd Qu.:283.75                     
 Max.   :378.00                     

5- Create DF for teams

Join the home and away DF by match ID will create a new DF with all the match stats and the possibility to know who was the winner.

Bind both DF with similar numbers of columns and observation will create a DF where is all the data for each team.

DF_Total <- data.frame()
DF_Total <-   
 DF_away%>%
inner_join(DF_home, ., by = "match_id") 

DF_Team <- data.frame()
DF_Team <- rbind(DF_home,DF_away)

tail (DF_Team, 10)
NA

6- Determine winner

Determine who win each match will create two class, the class1 that it´s 1 when win the home team and 0 when win the away team, and class2 that it´s the inverse.

DF_Score <- data.frame()
DF_Score <-  DF_Total%>%
  mutate( winner= ifelse(DF_Total$score.x > DF_Total$score.y, DF_Total$name.x, DF_Total$name.y)) %>%
  select(match_id, winner)

DF_Total$class1 <- ifelse(DF_Total$score.x > DF_Total$score.y, 1, 0)
DF_Total$class2 <- ifelse(DF_Total$score.x > DF_Total$score.y, 0, 1)

DF_Team <- DF_Team %>%
left_join(DF_Score, ., by = "match_id")

head(DF_Score, 5)

7- Create the stats table

The first when create the stats table it´s group the teams to have a individual observation of each one and know how many match played, some teams are in 37 match because are missing 2 JSON files.

Also we create a table with the wins at home, away, in total and a ratio that determinate the upgrade when plays in home and a downgrade when plays away.

teams_stats<- data.frame()
teams_stats <- DF_Team %>% 
  group_by(name) %>% 
  count(name) 


wins<- data.frame()
wins <- DF_Team %>% 
  group_by(winner) %>% 
  count(winner) %>% 
  mutate(n = n/2)

names(wins) <- c ("name", "wins")

teams_stats <- teams_stats %>%
inner_join(wins, ., by = "name")


teams_stats <- teams_stats %>% mutate(wins_ratio = wins/n)


home_wins <- data.frame()
home_wins <- DF_Total %>% 
  group_by(name.x, class1) %>% 
  filter(class1 == 1)%>%
  summarise(winshome= sum(class1)) %>%
  mutate(winshome = winshome/19)%>%
  select(name.x, winshome)
 
names(home_wins) = c("name", "home_ratio")

teams_stats <- teams_stats %>%
inner_join(home_wins, ., by = "name") 

teams_stats <- teams_stats %>% mutate(home_wins_upgrade = (home_ratio/wins_ratio))

away_wins <- data.frame()
away_wins <- DF_Total %>% 
  group_by(name.y, class2) %>% 
  filter(class2 == 1)%>%
  summarise(winsaway= sum(class2)) %>%
  mutate(winsaway = winsaway/19)%>%
  select(name.y, winsaway)
 
names(away_wins) = c("name", "away_ratio")


teams_stats <- teams_stats %>%
inner_join(away_wins, ., by = "name") 

teams_stats <- teams_stats %>% mutate(away_wins_downgrade = (away_ratio/wins_ratio))

print (head(teams_stats, 5))

8- Create a shot DF

A DF with all the info of the shots during the season, and also create the field with the percentage of shots made it´s the next step, this will bring more information and will be join with the team stats.

shots <- data.frame()
shots <- DF_Team %>% 
  group_by(name) %>% 
  summarise(
   P3 = sum(P3) ,
  P3A = sum(P3A) ,
   P2 = sum(P2) ,
   P2A = sum(P2A) ,
    FT = sum(FT) ,
   FTA = sum(FTA) ,
   FG = sum(FG), 
   FGA = sum(FGA) ,
  ) 
`summarise()` ungrouping output (override with `.groups` argument)
 
  shots <- shots  %>%
    mutate( P3P = P3 / P3A ) %>%
    mutate(P2P = P2 / P2A ) %>%
    mutate(FTP = FT / FTA ) %>%
    mutate(FGP = FG / FGA )

teams_stats <- teams_stats %>%
inner_join(shots, ., by = "name") 

tail (shots, 5)

9- Create other stats DF

A DF with all the info of other during the season, it´s the next step, this will bring more information and will be join with the team stats.

other_stats <- data.frame()
other_stats <- DF_Team %>% 
  group_by(name) %>% 
  summarise(
    RBS = mean(RBS),
    ASS= mean(ASS),
    BLO = mean(BLO),
    DRI = mean(STE) - mean(TO),
    PP = mean (PP),
    OP2 = mean (OP2),
    BP = mean(BP),
    FB = mean(FB) + mean(PFT),
    points = mean(score),
    EFF = mean(EFF)
  )
`summarise()` ungrouping output (override with `.groups` argument)
    
teams_stats <- teams_stats %>%
inner_join(other_stats, ., by = "name")    

head(other_stats, 5)

10- Create a match DF

We have all the info about the teams, know will create a DF of each match preview with the info of the teams, this will be use to teach the model, using the info of DF_total (match id, home, away and the classes) and bringing all the stats for the correspondient DF

DF_match <- data.frame()
DF_match <-  DF_Total %>%
  select(match_id, name.x, name.y, class1, class2 ) #I take the initial fields

colnames(DF_match)[2] <-  "name" #Rename to make the join

DF_match <- teams_stats %>%
inner_join(DF_match, ., by = "name")  #Do the join

colnames(DF_match)[2] <-  "home" #I put as home name
colnames(DF_match)[3] <-  "name" #Rename to make the join
DF_match <- teams_stats %>%
inner_join(DF_match, ., by = "name")  #Do the join

colnames(DF_match)[3] <-  "away"#I put as away name

tail(DF_match, 10)

11- Create ratios

For the match DF i will create ratios to normalize the data between the observation of home teams (.x) and the away teams (.y).

The normalization it´s a mandatory step to avoid problem in the regression.

DF_match <-  DF_match%>%
  mutate(
    points_ratio.x = (points.x-points.y)/points.y ,
    wins_dif.x = (wins_ratio.x-wins_ratio.y) / wins_ratio.y,
    rbs_ratio.x = (RBS.x - RBS.y)/RBS.y ,
    SH_ratio.x = (FGP.x - FGP.y)/ FGP.y ,
    EFF_ratio.x = (EFF.x - EFF.y)/ EFF.y,
    ASS_ratio.x = (ASS.x - ASS.y) / ASS.y,
    BLO_ratio.x= (BLO.x - BLO.y ) /BLO.y,
    DRI_ratio.x = (DRI.x - DRI.y) / DRI.y,
    PP_ratio.x =   (PP.x -PP.y) / PP.y,
    P3P_ratio.x = (P3P.x - P3P.y) / P3P.y,
    P2P_ratio.x = (P2P.x - P2P.y) / P2P.y,
    FT_ratio.x = (FTP.x - FTP.y) / FTP.y ,
    OP2_ratio.x = (OP2.x - OP2.y) / OP2.y,
    home_condition.x = (home_wins_upgrade.x - away_wins_downgrade.y) / away_wins_downgrade.y, 
    BP_ratio.x = (BP.x - BP.y) / BP.y,
    FB_ratio.x = (FB.x - FB.y) / FB.y,
    points_ratio.y = (points.y-points.x)/points.x ,
    rbs_ratio.y = (RBS.y - RBS.x)/RBS.x ,
    SH_ratio.y = (FGP.y - FGP.x)/ FGP.x ,
    EFF_ratio.y = (EFF.y - EFF.x)/ EFF.x,
    ASS_ratio.y = (ASS.y - ASS.x) / ASS.x,
    BLO_ratio.y= (BLO.y - BLO.x ) /BLO.x,
    DRI_ratio.y = (DRI.y - DRI.x) / DRI.x,
    PP_ratio.y =   (PP.y -PP.x) / PP.x,
    OP2_ratio.y = (OP2.y - OP2.x) / OP2.x,
    BP_ratio.y = (BP.y - BP.x) / BP.x,
    away_condition.y = -(away_wins_downgrade.y - home_wins_upgrade.x) / away_wins_downgrade.y,
    wins_dif.y = (wins_ratio.y-wins_ratio.x) / wins_ratio.y ,
    FB_ratio.y = (FB.y - FB.x) / FB.x,
  ) %>%
  select(match_id, 
         home, 
         away, 
    points_ratio.x,
    home_condition.x,
    rbs_ratio.x, 
    SH_ratio.x ,
    wins_dif.x,
    EFF_ratio.x ,
    ASS_ratio.x ,
    BLO_ratio.x,
    DRI_ratio.x ,
    PP_ratio.x ,
    P3P_ratio.x ,
    P2P_ratio.x ,
    FT_ratio.x ,
    OP2_ratio.x ,
    BP_ratio.x ,
    FB_ratio.x ,
    home_wins_upgrade.x ,
    wins_ratio.x,
    points_ratio.y ,
    rbs_ratio.y ,
    SH_ratio.y , 
    EFF_ratio.y ,
    ASS_ratio.y ,
    BLO_ratio.y,
    DRI_ratio.y ,
    PP_ratio.y ,
    OP2_ratio.y ,
    BP_ratio.y ,
    FB_ratio.y ,
    away_wins_downgrade.y ,
    wins_dif.y,
    away_condition.y,
    wins_ratio.y ,
         class1,
         class2)


head (DF_match, 10)

12- Training and testing

I divide the DF between training and testing, in this case 300 matches, near 80%, are use to the training and the other 78 match will be the test target.

set.seed(1234)
sample <- 300
trIndex <- sample(nrow(DF_match), sample, replace=F)
teIndex <- seq_len(nrow(DF_match))[!(seq_len(nrow(DF_match)) %in% trIndex)]

training <- DF_match[trIndex,]
test <- DF_match[teIndex,]

tail(test, 10)

13- Create the class regression

I create the regression for the class1 (home wins) with all the information and stats of the team 1, the most important are the win ratio and the home upgrade, the rest of the information get small details and i decided to don´t keep out

Reg_1 <- glm(class1 ~ points_ratio.x + rbs_ratio.x +SH_ratio.x + EFF_ratio.x + ASS_ratio.x+ BLO_ratio.x + DRI_ratio.x +PP_ratio.x + OP2_ratio.x  +BP_ratio.x + FB_ratio.x + home_condition.x + wins_dif.x, data=training, family=binomial(link="logit"))

summary(Reg_1)

Call:
glm(formula = class1 ~ points_ratio.x + rbs_ratio.x + SH_ratio.x + 
    EFF_ratio.x + ASS_ratio.x + BLO_ratio.x + DRI_ratio.x + PP_ratio.x + 
    OP2_ratio.x + BP_ratio.x + FB_ratio.x + home_condition.x + 
    wins_dif.x, family = binomial(link = "logit"), data = training)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2567  -1.0866  0.5355  0.9454  1.6934 

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)   
(Intercept)      -0.37403    0.33465  -1.118   0.2637   
points_ratio.x    5.29290    9.05418   0.585   0.5588   
rbs_ratio.x       0.36648    7.21009   0.051   0.9595   
SH_ratio.x        6.82025   13.48520   0.506   0.6130   
EFF_ratio.x      -4.99611    6.22734  -0.802   0.4224   
ASS_ratio.x      -1.14994    2.06344  -0.557   0.5773   
BLO_ratio.x       1.23716    1.55440   0.796   0.4261   
DRI_ratio.x      -0.29932    1.47404  -0.203   0.8391   
PP_ratio.x       -1.28217    3.94090  -0.325   0.7449   
OP2_ratio.x       0.42855    2.70186   0.159   0.8740   
BP_ratio.x       -0.25085    0.77456  -0.324   0.7460   
FB_ratio.x       -0.03992    1.26968  -0.031   0.9749   
home_condition.x  0.94851    0.31376   3.023   0.0025 **
wins_dif.x        1.88558    1.44813   1.302   0.1929   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 393.19  on 299  degrees of freedom
Residual deviance: 339.52  on 286  degrees of freedom
AIC: 367.52

Number of Fisher Scoring iterations: 5

14- Create the tree

In the tree is similar to the regression, the importance is the wins ratio, the home condition and also add the shoot efficiency to know if home team win or no.


fit_1 <- rpart(class1 ~ points_ratio.x + rbs_ratio.x +SH_ratio.x + EFF_ratio.x + ASS_ratio.x+ BLO_ratio.x + DRI_ratio.x +PP_ratio.x + OP2_ratio.x  +BP_ratio.x + FB_ratio.x + home_condition.x + wins_dif.x, data=training)
rpart.plot(fit_1, extra=0, type=2)

15- Predict the class

We will have after the prediction a probability of 1 from tree, from the regression and finally will use the mean of both to determinate the probability of home team wins

Pred_class1_Tree <- predict(fit_1,test,method='class')
Pred_class1_Reg <- predict(Reg_1,test,type='response')

test$class1_Reg <- Pred_class1_Reg
test$class1_Tree <- Pred_class1_Tree

test$class1_ensamble <- rowMeans(test[,c("class1_Tree", "class1_Reg")])

16- Repeat the same for class 2

Make know the same regression and tree for class 2 it´s the next step, and also the main field are the same, change the upgrade of home team for the downgrade in the away team

Reg_2 <- glm(class2 ~  points_ratio.y + rbs_ratio.y + SH_ratio.y + EFF_ratio.y +  ASS_ratio.y + BLO_ratio.y + DRI_ratio.y + PP_ratio.y + OP2_ratio.y + BP_ratio.y + FB_ratio.y  + away_condition.y + wins_dif.y , data=training, family=binomial(link="logit"))

summary(Reg_2)

Call:
glm(formula = class2 ~ points_ratio.y + rbs_ratio.y + SH_ratio.y + 
    EFF_ratio.y + ASS_ratio.y + BLO_ratio.y + DRI_ratio.y + PP_ratio.y + 
    OP2_ratio.y + BP_ratio.y + FB_ratio.y + away_condition.y + 
    wins_dif.y, family = binomial(link = "logit"), data = training)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5721  -0.9445  -0.5251   1.0590   2.3288  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)   
(Intercept)       0.43216    0.33929   1.274   0.2028   
points_ratio.y    5.41093    8.73122   0.620   0.5354   
rbs_ratio.y       3.10256    7.16125   0.433   0.6648   
SH_ratio.y       -4.54893   13.39144  -0.340   0.7341   
EFF_ratio.y      -2.23853    5.95135  -0.376   0.7068   
ASS_ratio.y       0.07936    1.82499   0.043   0.9653   
BLO_ratio.y      -0.68663    1.11643  -0.615   0.5385   
DRI_ratio.y       0.26851    1.40206   0.192   0.8481   
PP_ratio.y        1.40929    3.90546   0.361   0.7182   
OP2_ratio.y      -2.45812    2.51657  -0.977   0.3287   
BP_ratio.y        0.40550    0.67684   0.599   0.5491   
FB_ratio.y       -0.57270    1.20680  -0.475   0.6351   
away_condition.y -0.92693    0.31932  -2.903   0.0037 **
wins_dif.y        2.96970    1.29190   2.299   0.0215 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 393.19  on 299  degrees of freedom
Residual deviance: 339.38  on 286  degrees of freedom
AIC: 367.38

Number of Fisher Scoring iterations: 5

fit_2 <- rpart(class2 ~  points_ratio.y + rbs_ratio.y + SH_ratio.y + EFF_ratio.y +  ASS_ratio.y + BLO_ratio.y + DRI_ratio.y + PP_ratio.y + OP2_ratio.y + BP_ratio.y + FB_ratio.y  + away_condition.y + wins_dif.y , data=training)

rpart.plot(fit_2, extra=0, type=2)

Pred_class2_Tree <- predict(fit_2,test,method='class')
Pred_class2_Reg <- predict(Reg_2,test,type='response')

test$class2_Reg <- Pred_class2_Reg
test$class2_Tree <- Pred_class2_Tree

test$class2_ensamble <- rowMeans(test[,c("class2_Tree", "class2_Reg")])

17- Determinate the winner

If class 1 is higher than class 2 it´s the home team the winner, conversely if is lower the away team its the winner.

We create a DF with the results predicted in testing and see that near of the 80% of the match were predicted good by this tool.

This high level of precision it´s a good signal that data science can determinate the results and the probability before the match.

If we run over the 378 matches the algorithm (For all match, tat who was in training and in testing) the success rate improve over the 85% percent

test$winner <- ifelse(test$class1 == 1, "Home", "Away")
test$winner_predict <- ifelse(test$class1_ensamble  > test$class2_ensamble, "Home", "Away")

DF_Score <- test %>%
  select(match_id, home, away, winner, winner_predict)

DF_Score$success <- ifelse (test$winner==test$winner_predict, 1, 0)

print ("The success rate is:")
[1] "The success rate is:"
[1] 0.795641

18- Create the preview probability

In another DF will create the probability to win of each team, divide for each class probability the add of both class, in this case will have a probability for home and other for away, and each will add the 100% percent.

The final report

Now with all the information and probability we can have a report panel to all the stats and prediction for teams and matches, let´s explore it.

A glossary teams of stats we need to know the name of each field

1- Individual stats

Create a function, it will give the possibility to see the statistics for a team in a category, for example see Boca in 3 points % or Quilmes y rebounds per game

#Glossary teams

#"ARGENTINO"       "ATENAS (CBA)"    "BOCA"            "COMUNICACIONES"  "ESTUDIANTES (C)" "FERRO"          
#"GIMNASIA (CR)"   "HISPANO"         "INSTITUTO"       "LA UNION FSA"    "LIBERTAD"        "OBRAS"          
#"OLIMPICO"        "PEÑAROL"         "QUILMES"         "QUIMSA"          "REGATAS"         "SAN LORENZO"  

#Glossary stats
#"RBS" = Rebounds  || "ASS" = Assist || "BLO" = Blocks || "DRI" = Steal - Lost || "PP" = Point in the paint
#"OP2" = Second opportunity points || #"BP" = Bench points || #"FB" = Fast break || #"points" = Points per game
#"EFF" = Efficiency per game || "P3P" = %of 3 points || "P2P" = %of 2points  || "FTP" = % Free Throws 
#"FGP" = % of field goals || "wins_ratio" = % of wins||"home_ratio" = %of wins at home ||"away ratio" = %of wins away

team_stats_function = function (x,z){
  teams_stats %>%  
 melt(id = "name") %>% 
    filter(name %in% (x)) %>% 
    filter(variable %in% (z)) %>% 
     ggplot()+
  aes(x=variable, y= value, fill= name) +
  geom_bar(stat='identity', position='dodge') + 
  labs(title = "Stats view", x = "Stat", y = "Total") +
    tema1
}


team_stats_function (c("BOCA"), c("P3P"))

team_stats_function (c("QUILMES"), c("RBS"))

2- Multiple teams the same stats

You can also see the performance of several teams in the same category, in this case Ferro and Obras in points and San Lorenzo, Quimsa and Comunicaciones in bench point

#Glossary teams

#"ARGENTINO"       "ATENAS (CBA)"    "BOCA"            "COMUNICACIONES"  "ESTUDIANTES (C)" "FERRO"          
#"GIMNASIA (CR)"   "HISPANO"         "INSTITUTO"       "LA UNION FSA"    "LIBERTAD"        "OBRAS"          
#"OLIMPICO"        "PEÑAROL"         "QUILMES"         "QUIMSA"          "REGATAS"         "SAN LORENZO"  

#Glossary stats
#"RBS" = Rebounds  || "ASS" = Assist || "BLO" = Blocks || "DRI" = Steal - Lost || "PP" = Point in the paint
#"OP2" = Second opportunity points || #"BP" = Bench points || #"FB" = Fast break || #"points" = Points per game
#"EFF" = Efficiency per game || "P3P" = %of 3 points || "P2P" = %of 2points  || "FTP" = % Free Throws 
#"FGP" = % of field goals || "wins_ratio" = % of wins||"home_ratio" = %of wins at home ||"away ratio" = %of wins away

team_stats_function (c("FERRO", "OBRAS"), c("points"))

team_stats_function (c("SAN LORENZO", "QUIMSA", "COMUNICACIONES"), c("BP"))

3- Multiple stats the same team

You can also see the performance of several stats but for the same team, in this case from Instituto will se the % of 2 points, 3 points, FT, Field Goals and from Hispano the wins ratio, the away wins ratio and home wins ratio

#Glossary teams

#"ARGENTINO"       "ATENAS (CBA)"    "BOCA"            "COMUNICACIONES"  "ESTUDIANTES (C)" "FERRO"          
#"GIMNASIA (CR)"   "HISPANO"         "INSTITUTO"       "LA UNION FSA"    "LIBERTAD"        "OBRAS"          
#"OLIMPICO"        "PEÑAROL"         "QUILMES"         "QUIMSA"          "REGATAS"         "SAN LORENZO"  

#Glossary stats
#"RBS" = Rebounds  || "ASS" = Assist || "BLO" = Blocks || "DRI" = Steal - Lost || "PP" = Point in the paint
#"OP2" = Second opportunity points || #"BP" = Bench points || #"FB" = Fast break || #"points" = Points per game
#"EFF" = Efficiency per game || "P3P" = %of 3 points || "P2P" = %of 2points  || "FTP" = % Free Throws 
#"FGP" = % of field goals || "wins_ratio" = % of wins||"home_ratio" = %of wins at home ||"away ratio" = %of wins away

team_stats_function (c("INSTITUTO"), c("P2P", "P3P", "FTP", "FGP"))

team_stats_function (c("HISPANO"), c("wins_ratio", "home_ratio", "away_ratio"))

4- Multiple stats and multiple teams

You can also see the performance of several stats in several teams, in this case from Libertad, Olimpico and Regatas we want to see the rebounds, the assist, the bench points and fast break points

#Glossary teams

#"ARGENTINO"       "ATENAS (CBA)"    "BOCA"            "COMUNICACIONES"  "ESTUDIANTES (C)" "FERRO"          
#"GIMNASIA (CR)"   "HISPANO"         "INSTITUTO"       "LA UNION FSA"    "LIBERTAD"        "OBRAS"          
#"OLIMPICO"        "PEÑAROL"         "QUILMES"         "QUIMSA"          "REGATAS"         "SAN LORENZO"  

#Glossary stats
#"RBS" = Rebounds  || "ASS" = Assist || "BLO" = Blocks || "DRI" = Steal - Lost || "PP" = Point in the paint
#"OP2" = Second opportunity points || #"BP" = Bench points || #"FB" = Fast break || #"points" = Points per game
#"EFF" = Efficiency per game || "P3P" = %of 3 points || "P2P" = %of 2points  || "FTP" = % Free Throws 
#"FGP" = % of field goals || "wins_ratio" = % of wins||"home_ratio" = %of wins at home ||"away ratio" = %of wins away

team_stats_function (c("LIBERTAD", "OLIMPICO", "REGATAS"), c("FB", "BP", "RBS", "ASS"))

5- Preview review

Further the individual stats, in a preview cast we can see the ratios between home and away team for the most important stats.

In this case the preview between Obras and Boca, show in the rigth that stats were Obras is best and in the left those where Boca is best.

Preview_review = function (x,z){
DF_match %>%  
  filter(home == x & away == z) %>% 
    select(home, wins_dif.x, points_ratio.x, rbs_ratio.x, ASS_ratio.x, PP_ratio.x, BP_ratio.x, OP2_ratio.x )  %>% 
    melt(id = "home") %>%
    mutate(best = ifelse(value> 0, "Home best", "Away best"))%>%
  ggplot(aes(x= variable, y= value, fill = best)) +
  geom_bar(stat='identity', position='dodge') +
    labs(title = paste0("Home: ", x, " Away: ", z), x = "Stat", y = "% of best") +
    coord_flip()+ 
     scale_fill_manual(values = c("darkorange1", "orange")) +
  scale_color_manual(values = c("darkorange1", "orange")) +
    tema2
}

Preview_review ("OBRAS", "BOCA")

6- Preview review in radar

For best display, but difficult comprehension we can see the preview review in a Radar or Polar graph

Preview_review_polar = function (x,z){
DF_match %>%  
  filter(home == x & away == z) %>% 
    select(home, wins_dif.x, points_ratio.x, rbs_ratio.x, ASS_ratio.x, PP_ratio.x, BP_ratio.x, OP2_ratio.x)  %>% 
    melt(id = "home") %>%
    mutate(best = ifelse(value> 0, "Home best", "Away best"))%>%
  ggplot(aes(x= variable, y= value, fill = best)) +
  geom_bar(stat='identity', position='dodge') +
    labs(title = paste0("Home: ", x, " Away: ", z), x = "Stat", y = "% of best") +
    coord_polar()+ 
     scale_fill_manual(values = c("green", "red")) +
  scale_color_manual(values = c("green", "red")) +
    tema2
}

Preview_review_polar ("OBRAS", "BOCA")

7- Preview probability

The most important after the machine learning its the probability of win for each team, in this case create a function were put the name of teams of the next match and can see the probability of success for each one.

In this example Hispano vs Regatas and Libertad vs Peñarol

Probability_by_user = function (x,z){
DF_match_preview %>%  
  filter(home == x & away == z) %>% 
    melt(id = "home") %>% 
      filter(variable != "away") %>% 
  ggplot(aes(x= variable, y= value)) +
  geom_bar(stat='identity', position='dodge', fill = "#CF5300") +
    labs(title = paste0("Home: ", x, " Away: ", z), x = "Team", y = "%Probability") +
    tema1
}

Probability_by_user ("HISPANO", "REGATAS")

Probability_by_user ("LIBERTAD", "PEÑAROL")

8- After regular season

After the regular season come the playoffs, in the first round Ferro and Comunicaciones will H2H, we know the probability of each team and with a binomial distribution we can see that the most probable result for the tie its 3-2.

How did the series end? 3-2 for Ferro

Probability_by_user ("FERRO", "COMUNICACIONES")

Probability_by_user ("COMUNICACIONES", "FERRO")

[1] "Table of probability"

Conclusion

The machine learning and the data science help to know the winner in a match, the analysis of the stats it´s a very helpful tool, not only for the coaches.

This example serves to demonstrate the power of predictions, in a normal season one should use the first 5 or 6 dates to complete the information, then one could start predicting what will happen.

The previous statistics will show where the weaknesses of one are and the strengths of another, which differentiates them. How much one suffers from traveling and how the other is strengthened by being local.

Finally, all this together, gives a prediction of who should be the winner, how favorite one is over another, and can even explain why a bump will happen.

Obviously, such a tool should be retrained every few months and that it is designed for the Argentine league, with the intricacies of each league, if these algorithms were used in another country, it could lower the level of effectiveness.

Obviously in sports there are surprises, injuries or bad days that can change the events of a game. However, believing that you cannot previously analyze a match, and predict the future, is ridiculous

