title: “Final Project (MATH217)”

author: “A. Diaz-Nova”

format: pdf

editor: visual


Introduction (Part One)

There is a strategic method for academies at the moment which is named bio-banding, and overall is a strategy in place to combat huge discrepancies in physical maturity within soccer. In recent years, there are top soccer clubs in England (I.e., Brighton, AFC Bournemouth, and etc.) who find great value in this type of research and implementation. For one, the hope is to assist young players by applying specific conditioning regimes whereby the goal is to reduce phyiscal demand to improve on-ball actions and tactical-cognitive thinking.To add more, most research findings corroborates with this belief which is because bio-banding indeed has a great capacity to influence making the game-states more technically and tactically challenging for academic players participating.

Methodology (No ReadME file was provided)

Based from the contributions of 13 authors, the aim of their study was to examine the effect of ‘bio-banding’ on technical and tactical markers of talent ‘ID’ in 11-14 year old academy soccer players. In other words, in younger ages there are those who are early or late developers to their adult physical size, and thus, by differentiating we can measure technical and tactical ability without the extreme exposure to physical dominance.

However, there is a lot left to be desired from researchers. Since, bio-banding is still in its infancy, further research should be with a focus to determine the effectivenes and limitations from this approach. As such, journal recommends to state or to consider the long-term effects, the optimal time frame for application, and so forth.

In progress (For Final Draft)

  • Define Variables (Will do this towards the end - not sure on the final subset)

Some questions that linger based on the facts/research are

  • A hypothesis test to see which method of grouping was most effective in finding players with a better GTSC score

  • What is the relationship of variables related to maturity (Stature, Mass, Seated Stature)

Working with the data (Part Two)

Load in the libraries

library(tidyverse)

library(tidymodels)

library(infer)



# Setting the working directory

setwd("C:/Users/Angel/OneDrive/Documents/Datasets")

Using R to load CSV files in your ‘wd’

# Loading the data from a downloaded csv file

bio_banding <- read_csv("Biobanding Small Sided Games.csv")



# Seeing the structure of the dataset

str(bio_banding)
## spc_tbl_ [480 × 70] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Trial Number                                    : num [1:480] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Method of Grouping Teams                        : chr [1:480] "Chronological Age Groups" "Chronological Age Groups" "Chronological Age Groups" "Chronological Age Groups" ...
##  $ Player ID                                       : num [1:480] 18 18 18 18 18 14 14 14 14 14 ...
##  $ Test Team Number                                : num [1:480] 6 6 6 6 6 5 5 5 5 5 ...
##  $ Date of Birth                                   : chr [1:480] "05/30/2005" "05/30/2005" "05/30/2005" "05/30/2005" ...
##  $ Age                                             : num [1:480] 13.6 13.6 13.6 13.6 13.6 ...
##  $ Birth Month                                     : chr [1:480] "May" "May" "May" "May" ...
##  $ Quartile                                        : num [1:480] 3 3 3 3 3 3 3 3 3 3 ...
##  $ Position                                        : num [1:480] 4 4 4 4 4 3 3 3 3 3 ...
##  $ Stature (cm)                                    : num [1:480] 163 163 163 163 163 ...
##  $ Seated Stature (cm)                             : num [1:480] 84.6 84.6 84.6 84.6 84.6 ...
##  $ Body Mass (kg)                                  : num [1:480] 47.3 47.3 47.3 47.3 47.3 45.2 45.2 45.2 45.2 45.2 ...
##  $ Khamis EASA %                                   : num [1:480] 87.9 87.9 87.9 87.9 87.9 ...
##  $ Chronological Age Group                         : num [1:480] 14 14 14 14 14 14 14 14 14 14 ...
##  $ Khamis Banding Category                         : num [1:480] 2 2 2 2 2 3 3 3 3 3 ...
##  $ Fransen Years to PHV                            : num [1:480] -0.0825 -0.0825 -0.0825 -0.0825 -0.0825 ...
##  $ Fransen Banding Category                        : num [1:480] 2 2 2 2 2 3 3 3 3 3 ...
##  $ Period Name                                     : chr [1:480] "3B Game 1" "3B Game 2" "3B Game 3" "3B Game 4" ...
##  $ Duration (Mins)                                 : num [1:480] 5.02 5.05 4.93 4.98 4.98 ...
##  $ Total Distance (m)                              : num [1:480] 470 422 413 427 419 ...
##  $ Low Intensity Running (m) (< 13 kph)            : num [1:480] 231 245 225 246 229 ...
##  $ High Intensity Running (m) (13.1 to 16.1 kph)   : num [1:480] 234 169 170 168 184 ...
##  $ Very High Intensity Running (m) (16.1 to 19 kph): num [1:480] 3.59 8.23 14.82 5.13 4.43 ...
##  $ Sprinting Distance (m) (>19.1 kph)              : num [1:480] 0 0 2.12 7.89 0 2.88 5.84 0 0 0 ...
##  $ Very High Intensity Activities (m) (>16.1 kph)  : num [1:480] 3.59 8.23 16.94 13.02 4.43 ...
##  $ Player Load (AU)                                : num [1:480] 55.9 51.4 46.2 49 48.7 ...
##  $ Mean Heart Rate (bpm)...27                      : num [1:480] 174 163 162 165 165 ...
##  $ Maximum Velocity (m/s)                          : num [1:480] 4.44 4.79 5.25 5.64 4.61 ...
##  $ Accelerations >2m/s/s                           : num [1:480] 1 0 0 0 0 1 2 2 0 4 ...
##  $ Decelerations >2 m/s/s                          : num [1:480] 0 0 1 1 0 1 0 0 0 0 ...
##  $ Mean Heart Rate (bpm)...31                      : num [1:480] 174 163 162 165 165 ...
##  $ Team                                            : chr [1:480] "3B" "3B" "3B" "3B" ...
##  $ Players Team Maturation Status Group            : chr [1:480] "Mixed" "Mixed" "Mixed" "Mixed" ...
##  $ Opponent Team Maturation Status Group           : chr [1:480] "Mixed" "Mixed" "Mixed" "Mixed" ...
##  $ Players Team Maturation Status Group No         : num [1:480] 4 4 4 4 4 4 4 4 4 4 ...
##  $ Opponent Team Maturation Status Group No        : num [1:480] 4 4 4 4 4 4 4 4 4 4 ...
##  $ Cover                                           : num [1:480] 3 3 3 2 2 2 2 3 3 3 ...
##  $ Communication                                   : num [1:480] 3 2 3 2 2 1 1 2 3 2 ...
##  $ Decision Making                                 : num [1:480] 3 3 2 3 2 3 2 3 3 3 ...
##  $ Passing                                         : num [1:480] 3 2 3 3 3 3 2 3 3 3 ...
##  $ 1st Touch                                       : num [1:480] 3 3 3 3 3 3 3 3 2 3 ...
##  $ Control                                         : num [1:480] NA 3 3 NA NA NA NA 2 NA 2 ...
##  $ 1v1                                             : num [1:480] 3 3 3 2 2 2 2 3 3 3 ...
##  $ Shooting                                        : num [1:480] 3 2 3 NA NA NA 2 NA NA NA ...
##  $ Assist                                          : num [1:480] NA NA NA NA NA NA NA NA NA NA ...
##  $ Marking                                         : num [1:480] 3 3 2 2 2 2 NA NA 3 3 ...
##  $ Total GTSC Score                                : num [1:480] 24 24 25 17 16 16 14 19 20 22 ...
##  $ Opponent                                        : chr [1:480] "3A" "3A" "3A" "3A" ...
##  $ Result                                          : chr [1:480] "Draw" "Loss" "Loss" "Loss" ...
##  $ sRPE-Overall                                    : num [1:480] 220 175 225 230 280 220 190 185 180 245 ...
##  $ Differential sRPE-Breathing                     : num [1:480] 175 235 225 220 255 275 180 150 125 210 ...
##  $ Differential sRPE-Legs                          : num [1:480] 115 155 220 230 220 260 225 225 215 195 ...
##  $ Differential sRPE-Tec/Tac                       : num [1:480] 110 105 95 115 105 175 185 180 190 215 ...
##  $ Game Number                                     : num [1:480] 1 2 3 4 5 1 2 3 4 5 ...
##  $ Assists                                         : num [1:480] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Attempts                                        : num [1:480] 1 1 0 2 2 0 2 1 0 1 ...
##  $ Attempts Created                                : num [1:480] 0 0 0 1 0 0 1 1 0 1 ...
##  $ Blocks                                          : num [1:480] 0 0 0 0 0 0 0 0 1 0 ...
##  $ Chances Created                                 : num [1:480] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Successful Passes                               : num [1:480] 3 3 4 7 4 7 5 3 6 2 ...
##  $ Forward Passes                                  : num [1:480] 2 0 1 4 0 7 4 1 5 2 ...
##  $ Goals                                           : num [1:480] 0 0 0 0 0 0 1 1 0 0 ...
##  $ In Possession Duels Lost                        : num [1:480] 0 0 0 0 0 0 1 0 0 0 ...
##  $ In Possession Duels Won                         : num [1:480] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Entries to Opposition Half                      : num [1:480] 1 2 1 1 2 1 1 0 1 1 ...
##  $ Out of Possession Duels Lost                    : num [1:480] 0 0 0 0 0 0 0 0 0 1 ...
##  $ Out of Possession Duels Won                     : num [1:480] 0 0 0 0 0 0 0 0 0 1 ...
##  $ Total Passes                                    : num [1:480] 4 4 5 10 4 9 6 3 7 3 ...
##  $ Tackles & Interceptions                         : num [1:480] 2 0 1 0 1 4 6 2 1 0 ...
##  $ Turns                                           : num [1:480] 0 0 0 0 1 0 0 0 0 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Trial Number` = col_double(),
##   ..   `Method of Grouping Teams` = col_character(),
##   ..   `Player ID` = col_double(),
##   ..   `Test Team Number` = col_double(),
##   ..   `Date of Birth` = col_character(),
##   ..   Age = col_double(),
##   ..   `Birth Month` = col_character(),
##   ..   Quartile = col_double(),
##   ..   Position = col_double(),
##   ..   `Stature (cm)` = col_double(),
##   ..   `Seated Stature (cm)` = col_double(),
##   ..   `Body Mass (kg)` = col_double(),
##   ..   `Khamis EASA %` = col_double(),
##   ..   `Chronological Age Group` = col_double(),
##   ..   `Khamis Banding Category` = col_double(),
##   ..   `Fransen Years to PHV` = col_double(),
##   ..   `Fransen Banding Category` = col_double(),
##   ..   `Period Name` = col_character(),
##   ..   `Duration (Mins)` = col_double(),
##   ..   `Total Distance (m)` = col_double(),
##   ..   `Low Intensity Running (m) (< 13 kph)` = col_double(),
##   ..   `High Intensity Running (m) (13.1 to 16.1 kph)` = col_double(),
##   ..   `Very High Intensity Running (m) (16.1 to 19 kph)` = col_double(),
##   ..   `Sprinting Distance (m) (>19.1 kph)` = col_double(),
##   ..   `Very High Intensity Activities (m) (>16.1 kph)` = col_double(),
##   ..   `Player Load (AU)` = col_double(),
##   ..   `Mean Heart Rate (bpm)...27` = col_double(),
##   ..   `Maximum Velocity (m/s)` = col_double(),
##   ..   `Accelerations >2m/s/s` = col_double(),
##   ..   `Decelerations >2 m/s/s` = col_double(),
##   ..   `Mean Heart Rate (bpm)...31` = col_double(),
##   ..   Team = col_character(),
##   ..   `Players Team Maturation Status Group` = col_character(),
##   ..   `Opponent Team Maturation Status Group` = col_character(),
##   ..   `Players Team Maturation Status Group No` = col_double(),
##   ..   `Opponent Team Maturation Status Group No` = col_double(),
##   ..   Cover = col_double(),
##   ..   Communication = col_double(),
##   ..   `Decision Making` = col_double(),
##   ..   Passing = col_double(),
##   ..   `1st Touch` = col_double(),
##   ..   Control = col_double(),
##   ..   `1v1` = col_double(),
##   ..   Shooting = col_double(),
##   ..   Assist = col_double(),
##   ..   Marking = col_double(),
##   ..   `Total GTSC Score` = col_double(),
##   ..   Opponent = col_character(),
##   ..   Result = col_character(),
##   ..   `sRPE-Overall` = col_double(),
##   ..   `Differential sRPE-Breathing` = col_double(),
##   ..   `Differential sRPE-Legs` = col_double(),
##   ..   `Differential sRPE-Tec/Tac` = col_double(),
##   ..   `Game Number` = col_double(),
##   ..   Assists = col_double(),
##   ..   Attempts = col_double(),
##   ..   `Attempts Created` = col_double(),
##   ..   Blocks = col_double(),
##   ..   `Chances Created` = col_double(),
##   ..   `Successful Passes` = col_double(),
##   ..   `Forward Passes` = col_double(),
##   ..   Goals = col_double(),
##   ..   `In Possession Duels Lost` = col_double(),
##   ..   `In Possession Duels Won` = col_double(),
##   ..   `Entries to Opposition Half` = col_double(),
##   ..   `Out of Possession Duels Lost` = col_double(),
##   ..   `Out of Possession Duels Won` = col_double(),
##   ..   `Total Passes` = col_double(),
##   ..   `Tackles & Interceptions` = col_double(),
##   ..   Turns = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

There are 480 observations and 70 variables available to use, but the problem is the data is not manipulated in a way to easily use. We will have to clean the data before starting or moving forward.

Cleaning the dataset

# A glimpse of the original data

head(bio_banding)
## # A tibble: 6 × 70
##   `Trial Number` `Method of Grouping Teams` `Player ID` `Test Team Number`
##            <dbl> <chr>                            <dbl>              <dbl>
## 1              1 Chronological Age Groups            18                  6
## 2              1 Chronological Age Groups            18                  6
## 3              1 Chronological Age Groups            18                  6
## 4              1 Chronological Age Groups            18                  6
## 5              1 Chronological Age Groups            18                  6
## 6              1 Chronological Age Groups            14                  5
## # ℹ 66 more variables: `Date of Birth` <chr>, Age <dbl>, `Birth Month` <chr>,
## #   Quartile <dbl>, Position <dbl>, `Stature (cm)` <dbl>,
## #   `Seated Stature (cm)` <dbl>, `Body Mass (kg)` <dbl>, `Khamis EASA %` <dbl>,
## #   `Chronological Age Group` <dbl>, `Khamis Banding Category` <dbl>,
## #   `Fransen Years to PHV` <dbl>, `Fransen Banding Category` <dbl>,
## #   `Period Name` <chr>, `Duration (Mins)` <dbl>, `Total Distance (m)` <dbl>,
## #   `Low Intensity Running (m) (< 13 kph)` <dbl>, …

Notes:

This is a raw dataset without any changes, meaning there it is rough around the edges. The next step would be to clean the dataset, but without disregarding crucial elements or richness of the data.

Making adjustments to the data (Data Wrangling)

# Making columns lowercase and adding a 'period' (.) within every break

names(bio_banding) <- gsub(" ", ".", names(bio_banding))

names(bio_banding) <- tolower(names(bio_banding))

head(bio_banding)
## # A tibble: 6 × 70
##   trial.number method.of.grouping.teams player.id test.team.number date.of.birth
##          <dbl> <chr>                        <dbl>            <dbl> <chr>        
## 1            1 Chronological Age Groups        18                6 05/30/2005   
## 2            1 Chronological Age Groups        18                6 05/30/2005   
## 3            1 Chronological Age Groups        18                6 05/30/2005   
## 4            1 Chronological Age Groups        18                6 05/30/2005   
## 5            1 Chronological Age Groups        18                6 05/30/2005   
## 6            1 Chronological Age Groups        14                5 05/19/2005   
## # ℹ 65 more variables: age <dbl>, birth.month <chr>, quartile <dbl>,
## #   position <dbl>, `stature.(cm)` <dbl>, `seated.stature.(cm)` <dbl>,
## #   `body.mass.(kg)` <dbl>, `khamis.easa.%` <dbl>,
## #   chronological.age.group <dbl>, khamis.banding.category <dbl>,
## #   fransen.years.to.phv <dbl>, fransen.banding.category <dbl>,
## #   period.name <chr>, `duration.(mins)` <dbl>, `total.distance.(m)` <dbl>,
## #   `low.intensity.running.(m).(<.13.kph)` <dbl>, …

Every column should be lower-cased and within every break there is a period to separate words. The key thing will be to remove redundant variables or ones that aren’t easily understandable.

sorted_biobanding <- bio_banding |>

# There is no description for variables like quartile, position, player load, and so forth

# Other variables aren't of interest so we are removing them from the new subset (player.id, heart rate, period name, and so forth)

  select(-(quartile), -(position), -('mean.heart.rate.(bpm)...27'), -('mean.heart.rate.(bpm)...31'), -(team), -(opponent), -(`player.load.(au)`), -(test.team.number), -(period.name), -(`duration.(mins)`), -(players.team.maturation.status.group.no), - (opponent.team.maturation.status.group.no), - (`differential.srpe-breathing`), - (`differential.srpe-legs`), -(`differential.srpe-tec/tac`), -(`srpe-overall`), -(game.number), -(attempts), -(trial.number))

# Viewing the structure of the new subset
str(sorted_biobanding)
## tibble [480 × 51] (S3: tbl_df/tbl/data.frame)
##  $ method.of.grouping.teams                        : chr [1:480] "Chronological Age Groups" "Chronological Age Groups" "Chronological Age Groups" "Chronological Age Groups" ...
##  $ player.id                                       : num [1:480] 18 18 18 18 18 14 14 14 14 14 ...
##  $ date.of.birth                                   : chr [1:480] "05/30/2005" "05/30/2005" "05/30/2005" "05/30/2005" ...
##  $ age                                             : num [1:480] 13.6 13.6 13.6 13.6 13.6 ...
##  $ birth.month                                     : chr [1:480] "May" "May" "May" "May" ...
##  $ stature.(cm)                                    : num [1:480] 163 163 163 163 163 ...
##  $ seated.stature.(cm)                             : num [1:480] 84.6 84.6 84.6 84.6 84.6 ...
##  $ body.mass.(kg)                                  : num [1:480] 47.3 47.3 47.3 47.3 47.3 45.2 45.2 45.2 45.2 45.2 ...
##  $ khamis.easa.%                                   : num [1:480] 87.9 87.9 87.9 87.9 87.9 ...
##  $ chronological.age.group                         : num [1:480] 14 14 14 14 14 14 14 14 14 14 ...
##  $ khamis.banding.category                         : num [1:480] 2 2 2 2 2 3 3 3 3 3 ...
##  $ fransen.years.to.phv                            : num [1:480] -0.0825 -0.0825 -0.0825 -0.0825 -0.0825 ...
##  $ fransen.banding.category                        : num [1:480] 2 2 2 2 2 3 3 3 3 3 ...
##  $ total.distance.(m)                              : num [1:480] 470 422 413 427 419 ...
##  $ low.intensity.running.(m).(<.13.kph)            : num [1:480] 231 245 225 246 229 ...
##  $ high.intensity.running.(m).(13.1.to.16.1.kph)   : num [1:480] 234 169 170 168 184 ...
##  $ very.high.intensity.running.(m).(16.1.to.19.kph): num [1:480] 3.59 8.23 14.82 5.13 4.43 ...
##  $ sprinting.distance.(m).(>19.1.kph)              : num [1:480] 0 0 2.12 7.89 0 2.88 5.84 0 0 0 ...
##  $ very.high.intensity.activities.(m).(>16.1.kph)  : num [1:480] 3.59 8.23 16.94 13.02 4.43 ...
##  $ maximum.velocity.(m/s)                          : num [1:480] 4.44 4.79 5.25 5.64 4.61 ...
##  $ accelerations.>2m/s/s                           : num [1:480] 1 0 0 0 0 1 2 2 0 4 ...
##  $ decelerations.>2.m/s/s                          : num [1:480] 0 0 1 1 0 1 0 0 0 0 ...
##  $ players.team.maturation.status.group            : chr [1:480] "Mixed" "Mixed" "Mixed" "Mixed" ...
##  $ opponent.team.maturation.status.group           : chr [1:480] "Mixed" "Mixed" "Mixed" "Mixed" ...
##  $ cover                                           : num [1:480] 3 3 3 2 2 2 2 3 3 3 ...
##  $ communication                                   : num [1:480] 3 2 3 2 2 1 1 2 3 2 ...
##  $ decision.making                                 : num [1:480] 3 3 2 3 2 3 2 3 3 3 ...
##  $ passing                                         : num [1:480] 3 2 3 3 3 3 2 3 3 3 ...
##  $ 1st.touch                                       : num [1:480] 3 3 3 3 3 3 3 3 2 3 ...
##  $ control                                         : num [1:480] NA 3 3 NA NA NA NA 2 NA 2 ...
##  $ 1v1                                             : num [1:480] 3 3 3 2 2 2 2 3 3 3 ...
##  $ shooting                                        : num [1:480] 3 2 3 NA NA NA 2 NA NA NA ...
##  $ assist                                          : num [1:480] NA NA NA NA NA NA NA NA NA NA ...
##  $ marking                                         : num [1:480] 3 3 2 2 2 2 NA NA 3 3 ...
##  $ total.gtsc.score                                : num [1:480] 24 24 25 17 16 16 14 19 20 22 ...
##  $ result                                          : chr [1:480] "Draw" "Loss" "Loss" "Loss" ...
##  $ assists                                         : num [1:480] 0 0 0 0 0 0 0 0 0 0 ...
##  $ attempts.created                                : num [1:480] 0 0 0 1 0 0 1 1 0 1 ...
##  $ blocks                                          : num [1:480] 0 0 0 0 0 0 0 0 1 0 ...
##  $ chances.created                                 : num [1:480] 0 0 0 0 0 0 0 0 0 0 ...
##  $ successful.passes                               : num [1:480] 3 3 4 7 4 7 5 3 6 2 ...
##  $ forward.passes                                  : num [1:480] 2 0 1 4 0 7 4 1 5 2 ...
##  $ goals                                           : num [1:480] 0 0 0 0 0 0 1 1 0 0 ...
##  $ in.possession.duels.lost                        : num [1:480] 0 0 0 0 0 0 1 0 0 0 ...
##  $ in.possession.duels.won                         : num [1:480] 0 0 0 0 0 0 0 0 0 0 ...
##  $ entries.to.opposition.half                      : num [1:480] 1 2 1 1 2 1 1 0 1 1 ...
##  $ out.of.possession.duels.lost                    : num [1:480] 0 0 0 0 0 0 0 0 0 1 ...
##  $ out.of.possession.duels.won                     : num [1:480] 0 0 0 0 0 0 0 0 0 1 ...
##  $ total.passes                                    : num [1:480] 4 4 5 10 4 9 6 3 7 3 ...
##  $ tackles.&.interceptions                         : num [1:480] 2 0 1 0 1 4 6 2 1 0 ...
##  $ turns                                           : num [1:480] 0 0 0 0 1 0 0 0 0 1 ...
# Taking a glimpse of the new subset

head(sorted_biobanding)
## # A tibble: 6 × 51
##   method.of.grouping.teams player.id date.of.birth   age birth.month
##   <chr>                        <dbl> <chr>         <dbl> <chr>      
## 1 Chronological Age Groups        18 05/30/2005     13.6 May        
## 2 Chronological Age Groups        18 05/30/2005     13.6 May        
## 3 Chronological Age Groups        18 05/30/2005     13.6 May        
## 4 Chronological Age Groups        18 05/30/2005     13.6 May        
## 5 Chronological Age Groups        18 05/30/2005     13.6 May        
## 6 Chronological Age Groups        14 05/19/2005     13.6 May        
## # ℹ 46 more variables: `stature.(cm)` <dbl>, `seated.stature.(cm)` <dbl>,
## #   `body.mass.(kg)` <dbl>, `khamis.easa.%` <dbl>,
## #   chronological.age.group <dbl>, khamis.banding.category <dbl>,
## #   fransen.years.to.phv <dbl>, fransen.banding.category <dbl>,
## #   `total.distance.(m)` <dbl>, `low.intensity.running.(m).(<.13.kph)` <dbl>,
## #   `high.intensity.running.(m).(13.1.to.16.1.kph)` <dbl>,
## #   `very.high.intensity.running.(m).(16.1.to.19.kph)` <dbl>, …

This is a better version to use moving forward to conduct statistical analysis, but one thing that stumped me were some NA’s. If I did remove them from the data, instead of 480 observations, I would have roughly 10-15 observations – not ideal and too low to conduct anything meaningful.

Preliminary plots

Exploring relationship between method of grouping teams and ages

sorted_biobanding |>

# 1) We will be using the ggplot package for plots and setting the aesthetics to method of grouping and ages
      ggplot(aes(x = method.of.grouping.teams, y = age))+

# 2) This feature will make a boxplot and set the color for background and boxes

      geom_boxplot(color = "#C8A2C8") +
  theme_bw() +
  
# 3) This feature will help you label your axis and title  
    labs(x = "Method of Grouping", y = "Age of Academy Player", title = "Age Distribution by Bio-banding Group", caption = "Source: Bio-banding in soccer")

Rougly every group is normally distributed equally when you take a look at the median from all four groups. However, the group named Random is slightly further away, and that can make a huge difference considering maturity stages can be developed when kids are aged months apart. Moving the discussion onto the phyiscal makeup for academy players.

What is the relationship for body mass and stature for academy players?

There will be two plots, the first will explore how method of grouping teams is associated with the two variables. The following plot will explore how age is associated with the two variables.

sorted_biobanding |>

# 1) We will be using the ggplot package for plots and setting the aesthetics to mass and stature  
    ggplot(aes(x = `body.mass.(kg)`, y = `stature.(cm)`, color = method.of.grouping.teams)) +
  
# 2) This feature is for making a scatter plot and setting the background and color for points
      geom_point(aes(shape = method.of.grouping.teams)) +
theme_bw() +

# 3) This feature will apply the linear model to the scatterplot
     geom_smooth(method = "lm", se = FALSE) +



# 4) This feature will help you label your axis and title  
    labs(x = "Body Mass (kg)", y = "Stature (cm)", title = "Scatterplot of Body Mass to Stature", subtitle = "For Academy Players Aged 11-14", caption = "Source: Bio-banding in soccer")

There is a positive association between body mass to stature, and the geom_smooth feature can help us visualize it. In other words, as body mass increases then it is logical to believe stature will increase as well.

sorted_biobanding |>
  
# 1) We will be using the ggplot package for plots and setting the aesthetics to mass and stature    
  ggplot(aes(x = `body.mass.(kg)`, y = `stature.(cm)`)) +

# 2) This feature will set a color grading system for age, add transparency to the points plotted, and set the background colors to black and white 
    scale_color_gradient(low = "green", high = "purple") +
  geom_point(aes(color = age, alpha = 0.5)) +
  theme_bw() +

# 3) This feature will apply the linear model to the scatter plot
  geom_smooth(method = "lm", se = FALSE) +

# 4) This feature will help you label your axis and title   
  labs(x = "Body Mass (kg)", y = "Stature (cm)", title = "Scatterplot of Body Mass to Stature", subtitle = "For Academy Players Aged 11-14", caption = "Source: Bio-banding in soccer")

A linear model to showcase the relationship

# This feature will create a subset for the linear model and set the format to y ~ x.
fit1 <- lm(data = sorted_biobanding, `stature.(cm)` ~ `body.mass.(kg)`)

# This feature will shows us summary statistics of the sub-setted linear model
summary(fit1)
## 
## Call:
## lm(formula = `stature.(cm)` ~ `body.mass.(kg)`, data = sorted_biobanding)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2751  -2.1311   0.3935   2.6174   9.2043 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      119.13060    1.25530   94.90   <2e-16 ***
## `body.mass.(kg)`   0.87425    0.02632   33.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.69 on 478 degrees of freedom
## Multiple R-squared:  0.6978, Adjusted R-squared:  0.6971 
## F-statistic:  1104 on 1 and 478 DF,  p-value: < 2.2e-16

There are three things to factor; the linear model and the adjusted R-squared value.

Making a residual plot to compare and contrast

fit1 |>

# 1) We will be using the ggplot package for plots and setting the aesthetics to fitted and residuals
  ggplot(aes(x = .fitted, y = .resid))+

# 2) This feature will create a scatter plot and a horizontal line set at y = 0 with red coloring and higher line width  
  geom_point() +
  geom_hline(yintercept = 0, color = "red", linewidth = 1.5) +

# 3) This feature will set the background to black and white, plus label your axis and title  
  theme_bw() +
  labs(x = "Fitted Values", y = "Residual Values", title = "Residuals vs. Fitted Values", subtitle = "For a Linear Regression Model")

The residual plots do not show any indication of pattern about the horizontal line at y = 0, and thus, the variability is normally distributed.

Exploring different relationships and setting up a criteria

unique(sorted_biobanding$method.of.grouping.teams)
## [1] "Chronological Age Groups" "Khamis-Roche"            
## [3] "Fransen"                  "Random"

The group random is the one that is most vague and the others are reasonable to include. Not sure if random means like a “free for all” sort of scenario where there is no criteria requirement. Also, I will conduct a test to see the difference of proportion for Fransen and Khamis-Roche (bio-banding methods)

unique(sorted_biobanding$total.gtsc.score)
##  [1] 24 25 17 16 14 19 20 22 23  9 10 13 18 26 27 15 11 28 12  3 29 34 30 31 21
## [26] 36  6  8 33  7
summary(sorted_biobanding$total.gtsc.score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   12.00   16.00   16.44   20.00   36.00

There are no NA entries which is good, but we can set a criteria for what is a ‘good’ score. Anywhere from 16.44 to 36 can be considered passing the test for good technical and tactical assessment.

good <- sorted_biobanding |>

  mutate(high_rating = ifelse(total.gtsc.score > 16.44, "pass", "fail")) |>

  filter(method.of.grouping.teams == "Fransen" | method.of.grouping.teams == "Khamis-Roche")
unique(sorted_biobanding$birth.month)
##  [1] "May"       "February"  "July"      "November"  "September" "October"  
##  [7] "April"     "August"    "December"  "January"

There are no entries for March and June, perhaps worth taking a look

Conducting a 95% confidence interval for the difference in proportion of all methods of grouping teams passing the high rating test

prop_teams <- good |>

  specify(high_rating ~ method.of.grouping.teams, success = "pass") |>

  generate(reps = 1000, type = "bootstrap") |>

  calculate(stat = "diff in props", order = c("Fransen", "Khamis-Roche"))

SE <- prop_teams |>
  summarise(se = sd(stat)) |>
  pull()

c(d_hat - 2 * SE, d_hat + 2 * SE)
## [1] 0.1664013 0.4169321

We are 95% confident that the true difference in proportions of Fransen and Khamis-Roche groups passing the assessment is between 17.12% and 41.20%. On that note, there is compelling evidence to believe that there is a true difference in method of grouping teams of passing the technical and tactical assessment.

We will conduct a chi-square test for the same variables

obs <- good |>

  select(method.of.grouping.teams, high_rating) |>
  table()

obs |>
  tidy() |>
  uncount(n)
## # A tibble: 240 × 2
##    method.of.grouping.teams high_rating
##    <chr>                    <chr>      
##  1 Fransen                  fail       
##  2 Fransen                  fail       
##  3 Fransen                  fail       
##  4 Fransen                  fail       
##  5 Fransen                  fail       
##  6 Fransen                  fail       
##  7 Fransen                  fail       
##  8 Fransen                  fail       
##  9 Fransen                  fail       
## 10 Fransen                  fail       
## # ℹ 230 more rows
chisq.test(good$method.of.grouping.teams, good$high_rating)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  good$method.of.grouping.teams and good$high_rating
## X-squared = 19.279, df = 1, p-value = 1.13e-05

Concurrently, we have mentioned that there was statistical significance from the boostraping confidence interval. Let us see if the chi-square test corroborates as well. The p-value is less than 0.05 which suggests there is compelling evidence that there is an association between passing the assessment within the Fransen or Khamis-Roche group.

The conclusion (Part three)

  • To recap, I have conducted an linear model equation (soon to be multiple regression - need more time), confidence interval bootstraping test, and chi-square test for a p-value

  • Indicate statistical significance for predicting stature with body mass and bio-banding assists academy players pass a technical and tactical assessment if the criteria score was high ratings only.

  • I like to say maybe one more test would help these findings and making the linear model stronger with more predictor variables. Now I wonder what the ongoing research for bio-banding is going if the process still exists.

  • Include the bibliography, which we already did earlier in the month

  • I’ll try to get started with the slides, but I’m not sure if I can get it done in time with other projects going on at the same time