title: “Final Project (MATH217)”
author: “A. Diaz-Nova”
format: pdf
editor: visual
There is a strategic method for academies at the moment which is named bio-banding, and overall is a strategy in place to combat huge discrepancies in physical maturity within soccer. In recent years, there are top soccer clubs in England (I.e., Brighton, AFC Bournemouth, and etc.) who find great value in this type of research and implementation. For one, the hope is to assist young players by applying specific conditioning regimes whereby the goal is to reduce phyiscal demand to improve on-ball actions and tactical-cognitive thinking.To add more, most research findings corroborates with this belief which is because bio-banding indeed has a great capacity to influence making the game-states more technically and tactically challenging for academic players participating.
Based from the contributions of 13 authors, the aim of their study was to examine the effect of ‘bio-banding’ on technical and tactical markers of talent ‘ID’ in 11-14 year old academy soccer players. In other words, in younger ages there are those who are early or late developers to their adult physical size, and thus, by differentiating we can measure technical and tactical ability without the extreme exposure to physical dominance.
However, there is a lot left to be desired from researchers. Since, bio-banding is still in its infancy, further research should be with a focus to determine the effectivenes and limitations from this approach. As such, journal recommends to state or to consider the long-term effects, the optimal time frame for application, and so forth.
Some questions that linger based on the facts/research are
A hypothesis test to see which method of grouping was most effective in finding players with a better GTSC score
What is the relationship of variables related to maturity (Stature, Mass, Seated Stature)
library(tidyverse)
library(tidymodels)
library(infer)
# Setting the working directory
setwd("C:/Users/Angel/OneDrive/Documents/Datasets")
# Loading the data from a downloaded csv file
bio_banding <- read_csv("Biobanding Small Sided Games.csv")
# Seeing the structure of the dataset
str(bio_banding)
## spc_tbl_ [480 × 70] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Trial Number : num [1:480] 1 1 1 1 1 1 1 1 1 1 ...
## $ Method of Grouping Teams : chr [1:480] "Chronological Age Groups" "Chronological Age Groups" "Chronological Age Groups" "Chronological Age Groups" ...
## $ Player ID : num [1:480] 18 18 18 18 18 14 14 14 14 14 ...
## $ Test Team Number : num [1:480] 6 6 6 6 6 5 5 5 5 5 ...
## $ Date of Birth : chr [1:480] "05/30/2005" "05/30/2005" "05/30/2005" "05/30/2005" ...
## $ Age : num [1:480] 13.6 13.6 13.6 13.6 13.6 ...
## $ Birth Month : chr [1:480] "May" "May" "May" "May" ...
## $ Quartile : num [1:480] 3 3 3 3 3 3 3 3 3 3 ...
## $ Position : num [1:480] 4 4 4 4 4 3 3 3 3 3 ...
## $ Stature (cm) : num [1:480] 163 163 163 163 163 ...
## $ Seated Stature (cm) : num [1:480] 84.6 84.6 84.6 84.6 84.6 ...
## $ Body Mass (kg) : num [1:480] 47.3 47.3 47.3 47.3 47.3 45.2 45.2 45.2 45.2 45.2 ...
## $ Khamis EASA % : num [1:480] 87.9 87.9 87.9 87.9 87.9 ...
## $ Chronological Age Group : num [1:480] 14 14 14 14 14 14 14 14 14 14 ...
## $ Khamis Banding Category : num [1:480] 2 2 2 2 2 3 3 3 3 3 ...
## $ Fransen Years to PHV : num [1:480] -0.0825 -0.0825 -0.0825 -0.0825 -0.0825 ...
## $ Fransen Banding Category : num [1:480] 2 2 2 2 2 3 3 3 3 3 ...
## $ Period Name : chr [1:480] "3B Game 1" "3B Game 2" "3B Game 3" "3B Game 4" ...
## $ Duration (Mins) : num [1:480] 5.02 5.05 4.93 4.98 4.98 ...
## $ Total Distance (m) : num [1:480] 470 422 413 427 419 ...
## $ Low Intensity Running (m) (< 13 kph) : num [1:480] 231 245 225 246 229 ...
## $ High Intensity Running (m) (13.1 to 16.1 kph) : num [1:480] 234 169 170 168 184 ...
## $ Very High Intensity Running (m) (16.1 to 19 kph): num [1:480] 3.59 8.23 14.82 5.13 4.43 ...
## $ Sprinting Distance (m) (>19.1 kph) : num [1:480] 0 0 2.12 7.89 0 2.88 5.84 0 0 0 ...
## $ Very High Intensity Activities (m) (>16.1 kph) : num [1:480] 3.59 8.23 16.94 13.02 4.43 ...
## $ Player Load (AU) : num [1:480] 55.9 51.4 46.2 49 48.7 ...
## $ Mean Heart Rate (bpm)...27 : num [1:480] 174 163 162 165 165 ...
## $ Maximum Velocity (m/s) : num [1:480] 4.44 4.79 5.25 5.64 4.61 ...
## $ Accelerations >2m/s/s : num [1:480] 1 0 0 0 0 1 2 2 0 4 ...
## $ Decelerations >2 m/s/s : num [1:480] 0 0 1 1 0 1 0 0 0 0 ...
## $ Mean Heart Rate (bpm)...31 : num [1:480] 174 163 162 165 165 ...
## $ Team : chr [1:480] "3B" "3B" "3B" "3B" ...
## $ Players Team Maturation Status Group : chr [1:480] "Mixed" "Mixed" "Mixed" "Mixed" ...
## $ Opponent Team Maturation Status Group : chr [1:480] "Mixed" "Mixed" "Mixed" "Mixed" ...
## $ Players Team Maturation Status Group No : num [1:480] 4 4 4 4 4 4 4 4 4 4 ...
## $ Opponent Team Maturation Status Group No : num [1:480] 4 4 4 4 4 4 4 4 4 4 ...
## $ Cover : num [1:480] 3 3 3 2 2 2 2 3 3 3 ...
## $ Communication : num [1:480] 3 2 3 2 2 1 1 2 3 2 ...
## $ Decision Making : num [1:480] 3 3 2 3 2 3 2 3 3 3 ...
## $ Passing : num [1:480] 3 2 3 3 3 3 2 3 3 3 ...
## $ 1st Touch : num [1:480] 3 3 3 3 3 3 3 3 2 3 ...
## $ Control : num [1:480] NA 3 3 NA NA NA NA 2 NA 2 ...
## $ 1v1 : num [1:480] 3 3 3 2 2 2 2 3 3 3 ...
## $ Shooting : num [1:480] 3 2 3 NA NA NA 2 NA NA NA ...
## $ Assist : num [1:480] NA NA NA NA NA NA NA NA NA NA ...
## $ Marking : num [1:480] 3 3 2 2 2 2 NA NA 3 3 ...
## $ Total GTSC Score : num [1:480] 24 24 25 17 16 16 14 19 20 22 ...
## $ Opponent : chr [1:480] "3A" "3A" "3A" "3A" ...
## $ Result : chr [1:480] "Draw" "Loss" "Loss" "Loss" ...
## $ sRPE-Overall : num [1:480] 220 175 225 230 280 220 190 185 180 245 ...
## $ Differential sRPE-Breathing : num [1:480] 175 235 225 220 255 275 180 150 125 210 ...
## $ Differential sRPE-Legs : num [1:480] 115 155 220 230 220 260 225 225 215 195 ...
## $ Differential sRPE-Tec/Tac : num [1:480] 110 105 95 115 105 175 185 180 190 215 ...
## $ Game Number : num [1:480] 1 2 3 4 5 1 2 3 4 5 ...
## $ Assists : num [1:480] 0 0 0 0 0 0 0 0 0 0 ...
## $ Attempts : num [1:480] 1 1 0 2 2 0 2 1 0 1 ...
## $ Attempts Created : num [1:480] 0 0 0 1 0 0 1 1 0 1 ...
## $ Blocks : num [1:480] 0 0 0 0 0 0 0 0 1 0 ...
## $ Chances Created : num [1:480] 0 0 0 0 0 0 0 0 0 0 ...
## $ Successful Passes : num [1:480] 3 3 4 7 4 7 5 3 6 2 ...
## $ Forward Passes : num [1:480] 2 0 1 4 0 7 4 1 5 2 ...
## $ Goals : num [1:480] 0 0 0 0 0 0 1 1 0 0 ...
## $ In Possession Duels Lost : num [1:480] 0 0 0 0 0 0 1 0 0 0 ...
## $ In Possession Duels Won : num [1:480] 0 0 0 0 0 0 0 0 0 0 ...
## $ Entries to Opposition Half : num [1:480] 1 2 1 1 2 1 1 0 1 1 ...
## $ Out of Possession Duels Lost : num [1:480] 0 0 0 0 0 0 0 0 0 1 ...
## $ Out of Possession Duels Won : num [1:480] 0 0 0 0 0 0 0 0 0 1 ...
## $ Total Passes : num [1:480] 4 4 5 10 4 9 6 3 7 3 ...
## $ Tackles & Interceptions : num [1:480] 2 0 1 0 1 4 6 2 1 0 ...
## $ Turns : num [1:480] 0 0 0 0 1 0 0 0 0 1 ...
## - attr(*, "spec")=
## .. cols(
## .. `Trial Number` = col_double(),
## .. `Method of Grouping Teams` = col_character(),
## .. `Player ID` = col_double(),
## .. `Test Team Number` = col_double(),
## .. `Date of Birth` = col_character(),
## .. Age = col_double(),
## .. `Birth Month` = col_character(),
## .. Quartile = col_double(),
## .. Position = col_double(),
## .. `Stature (cm)` = col_double(),
## .. `Seated Stature (cm)` = col_double(),
## .. `Body Mass (kg)` = col_double(),
## .. `Khamis EASA %` = col_double(),
## .. `Chronological Age Group` = col_double(),
## .. `Khamis Banding Category` = col_double(),
## .. `Fransen Years to PHV` = col_double(),
## .. `Fransen Banding Category` = col_double(),
## .. `Period Name` = col_character(),
## .. `Duration (Mins)` = col_double(),
## .. `Total Distance (m)` = col_double(),
## .. `Low Intensity Running (m) (< 13 kph)` = col_double(),
## .. `High Intensity Running (m) (13.1 to 16.1 kph)` = col_double(),
## .. `Very High Intensity Running (m) (16.1 to 19 kph)` = col_double(),
## .. `Sprinting Distance (m) (>19.1 kph)` = col_double(),
## .. `Very High Intensity Activities (m) (>16.1 kph)` = col_double(),
## .. `Player Load (AU)` = col_double(),
## .. `Mean Heart Rate (bpm)...27` = col_double(),
## .. `Maximum Velocity (m/s)` = col_double(),
## .. `Accelerations >2m/s/s` = col_double(),
## .. `Decelerations >2 m/s/s` = col_double(),
## .. `Mean Heart Rate (bpm)...31` = col_double(),
## .. Team = col_character(),
## .. `Players Team Maturation Status Group` = col_character(),
## .. `Opponent Team Maturation Status Group` = col_character(),
## .. `Players Team Maturation Status Group No` = col_double(),
## .. `Opponent Team Maturation Status Group No` = col_double(),
## .. Cover = col_double(),
## .. Communication = col_double(),
## .. `Decision Making` = col_double(),
## .. Passing = col_double(),
## .. `1st Touch` = col_double(),
## .. Control = col_double(),
## .. `1v1` = col_double(),
## .. Shooting = col_double(),
## .. Assist = col_double(),
## .. Marking = col_double(),
## .. `Total GTSC Score` = col_double(),
## .. Opponent = col_character(),
## .. Result = col_character(),
## .. `sRPE-Overall` = col_double(),
## .. `Differential sRPE-Breathing` = col_double(),
## .. `Differential sRPE-Legs` = col_double(),
## .. `Differential sRPE-Tec/Tac` = col_double(),
## .. `Game Number` = col_double(),
## .. Assists = col_double(),
## .. Attempts = col_double(),
## .. `Attempts Created` = col_double(),
## .. Blocks = col_double(),
## .. `Chances Created` = col_double(),
## .. `Successful Passes` = col_double(),
## .. `Forward Passes` = col_double(),
## .. Goals = col_double(),
## .. `In Possession Duels Lost` = col_double(),
## .. `In Possession Duels Won` = col_double(),
## .. `Entries to Opposition Half` = col_double(),
## .. `Out of Possession Duels Lost` = col_double(),
## .. `Out of Possession Duels Won` = col_double(),
## .. `Total Passes` = col_double(),
## .. `Tackles & Interceptions` = col_double(),
## .. Turns = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
There are 480 observations and 70 variables available to use, but the problem is the data is not manipulated in a way to easily use. We will have to clean the data before starting or moving forward.
# A glimpse of the original data
head(bio_banding)
## # A tibble: 6 × 70
## `Trial Number` `Method of Grouping Teams` `Player ID` `Test Team Number`
## <dbl> <chr> <dbl> <dbl>
## 1 1 Chronological Age Groups 18 6
## 2 1 Chronological Age Groups 18 6
## 3 1 Chronological Age Groups 18 6
## 4 1 Chronological Age Groups 18 6
## 5 1 Chronological Age Groups 18 6
## 6 1 Chronological Age Groups 14 5
## # ℹ 66 more variables: `Date of Birth` <chr>, Age <dbl>, `Birth Month` <chr>,
## # Quartile <dbl>, Position <dbl>, `Stature (cm)` <dbl>,
## # `Seated Stature (cm)` <dbl>, `Body Mass (kg)` <dbl>, `Khamis EASA %` <dbl>,
## # `Chronological Age Group` <dbl>, `Khamis Banding Category` <dbl>,
## # `Fransen Years to PHV` <dbl>, `Fransen Banding Category` <dbl>,
## # `Period Name` <chr>, `Duration (Mins)` <dbl>, `Total Distance (m)` <dbl>,
## # `Low Intensity Running (m) (< 13 kph)` <dbl>, …
Notes:
The first six obs. in “Date of Birth” is from 2005
Physical Variables are “Stature (cm)”, “Body Mass (kg)”
What is Khamis EASA %? Or Khamis Banding Category? Or Fransen Years to PHV?
How is Player Load quantified?
Technical Aspects (Variables) are Cover, Communication, Decision Making, Passing, 1st Touch, Control, 1v1, and so on
This is a raw dataset without any changes, meaning there it is rough around the edges. The next step would be to clean the dataset, but without disregarding crucial elements or richness of the data.
# Making columns lowercase and adding a 'period' (.) within every break
names(bio_banding) <- gsub(" ", ".", names(bio_banding))
names(bio_banding) <- tolower(names(bio_banding))
head(bio_banding)
## # A tibble: 6 × 70
## trial.number method.of.grouping.teams player.id test.team.number date.of.birth
## <dbl> <chr> <dbl> <dbl> <chr>
## 1 1 Chronological Age Groups 18 6 05/30/2005
## 2 1 Chronological Age Groups 18 6 05/30/2005
## 3 1 Chronological Age Groups 18 6 05/30/2005
## 4 1 Chronological Age Groups 18 6 05/30/2005
## 5 1 Chronological Age Groups 18 6 05/30/2005
## 6 1 Chronological Age Groups 14 5 05/19/2005
## # ℹ 65 more variables: age <dbl>, birth.month <chr>, quartile <dbl>,
## # position <dbl>, `stature.(cm)` <dbl>, `seated.stature.(cm)` <dbl>,
## # `body.mass.(kg)` <dbl>, `khamis.easa.%` <dbl>,
## # chronological.age.group <dbl>, khamis.banding.category <dbl>,
## # fransen.years.to.phv <dbl>, fransen.banding.category <dbl>,
## # period.name <chr>, `duration.(mins)` <dbl>, `total.distance.(m)` <dbl>,
## # `low.intensity.running.(m).(<.13.kph)` <dbl>, …
Every column should be lower-cased and within every break there is a period to separate words. The key thing will be to remove redundant variables or ones that aren’t easily understandable.
sorted_biobanding <- bio_banding |>
# There is no description for variables like quartile, position, player load, and so forth
# Other variables aren't of interest so we are removing them from the new subset (player.id, heart rate, period name, and so forth)
select(-(quartile), -(position), -('mean.heart.rate.(bpm)...27'), -('mean.heart.rate.(bpm)...31'), -(team), -(opponent), -(`player.load.(au)`), -(test.team.number), -(period.name), -(`duration.(mins)`), -(players.team.maturation.status.group.no), - (opponent.team.maturation.status.group.no), - (`differential.srpe-breathing`), - (`differential.srpe-legs`), -(`differential.srpe-tec/tac`), -(`srpe-overall`), -(game.number), -(attempts), -(trial.number))
# Viewing the structure of the new subset
str(sorted_biobanding)
## tibble [480 × 51] (S3: tbl_df/tbl/data.frame)
## $ method.of.grouping.teams : chr [1:480] "Chronological Age Groups" "Chronological Age Groups" "Chronological Age Groups" "Chronological Age Groups" ...
## $ player.id : num [1:480] 18 18 18 18 18 14 14 14 14 14 ...
## $ date.of.birth : chr [1:480] "05/30/2005" "05/30/2005" "05/30/2005" "05/30/2005" ...
## $ age : num [1:480] 13.6 13.6 13.6 13.6 13.6 ...
## $ birth.month : chr [1:480] "May" "May" "May" "May" ...
## $ stature.(cm) : num [1:480] 163 163 163 163 163 ...
## $ seated.stature.(cm) : num [1:480] 84.6 84.6 84.6 84.6 84.6 ...
## $ body.mass.(kg) : num [1:480] 47.3 47.3 47.3 47.3 47.3 45.2 45.2 45.2 45.2 45.2 ...
## $ khamis.easa.% : num [1:480] 87.9 87.9 87.9 87.9 87.9 ...
## $ chronological.age.group : num [1:480] 14 14 14 14 14 14 14 14 14 14 ...
## $ khamis.banding.category : num [1:480] 2 2 2 2 2 3 3 3 3 3 ...
## $ fransen.years.to.phv : num [1:480] -0.0825 -0.0825 -0.0825 -0.0825 -0.0825 ...
## $ fransen.banding.category : num [1:480] 2 2 2 2 2 3 3 3 3 3 ...
## $ total.distance.(m) : num [1:480] 470 422 413 427 419 ...
## $ low.intensity.running.(m).(<.13.kph) : num [1:480] 231 245 225 246 229 ...
## $ high.intensity.running.(m).(13.1.to.16.1.kph) : num [1:480] 234 169 170 168 184 ...
## $ very.high.intensity.running.(m).(16.1.to.19.kph): num [1:480] 3.59 8.23 14.82 5.13 4.43 ...
## $ sprinting.distance.(m).(>19.1.kph) : num [1:480] 0 0 2.12 7.89 0 2.88 5.84 0 0 0 ...
## $ very.high.intensity.activities.(m).(>16.1.kph) : num [1:480] 3.59 8.23 16.94 13.02 4.43 ...
## $ maximum.velocity.(m/s) : num [1:480] 4.44 4.79 5.25 5.64 4.61 ...
## $ accelerations.>2m/s/s : num [1:480] 1 0 0 0 0 1 2 2 0 4 ...
## $ decelerations.>2.m/s/s : num [1:480] 0 0 1 1 0 1 0 0 0 0 ...
## $ players.team.maturation.status.group : chr [1:480] "Mixed" "Mixed" "Mixed" "Mixed" ...
## $ opponent.team.maturation.status.group : chr [1:480] "Mixed" "Mixed" "Mixed" "Mixed" ...
## $ cover : num [1:480] 3 3 3 2 2 2 2 3 3 3 ...
## $ communication : num [1:480] 3 2 3 2 2 1 1 2 3 2 ...
## $ decision.making : num [1:480] 3 3 2 3 2 3 2 3 3 3 ...
## $ passing : num [1:480] 3 2 3 3 3 3 2 3 3 3 ...
## $ 1st.touch : num [1:480] 3 3 3 3 3 3 3 3 2 3 ...
## $ control : num [1:480] NA 3 3 NA NA NA NA 2 NA 2 ...
## $ 1v1 : num [1:480] 3 3 3 2 2 2 2 3 3 3 ...
## $ shooting : num [1:480] 3 2 3 NA NA NA 2 NA NA NA ...
## $ assist : num [1:480] NA NA NA NA NA NA NA NA NA NA ...
## $ marking : num [1:480] 3 3 2 2 2 2 NA NA 3 3 ...
## $ total.gtsc.score : num [1:480] 24 24 25 17 16 16 14 19 20 22 ...
## $ result : chr [1:480] "Draw" "Loss" "Loss" "Loss" ...
## $ assists : num [1:480] 0 0 0 0 0 0 0 0 0 0 ...
## $ attempts.created : num [1:480] 0 0 0 1 0 0 1 1 0 1 ...
## $ blocks : num [1:480] 0 0 0 0 0 0 0 0 1 0 ...
## $ chances.created : num [1:480] 0 0 0 0 0 0 0 0 0 0 ...
## $ successful.passes : num [1:480] 3 3 4 7 4 7 5 3 6 2 ...
## $ forward.passes : num [1:480] 2 0 1 4 0 7 4 1 5 2 ...
## $ goals : num [1:480] 0 0 0 0 0 0 1 1 0 0 ...
## $ in.possession.duels.lost : num [1:480] 0 0 0 0 0 0 1 0 0 0 ...
## $ in.possession.duels.won : num [1:480] 0 0 0 0 0 0 0 0 0 0 ...
## $ entries.to.opposition.half : num [1:480] 1 2 1 1 2 1 1 0 1 1 ...
## $ out.of.possession.duels.lost : num [1:480] 0 0 0 0 0 0 0 0 0 1 ...
## $ out.of.possession.duels.won : num [1:480] 0 0 0 0 0 0 0 0 0 1 ...
## $ total.passes : num [1:480] 4 4 5 10 4 9 6 3 7 3 ...
## $ tackles.&.interceptions : num [1:480] 2 0 1 0 1 4 6 2 1 0 ...
## $ turns : num [1:480] 0 0 0 0 1 0 0 0 0 1 ...
# Taking a glimpse of the new subset
head(sorted_biobanding)
## # A tibble: 6 × 51
## method.of.grouping.teams player.id date.of.birth age birth.month
## <chr> <dbl> <chr> <dbl> <chr>
## 1 Chronological Age Groups 18 05/30/2005 13.6 May
## 2 Chronological Age Groups 18 05/30/2005 13.6 May
## 3 Chronological Age Groups 18 05/30/2005 13.6 May
## 4 Chronological Age Groups 18 05/30/2005 13.6 May
## 5 Chronological Age Groups 18 05/30/2005 13.6 May
## 6 Chronological Age Groups 14 05/19/2005 13.6 May
## # ℹ 46 more variables: `stature.(cm)` <dbl>, `seated.stature.(cm)` <dbl>,
## # `body.mass.(kg)` <dbl>, `khamis.easa.%` <dbl>,
## # chronological.age.group <dbl>, khamis.banding.category <dbl>,
## # fransen.years.to.phv <dbl>, fransen.banding.category <dbl>,
## # `total.distance.(m)` <dbl>, `low.intensity.running.(m).(<.13.kph)` <dbl>,
## # `high.intensity.running.(m).(13.1.to.16.1.kph)` <dbl>,
## # `very.high.intensity.running.(m).(16.1.to.19.kph)` <dbl>, …
This is a better version to use moving forward to conduct statistical analysis, but one thing that stumped me were some NA’s. If I did remove them from the data, instead of 480 observations, I would have roughly 10-15 observations – not ideal and too low to conduct anything meaningful.
sorted_biobanding |>
# 1) We will be using the ggplot package for plots and setting the aesthetics to method of grouping and ages
ggplot(aes(x = method.of.grouping.teams, y = age))+
# 2) This feature will make a boxplot and set the color for background and boxes
geom_boxplot(color = "#C8A2C8") +
theme_bw() +
# 3) This feature will help you label your axis and title
labs(x = "Method of Grouping", y = "Age of Academy Player", title = "Age Distribution by Bio-banding Group", caption = "Source: Bio-banding in soccer")
Rougly every group is normally distributed equally when you take a look at the median from all four groups. However, the group named Random is slightly further away, and that can make a huge difference considering maturity stages can be developed when kids are aged months apart. Moving the discussion onto the phyiscal makeup for academy players.
There will be two plots, the first will explore how method of grouping teams is associated with the two variables. The following plot will explore how age is associated with the two variables.
sorted_biobanding |>
# 1) We will be using the ggplot package for plots and setting the aesthetics to mass and stature
ggplot(aes(x = `body.mass.(kg)`, y = `stature.(cm)`, color = method.of.grouping.teams)) +
# 2) This feature is for making a scatter plot and setting the background and color for points
geom_point(aes(shape = method.of.grouping.teams)) +
theme_bw() +
# 3) This feature will apply the linear model to the scatterplot
geom_smooth(method = "lm", se = FALSE) +
# 4) This feature will help you label your axis and title
labs(x = "Body Mass (kg)", y = "Stature (cm)", title = "Scatterplot of Body Mass to Stature", subtitle = "For Academy Players Aged 11-14", caption = "Source: Bio-banding in soccer")
There is a positive association between body mass to stature, and the geom_smooth feature can help us visualize it. In other words, as body mass increases then it is logical to believe stature will increase as well.
sorted_biobanding |>
# 1) We will be using the ggplot package for plots and setting the aesthetics to mass and stature
ggplot(aes(x = `body.mass.(kg)`, y = `stature.(cm)`)) +
# 2) This feature will set a color grading system for age, add transparency to the points plotted, and set the background colors to black and white
scale_color_gradient(low = "green", high = "purple") +
geom_point(aes(color = age, alpha = 0.5)) +
theme_bw() +
# 3) This feature will apply the linear model to the scatter plot
geom_smooth(method = "lm", se = FALSE) +
# 4) This feature will help you label your axis and title
labs(x = "Body Mass (kg)", y = "Stature (cm)", title = "Scatterplot of Body Mass to Stature", subtitle = "For Academy Players Aged 11-14", caption = "Source: Bio-banding in soccer")
# This feature will create a subset for the linear model and set the format to y ~ x.
fit1 <- lm(data = sorted_biobanding, `stature.(cm)` ~ `body.mass.(kg)`)
# This feature will shows us summary statistics of the sub-setted linear model
summary(fit1)
##
## Call:
## lm(formula = `stature.(cm)` ~ `body.mass.(kg)`, data = sorted_biobanding)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2751 -2.1311 0.3935 2.6174 9.2043
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.13060 1.25530 94.90 <2e-16 ***
## `body.mass.(kg)` 0.87425 0.02632 33.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.69 on 478 degrees of freedom
## Multiple R-squared: 0.6978, Adjusted R-squared: 0.6971
## F-statistic: 1104 on 1 and 478 DF, p-value: < 2.2e-16
There are three things to factor; the linear model and the adjusted R-squared value.
To write out the linear model, it would look like
stature.(cm) = 119.130 + 0.874 ’body.mass.(kg)`. In other
words, we can interpret the slope and intercept. When the body mass is
at zero kilograms, the stature in cm is at 119.13; not to mention for
every additional kg in body mass, the stature grows by 0.874
centimeters.
Roughly 70% of the variation in the observations can be explained by the use of this model
The p-value is below 0.05 which is evidence to prove there is statistical significance made by the model
fit1 |>
# 1) We will be using the ggplot package for plots and setting the aesthetics to fitted and residuals
ggplot(aes(x = .fitted, y = .resid))+
# 2) This feature will create a scatter plot and a horizontal line set at y = 0 with red coloring and higher line width
geom_point() +
geom_hline(yintercept = 0, color = "red", linewidth = 1.5) +
# 3) This feature will set the background to black and white, plus label your axis and title
theme_bw() +
labs(x = "Fitted Values", y = "Residual Values", title = "Residuals vs. Fitted Values", subtitle = "For a Linear Regression Model")
The residual plots do not show any indication of pattern about the horizontal line at y = 0, and thus, the variability is normally distributed.
unique(sorted_biobanding$method.of.grouping.teams)
## [1] "Chronological Age Groups" "Khamis-Roche"
## [3] "Fransen" "Random"
The group random is the one that is most vague and the others are reasonable to include. Not sure if random means like a “free for all” sort of scenario where there is no criteria requirement. Also, I will conduct a test to see the difference of proportion for Fransen and Khamis-Roche (bio-banding methods)
unique(sorted_biobanding$total.gtsc.score)
## [1] 24 25 17 16 14 19 20 22 23 9 10 13 18 26 27 15 11 28 12 3 29 34 30 31 21
## [26] 36 6 8 33 7
summary(sorted_biobanding$total.gtsc.score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 12.00 16.00 16.44 20.00 36.00
There are no NA entries which is good, but we can set a criteria for what is a ‘good’ score. Anywhere from 16.44 to 36 can be considered passing the test for good technical and tactical assessment.
good <- sorted_biobanding |>
mutate(high_rating = ifelse(total.gtsc.score > 16.44, "pass", "fail")) |>
filter(method.of.grouping.teams == "Fransen" | method.of.grouping.teams == "Khamis-Roche")
unique(sorted_biobanding$birth.month)
## [1] "May" "February" "July" "November" "September" "October"
## [7] "April" "August" "December" "January"
There are no entries for March and June, perhaps worth taking a look
prop_teams <- good |>
specify(high_rating ~ method.of.grouping.teams, success = "pass") |>
generate(reps = 1000, type = "bootstrap") |>
calculate(stat = "diff in props", order = c("Fransen", "Khamis-Roche"))
SE <- prop_teams |>
summarise(se = sd(stat)) |>
pull()
c(d_hat - 2 * SE, d_hat + 2 * SE)
## [1] 0.1664013 0.4169321
We are 95% confident that the true difference in proportions of Fransen and Khamis-Roche groups passing the assessment is between 17.12% and 41.20%. On that note, there is compelling evidence to believe that there is a true difference in method of grouping teams of passing the technical and tactical assessment.
obs <- good |>
select(method.of.grouping.teams, high_rating) |>
table()
obs |>
tidy() |>
uncount(n)
## # A tibble: 240 × 2
## method.of.grouping.teams high_rating
## <chr> <chr>
## 1 Fransen fail
## 2 Fransen fail
## 3 Fransen fail
## 4 Fransen fail
## 5 Fransen fail
## 6 Fransen fail
## 7 Fransen fail
## 8 Fransen fail
## 9 Fransen fail
## 10 Fransen fail
## # ℹ 230 more rows
chisq.test(good$method.of.grouping.teams, good$high_rating)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: good$method.of.grouping.teams and good$high_rating
## X-squared = 19.279, df = 1, p-value = 1.13e-05
Concurrently, we have mentioned that there was statistical significance from the boostraping confidence interval. Let us see if the chi-square test corroborates as well. The p-value is less than 0.05 which suggests there is compelling evidence that there is an association between passing the assessment within the Fransen or Khamis-Roche group.
To recap, I have conducted an linear model equation (soon to be multiple regression - need more time), confidence interval bootstraping test, and chi-square test for a p-value
Indicate statistical significance for predicting stature with body mass and bio-banding assists academy players pass a technical and tactical assessment if the criteria score was high ratings only.
I like to say maybe one more test would help these findings and making the linear model stronger with more predictor variables. Now I wonder what the ongoing research for bio-banding is going if the process still exists.
Include the bibliography, which we already did earlier in the month
I’ll try to get started with the slides, but I’m not sure if I can get it done in time with other projects going on at the same time