Project 1

library (tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(ggfortify)

Introduction

These are datasets of about ninety-three thousand football (soccer) players and their careers, displaying multiple decades of professional football. The data includes player profiles such as names, clubs, and dates of birth, but most importantly, it shows each player’s injury history. It shows the type of injury, the dates the injury occurred and the number of days with games missed out of the team. Together, these records give people an understanding into the physical toll of professional football, making it possible to uncover patterns in how, when, and how severely players get hurt across different eras of the sport. This project will explore which injuries are most common, which are most severe, and which positions carry an injury burden.

Source: Transfermarkt (www.transfermarkt.com)

# Load in the two datasets
player_profiles <- read_csv ('/Users/eabban/College Stuff/R Studio/archive (1)/player_profiles/player_profiles.csv')
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 92671 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (19): player_slug, player_name, player_image_url, name_in_home_country,...
dbl   (5): player_id, height, current_club_id, player_agent_id, on_loan_from...
lgl   (5): is_eu, third_club_url, third_club_name, fourth_club_url, fourth_c...
date  (5): date_of_birth, joined, contract_expires, date_of_last_contract_ex...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
player_injuries <- read_csv('/Users/eabban/College Stuff/R Studio/archive (1)/player_injuries/player_injuries.csv')
Rows: 143195 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): season_name, injury_reason
dbl  (3): player_id, days_missed, games_missed
date (2): from_date, end_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Joining

This chunk loads joins the two datasets to make them into one single data called “injured_players” using player_id as the shared key between both datasets

injured_players <- left_join(player_injuries, player_profiles, by ="player_id")

Removing all N/A and not needed columns

Cleaning all the unnecessary columns with all the N/A in the data to make the it less crowded and much less to analyze.

clean_players <- injured_players |>
  
  filter(!is.na(games_missed) & !is.na(days_missed) & !is.na(position) & !is.na(foot) & !is.na(end_date) & !is.na(player_name) & !is.na(height) & height>0) |>
  
  select (- player_slug, - player_image_url, -player_agent_id, -date_of_death, -fourth_club_name, -fourth_club_url, -third_club_name, -third_club_url, -second_club_url, -second_club_name, -contract_expires, -contract_option, -contract_there_expires, -on_loan_from_club_id, -on_loan_from_club_name, -date_of_last_contract_extension, -player_agent_name, -social_media_url, -outfitter, -name_in_home_country)

Categorize by injury

In this chunk, I grouped the injuries by how many times they appear and summarised them into three group : number of player, average days, and average games missed.

by_injury <- clean_players |>
  group_by(injury_reason) |>
  summarise(num_players = n_distinct(player_id), avg_days = mean(days_missed), avg_games_missed = mean(games_missed))

Selecting the top 30 and top days

This chunk creates two filtered subsets: the top 30 injuries by number of players affected, and the top 30 injuries by average days missed.

## Sort by the injury players had the most
top_30 <- by_injury |>
  arrange(desc(num_players))|>
  head(30)

## The most dangerous/injuries that took longer to come back from
top_days <- by_injury |>
  arrange(desc(avg_days)) |>
  head(30)

Top 30 graph without unknown injury since it doesn’t tell us anything specific

top_30a <- top_30 |>
  filter(!injury_reason == "unknown injury")

What injury happens the most?

ggplot(top_30a, aes(x = reorder(injury_reason, num_players), y = num_players)) +
  geom_col(fill = "#85a18b") +
  coord_flip() +
  labs(
    title = "Top 30 Injuries Among Football Players",
    x = "Injury Type",
    y = "Number of Players",
    caption = "Source: Transfermarkt"
  ) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 8))

Most serious injuries

This graph shows the most serious injuries/the injuries the took the most to recover from whiles also pointing out that not all these injuries stem from playing football.

top_days |>
  ggplot(aes(x = reorder(injury_reason, avg_days), y = avg_days)) +
  geom_segment(aes(xend = injury_reason, yend = 0), color = "#a0b23d") +
  geom_point(size = 4, color = "#2b2f60") +
  coord_flip() +
  labs(
    title = "Most Severe Injuries by Average Days Missed",
    subtitle = "Not all injuries shown are football related",
    x = "Injury Type",
    y = "Average Days Missed",
    caption = "Source: Transfermarkt"
  ) +
  theme_minimal()

Correlation

Predicting days_missed using games_missed and height

cor(clean_players$games_missed, clean_players$days_missed)
[1] 0.848437
fit1 <- lm(days_missed ~ games_missed + height, data = clean_players)
summary(fit1)

Call:
lm(formula = days_missed ~ games_missed + height, data = clean_players)

Residuals:
   Min     1Q Median     3Q    Max 
-427.6  -10.0   -5.0    1.8 3360.4 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.15281    2.92920   1.076    0.282    
games_missed  6.66967    0.01145 582.266   <2e-16 ***
height        0.00790    0.01606   0.492    0.623    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 38.22 on 131951 degrees of freedom
Multiple R-squared:  0.7198,    Adjusted R-squared:  0.7198 
F-statistic: 1.695e+05 on 2 and 131951 DF,  p-value: < 2.2e-16

Remove height because it has the biggest p-value

The p-value on games_missed is really small, confirming it is the predictor of days missed. The Adjusted R-squared of 0.7198 meaning that 71% of the variation in days missed is explained by games missed alone.

fit2 <- lm(days_missed ~ games_missed, data = clean_players)
summary(fit2)

Call:
lm(formula = days_missed ~ games_missed, data = clean_players)

Residuals:
   Min     1Q Median     3Q    Max 
-427.5   -9.9   -5.0    1.7 3360.4 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.59219    0.12875   35.67   <2e-16 ***
games_missed  6.66964    0.01145  582.28   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 38.22 on 131952 degrees of freedom
Multiple R-squared:  0.7198,    Adjusted R-squared:  0.7198 
F-statistic: 3.39e+05 on 1 and 131952 DF,  p-value: < 2.2e-16

Equation for your model

days_missed = (6.669 × games_missed) + 4.592

This means that for every additional game a player misses, they are predicted to miss around 6.669 more days.

Checking the assumption models

autoplot(fit2, 1:4, nrow=2, ncol=2)
Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Bubble Graph

ggplot(clean_players, aes(x = games_missed, y = days_missed, size = height)) +
  geom_point(alpha = 0.5, color = "#aa70bf") +
  labs(
    title = "Days Missed vs Games Missed",
    caption = "Source: Transfermarkt",
    x = "Games Missed",
    y = "Days Missed",
    size = "Height (cm)"
  ) +
  theme_minimal()

Categorizing by the position of the players

This chunk groups the players by their position and the total number of games missed to see if there is correlation between their position on the field and the injuries they sustain.

position_injury <- clean_players |>
  group_by(main_position, injury_reason) |>
  summarise(num_players = n_distinct(player_id),avg_days = mean(days_missed),total_games_lost = sum(games_missed))
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by main_position and injury_reason.
ℹ Output is grouped by main_position.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(main_position, injury_reason))` for per-operation
  grouping (`?dplyr::dplyr_by`) instead.

Final Plot

p <- ggplot(position_injury, aes(
    x     = avg_days,
    y     = num_players,
    size  = total_games_lost,
    color = main_position,
    text  = paste(
      "Injury:", injury_reason,
      "<br>Position:", main_position,
      "<br>Players Affected:", num_players,
      "<br>Avg Days Missed:", round(avg_days, 1),
      "<br>Total Games Lost:", total_games_lost
    )
  )) +
  geom_point(alpha = 0.7) +
  scale_color_manual(values = c(
    "Goalkeeper" = "#e74c3c",
    "Defender"   = "#2980b9",
    "Midfield" = "#27ae60",
    "Attack"    = "#acfa70"
  )) +
  scale_size_continuous(range = c(2, 15)) +
  labs(
    title   = "The Full Cost of Football Injuries by Position",
    x       = "Average Days Missed per Injury",
    y       = "Number of Players Affected",
    color   = "Position",
    size    = "Total Games Lost",
    caption = "Source: Transfermarkt", 
    subtitle = "Bubble size represents total games lost across all players"
  ) +
  theme_minimal()

ggplotly(p, tooltip = "text")

Reflection

I began this project with two different files, player_profiles and player_injuries, and I joined them into one single dataset called injured_players using a left join on a variable called player_id that both datasets had. This made every player name and all the information about them match up across both datasets. I had to do a lot of cleaning since most of the columns were not needed or were not helpful in what I was going for. I used filter() to clean up NA’s in the columns and select() to remove the columns I didn’t need. To make the dataset look presentable, I grouped them by injuries using the group_by() function and summarized with summarise() to calculate the number of players, average days missed, and average games missed per injury type. I used these two techniques summarise() and group_by() multiple times throughout the project to summarize and group different variables.

The bar chart displays the top 30 most frequent injuries among professional football (soccer) players, ranked by the number of players affected. The most obvious pattern I noticed is that the injuries most players got were muscle-related ones. What surprises me the most, however, is the number of players that were affected by injuries not directly related to football. The lollipop graph shows the most severe injuries by average days missed, the bubble graph displays games missed vs days missed with the size of the bubble being the height of the players and the final plot is an interactive plot that displays the cost of football injuries by position, average days missed per Injury, number of players affected, their positions, and the total games lost.

With this project, there were many things I wished could have added or that I tried but couldn’t get to work. One major idea would have been a time-lapse or video-type visualization to show which injuries became more dominant as the years went on. One thing I tried to add was a heatmap of all the injuries with their size showing a specific injury but some of the text was not showing in the map and the sizes wasn’t matching.