The dataset I decided to use has information compiled about Bundesliga soccer players, these information include variables such as their ages, nationalities, positions, and clubs. As a soccer player myself, I thought it would be interesting to dig deep and learn more about the players I watch pretty much every day.
# VISUALIZATION 1
getwd()
## [1] "C:/Users/deano/Documents/RStudioFiles"
setwd ("C://Users//deano//Documents//RStudioFiles")
if(!file.exists("R_datafiles")) dir.create("R_datafiles")
df <- read.csv("C:/Users/deano/Documents/RStudioFiles/R_datafiles/bundesliga_player.csv")
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("ggplot2")
## Installing package into 'C:/Users/deano/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ggplot2' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\deano\AppData\Local\Temp\RtmpkJK0So\downloaded_packages
library(ggplot2)
df <- read.csv("C:/Users/deano/Documents/RStudioFiles/R_datafiles/bundesliga_player.csv")
df$age <- as.numeric(df$age)
german_players <- df[df$nationality == "Germany" & !is.na(df$age), ]
if (nrow(german_players) < 10) {
warning("Less than 10 German players found in the dataset.")
} else {
oldest_germans <- german_players[order(german_players$age, decreasing = TRUE), ][1:10, ]
ggplot(oldest_germans, aes(x = reorder(name, age), y = age)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Oldest German Players in Bundesliga",
x = "Player Name",
y = "Age") +
theme_minimal()
}
The focus of my first visualization was to show the top 10 oldest german players that are currently active in the bundesliga, which is a 1st tier german league. I did this by filtering the dataset to only have players who are of German origin and their age data. After this I sorted their ages in descending order and the ten oldest soccer players were the ones selected for the bar chart.
The bar chart shows us the ages of these chooses players, it also included other names on the y axis and their ages on the x axis.
# VISUALIZATION 2
library(ggplot2)
df <- read.csv("C:/Users/deano/Documents/RStudioFiles/R_datafiles/bundesliga_player.csv")
df$height <- as.numeric(df$height)
tallest_players <- df[order(df$height, decreasing = TRUE), ][1:10, ]
ggplot(tallest_players, aes(x = reorder(name, height), y = height)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 10 Tallest Bundesliga Players",
x = "Player Name",
y = "Height (m)") +
theme_minimal()
The second visualization I decided to do was one that showcased the top 10 tallest players in the league. Height is something that gives players a physical advantage in soccer and I was really curious to see the players who were the tallest. In soccer, height gives you an advantage especially in aerial duels, being able to defend set pieces, and one of the most important which is scoring headers. By bringing this data to life, we can easily see those who stand out physically and it can help us understand how their height can help their performance and how it shapes their roles within each respective team.
# VISUALIZATION 3
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
df <- read.csv("C:/Users/deano/Documents/RStudioFiles/R_datafiles/bundesliga_player.csv")
ggplot(df, aes(x = name, y = position, color = club)) +
geom_point() +
geom_line(linewidth = 1) +
scale_y_discrete() +
theme_minimal() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
labs(x = NULL) +
ggtitle("Player Position Distribution by Club")
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
My third visualization helps us see how player positions are distributed across the different clubs. Each point stands for a player, with their position on the y-axis, and their club is shown by color. The x-axis originally had the player names but due to overlaps and clutter, I decided to hide it for better readability. This graph helps us get a clear comparison of positional trends across the clubs and it helps us see which teams have more players in what positions this can also help us with our previous graph by comparing what clubs the tallest players were in and seeing what positions each club had the most.
# VISUALIZATION 4
library(ggplot2)
library(dplyr)
df <- df %>%
mutate(month = sample(1:12, nrow(df), replace = TRUE))
df_heatmap <- df %>%
group_by(position, month) %>%
summarise(player_count = n(), .groups = 'drop')
ggplot(df_heatmap, aes(x = factor(month), y = position, fill = player_count)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
theme_minimal() +
labs(title = "Player Position by Month",
x = "Month",
y = "Position",
fill = "Player Count") +
scale_x_discrete(labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
The fourth visualization is a heatmap that shows us the distribution of the player’s positions across all 12 months. The heatmap can be used to help reveal potential trends or patterns in the way each player’s position occurs over a year. This can be useful because it can help us understand the seasonality of player roles or how each team’s strategy changes across each month.
# VISUALIZATION 5
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
position_count <- df %>%
group_by(club, position) %>%
summarise(player_count = n(), .groups = 'drop')
plot_ly(position_count,
labels = ~position,
values = ~player_count,
type = 'pie',
textinfo = 'label+percent',
title = 'Player Position Distribution') %>%
layout(showlegend = TRUE)
For my final visualization, I used a pie chart to display the percentages of player positions across the different Bundesliga clubs. The Plotly library was used to create an interactive pie chart. Each slice represents a player position, with the size corresponding to the number of soccer players in that position for their respective club. The chart shows both the “position” label and the percentage of players in each position. This provides a clear and visual representation of how each role is distributed across the teams.
In this project, I went through a dataset that had information on Bundesliga soccer players. The info ranged from aspects such as age, height, position, and club. The visualizations aimed to show how positioning and age were distributed among the teams, using different charts like pie charts, bar charts, histograms, and heat maps. By looking at these patterns, I was able to gain insights into how players’ roles and physical traits align with their teams, and how these factors influence their positions in the league.