Brief Introduction

The dataset I have selected is a collection of roughly 3,900 anonymous football players with their own football and personal statistics from the year 2022. I chose this topic because I am a big football/soccer fan and have been watching and playing the sport since I was 5 years old. Football and sports in general have had a major impact on my life and when I found a dataset that dealt with football, it not only motivated me to choose the dataset, but create data visualizations to the best of my ability. This dataset contains 3,907 observations and 8 variables. Among those variables, there are 5 categorical variables and 3 quantitative variables. The variables provide information about the player’s nationality, age, their position, the team they play for, and their annual salary. The author of this dataset is Yash. He gathered the data from a video game called Football Manager 2022 (FM22). To explore any information about the game itself, the community, or additional features, feel free to explore this website: https://www.footballmanager.com/

The variables I plan on implementing in my data visualizations include age, appearances (the amount of games they played), and position. First, I will filter the dataset for the top 3 football (soccer) leagues in the world. Then, I will look for a correlation between age and appearances using a random sample of 500. Once I find those results, I will analyze the data to determine which player position has the most appearances. In other words, out of the all the players in the top 3 leagues, which position occupies the highest percentage of appearances. Finally, I will create a data visualization showing the locations of football clubs in that European region.

Load the libraries and set working directory

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.2
library(ggplot2)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
setwd("C:/Users/danyd/OneDrive/Desktop/data 110/week11hw")
footballsalary <- read.csv("salarypredictions.csv")

Lets take a look at the top and bottom of the dataset

head(footballsalary)
##         Wage Age      Club            League Nation   Position Apps Caps
## 1 46,427,000  23       PSG Ligue 1 Uber Eats    FRA    Forward  190   57
## 2 42,125,000  30       PSG Ligue 1 Uber Eats    BRA  Midfilder  324  119
## 3 34,821,000  35       PSG Ligue 1 Uber Eats    ARG    Forward  585  162
## 4 19,959,000  31 R. Madrid           La Liga    BEL    Forward  443  120
## 5 19,500,000  31   Man UFC    Premier League    ESP Goalkeeper  480   45
## 6 18,810,000  30 R. Madrid           La Liga    AUT   Defender  371   94
tail(footballsalary)
##       Wage Age      Club        League Nation   Position Apps Caps
## 3902 3,600  18      Vigo       La Liga    ESP   Defender    0    0
## 3903 3,400  19      Vigo       La Liga    ESP   Defender    0    0
## 3904 3,200  18 Famalicao Primiera Liga    BRA Goalkeeper    0    0
## 3905 2,900  18      Vigo       La Liga    ESP    Forward    0    0
## 3906 2,700  18      Vigo       La Liga    ESP   Defender    0    0
## 3907 1,400  18      Vigo       La Liga    ESP   Defender    0    0

Are there any NAs in this dataset?

has_na <- any(is.na(footballsalary))

print(has_na) #If this code prints TRJE, there are NAs in this dataset. If it prints FALSE there are none.
## [1] FALSE

Since there were no NAs present in the dataset we will now filter the data set for the top 3 leagues in the world. The top 3 leagues are Premier League, Bundesliga, and La Liga in that particular order. Then we will create a random sample of 500 observations to find an unbiased representation of the data.

top3leagues <- footballsalary |>
  filter(League %in% c("Premier League", "Bundesliga", "La Liga"))

# Sample 500 observations from the data set
random_sample <- top3leagues[sample(nrow(top3leagues), 500), ]

Scatterplot

ggplot(random_sample, aes(x = Age, y = Apps, color = League)) +
  geom_point() +
  labs(title = "Relationship Between Age and Apperances",
       x = "Age",
       y = "Appearances",
       caption = "Reference Source: https://www.footballmanager.com/", 
       color = "League") +
  theme_minimal()

Correlation

cor(footballsalary$Age, footballsalary$Apps) 
## [1] 0.9263375

Based on the scatterplot and correlation assessment, there is evidence of a strong linear relationship between the two quantitative variables at 0.93 (any number larger than 0.7 is considered strong and linear). Now, let’s create an interactive bar graph in order to determine which position has the most appearances.

Highcharter Bar Graph

highchart() %>%
  hc_chart(type = "bar") %>%
  hc_title(text = "Most Appearances by Position and Age") %>%
  hc_xAxis(categories = 0:41, title = list(text = "Age")) %>%
  hc_yAxis(title = list(text = "Apps")) %>%
  hc_add_series(
    data = top3leagues,
    hcaes(x = Age, y = Apps, group = Position),
    type = "bar"
  ) %>%
  hc_legend(
    title = list(text = "Position"),
    layout = "vertical",  
    align = "right",      
    verticalAlign = "top"  
  ) %>%
  hc_caption(text = "Reference Source: https://www.footballmanager.com/")

Looking at the data visualization, it may be difficult to read the graph, however by clicking on specific positions in the legend, you can filter out the data of the position you want to see. Based on that process, we can see that roughly all the positions are close to each other meaning the position variety in the dataset is close to even proportions. However, the forward position has the most appearances in the top 3 leagues.

Now, utilizing a new csv file and dataset, I will plot the locations of football teams/clubs in the European region along with the city, country, and name of their stadiums. The author of this dataset is the user, jokecamp. He gathered the information for this dataset from the following website: https://football-data.co.uk/

Read new csv file

GIScoordinates <- read.csv("stadiums.csv")

World Map

footballclubs <- leaflet(GIScoordinates) |>
  addTiles() 

footballclubs <- footballclubs |>
  addMarkers(data = GIScoordinates, lat = ~Latitude, lng = ~Longitude, 
             popup = ~paste("Team: ", Team, "<br>",
                            "Stadium: ", Stadium, "<br>",
                             "City: ", City, "<br>",
                            "Country: ", Country, "<br>", 
                             "Latitude: ", Latitude, "<br>",
                             "Longitude: ", Longitude, "<br>"))

footballclubs

Brief Essay

The topic of this dataset has to do with anonymous players and their salary wages, nationality, football team, and appearances. The data comes from a video game called Football Manager 2022. A new version of that game is released every year with updates to the statistical information of each player. The game usually releases the year before the actual date. For example, currently there Football Manager 2024 has released, yet it is still 2023. I was able to determine what were the top 3 football leagues in the world with this website listed below: https://www.globalfootballrankings.com/ The website offers global rankings of football clubs and is updated on a daily basis for accuracy. It shows the average rating of the league as well as the current top and bottom teams in the league. The teams listed in those categories were also found wihtin the dataset. The visualizations I created show the relationship between age and appearances, the most appearances by position, and a world map plotting the locations of each team. To be honest, with the given information in the dataset, I wanted to create data visualizations that showed which club produces the most wages, however for some reason, the variable ‘Wage’ was listed as a character variable which affected how the plot looked. I was unable to calculate a relationship between age and wage because it was not numeric. I attempted to change it to a quantitative variable using as.numeric(as.character()) but all it managed to do was change the whole column into a column of NAs. In addition, with the original dataset I selected, I wanted to add latitude and longitude for the clubs to plot them in a world map, but in order for me to do that I needed an API key and other tools since it was a large dataset. It was irritating to have to switch to a different analysis of the data, but I will say I like the world map I created. It looks similar to Google Maps which was the type of appeal I was looking for. I experienced several challenges in this project, but overall it was a good learning experience.