The dataset I have selected is a collection of roughly 3,900 anonymous football players with their own football and personal statistics from the year 2022. I chose this topic because I am a big football/soccer fan and have been watching and playing the sport since I was 5 years old. Football and sports in general have had a major impact on my life and when I found a dataset that dealt with football, it not only motivated me to choose the dataset, but create data visualizations to the best of my ability. This dataset contains 3,907 observations and 8 variables. Among those variables, there are 5 categorical variables and 3 quantitative variables. The variables provide information about the player’s nationality, age, their position, the team they play for, and their annual salary. The author of this dataset is Yash. He gathered the data from a video game called Football Manager 2022 (FM22). To explore any information about the game itself, the community, or additional features, feel free to explore this website: https://www.footballmanager.com/
The variables I plan on implementing in my data visualizations include age, appearances (the amount of games they played), and position. First, I will filter the dataset for the top 3 football (soccer) leagues in the world. Then, I will look for a correlation between age and appearances using a random sample of 500. Once I find those results, I will analyze the data to determine which player position has the most appearances. In other words, out of the all the players in the top 3 leagues, which position occupies the highest percentage of appearances. Finally, I will create a data visualization showing the locations of football clubs in that European region.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.2
library(ggplot2)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
setwd("C:/Users/danyd/OneDrive/Desktop/data 110/week11hw")
footballsalary <- read.csv("salarypredictions.csv")
head(footballsalary)
## Wage Age Club League Nation Position Apps Caps
## 1 46,427,000 23 PSG Ligue 1 Uber Eats FRA Forward 190 57
## 2 42,125,000 30 PSG Ligue 1 Uber Eats BRA Midfilder 324 119
## 3 34,821,000 35 PSG Ligue 1 Uber Eats ARG Forward 585 162
## 4 19,959,000 31 R. Madrid La Liga BEL Forward 443 120
## 5 19,500,000 31 Man UFC Premier League ESP Goalkeeper 480 45
## 6 18,810,000 30 R. Madrid La Liga AUT Defender 371 94
tail(footballsalary)
## Wage Age Club League Nation Position Apps Caps
## 3902 3,600 18 Vigo La Liga ESP Defender 0 0
## 3903 3,400 19 Vigo La Liga ESP Defender 0 0
## 3904 3,200 18 Famalicao Primiera Liga BRA Goalkeeper 0 0
## 3905 2,900 18 Vigo La Liga ESP Forward 0 0
## 3906 2,700 18 Vigo La Liga ESP Defender 0 0
## 3907 1,400 18 Vigo La Liga ESP Defender 0 0
has_na <- any(is.na(footballsalary))
print(has_na) #If this code prints TRJE, there are NAs in this dataset. If it prints FALSE there are none.
## [1] FALSE
top3leagues <- footballsalary |>
filter(League %in% c("Premier League", "Bundesliga", "La Liga"))
# Sample 500 observations from the data set
random_sample <- top3leagues[sample(nrow(top3leagues), 500), ]
ggplot(random_sample, aes(x = Age, y = Apps, color = League)) +
geom_point() +
labs(title = "Relationship Between Age and Apperances",
x = "Age",
y = "Appearances",
caption = "Reference Source: https://www.footballmanager.com/",
color = "League") +
theme_minimal()
cor(footballsalary$Age, footballsalary$Apps)
## [1] 0.9263375
highchart() %>%
hc_chart(type = "bar") %>%
hc_title(text = "Most Appearances by Position and Age") %>%
hc_xAxis(categories = 0:41, title = list(text = "Age")) %>%
hc_yAxis(title = list(text = "Apps")) %>%
hc_add_series(
data = top3leagues,
hcaes(x = Age, y = Apps, group = Position),
type = "bar"
) %>%
hc_legend(
title = list(text = "Position"),
layout = "vertical",
align = "right",
verticalAlign = "top"
) %>%
hc_caption(text = "Reference Source: https://www.footballmanager.com/")
GIScoordinates <- read.csv("stadiums.csv")
footballclubs <- leaflet(GIScoordinates) |>
addTiles()
footballclubs <- footballclubs |>
addMarkers(data = GIScoordinates, lat = ~Latitude, lng = ~Longitude,
popup = ~paste("Team: ", Team, "<br>",
"Stadium: ", Stadium, "<br>",
"City: ", City, "<br>",
"Country: ", Country, "<br>",
"Latitude: ", Latitude, "<br>",
"Longitude: ", Longitude, "<br>"))
footballclubs
The topic of this dataset has to do with anonymous players and their salary wages, nationality, football team, and appearances. The data comes from a video game called Football Manager 2022. A new version of that game is released every year with updates to the statistical information of each player. The game usually releases the year before the actual date. For example, currently there Football Manager 2024 has released, yet it is still 2023. I was able to determine what were the top 3 football leagues in the world with this website listed below: https://www.globalfootballrankings.com/ The website offers global rankings of football clubs and is updated on a daily basis for accuracy. It shows the average rating of the league as well as the current top and bottom teams in the league. The teams listed in those categories were also found wihtin the dataset. The visualizations I created show the relationship between age and appearances, the most appearances by position, and a world map plotting the locations of each team. To be honest, with the given information in the dataset, I wanted to create data visualizations that showed which club produces the most wages, however for some reason, the variable ‘Wage’ was listed as a character variable which affected how the plot looked. I was unable to calculate a relationship between age and wage because it was not numeric. I attempted to change it to a quantitative variable using as.numeric(as.character()) but all it managed to do was change the whole column into a column of NAs. In addition, with the original dataset I selected, I wanted to add latitude and longitude for the clubs to plot them in a world map, but in order for me to do that I needed an API key and other tools since it was a large dataset. It was irritating to have to switch to a different analysis of the data, but I will say I like the world map I created. It looks similar to Google Maps which was the type of appeal I was looking for. I experienced several challenges in this project, but overall it was a good learning experience.