library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.2.0 âś” readr 2.1.5
## âś” forcats 1.0.1 âś” stringr 1.6.0
## âś” ggplot2 4.0.2 âś” tibble 3.3.0
## âś” lubridate 1.9.4 âś” tidyr 1.3.1
## âś” purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
setwd("~/Documents/EC/Spring 2026/DATA 101/Project 1")
fifa_21 <- read.csv("FIFA-21 Complete.csv", sep = ";")
str(fifa_21)
## 'data.frame': 17981 obs. of 9 variables:
## $ player_id : int 158023 20801 190871 203376 200389 192985 188545 183277 212831 209331 ...
## $ name : chr "Lionel Messi" "Cristiano Ronaldo" "Neymar Jr" "Virgil van Dijk" ...
## $ nationality: chr "Argentina" "Portugal" "Brazil" "Netherlands" ...
## $ position : chr "ST|CF|RW" "ST|LW" "CAM|LW" "CB" ...
## $ overall : int 94 93 92 91 91 91 91 91 90 90 ...
## $ age : int 33 35 28 29 27 29 31 29 27 28 ...
## $ hits : int 299 276 186 127 47 119 89 66 53 94 ...
## $ potential : int 94 93 92 92 93 91 91 91 91 90 ...
## $ team : chr "FC Barcelona " "Juventus " "Paris Saint-Germain " "Liverpool " ...
head(fifa_21, n=6)
## player_id name nationality position overall age hits potential
## 1 158023 Lionel Messi Argentina ST|CF|RW 94 33 299 94
## 2 20801 Cristiano Ronaldo Portugal ST|LW 93 35 276 93
## 3 190871 Neymar Jr Brazil CAM|LW 92 28 186 92
## 4 203376 Virgil van Dijk Netherlands CB 91 29 127 92
## 5 200389 Jan Oblak Slovenia GK 91 27 47 93
## 6 192985 Kevin De Bruyne Belgium CM|CAM 91 29 119 91
## team
## 1 FC Barcelona
## 2 Juventus
## 3 Paris Saint-Germain
## 4 Liverpool
## 5 Atlético Madrid
## 6 Manchester City
Does the age of the soccer player effect the current and potential overall rating of the player? The data set I selected to work on contains data of many soccer players’ stats on the video game FIFA. The stats include the players’ ratings on the game, their current age, their nationality, the position that they play and their potential for growth in the game. The question I stated in the first sentence is what I am going to discover throughout this project with various coding techniques. I will utilize the variables in this data set such as overall, age, and potential. I discovered the data set from the Git Hub link on blackboard which linked to Kaggle and it states that this particular data set was taken from fifaindex.com.
To find if the age of the player effects the current and potential overall rating of the player, I will perform a table showing the means of each age’s potential and overall rating. I will then plug this into a scatter plot to have a nice visualization of the points. First, I will perform cleaning to the data set and select the main variables I am going to use in this project which are age, overall, and potential (I kept the names in there as well as it is appealing to see how many popular players are rated).
names(fifa_21) <- gsub("[(). \\-]", "_", names(fifa_21))
names(fifa_21) <- gsub("_$", "", names(fifa_21))
names(fifa_21) <- tolower(names(fifa_21))
head(fifa_21, n=6)
## player_id name nationality position overall age hits potential
## 1 158023 Lionel Messi Argentina ST|CF|RW 94 33 299 94
## 2 20801 Cristiano Ronaldo Portugal ST|LW 93 35 276 93
## 3 190871 Neymar Jr Brazil CAM|LW 92 28 186 92
## 4 203376 Virgil van Dijk Netherlands CB 91 29 127 92
## 5 200389 Jan Oblak Slovenia GK 91 27 47 93
## 6 192985 Kevin De Bruyne Belgium CM|CAM 91 29 119 91
## team
## 1 FC Barcelona
## 2 Juventus
## 3 Paris Saint-Germain
## 4 Liverpool
## 5 Atlético Madrid
## 6 Manchester City
fifa_21_ratings <- fifa_21 |>
select(name, age, overall, potential)
head(fifa_21_ratings, n=6)
## name age overall potential
## 1 Lionel Messi 33 94 94
## 2 Cristiano Ronaldo 35 93 93
## 3 Neymar Jr 28 92 92
## 4 Virgil van Dijk 29 91 92
## 5 Jan Oblak 27 91 93
## 6 Kevin De Bruyne 29 91 91
overall_rating_mean <- fifa_21_ratings |>
group_by(age) |>
summarize(mean_overall = mean(overall, na.rm = TRUE))
overall_rating_mean
## # A tibble: 27 Ă— 2
## age mean_overall
## <int> <dbl>
## 1 17 60.7
## 2 18 60.9
## 3 19 61.4
## 4 20 63.5
## 5 21 63.6
## 6 22 65.2
## 7 23 66.1
## 8 24 67.2
## 9 25 67.5
## 10 26 68.1
## # ℹ 17 more rows
overall_rating_potential <- fifa_21_ratings |>
group_by(age) |>
summarize(mean_potential = mean(potential, na.rm = TRUE))
overall_rating_potential
## # A tibble: 27 Ă— 2
## age mean_potential
## <int> <dbl>
## 1 17 78.5
## 2 18 78.0
## 3 19 76.5
## 4 20 75.6
## 5 21 74.6
## 6 22 74.5
## 7 23 73.8
## 8 24 72.8
## 9 25 72.0
## 10 26 71.0
## # ℹ 17 more rows
scatterplot_overall <- ggplot(overall_rating_mean, aes(x = age, y = mean_overall)) +
labs(title = "Correlation between Age and Overall Ratings in FIFA 21",
caption = "Source: fifaindex.com",
x = "Age of Players in FIFA 21",
y = "Overall Rating of Players in FIFA 21") +
theme_minimal(base_size = 10)
scatterplot_overall + geom_point()
scatterplot_potential <- ggplot(overall_rating_potential, aes(x = age, y = mean_potential)) +
labs(title = "Correlation between Age and Potential Ratings in FIFA 21",
caption = "Source: fifaindex.com",
x = "Age of Players in FIFA 21",
y = "Potential Rating of Players in FIFA 21") +
theme_minimal(base_size = 10)
scatterplot_potential + geom_point()
Looking at my findings, I can see that younger players’ overall mean is the lowest of all ages compared to being the highest for potential mean ages. For the overall mean we can see that the points peak at early to mid thirties which shown the prime of many players’ careers. It starts to decrease as they get older which makes sense as well as staying low when they are younger due to them still developing as a player. For potential, the younger you are the higher your potential rating could be as you still have so much room to develop as a player with all the years left in your career which explains the decline in means as the age value gets higher. There is an outline at age 42 with the point having an extremely high value in each graph as this can mean that there was not many 41 year-old athletes playing soccer which can fluctuate the mean for that age. Using these findings, we can see that the age of a soccer player does indeed affect the potential and overall rating of the player. We can improve these findings by adding a variable to see when these players are the healthiest at a certain age. It will allow us to have an extra variable to research upon and may make my findings more understandable.
Source: The data set was taken from fifaindex.com. (Found the source from the GitHub link on Blackboard which linked to Kaggle) Source Links: https://www.kaggle.com/datasets/aayushmishra1512/fifa-2021-complete-player-data, https://fifaindex.com/players/top/fifa21_486/