The following dataset is the player database from EA Sports FIFA 18, a popular soccer videogame. There are over 19,000 players in the database. The database has basic biographical data like Nationality, Age, and Salary. It also has a players ratings along with the position they play.
summary(df_cleaned)
## Name Age Photo Nationality
## Length:17981 Min. :16.00 Length:17981 Length:17981
## Class :character 1st Qu.:21.00 Class :character Class :character
## Mode :character Median :25.00 Mode :character Mode :character
## Mean :25.14
## 3rd Qu.:28.00
## Max. :47.00
## Flag Overall Potential Club
## Length:17981 Min. :46.00 Min. :46.00 Length:17981
## Class :character 1st Qu.:62.00 1st Qu.:67.00 Class :character
## Mode :character Median :66.00 Median :71.00 Mode :character
## Mean :66.25 Mean :71.19
## 3rd Qu.:71.00 3rd Qu.:75.00
## Max. :94.00 Max. :94.00
## Club.Logo Value Wage Special
## Length:17981 Length:17981 Length:17981 Min. : 728
## Class :character Class :character Class :character 1st Qu.:1449
## Mode :character Mode :character Mode :character Median :1633
## Mean :1594
## 3rd Qu.:1786
## Max. :2291
## Acceleration Aggression Agility Balance
## Length:17981 Length:17981 Length:17981 Length:17981
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Ball.control Composure Crossing Curve
## Length:17981 Length:17981 Length:17981 Length:17981
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Dribbling Finishing Free.kick.accuracy GK.diving
## Length:17981 Length:17981 Length:17981 Length:17981
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## GK.handling GK.kicking GK.positioning GK.reflexes
## Length:17981 Length:17981 Length:17981 Length:17981
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Heading.accuracy Interceptions Jumping Long.passing
## Length:17981 Length:17981 Length:17981 Length:17981
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Long.shots Marking Penalties Positioning
## Length:17981 Length:17981 Length:17981 Length:17981
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Reactions Short.passing Shot.power Sliding.tackle
## Length:17981 Length:17981 Length:17981 Length:17981
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Sprint.speed Stamina Standing.tackle Strength
## Length:17981 Length:17981 Length:17981 Length:17981
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Vision Volleys ID Preferred.Positions
## Length:17981 Length:17981 Min. : 16 Length:17981
## Class :character Class :character 1st Qu.:192622 Class :character
## Mode :character Mode :character Median :214057 Mode :character
## Mean :207659
## 3rd Qu.:231448
## Max. :241219
On the men’s side of the game, the United States does not produce strong talent. In 2018, the United States Men’s National Team failed to qualify for the World Cup. I decided to examine how the United States compared to other countries in FIFA 18, while still exploring some general characteristics of the database. Below I have five visualizations that I created using R.
country_count = data.frame(count(df_cleaned, Nationality))
country_count = country_count[order(country_count$n, decreasing = TRUE),]
top_countries = country_count[1:15,]
p1 = ggplot(top_countries, aes(x = reorder(Nationality, -n), y = n)) +
geom_bar(colour = "dark blue", fill = "light blue", stat = "identity") +
labs(title = "15 Most Represented Countries in FIFA 18",
x = "Country",
y = "Number of Players") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_light() +
scale_y_continuous(labels = comma)
This is a bar chart of the top 15 represented countries in FIFA 18. On this graph, the US is the 12th most represented country in FIFA 18. England has the most amount of players, and it’s not even close. England has about 500 more players than Germany, who ranks second.
x_axis_labels = min(df_cleaned$Age):max(df_cleaned$Age)
p2 = ggplot(df_cleaned, aes(x = Age)) +
geom_histogram(bins = 32, color= "darkgreen", fill = "lightgreen") +
labs(title= "Histogram of Ages in FIFA 18", x = "Age", y = "Frequency") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels=comma) +
stat_bin(binwidth = 1, geom = 'text', color = 'black', aes(label = ..count..), vjust = -0.5) +
scale_x_continuous(labels = x_axis_labels, breaks = x_axis_labels)
This is a histogram of the ages in FIFA 18. The histogram clearly shows that the data is right-skewed. There may exist outliers on the far right end of the histogram. There is one 47 year old in the database. The most common age, or mode, is 25 with 1,522 players being 25 years old.
df_cleaned$Overall_Group = ifelse(df_cleaned$Overall>=90, "90's", ifelse(df_cleaned$Overall>=80, "80's", ifelse(df_cleaned$Overall>=70, "70's", ifelse(df_cleaned$Overall>=60, "60's", ifelse(df_cleaned$Overall>=50, "50's", "40's")))))
my_countries = c("England", "Spain", "United States", "France", "Italy", "Germany")
new_df = df_cleaned %>%
filter(Nationality %in% my_countries) %>%
select(Nationality, Overall_Group) %>%
group_by(Nationality, Overall_Group) %>%
summarise(n = length(Nationality), .groups = 'keep') %>%
data.frame()
agg_tot = new_df %>%
select(Nationality, n) %>%
group_by(Nationality) %>%
summarise(tot = sum(n), .groups = 'keep') %>%
data.frame()
new_df$Overall_Group = as.factor(new_df$Overall_Group)
max_y = round_any(max(agg_tot$tot), 500, ceiling)
p3 = ggplot(new_df, aes(x = reorder(Nationality, -n, sum), y = n, fill = Overall_Group)) +
geom_bar(stat = "identity", position = position_stack(reverse = TRUE)) +
labs(title = "Stacked Bar Chart of Overall Rating By Country", x = "Country", y = "Frequency", fill = "Overall Rating") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_brewer(palette = "Paired", guide = guide_legend(reverse = TRUE)) +
geom_text(data = agg_tot, aes(x = Nationality, y = tot, label = scales::comma(tot), fill = NULL), vjust = -0.3) +
scale_y_continuous(labels = comma, limits = c(0,max_y))
I decided to compare the United States with England, France, Germany, Spain, and Italy. These five countries are home to top leagues in the world. In addition, they perform well in international competitions.
There are a few takeaways from this graph. First of all, the United States only has 1 player with an overall rating in the 80’s. Visually speaking, it is easy to see light red sections of all the bars, except for the United States. For Germany, the darker red can be seen on the graph. It is impressive that Spain is able to produce a lot of quality players. If you look at both blue sections of the bars (which represents players with Overall Ratings in the 40’s or 50’s), you can see that it is smaller than all of the other countries. When compared with the United States, France produces about 3 times the amount of players and still has less “poor” quality players.
df_3 = df_cleaned %>%
filter(Nationality %in% my_countries) %>%
select(Age, Nationality) %>%
mutate(age_group = ifelse(Age < 20, "Under 20", ifelse(Age < 25, "20-24", ifelse(Age < 30, "25-29", ifelse(Age < 35, "30-34", ifelse(Age < 40, "35-39", "40+")))))) %>%
group_by(age_group, Nationality) %>%
summarise(n = length(Age), .groups='keep') %>%
data.frame()
age_group_order = factor(df_3$age_group, level = c("Under 20", "20-24", "25-29", "30-34", "35-39", "40+"))
p4 = ggplot(df_3, aes(x = age_group_order, y = n, group = Nationality)) +
geom_line(aes(color = Nationality), size = 3) +
labs(title = "Number of Players by Age Group and Country", x = "Age Group", y = "Number of Players") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
geom_point(shape = 21, size = 5, color = "black", fill = "white") +
scale_y_continuous(labels = comma) +
scale_color_brewer(palette = "Paired", name = "Country", guide = guide_legend(reverse = TRUE))
This is a line chart showing the representation by country in various age groups. The countries are the same as the previous graph. The age groups are as follows: under 20, 20-24, 25-29, 30-34, 35-39, and 40 and over.
A skewed right pattern can be seen, just like the histogram of ages. Germany seems to be the country with the most youth. They have the second most amount of players in the each of the categories under 30. When it comes to the 30 and older age groups, Germany has the second least amount of players. The United States has the least amount of players in every age group.The United States is the least skewed of the countries, while England is the most skewed country.
df_4 = df_cleaned %>%
select(Age, Overall_Group) %>%
mutate(age_group = ifelse(Age < 20, "Under 20", ifelse(Age < 25, "20-24", ifelse(Age < 30, "25-29", ifelse(Age < 35, "30-34", ifelse(Age < 40, "35-39", "40+")))))) %>%
group_by(age_group, Overall_Group) %>%
summarise(n = length(Age), .groups='keep') %>%
data.frame()
age_group_order = factor(df_4$age_group, level = c("Under 20", "20-24", "25-29", "30-34", "35-39", "40+"))
breaks = c(seq(0, max(df_4$n), by = 500))
p5 = ggplot(df_4, aes(x = age_group_order, y = Overall_Group, fill = n)) +
geom_tile(color = "black") +
geom_text(aes(label = comma(n))) +
coord_equal(ratio = 1) +
labs(title = "Heatmap of Overall Group and Age Group", x = "Age Group", y = "Overall Group", fill = "Player Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_continuous(low = "lightblue", high = "red", labels = comma, breaks = breaks) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(colour = "black")))
This graph is a heatmap that shows a comparison between the age groups and overall groups. The box with the most amount of players is the 20-24 age group with an overall rating in the 60’s. There is a bit of a trend where players under 30 continue to get better and hit there prime around their late 20’s and early 30’s. Once a player is in their late 30’s, their skills start to diminish and their overall rating goes down.
In conclusion, the main problem for the United States is that they do not have enough players playing professional soccer. The ones that are playing are not talented enough. The United States needs to increase the amount of young players in the sport. This will give the US a better chance at creating star players in soccer, and in turn compete better at the international level.