Data Visualization using ggplot2
I will be analyzing the dataset of top 250 most expensive football transfers from season 2000-2001 until 2018-2019.
## Name Position Age Team_from League_from Team_to
## 1 LuÃs Figo Right Winger 27 FC Barcelona LaLiga Real Madrid
## 2 Hernán Crespo Centre-Forward 25 Parma Serie A Lazio
## 3 Marc Overmars Left Winger 27 Arsenal Premier League FC Barcelona
## 4 Gabriel Batistuta Centre-Forward 31 Fiorentina Serie A AS Roma
## 5 Nicolas Anelka Centre-Forward 21 Real Madrid LaLiga Paris SG
## 6 Rio Ferdinand Centre-Back 22 West Ham Premier League Leeds
## League_to Season Market_value Transfer_fee
## 1 LaLiga 2000-2001 NA 60000000
## 2 Serie A 2000-2001 NA 56810000
## 3 LaLiga 2000-2001 NA 40000000
## 4 Serie A 2000-2001 NA 36150000
## 5 Ligue 1 2000-2001 NA 34500000
## 6 Premier League 2000-2001 NA 26000000
## Name Position Age Team_from
## Length:4700 Length:4700 Min. : 0.00 Length:4700
## Class :character Class :character 1st Qu.:22.00 Class :character
## Mode :character Mode :character Median :24.00 Mode :character
## Mean :24.34
## 3rd Qu.:27.00
## Max. :35.00
##
## League_from Team_to League_to Season
## Length:4700 Length:4700 Length:4700 Length:4700
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Market_value Transfer_fee
## Min. : 50000 Min. : 825000
## 1st Qu.: 3500000 1st Qu.: 4000000
## Median : 6000000 Median : 6500000
## Mean : 8622469 Mean : 9447586
## 3rd Qu.: 10000000 3rd Qu.: 10820000
## Max. :120000000 Max. :222000000
## NA's :1260
top_10_spenders<-dataset%>%
group_by(Team_to)%>%
summarize(Total_Fee=sum(Transfer_fee))%>%
top_n(10,Total_Fee)%>%
arrange(desc(Total_Fee))
DT::datatable(top_10_spenders)
A graph that shows the distribution of teams in the data-set by League. This gives the number of teams in each league.
# Dataframe to hold frequencies of teams in each league
league <- dataset %>%
group_by(League_from) %>%
summarise(leagues_freq = n())
# Visualization using ggplot
ggplot(league, aes(x=reorder(League_from, leagues_freq), y=leagues_freq)) +
geom_bar(stat= "identity", fill="#76448a")+labs(title="Distribution of Teams by League", x="League", y="Number of Teams") + coord_flip()+geom_text(aes(label=leagues_freq), vjust=0.6, hjust=1.2, size=3, color="black")
# From the visualization above, the Premier League is the leading League
league_plot_with_outliers<-dataset%>%
filter(League_from=='Premier League')
league_plot_with_outliers<-league_plot_with_outliers[complete.cases(league_plot_with_outliers),]
league_plot_with_outliers<-league_plot_with_outliers%>%
group_by(Team_from)
#A plot with outliers
ggplot(league_plot_with_outliers, aes(x=Team_from, y=Age)) +
geom_boxplot()+
scale_y_continuous("Average Age", trans='log2')+
coord_flip()+
ggtitle("Age by Team in the Premier League")+
theme_classic()+
theme(panel.background = element_rect(fill = "#abebc6"))
#Removing the outliers
plot_without_outliers <- league_plot_with_outliers %>%
group_by(Team_from) %>%
filter((Age <= quantile(Age,0.75)+1.5*IQR(Age))
&Age >= quantile(Age,0.25)-1.5*IQR(Age))%>%
mutate(avg_age=mean(Age))
ggplot(plot_without_outliers, aes(x=reorder(Team_from,avg_age), y=Age)) +
geom_boxplot()+
scale_y_continuous("Average, Median and Distribution of Age", trans='log2')+
stat_summary(fun.x=mean, geom="point", shape=20, size=2, color="purple", fill="purple")+
coord_flip()+
ggtitle("Age by Team in the Premier League")+
theme_classic()+
theme(panel.background = element_rect(fill = "#abebc6"))
## Warning: Ignoring unknown parameters: fun.x
## No summary function supplied, defaulting to `mean_se()`
# The purple circles represent the mean