Data Visualization using ggplot2

I will be analyzing the dataset of top 250 most expensive football transfers from season 2000-2001 until 2018-2019.

Source

Loading the dataset

Preview and Summary of the data

##                Name       Position Age    Team_from    League_from      Team_to
## 1        Luís Figo   Right Winger  27 FC Barcelona         LaLiga  Real Madrid
## 2    Hernán Crespo Centre-Forward  25        Parma        Serie A        Lazio
## 3     Marc Overmars    Left Winger  27      Arsenal Premier League FC Barcelona
## 4 Gabriel Batistuta Centre-Forward  31   Fiorentina        Serie A      AS Roma
## 5    Nicolas Anelka Centre-Forward  21  Real Madrid         LaLiga     Paris SG
## 6     Rio Ferdinand    Centre-Back  22     West Ham Premier League        Leeds
##        League_to    Season Market_value Transfer_fee
## 1         LaLiga 2000-2001           NA     60000000
## 2        Serie A 2000-2001           NA     56810000
## 3         LaLiga 2000-2001           NA     40000000
## 4        Serie A 2000-2001           NA     36150000
## 5        Ligue 1 2000-2001           NA     34500000
## 6 Premier League 2000-2001           NA     26000000
##      Name             Position              Age         Team_from        
##  Length:4700        Length:4700        Min.   : 0.00   Length:4700       
##  Class :character   Class :character   1st Qu.:22.00   Class :character  
##  Mode  :character   Mode  :character   Median :24.00   Mode  :character  
##                                        Mean   :24.34                     
##                                        3rd Qu.:27.00                     
##                                        Max.   :35.00                     
##                                                                          
##  League_from          Team_to           League_to            Season         
##  Length:4700        Length:4700        Length:4700        Length:4700       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   Market_value        Transfer_fee      
##  Min.   :    50000   Min.   :   825000  
##  1st Qu.:  3500000   1st Qu.:  4000000  
##  Median :  6000000   Median :  6500000  
##  Mean   :  8622469   Mean   :  9447586  
##  3rd Qu.: 10000000   3rd Qu.: 10820000  
##  Max.   :120000000   Max.   :222000000  
##  NA's   :1260

Non-visual exploratory information

Top 10 players in terms of Market Value

Top 10 teams that spend the most

top_10_spenders<-dataset%>%
  group_by(Team_to)%>%
  summarize(Total_Fee=sum(Transfer_fee))%>%
  top_n(10,Total_Fee)%>%
  arrange(desc(Total_Fee))
DT::datatable(top_10_spenders)

Visualizations

A graph that shows the distribution of teams in the data-set by League. This gives the number of teams in each league.

# Dataframe to hold frequencies of teams in each league
league <- dataset %>% 
  group_by(League_from) %>%
  summarise(leagues_freq = n())


# Visualization using ggplot 
ggplot(league, aes(x=reorder(League_from, leagues_freq), y=leagues_freq)) +
  geom_bar(stat= "identity", fill="#76448a")+labs(title="Distribution of Teams by League", x="League", y="Number of Teams") + coord_flip()+geom_text(aes(label=leagues_freq), vjust=0.6, hjust=1.2, size=3, color="black")

Analysis of the Premier League

# From the visualization above, the Premier League is the leading League
league_plot_with_outliers<-dataset%>%
  filter(League_from=='Premier League')
league_plot_with_outliers<-league_plot_with_outliers[complete.cases(league_plot_with_outliers),]
league_plot_with_outliers<-league_plot_with_outliers%>%
  group_by(Team_from)

#A plot with outliers
ggplot(league_plot_with_outliers, aes(x=Team_from, y=Age)) + 
  geom_boxplot()+ 
  scale_y_continuous("Average Age", trans='log2')+
  coord_flip()+
  ggtitle("Age by Team in the Premier League")+
  theme_classic()+
  theme(panel.background = element_rect(fill = "#abebc6"))

#Removing the outliers
plot_without_outliers <- league_plot_with_outliers %>%
  group_by(Team_from) %>%
  filter((Age <= quantile(Age,0.75)+1.5*IQR(Age))
          &Age >= quantile(Age,0.25)-1.5*IQR(Age))%>%
  mutate(avg_age=mean(Age))

ggplot(plot_without_outliers, aes(x=reorder(Team_from,avg_age), y=Age)) + 
  geom_boxplot()+ 
  scale_y_continuous("Average, Median and Distribution of Age", trans='log2')+
  stat_summary(fun.x=mean, geom="point", shape=20, size=2, color="purple", fill="purple")+
  coord_flip()+
  ggtitle("Age by Team in the Premier League")+
  theme_classic()+
  theme(panel.background = element_rect(fill = "#abebc6"))
## Warning: Ignoring unknown parameters: fun.x
## No summary function supplied, defaulting to `mean_se()`

# The purple circles represent the mean