In this tutorial, we are going to plot three graphs. The goal of this exercise is to test your ability to follow a script/tutorial that is given to you. For those who like soccer, we are going to plot some statistics about the top goal scorers in the UEFA Champions League.

how does it work?

This format of webpages is called ‘Markdown’. It integrates text chunks (like the one you are reading now) with code chunks (the grey boxes), and output chunks (the white ones). You will have to copy and paste the code (only what is in the grey boxes!) in your RStudio and run it in order to follow.

Libraries

The first step in each script would be to load the needed libraries. We know we need ggplot2 because we need to make nice plots. In addition, we need a package called ‘curl’ so you can directly download the data we need from the internet into your script. We also need a package called ‘ggrepel’ to add functionalities to our ggplot. We do not have these packages installed yet, so we are going to install them too.

library(ggplot2) # load ggplot ( you should have it already installed)

if(!require("curl")){      # an if condition, if curl is not installed:
  install.packages("curl") #install curl
}
## Loading required package: curl
## Using libcurl 7.79.1 with LibreSSL/3.3.6
if(!require("ggrepel")){      # an if condition, if curl is not installed:
  install.packages("ggrepel") #install ggrepel
}
## Loading required package: ggrepel
library("ggrepel")
library(curl) # finally, load the library curl

Get the data

Now we are going to import the data into our R environment using a function from the ‘curl’ package.

x <- curl("https://github.com/mrgambero/lessons_microbiome/raw/main/uefa_champions_league_top_scorers.csv")
data_frame_cl <- read.delim(x,  sep = ",")

If you look on the right side of the screen, in the Environment tab, you should see the dataframe_cl, which has 20 observations in 4 variables. Great! Let’s now explore the data a little.

Look at the data

data_frame_cl #visualize the dataframe in the console
##                    Name goals games Nationality
## 1     Cristiano Ronaldo   141   183    Portugal
## 2          Lionel Messi   125   159   Argentina
## 3    Robert Lewandowski    86   110      Poland
## 4         Karim Benzema    86   145      France
## 5         Raul Gonzalez    71   142       Spain
## 6   Ruud van Nistelrooy    60    73 Netherlands
## 7     Andriy Shevchenko    59   100     Ukraine
## 8         Thomas Muller    52   137     Germany
## 9         Thierry Henry    51   112      France
## 10      Filippo Inzaghi    50    81       Italy
## 11   Alfredo Di Stefano    49    58       Spain
## 12   Zlatan Ibrahimovic    49   124      Sweden
## 13        Sergio Aguero    47    79   Argentina
## 14              Eusebio    47    65    Portugal
## 15        Didier Drogba    44    92 Ivory coast
## 16 Alessandro Del Piero    43    89       Italy
## 17               Neymar    41    79      Brasil
## 18        Mohamed Salah    40    75       Egypt
## 19   Fernando Morientes    39    93       Spain
## 20        Ferenc Puskas    36    41     Hungary
mean(data_frame_cl$goals) # calculate the mean goals in champion league for the 20 top scorer of all time
## [1] 60.8
mean(data_frame_cl$games) # calculate the mean games in champion league for the 20 top scorer of all time
## [1] 101.85
sort(table(data_frame_cl$Nationality), decreasing = TRUE) # all nationalities sorted by the number of players
## 
##       Spain   Argentina      France       Italy    Portugal      Brasil 
##           3           2           2           2           2           1 
##       Egypt     Germany     Hungary Ivory coast Netherlands      Poland 
##           1           1           1           1           1           1 
##      Sweden     Ukraine 
##           1           1

First plot

Great! We have 20 players from multiple nationalities. Now let’s create a barplot with the number of goals as the height of the bars. To create this barplot, we are going to use ggplot! Since ggplot tends to sort things alphabetically, we need to specify that we want to sort it by the number of goals.

data_frame_cl$Name = factor(data_frame_cl$Name, levels = data_frame_cl$Name ) # this sort the database
p<-ggplot(data=data_frame_cl, aes(x=Name, y=goals)) + #say dataframe and columns for x and y
  geom_bar(stat="identity") + #say you want to have a barplot
  theme(axis.text.x = element_text(angle = 45, hjust=1)) +
  xlab("Player") + #change x laber
  ylab("CL goals scored") #change y label
p

Second plot

Now, let’s determine which nationality scored the most goals. We need to create a new dataset where we sum together the goals for each nationality.

data_frame_nationality = aggregate(goals ~ Nationality,data_frame_cl, FUN = "sum")  # We create the new dataframe
# We sort it in decreasing order
data_frame_nationality = data_frame_nationality[order(data_frame_nationality$goals, decreasing = TRUE),]
data_frame_nationality
##    Nationality goals
## 11    Portugal   188
## 1    Argentina   172
## 12       Spain   159
## 4       France   137
## 7        Italy    93
## 10      Poland    86
## 9  Netherlands    60
## 14     Ukraine    59
## 5      Germany    52
## 13      Sweden    49
## 8  Ivory coast    44
## 2       Brasil    41
## 3        Egypt    40
## 6      Hungary    36

It seems that Portugal is leading the list! Let’s create a barplot, like before.

data_frame_nationality$Nationality = factor(data_frame_nationality$Nationality, levels = data_frame_nationality$Nationality ) # this sort the database
p<-ggplot(data=data_frame_nationality, aes(x=Nationality, y=goals)) + #say dataframe and columns for x and y
  geom_bar(stat="identity") + #say you want to have a barplot
  xlab("Nationality") + #change x laber
  ylab("CL goals scored") +#change y label
  theme_classic() +#We change the way the plot looks!
  theme(axis.text.x = element_text(angle = 45, hjust=1)) 
p

Very cool, Portugal, Argentina and Spain are on the podium.

Third plot

Now, next plot is, do we see a correlation between how many games these players play, and the goals they scored.

p<-ggplot(data=data_frame_cl, aes(x=games, y=goals)) + #say dataframe and columns for x and y
  geom_point() + #say you want to have points
  xlab("Games played") + #change x laber
  ylab("CL goals scored")+#change y label
  theme_light() + #change look
  stat_smooth(method="lm", se=FALSE)+#add correaltion line
  geom_text_repel(aes(label = Name),
                    size = 3.5) 
  p
## `geom_smooth()` using formula = 'y ~ x'

Finished

We have got the last graph. We are done! did it all worked? if so, congratulations! ` If not, try troubleshooting, try googling the error.