In this tutorial, we are going to plot three graphs. The goal of this exercise is to test your ability to follow a script/tutorial that is given to you. For those who like soccer, we are going to plot some statistics about the top goal scorers in the UEFA Champions League.
This format of webpages is called ‘Markdown’. It integrates text chunks (like the one you are reading now) with code chunks (the grey boxes), and output chunks (the white ones). You will have to copy and paste the code (only what is in the grey boxes!) in your RStudio and run it in order to follow.
The first step in each script would be to load the needed libraries. We know we need ggplot2 because we need to make nice plots. In addition, we need a package called ‘curl’ so you can directly download the data we need from the internet into your script. We also need a package called ‘ggrepel’ to add functionalities to our ggplot. We do not have these packages installed yet, so we are going to install them too.
library(ggplot2) # load ggplot ( you should have it already installed)
if(!require("curl")){ # an if condition, if curl is not installed:
install.packages("curl") #install curl
}
## Loading required package: curl
## Using libcurl 7.79.1 with LibreSSL/3.3.6
if(!require("ggrepel")){ # an if condition, if curl is not installed:
install.packages("ggrepel") #install ggrepel
}
## Loading required package: ggrepel
library("ggrepel")
library(curl) # finally, load the library curl
Now we are going to import the data into our R environment using a function from the ‘curl’ package.
x <- curl("https://github.com/mrgambero/lessons_microbiome/raw/main/uefa_champions_league_top_scorers.csv")
data_frame_cl <- read.delim(x, sep = ",")
If you look on the right side of the screen, in the Environment tab, you should see the dataframe_cl, which has 20 observations in 4 variables. Great! Let’s now explore the data a little.
data_frame_cl #visualize the dataframe in the console
## Name goals games Nationality
## 1 Cristiano Ronaldo 141 183 Portugal
## 2 Lionel Messi 125 159 Argentina
## 3 Robert Lewandowski 86 110 Poland
## 4 Karim Benzema 86 145 France
## 5 Raul Gonzalez 71 142 Spain
## 6 Ruud van Nistelrooy 60 73 Netherlands
## 7 Andriy Shevchenko 59 100 Ukraine
## 8 Thomas Muller 52 137 Germany
## 9 Thierry Henry 51 112 France
## 10 Filippo Inzaghi 50 81 Italy
## 11 Alfredo Di Stefano 49 58 Spain
## 12 Zlatan Ibrahimovic 49 124 Sweden
## 13 Sergio Aguero 47 79 Argentina
## 14 Eusebio 47 65 Portugal
## 15 Didier Drogba 44 92 Ivory coast
## 16 Alessandro Del Piero 43 89 Italy
## 17 Neymar 41 79 Brasil
## 18 Mohamed Salah 40 75 Egypt
## 19 Fernando Morientes 39 93 Spain
## 20 Ferenc Puskas 36 41 Hungary
mean(data_frame_cl$goals) # calculate the mean goals in champion league for the 20 top scorer of all time
## [1] 60.8
mean(data_frame_cl$games) # calculate the mean games in champion league for the 20 top scorer of all time
## [1] 101.85
sort(table(data_frame_cl$Nationality), decreasing = TRUE) # all nationalities sorted by the number of players
##
## Spain Argentina France Italy Portugal Brasil
## 3 2 2 2 2 1
## Egypt Germany Hungary Ivory coast Netherlands Poland
## 1 1 1 1 1 1
## Sweden Ukraine
## 1 1
Great! We have 20 players from multiple nationalities. Now let’s create a barplot with the number of goals as the height of the bars. To create this barplot, we are going to use ggplot! Since ggplot tends to sort things alphabetically, we need to specify that we want to sort it by the number of goals.
data_frame_cl$Name = factor(data_frame_cl$Name, levels = data_frame_cl$Name ) # this sort the database
p<-ggplot(data=data_frame_cl, aes(x=Name, y=goals)) + #say dataframe and columns for x and y
geom_bar(stat="identity") + #say you want to have a barplot
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
xlab("Player") + #change x laber
ylab("CL goals scored") #change y label
p
Now, let’s determine which nationality scored the most goals. We need to create a new dataset where we sum together the goals for each nationality.
data_frame_nationality = aggregate(goals ~ Nationality,data_frame_cl, FUN = "sum") # We create the new dataframe
# We sort it in decreasing order
data_frame_nationality = data_frame_nationality[order(data_frame_nationality$goals, decreasing = TRUE),]
data_frame_nationality
## Nationality goals
## 11 Portugal 188
## 1 Argentina 172
## 12 Spain 159
## 4 France 137
## 7 Italy 93
## 10 Poland 86
## 9 Netherlands 60
## 14 Ukraine 59
## 5 Germany 52
## 13 Sweden 49
## 8 Ivory coast 44
## 2 Brasil 41
## 3 Egypt 40
## 6 Hungary 36
It seems that Portugal is leading the list! Let’s create a barplot, like before.
data_frame_nationality$Nationality = factor(data_frame_nationality$Nationality, levels = data_frame_nationality$Nationality ) # this sort the database
p<-ggplot(data=data_frame_nationality, aes(x=Nationality, y=goals)) + #say dataframe and columns for x and y
geom_bar(stat="identity") + #say you want to have a barplot
xlab("Nationality") + #change x laber
ylab("CL goals scored") +#change y label
theme_classic() +#We change the way the plot looks!
theme(axis.text.x = element_text(angle = 45, hjust=1))
p
Very cool, Portugal, Argentina and Spain are on the podium.
Now, next plot is, do we see a correlation between how many games these players play, and the goals they scored.
p<-ggplot(data=data_frame_cl, aes(x=games, y=goals)) + #say dataframe and columns for x and y
geom_point() + #say you want to have points
xlab("Games played") + #change x laber
ylab("CL goals scored")+#change y label
theme_light() + #change look
stat_smooth(method="lm", se=FALSE)+#add correaltion line
geom_text_repel(aes(label = Name),
size = 3.5)
p
## `geom_smooth()` using formula = 'y ~ x'
We have got the last graph. We are done! did it all worked? if so, congratulations! ` If not, try troubleshooting, try googling the error.