Import the data into program R.
mydata <- read.table("./Football.csv", header=TRUE, sep=";", dec=".")
head(mydata)
## Rank Name Position Age Value Club
## 1 1 Kylian Mbappe 4 22 144 Paris Saint-Germain
## 2 2 Erling Haaland 4 21 135 Borussia Dortmund
## 3 3 Harry Kane 4 28 108 Tottenham Hotspur
## 4 4 Jack Grealish 1 26 90 Manchester City
## 5 5 Mohamed Salah 2 29 90 Liverpool FC
## 6 6 Romelu Lukaku 4 28 90 Chelsea FC
## Games_played Goals Assists Card_yellow
## 1 16 7 11 3
## 2 10 13 4 1
## 3 16 7 2 2
## 4 15 2 3 1
## 5 15 15 6 1
## 6 11 4 1 0
Description of the variables: - Rank: The rank of the footballer in relation to the value.
Name: Name of the player.
Position: Player position (1:midfielder, 2:winger, 3:defender, 4:striker, 5:goalkeeper)
Age: Age of the player.
Value: Estimated value of the player in million EUR for the year 2021.
Club: The club where the player plays.
Games played: Number of games played in 2021.
Goals: Number of goals scored in 2021.
Assists: Number of assists given in 2021.
Card_yellow: Number of yellow cards received in 2021.
mydata$Position <- factor(mydata$Position,
levels = c(1, 2, 3, 4, 5),
labels = c("midfielder", "winger", "defender", "striker", "goalkeeper"))
Display the frequency distribution of the players’ positions.
library(ggplot2)
library(ggplot2)
ggplot(mydata, aes(x = Position)) +
geom_bar() +
ylab("Frequency") +
xlab("Position")
Draw a scatterplot between the number of games played and the number of goals scored and explain it.
library(car)
scatterplot(y = mydata$Goals,
x = mydata$Games_played,
ylab = "Goals scored",
xlab = "Games played",
smooth = FALSE)
library(ggplot2)
ggplot(mydata, aes(x=Games_played, y=Goals)) +
geom_point(color = "chocolate1")
The relationship between the variables is positive - the more games you play, the more goals you score.
Estimate the average number of yellow cards for defenders. Can you say that the average number of yellow cards for strikers is lower?
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:onewaytests':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## The following object is masked from 'package:car':
##
## logit
describeBy(mydata$Card_yellow, group = mydata$Position)
##
## Descriptive statistics by group
## group: midfielder
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 48 1.56 1.44 1 1.45 1.48 0 5 5 0.52 -0.9
## se
## X1 0.21
## ----------------------------------------------------
## group: winger
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 9 0.44 0.73 0 0.44 0 0 2 2 1.04 -0.5
## se
## X1 0.24
## ----------------------------------------------------
## group: defender
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 23 1.91 1.24 2 1.84 1.48 0 5 5 0.7 0
## se
## X1 0.26
## ----------------------------------------------------
## group: striker
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 14 1.64 1.28 1.5 1.58 1.48 0 4 4 0.22 -1.29
## se
## X1 0.34
## ----------------------------------------------------
## group: goalkeeper
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 6 1 0.89 1 1 1.48 0 2 2 0 -1.96
## se
## X1 0.37
Average number for defender is 1.91 yellow cards. H0: Mu = 1.91 H1: Mu < 1.91 This is one-sided test.
t.test(mydata[mydata$Position == "striker", ]$Card_yellow,
mu = 1.91,
alternative = "less")
##
## One Sample t-test
##
## data: mydata[mydata$Position == "striker", ]$Card_yellow
## t = -0.78247, df = 13, p-value = 0.224
## alternative hypothesis: true mean is less than 1.91
## 95 percent confidence interval:
## -Inf 2.247475
## sample estimates:
## mean of x
## 1.642857
table(mydata$Position)
##
## midfielder winger defender striker goalkeeper
## 48 9 23 14 6
We can’t say that the average number of yellow cards is different from 1.91. (p-value is too high, around 22%)