# Clearing workspace
rm(list = ls()) # Clear environment
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 523038 28.0 1164148 62.2 660491 35.3
## Vcells 950402 7.3 8388608 64.0 1769514 13.6
# Clear unused memory
cat("\f")
mydata <- read.csv('./2022_world_cup_squads.csv') # Downloading World Cup Data
# Heading our Data
head(mydata)
## ID Team Position Player Age Caps Goals WC.Goals
## 1 1 Ecuador Goalkeeper Hern\xe1n Gal\xedndez 35 12 0 0
## 2 2 Ecuador Defender F\xe9lix Torres 25 17 2 0
## 3 3 Ecuador Defender Piero Hincapi\xe9 20 21 1 0
## 4 4 Ecuador Defender Robert Arboleda 31 33 2 0
## 5 5 Ecuador Midfielder Jos\xe9 Cifuentes 23 11 0 0
## 6 6 Ecuador Defender William Pacho 21 0 0 0
## League Club
## 1 Ecuador Aucas
## 2 Mexico Santos Laguna
## 3 Germany Bayer Leverkusen
## 4 Brazil S\xe3o Paulo
## 5 United States Los Angeles FC
## 6 Belgium Antwerp
In my model, I will use Goals as the response variable/y variable/dependent variable. I will then use position, caps(games), and age as our predictor variables/x variables/independent variables.
# Creating Data Frame
df <- mydata[,c("Goals",
"Position",
"Caps",
"Age"
)
]
#Head Data Frame
head(df)
## Goals Position Caps Age
## 1 0 Goalkeeper 12 35
## 2 2 Defender 17 25
## 3 1 Defender 21 20
## 4 2 Defender 33 31
## 5 0 Midfielder 11 23
## 6 0 Defender 0 21
# Examining the Charts
plot(df)
# Load GGally
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Generate Pairs Plot
ggpairs(df)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Load Stargazer and Psych
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
# Looking at our Descriptive Statistics
describe(df)
## vars n mean sd median trimmed mad min max range skew kurtosis
## Goals 1 831 4.71 10.39 1 2.25 1.48 0 117 117 4.80 32.08
## Position* 2 831 2.45 1.25 2 2.44 1.48 1 4 3 0.12 -1.62
## Caps 3 831 34.04 33.95 24 28.32 26.69 0 191 191 1.50 2.16
## Age 4 831 26.97 4.25 27 26.84 4.45 18 40 22 0.30 -0.35
## se
## Goals 0.36
## Position* 0.04
## Caps 1.18
## Age 0.15
# Summary
summary(df)
## Goals Position Caps Age
## Min. : 0.000 Length:831 Min. : 0.00 Min. :18.00
## 1st Qu.: 0.000 Class :character 1st Qu.: 8.00 1st Qu.:24.00
## Median : 1.000 Mode :character Median : 24.00 Median :27.00
## Mean : 4.705 Mean : 34.04 Mean :26.97
## 3rd Qu.: 5.000 3rd Qu.: 47.00 3rd Qu.:30.00
## Max. :117.000 Max. :191.00 Max. :40.00
# LM Goals Position
lm.goal.pos <- lm(df$Goals~df$Position)
summary(lm.goal.pos)
##
## Call:
## lm(formula = df$Goals ~ df$Position)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.714 -3.905 -1.633 0.367 104.286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.6333 0.5701 2.865 0.00427 **
## df$PositionForward 11.0810 0.8884 12.473 < 2e-16 ***
## df$PositionGoalkeeper -1.6333 1.1006 -1.484 0.13819
## df$PositionMidfielder 2.2714 0.8040 2.825 0.00484 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.367 on 827 degrees of freedom
## Multiple R-squared: 0.1901, Adjusted R-squared: 0.1872
## F-statistic: 64.72 on 3 and 827 DF, p-value: < 2.2e-16
# Plot Goals Positions
par(mfrow = c(2,2))
plot(lm.goal.pos)
# LM Goals Caps
lm.goal.cap <- lm(df$Goals~df$Caps)
summary(lm.goal.cap)
##
## Call:
## lm(formula = df$Goals ~ df$Caps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.335 -2.827 0.230 1.475 82.940
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.661816 0.404457 -4.109 4.37e-05 ***
## df$Caps 0.187026 0.008415 22.224 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.23 on 829 degrees of freedom
## Multiple R-squared: 0.3734, Adjusted R-squared: 0.3726
## F-statistic: 493.9 on 1 and 829 DF, p-value: < 2.2e-16
# Plotting Goals Caps
par(mfrow = c(2,2))
plot(lm.goal.cap)
# LM Goals Age
lm.goal.age <- lm(df$Goals~df$Age)
summary(lm.goal.age)
##
## Call:
## lm(formula = df$Goals ~ df$Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.514 -4.478 -1.972 1.040 104.745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.6066 2.2035 -7.083 3.02e-12 ***
## df$Age 0.7530 0.0807 9.332 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.89 on 829 degrees of freedom
## Multiple R-squared: 0.09506, Adjusted R-squared: 0.09397
## F-statistic: 87.08 on 1 and 829 DF, p-value: < 2.2e-16
# Plotting Goals Age
par(mfrow = c(2,2))
plot(lm.goal.age)
Relationship Between Goals and Position: We can see that positions are separated into four categories; Forward, Midfielder, Defender, and Goalkeeper. These are represented in separate levels on our scatter plot. Our rows go 1. Defender 2. Forward 3. Goalkeeper and 4. Defender. We can see that goalkeepers are at 0 while forwards have the most goals. This makes the most the sense and is what we would expect.
Relationship between goals and Games Played (Caps): The relationship between games played and goals scored looks to be positive and moderately correlated. This would tell us that the more games a player has played in their career, the more goals they have scored. This would make sense as the better the player is, the longer their career likely will be. After calculating its correlation, we see it is 0.6110. We can see these are in fact moderately correlated but need to also factor in that some of these players are defenders who may have a very long career and not score many goals. With an Adjusted R-squared of 0.1872, a regression model may not be the best fit.
cor(x = df$Caps,
y = df$Goals)
## [1] 0.6110264
Relationship between goals and age: The relationship between goals and age seems to not have much of a correlation at all. After calculating the correlation, we see a positive r value of 0.3083 suggesting there is a weak relationship. This tells us that as a player gets older, it does not necessarily mean they will score more goals.
cor(x = df$Age,
y = df$Goals)
## [1] 0.3083152
Goals and Positions:
par(mfrow = c(2,2))
plot(lm.goal.pos)
Residual vs Fitted: I have reattached our residuals graphs here. The first graphs tells us if we are using the correct model for our data set. We are looking for residuals to be both positive and negative all scattered throughout the graph. We can quickly clearly see this is not the case with goals to positions. This should tell us it is non-linear.
QQ Plot: In our second graph, the QQ plot tell us if the residuals are normally distributed by comparing them with an actual normal distribution. A normal distribution would appear with most of the data points appearing between -1 and 1 and on the x axis represented by theoretical quantiles. This is because as we know on a normal distribution, 68% of the data is between 1 standard deviation in both directions from the mean. In our example, we can see that this does not appear to be normally distributed as most of the data extends even past two standard deviations from the mean. Additionally, on the Y axis, the residuals should come close to following the line on our graph which it does not. At the end of our data, we seem to have some extreme values.
Scale-Location: Our third graph, the scale-location plot, shows us if the residuals are spread equally among our predictions in order to check homoskedasticity. We can see that our redline has an upward trend and that are residuals get more spread our as the fitted values increase. We know these are not spread equally.
Residuals vs Leverage Plot: The residuals vs leverage plot shows influential data points that have a big effect on the linear model. If we included data points that are far away from the rest of our data, it can have a pretty big effect on our model. Leverage points on this graph that are far from the others represent points that have a lot of leverage and large residuals. These help us identify outliers. We can see in goals vs position, different positions can give us much more outliers and have an effect on the model itself as opposed to just looking at forward goals.
Goals and Caps (Games Played):
par(mfrow = c(2,2))
plot(lm.goal.cap)
In our first graph, we can see that we have some heteroscedasticity since the variability increases as games played increases. The residuals start off very tight but as the fitted values get larger and further from 0, they get more spread apart.
In our QQ graph, similar to the first, the residuals do not appear to be normally distributed. On the x axis, it does not appear to follow a normal distribution with data being spread out well past 2 standard deviations from the mean, On the Y axis, it does follow our line but has more extreme values on the right side of our graph.
Our third graph, We can see that our redline has an upward trend and that are residuals get more spread our as the fitted values increase. We know these are not spread equally.
In our fourth graph, we see that we have outliers that have a large affect on our linear model. We should look to remove these.
Goals and Age:
par(mfrow = c(2,2))
plot(lm.goal.age)
In our first graph, We can see most of our values are above the 0 line and that there is a little bit of heteroscedasticity as age increases.
Similar to our others, our QQ graph does not a follow a normal distribution for our residuals for the same reasons as above.
On our scale-location graph, this has less of an upward trend than the other two with residuals more evenly spread, but there are still a lot more residuals spread above our line than below.
Our fourth graph also shows that there are some large outliers having a large affect on our model.
The Gauss Markov Assumptions which are also known as conditions are as follows.
Linearity: Is that the relationship between the dependent and independent variable is linear. As we saw from our graphs and calculation, we have a mix of linearity. It holds for games played to goals but not for position to goals.
Independence: Is the independence of errors across all observations. In summary, one error is not related with another. This does not hold.
Homoscedasticity: Is defined as the variance of errors known as residuals should be constant across all levels of independent variables. This does not hold as it is not constant across all independent variables.
No Perfect Multicollinearity: This tells us that the independent variables should not be perfectly correlated with one another. This holds as they are not perfectly correlated.
Zero Conditional Mean (or Expected Value of Residuals): This means that the expected value of the residuals is zero for all the values of independent variables. This does not hold.
Normality: The error term is normally distributed. This is represented from our QQ plot. This does not hold as seen from our QQ graph and residual section.
OLS stands for Ordinary Least Squares. BLUE stands for Best Linear Unbiased Estimator. This is used to estimate the parameters of a linear regression model by minimizing the sum of squared differences between the observed and predicted values.