# Clearing workspace  
rm(list = ls()) # Clear environment 
gc()
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 523038 28.0    1164148 62.2   660491 35.3
## Vcells 950402  7.3    8388608 64.0  1769514 13.6
# Clear unused memory
cat("\f") 

Downloading Data

mydata <- read.csv('./2022_world_cup_squads.csv') # Downloading World Cup Data

Goals and Positions

# LM Goals Position
lm.goal.pos <- lm(df$Goals~df$Position)
summary(lm.goal.pos)
## 
## Call:
## lm(formula = df$Goals ~ df$Position)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.714  -3.905  -1.633   0.367 104.286 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.6333     0.5701   2.865  0.00427 ** 
## df$PositionForward     11.0810     0.8884  12.473  < 2e-16 ***
## df$PositionGoalkeeper  -1.6333     1.1006  -1.484  0.13819    
## df$PositionMidfielder   2.2714     0.8040   2.825  0.00484 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.367 on 827 degrees of freedom
## Multiple R-squared:  0.1901, Adjusted R-squared:  0.1872 
## F-statistic: 64.72 on 3 and 827 DF,  p-value: < 2.2e-16
# Plot Goals Positions
par(mfrow = c(2,2))
plot(lm.goal.pos)

Goals and Caps (Games Played)

# LM Goals Caps
lm.goal.cap <- lm(df$Goals~df$Caps)
summary(lm.goal.cap)
## 
## Call:
## lm(formula = df$Goals ~ df$Caps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.335  -2.827   0.230   1.475  82.940 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.661816   0.404457  -4.109 4.37e-05 ***
## df$Caps      0.187026   0.008415  22.224  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.23 on 829 degrees of freedom
## Multiple R-squared:  0.3734, Adjusted R-squared:  0.3726 
## F-statistic: 493.9 on 1 and 829 DF,  p-value: < 2.2e-16
# Plotting Goals Caps
par(mfrow = c(2,2))
plot(lm.goal.cap)

Goals and Age

# LM Goals Age
lm.goal.age <- lm(df$Goals~df$Age)
summary(lm.goal.age)
## 
## Call:
## lm(formula = df$Goals ~ df$Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.514  -4.478  -1.972   1.040 104.745 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.6066     2.2035  -7.083 3.02e-12 ***
## df$Age        0.7530     0.0807   9.332  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.89 on 829 degrees of freedom
## Multiple R-squared:  0.09506,    Adjusted R-squared:  0.09397 
## F-statistic: 87.08 on 1 and 829 DF,  p-value: < 2.2e-16
# Plotting Goals Age
par(mfrow = c(2,2))
plot(lm.goal.age)

2 Talk about what you find in a few lines i.e. interpret a few slopes. Is the sign in the expected direction, and is the magnitude meaningful? What about the statistical significance?

Relationship Between Goals and Position: We can see that positions are separated into four categories; Forward, Midfielder, Defender, and Goalkeeper. These are represented in separate levels on our scatter plot. Our rows go 1. Defender 2. Forward 3. Goalkeeper and 4. Defender. We can see that goalkeepers are at 0 while forwards have the most goals. This makes the most the sense and is what we would expect.

Relationship between goals and Games Played (Caps): The relationship between games played and goals scored looks to be positive and moderately correlated. This would tell us that the more games a player has played in their career, the more goals they have scored. This would make sense as the better the player is, the longer their career likely will be. After calculating its correlation, we see it is 0.6110. We can see these are in fact moderately correlated but need to also factor in that some of these players are defenders who may have a very long career and not score many goals. With an Adjusted R-squared of 0.1872, a regression model may not be the best fit.

cor(x = df$Caps,
    y = df$Goals)
## [1] 0.6110264

Relationship between goals and age: The relationship between goals and age seems to not have much of a correlation at all. After calculating the correlation, we see a positive r value of 0.3083 suggesting there is a weak relationship. This tells us that as a player gets older, it does not necessarily mean they will score more goals.

cor(x = df$Age,
    y = df$Goals)
## [1] 0.3083152

3 More importantly, interpret the residuals.

Goals and Positions:

par(mfrow = c(2,2))
plot(lm.goal.pos)

Residual vs Fitted: I have reattached our residuals graphs here. The first graphs tells us if we are using the correct model for our data set. We are looking for residuals to be both positive and negative all scattered throughout the graph. We can quickly clearly see this is not the case with goals to positions. This should tell us it is non-linear.

QQ Plot: In our second graph, the QQ plot tell us if the residuals are normally distributed by comparing them with an actual normal distribution. A normal distribution would appear with most of the data points appearing between -1 and 1 and on the x axis represented by theoretical quantiles. This is because as we know on a normal distribution, 68% of the data is between 1 standard deviation in both directions from the mean. In our example, we can see that this does not appear to be normally distributed as most of the data extends even past two standard deviations from the mean. Additionally, on the Y axis, the residuals should come close to following the line on our graph which it does not. At the end of our data, we seem to have some extreme values.

Scale-Location: Our third graph, the scale-location plot, shows us if the residuals are spread equally among our predictions in order to check homoskedasticity. We can see that our redline has an upward trend and that are residuals get more spread our as the fitted values increase. We know these are not spread equally.

Residuals vs Leverage Plot: The residuals vs leverage plot shows influential data points that have a big effect on the linear model. If we included data points that are far away from the rest of our data, it can have a pretty big effect on our model. Leverage points on this graph that are far from the others represent points that have a lot of leverage and large residuals. These help us identify outliers. We can see in goals vs position, different positions can give us much more outliers and have an effect on the model itself as opposed to just looking at forward goals.

Goals and Caps (Games Played):

par(mfrow = c(2,2))
plot(lm.goal.cap)

In our first graph, we can see that we have some heteroscedasticity since the variability increases as games played increases. The residuals start off very tight but as the fitted values get larger and further from 0, they get more spread apart.

In our QQ graph, similar to the first, the residuals do not appear to be normally distributed. On the x axis, it does not appear to follow a normal distribution with data being spread out well past 2 standard deviations from the mean, On the Y axis, it does follow our line but has more extreme values on the right side of our graph.

Our third graph, We can see that our redline has an upward trend and that are residuals get more spread our as the fitted values increase. We know these are not spread equally.

In our fourth graph, we see that we have outliers that have a large affect on our linear model. We should look to remove these.

Goals and Age:

par(mfrow = c(2,2))
plot(lm.goal.age)

In our first graph, We can see most of our values are above the 0 line and that there is a little bit of heteroscedasticity as age increases.

Similar to our others, our QQ graph does not a follow a normal distribution for our residuals for the same reasons as above.

On our scale-location graph, this has less of an upward trend than the other two with residuals more evenly spread, but there are still a lot more residuals spread above our line than below.

Our fourth graph also shows that there are some large outliers having a large affect on our model.

4 What are the Gauss Markov Assumptions assumptions, and did they hold?

The Gauss Markov Assumptions which are also known as conditions are as follows.

Linearity: Is that the relationship between the dependent and independent variable is linear. As we saw from our graphs and calculation, we have a mix of linearity. It holds for games played to goals but not for position to goals.

Independence: Is the independence of errors across all observations. In summary, one error is not related with another. This does not hold.

Homoscedasticity: Is defined as the variance of errors known as residuals should be constant across all levels of independent variables. This does not hold as it is not constant across all independent variables.

No Perfect Multicollinearity: This tells us that the independent variables should not be perfectly correlated with one another. This holds as they are not perfectly correlated.

Zero Conditional Mean (or Expected Value of Residuals): This means that the expected value of the residuals is zero for all the values of independent variables. This does not hold.

Normality: The error term is normally distributed. This is represented from our QQ plot. This does not hold as seen from our QQ graph and residual section.

5 What does OLS is BLUE mean?

OLS stands for Ordinary Least Squares. BLUE stands for Best Linear Unbiased Estimator. This is used to estimate the parameters of a linear regression model by minimizing the sum of squared differences between the observed and predicted values.