Homework 3 Correlation and Regression

IMPORTING DATA

data21 <- read.table("./players_21.csv", fill=TRUE, header=TRUE, sep=",")

data <- data21[,c(3,5,9,13,14,15,18,34,35,36,37,38,39)]

data[data == ''] <- NA

#Dropping rows containing missing values
data <- data %>% drop_na()

#Convert character objects to integer objects / factors
data$short_name <- factor(data$short_name)
data$age <- as.integer(data$age)
data$overall <- as.integer(data$overall)
data$nationality <- factor(data$nationality)
data$potential <- as.integer(data$potential)
data$value_eur <- as.integer(data$value_eur)
data$preferred_foot <- factor(data$preferred_foot)
data$pace <- as.integer(data$pace)
data$shooting <- as.integer(data$shooting) 
data$passing <- as.integer(data$passing)
data$dribbling <- as.integer(data$dribbling)
data$defending <- as.integer(data$defending)
data$physic <- as.integer(data$physic)

colnames(data) <- c("Name", "Age", "Nationality", "Overall", "Potential", "Value", "PreferredFoot", "Pace", "Shooting", "Passing", "Dribbling", "Defending", "Physic")

data$Value <- data$Value/1000000

# Selecting only player from Spain (smaller sample size)
mydata <- subset(data, Nationality == "Spain")

mydata <- mydata[,c(1,2,5,7,8,10,13)]

head(mydata)

##            Name Age Potential PreferredFoot Pace Passing Physic
## 11 Sergio Ramos  34        89         Right   71      76     85
## 23        Pedro  32        80         Right   73      76     50
## 26 Illarramendi  30        80         Right   52      78     69
## 29     Escudero  30        80          Left   72      75     71
## 30 Mario Gaspar  29        80         Right   75      71     77
## 37       Bartra  29        80         Right   72      72     74

Unit of observation: Football Player
Sample size: 493
Source: Kaggle | FIFA 21 complete player dataset
Main goal of the data analysis (e.g., the research question): How players’ age, preferred foot, pace, passing and physic rating affect their potential rating.

DESCRIPTION OF VARIABLES

Name: Name of a player

Age: Age of a player in years

Potential: Potential rating (1-100)

PreferredFoot: The natural preference of player’s left or right foot

Pace: Pace rating (1-100)

Passing: Passing rating (1-100)

Physic: Physic rating (1-100)

round(stat.desc(mydata[,c(-1,-4)]),2)

##                   Age Potential     Pace  Passing   Physic
## nbr.val        493.00    493.00   493.00   493.00   493.00
## nbr.null         0.00      0.00     0.00     0.00     0.00
## nbr.na           0.00      0.00     0.00     0.00     0.00
## min             17.00     59.00    30.00    31.00    33.00
## max             40.00     90.00    96.00    82.00    85.00
## range           23.00     31.00    66.00    51.00    52.00
## sum          12770.00  36550.00 32882.00 30759.00 31350.00
## median          25.00     74.00    68.00    63.00    65.00
## mean            25.90     74.14    66.70    62.39    63.59
## SE.mean          0.20      0.21     0.51     0.41     0.44
## CI.mean.0.95     0.39      0.42     1.00     0.81     0.86
## var             19.13     22.20   127.93    83.58    94.01
## std.dev          4.37      4.71    11.31     9.14     9.70
## coef.var         0.17      0.06     0.17     0.15     0.15

DEFINITION OF THE MULTIPLE REGRESSION MODEL

Potential - dependent variable
Age - independent variable
PrefferedFoot - independent variable (categorical)
Pace - independent variable
Passing - independent variable
Physic - independent variable

The term player potential in FIFA refers to the upward ceiling of the card of a particular footballer and indicates how high it can reach.

Based on our data we will show the relationship between potential and listed independent variables and determine if it is statistically significant.

VISUAL PRESENTATION OF THE DATA

scatterplotMatrix(mydata[,c(3,2,5,6,7)], smooth = FALSE)

According to the scaterplot matrix, almost all independent variables (pace, passing and physic rating) are positively correlated, age being the only independent variable negatively correlated.

REGRESSION FUNCTION ESTIMATION

reg_model <- lm(Potential ~ Age + Pace + Passing + Physic + PreferredFoot,
           data = mydata)

summary(reg_model)

## 
## Call:
## lm(formula = Potential ~ Age + Pace + Passing + Physic + PreferredFoot, 
##     data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.1011  -2.3675  -0.0701   2.0496  12.9847 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        62.48176    1.81700  34.387   <2e-16 ***
## Age                -0.73540    0.04555 -16.145   <2e-16 ***
## Pace                0.02528    0.01475   1.713   0.0873 .  
## Passing             0.28381    0.01906  14.891   <2e-16 ***
## Physic              0.17027    0.01865   9.131   <2e-16 ***
## PreferredFootRight  0.69953    0.34524   2.026   0.0433 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.508 on 487 degrees of freedom
## Multiple R-squared:  0.4513, Adjusted R-squared:  0.4457 
## F-statistic: 80.13 on 5 and 487 DF,  p-value: < 2.2e-16

ASSUMPTIONS

MULTICOLINEARITY

vif(reg_model)

##           Age          Pace       Passing        Physic PreferredFoot 
##      1.586951      1.113011      1.213755      1.306848      1.018179

mean(vif(reg_model))

## [1] 1.247749

There are no strong relationships between explanatory variables - Multicolinearity assumption is met.

VIF < 5

mean(VIF) > 1

NORMALITY

# Standardized residuals (finding outliers)
mydata$StdResid <- round(rstandard(reg_model), 3)
hist(mydata$StdResid, main = "Distribution of Standardized residuals")

# SHAPIRO-WILK NORMALITY TEST
# H0: Variable is normally distributed
# H1: Variable is NOT normally distributed
shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.9953, p-value = 0.1421

There is normal distribution (We cannot reject H0, p > 0.05). Regression model does not violate the normality assumption.

LINEARITY

mydata$StdFittedValue <- scale(fitted(reg_model)) 

scatterplot(mydata$StdResid, mydata$StdFittedValue,
            regLine = TRUE,
            smooth = FALSE, 
            xlab = "Standardized residuals",
            ylab = "Standardized fitted values")

Based on the scatterplot, we can assume that the linearity assumption is met.

HOMOSKEDASTICITY

# Test for homoskedasticity
ols_test_breusch_pagan(reg_model)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                 Data                  
##  -------------------------------------
##  Response : Potential 
##  Variables: fitted values of Potential 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    4.609884 
##  Prob > Chi2   =    0.03178819

Assumption of homoscedasticity is not met (We can reject H0, p = 0.032).

#Calculating cooks distances (units with high impact)
mydata$CooksD <- round(cooks.distance(reg_model), 3)

hist(mydata$CooksD, main = "Distribution of Cooks distances")

head(mydata[order(-mydata$CooksD),c(1,10)],10)

##              Name CooksD
## 169  Jorge Molina  0.043
## 11   Sergio Ramos  0.041
## 2222         Nino  0.036
## 124     Juan Mata  0.033
## 1920 Rubén Castro  0.027
## 859      Llorente  0.025
## 23          Pedro  0.021
## 461     Ansu Fati  0.021
## 585  Diego Castro  0.021
## 920        Brahim  0.021

We can detect some outliers.

CORRECTIONS

# Removing outliers and units with high impact
mydata <- mydata[!(mydata$StdResid < -3),]
mydata <- mydata[!(mydata$StdResid > 3),]

mydata <- mydata[!(mydata$CooksD > 0.021),]

head(mydata[order(-mydata$CooksD),c(1,10)])

##              Name CooksD
## 23          Pedro  0.021
## 461     Ansu Fati  0.021
## 585  Diego Castro  0.021
## 920        Brahim  0.021
## 7846       Tamayo  0.019
## 2910    Hugo Duro  0.018

REESTIMATED REGRESSION MODEL

reg_model <- lm(Potential ~ Age + Pace + Passing + Physic + PreferredFoot,
           data = mydata)

RECHECKING ASSUMPTIONS

MULTICOLINEARITY

vif(reg_model)

##           Age          Pace       Passing        Physic PreferredFoot 
##      1.620006      1.102971      1.217990      1.349680      1.016350

mean(vif(reg_model))

## [1] 1.2614

There are no strong relationships between explanatory variables - Multicolinearity assumption is met.

VIF < 5

mean(VIF) > 1

NORMALITY

# Standardized residuals (finding outliers)
mydata$StdResid <- round(rstandard(reg_model), 3)
hist(mydata$StdResid, main = "Distribution of Standardized residuals")

# SHAPIRO-WILK NORMALITY TEST
# H0: Variable is normally distributed
# H1: Variable is NOT normally distributed
shapiro.test(mydata$StdResid)

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata$StdResid
## W = 0.9964, p-value = 0.3443

There is normal distribution (We cannot reject H0: Normal distribution, p > 0.05). Regression model does not violate the normality assumption.

LINEARITY

mydata$StdFittedValue <- scale(fitted(reg_model)) 

scatterplot(mydata$StdResid, mydata$StdFittedValue,
            regLine = TRUE,
            smooth = FALSE, 
            xlab = "Standardized residuals",
            ylab = "Standardized fitted values")

Based on the scatterplot, we can assume that the linearity assumption is met.

HOMOSKEDASTICITY

# Test for homoskedasticity
ols_test_breusch_pagan(reg_model)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                 Data                  
##  -------------------------------------
##  Response : Potential 
##  Variables: fitted values of Potential 
## 
##         Test Summary         
##  ----------------------------
##  DF            =    1 
##  Chi2          =    2.466219 
##  Prob > Chi2   =    0.1163174

Assumption of homoscedasticity is now met (We cannot reject H0, p > 0.05).

EXPLAINATION OF FINAL RESULTS

summary(reg_model)

## 
## Call:
## lm(formula = Potential ~ Age + Pace + Passing + Physic + PreferredFoot, 
##     data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0378 -2.2619  0.0081  1.9161  9.9033 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        63.06174    1.73839  36.276   <2e-16 ***
## Age                -0.81977    0.04540 -18.056   <2e-16 ***
## Pace                0.03053    0.01420   2.149   0.0321 *  
## Passing             0.28696    0.01824  15.737   <2e-16 ***
## Physic              0.18530    0.01818  10.195   <2e-16 ***
## PreferredFootRight  0.70533    0.32869   2.146   0.0324 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.326 on 480 degrees of freedom
## Multiple R-squared:  0.4968, Adjusted R-squared:  0.4916 
## F-statistic: 94.78 on 5 and 480 DF,  p-value: < 2.2e-16

\[ Potential = 63.1 - 0.82 Age + 0.03 Pace + 0.29 Passing + 0.19 Physic + 0.71 PreferredFootRight \]

PARTIAL REGRESSION COEFFICIENTS

Age - If age increases by 1 year, potential rating decrease on average by 0.82, assuming everything else remains unchanged.
Pace - If pace rating increases by 1, potential rating increase on average by 0.03, assuming everything else remains unchanged.
Passing - If passing rating increases by 1, potential rating increase on average by 0.29, assuming everything else remains unchanged.
Physic - If physic rating increases by 1, potential rating increase on average by 0.19, assuming everything else remains unchanged.
PrefferedFoot - Given the values of all other variables, players who prefer right foot have on average 0.71 more potential then those, who prefer left.

MULTIPLE COEFFICIENT OF DETERMINATION

The adjusted R-squared indicates that 49.2% of the variability of dependent variable, is explained by linear effect of all explanatory variables.

MULTIPLE CORRELATION COEFFICIENT

sqrt(summary(reg_model)$r.squared)

## [1] 0.7048447

\[ sqrt(R^{2}) = 0.70 \]

The maximum degree of linear relationship between the variable Potential, and independent variables is 0.70 - fairly strong relationship.

F STATISTICS

H0: \(ρ^{2} = 0\)

H1: \(ρ^{2} > 0\)

We can reject H0 at p < 0.001.

We can conclude that our model is valid because we proved a relationship between the dependent variable and at least one explanatory variable.

Homework 3 Correlation and Regression

Martin Ferjan

2023-01-18

IMPORTING DATA

DESCRIPTION OF VARIABLES

DEFINITION OF THE MULTIPLE REGRESSION MODEL

VISUAL PRESENTATION OF THE DATA

REGRESSION FUNCTION ESTIMATION

ASSUMPTIONS

MULTICOLINEARITY

NORMALITY

LINEARITY

HOMOSKEDASTICITY

CORRECTIONS

REESTIMATED REGRESSION MODEL

RECHECKING ASSUMPTIONS

MULTICOLINEARITY

NORMALITY

LINEARITY

HOMOSKEDASTICITY

EXPLAINATION OF FINAL RESULTS

PARTIAL REGRESSION COEFFICIENTS

MULTIPLE COEFFICIENT OF DETERMINATION

MULTIPLE CORRELATION COEFFICIENT

F STATISTICS