INTRO TO R -Data Dive 8

Dataset-English premier league dataset

#Importing libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)

Importing dataset

data <-read.csv('C:/Downloads/final_dataset.csv')
colnames(data)
##  [1] "X"             "Date"          "HomeTeam"      "AwayTeam"     
##  [5] "FTHG"          "FTAG"          "FTR"           "HTGS"         
##  [9] "ATGS"          "HTGC"          "ATGC"          "HTP"          
## [13] "ATP"           "HM1"           "HM2"           "HM3"          
## [17] "HM4"           "HM5"           "AM1"           "AM2"          
## [21] "AM3"           "AM4"           "AM5"           "MW"           
## [25] "HTFormPtsStr"  "ATFormPtsStr"  "HTFormPts"     "ATFormPts"    
## [29] "HTWinStreak3"  "HTWinStreak5"  "HTLossStreak3" "HTLossStreak5"
## [33] "ATWinStreak3"  "ATWinStreak5"  "ATLossStreak3" "ATLossStreak5"
## [37] "HTGD"          "ATGD"          "DiffPts"       "DiffFormPts"

Response Variable-FTHG

Explanatory Variable-Home Team

Null hypothesis-There is no significant difference in the average number of goals scored by different home teams.

Significance level=0.05

ANOVA Test

data <- data
response_variable <-data$FTHG
explanatory_variable <- data$HomeTeam
anova_result <- aov(response_variable ~ explanatory_variable , data = data )
summary(anova_result)
##                        Df Sum Sq Mean Sq F value Pr(>F)    
## explanatory_variable   43    901  20.957   13.41 <2e-16 ***
## Residuals            6796  10620   1.563                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA Test shows that the F-Statistic is large 13.41

P -value is <2e-16 is much smaller than the significance level hence we can reject the null hypothesis.

Linear regression model

Another variable-HTGD

FTHG-dependent variable

HTGD-independent variable

data <- data
response_variable <- data$FTHG
explanatory_variable <- data$HTGD
linear_model <- lm(response_variable ~ explanatory_variable , data = data)
summary(linear_model)
## 
## Call:
## lm(formula = response_variable ~ explanatory_variable, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4413 -0.9746 -0.2366  0.6939  7.2639 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.53178    0.01525  100.44   <2e-16 ***
## explanatory_variable  0.44280    0.02199   20.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.261 on 6838 degrees of freedom
## Multiple R-squared:  0.05599,    Adjusted R-squared:  0.05586 
## F-statistic: 405.6 on 1 and 6838 DF,  p-value: < 2.2e-16
coefficients <- coef(linear_model)
print(coefficients)
##          (Intercept) explanatory_variable 
##            1.5317761            0.4428045
ggplot(data, aes(x=explanatory_variable, y=response_variable)) +
  geom_point() +                     
  geom_smooth(method="lm", se=FALSE) + 
  labs(x="Home Team Goal Difference (HTGD)", y="Full-Time Home Goals (FTHG)") +
  ggtitle("Scatter Plot with Linear Regression Line")
## `geom_smooth()` using formula = 'y ~ x'

The explanatory variable has a significant impact on the response variable.

Finding model accuracy using MSE and Rsquared.

predicted_values <- predict(linear_model)
mse <- mean((response_variable - predicted_values)^2)
mse
## [1] 1.59002
rsquared <- summary(linear_model)$r.squared
rsquared
## [1] 0.05599406

The R squared value tells us that the explanatory variable has only a small impact on the response variable and has a small impact on its variance.