INTRO TO R -Data Dive 8
Dataset-English premier league dataset
#Importing libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
Importing dataset
data <-read.csv('C:/Downloads/final_dataset.csv')
colnames(data)
## [1] "X" "Date" "HomeTeam" "AwayTeam"
## [5] "FTHG" "FTAG" "FTR" "HTGS"
## [9] "ATGS" "HTGC" "ATGC" "HTP"
## [13] "ATP" "HM1" "HM2" "HM3"
## [17] "HM4" "HM5" "AM1" "AM2"
## [21] "AM3" "AM4" "AM5" "MW"
## [25] "HTFormPtsStr" "ATFormPtsStr" "HTFormPts" "ATFormPts"
## [29] "HTWinStreak3" "HTWinStreak5" "HTLossStreak3" "HTLossStreak5"
## [33] "ATWinStreak3" "ATWinStreak5" "ATLossStreak3" "ATLossStreak5"
## [37] "HTGD" "ATGD" "DiffPts" "DiffFormPts"
Response Variable-FTHG
Explanatory Variable-Home Team
Null hypothesis-There is no significant difference in the average
number of goals scored by different home teams.
Significance level=0.05
ANOVA Test
data <- data
response_variable <-data$FTHG
explanatory_variable <- data$HomeTeam
anova_result <- aov(response_variable ~ explanatory_variable , data = data )
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## explanatory_variable 43 901 20.957 13.41 <2e-16 ***
## Residuals 6796 10620 1.563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA Test shows that the F-Statistic is large 13.41
P -value is <2e-16 is much smaller than the significance level
hence we can reject the null hypothesis.
Linear regression model
Another variable-HTGD
FTHG-dependent variable
HTGD-independent variable
data <- data
response_variable <- data$FTHG
explanatory_variable <- data$HTGD
linear_model <- lm(response_variable ~ explanatory_variable , data = data)
summary(linear_model)
##
## Call:
## lm(formula = response_variable ~ explanatory_variable, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4413 -0.9746 -0.2366 0.6939 7.2639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.53178 0.01525 100.44 <2e-16 ***
## explanatory_variable 0.44280 0.02199 20.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.261 on 6838 degrees of freedom
## Multiple R-squared: 0.05599, Adjusted R-squared: 0.05586
## F-statistic: 405.6 on 1 and 6838 DF, p-value: < 2.2e-16
coefficients <- coef(linear_model)
print(coefficients)
## (Intercept) explanatory_variable
## 1.5317761 0.4428045
ggplot(data, aes(x=explanatory_variable, y=response_variable)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(x="Home Team Goal Difference (HTGD)", y="Full-Time Home Goals (FTHG)") +
ggtitle("Scatter Plot with Linear Regression Line")
## `geom_smooth()` using formula = 'y ~ x'

The explanatory variable has a significant impact on the response
variable.
Finding model accuracy using MSE and Rsquared.
predicted_values <- predict(linear_model)
mse <- mean((response_variable - predicted_values)^2)
mse
## [1] 1.59002
rsquared <- summary(linear_model)$r.squared
rsquared
## [1] 0.05599406
The R squared value tells us that the explanatory variable has only
a small impact on the response variable and has a small impact on its
variance.