English Premier League Data Analysis and Modeling

Introduction

In this project, we aim to derive actionable insights for better performance in the English Premier League by leveraging data analytics and building predictive models. We will utilize multiple modeling techniques including Linear Regression, Decision Tree Regression, and Random Forest to predict the number of points earned by each team in the tournament. The models will be trained on data from the 2021-22 soccer season and tested on data from the subsequent season, 2022-23.

Steps Overview

Data Collection: Gather data from the English Premier League for the 2021-22 and 2022-23 seasons, including information about the number of points earned by each team and relevant numeric features.
Data Preprocessing: Clean and preprocess the data to handle missing values, outliers, and inconsistencies. Convert categorical variables into numerical format if necessary.
Model Building:
- Linear Regression: Build a linear regression model to predict the number of points earned by each team based on selected numeric features.
- Decision Tree Regression: Construct a decision tree regression model to capture nonlinear relationships between features and the target variable.
- Random Forest: Develop a random forest regression model leveraging ensemble learning for improved predictive accuracy and robustness.
Model Evaluation: Evaluate the performance of each model using appropriate metrics such as RMSE, R-squared, etc. Compare performance on both training and testing datasets to ensure generalization.
Insights and Recommendations: Analyze model results to identify actionable insights for performance improvement in the English Premier League. This includes understanding feature importance and developing strategies based on model predictions.
Deployment and Monitoring: Deploy the best-performing model for real-world use and continuously monitor its performance. Update the model as needed based on new data and insights.

Conclusion

By following this approach, we aim to utilize data-driven insights to optimize team performance and drive success in the English Premier League.

Theoretical Aspects

Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable $Y$ and one or more independent variables $X$. It assumes that this relationship can be approximated by a linear equation:

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \varepsilon \]

where: - $Y$ is the dependent variable (response), - $X_1, X_2, \ldots, X_n$ are the independent variables (features), - $\beta_0, \beta_1, \ldots, \beta_n$ are the coefficients, and - $\varepsilon$ is the error term.

Decision Tree Regression

A decision tree is a tree-like structure where each internal node represents a test on a feature, each branch represents the outcome of the test, and each leaf node represents a decision or prediction. Decision tree regression works by partitioning the feature space into regions and predicting the response variable based on the average value of the training instances within each region.

Random Forest Regression

Random forest is an ensemble learning method that constructs multiple decision trees during training and outputs the average prediction of the individual trees. It improves upon decision tree regression by reducing overfitting and increasing predictive accuracy.

library(rpart)

Loading the rpart Package

The library(rpart) statement is used in R to load the rpart package into the current working environment. The rpart package provides functions for building and visualizing decision trees. By loading the package with this command, we gain access to its functionality, including the ability to create decision tree models using the rpart() function and visualize them using plotting functions provided by the package.

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

Loading the randomForest Package

The randomForest package is loaded using the library() function. This package provides functions for building random forest models, a popular ensemble learning technique used for classification and regression tasks.

I will be commenting on my syntax and output right below each cell.

# read the CSV with headers
data2022<-read.csv("season22.csv", header=T,sep =",")
str(data2022)

## 'data.frame':    20 obs. of  11 variables:
##  $ teams         : chr  "Arsenal" "Aston Villa" "Brentford" "Brighton" ...
##  $ avg.age       : num  24.4 26.1 25.4 26.2 28.5 27.4 27.1 26.7 26.1 26.5 ...
##  $ possession    : num  52.8 46.5 44.8 54.4 40.2 61.8 51.1 40 52.3 52 ...
##  $ redcards      : int  4 2 3 2 2 1 1 6 3 1 ...
##  $ prog.carries  : int  734 653 450 676 388 894 592 512 671 622 ...
##  $ prog.passes   : int  1655 1300 1189 1547 972 1967 1229 1100 1483 1356 ...
##  $ points        : int  69 45 46 51 35 74 48 39 38 52 ...
##  $ avgdist.gk    : num  7.3 15.4 15.8 9.3 17 18.8 11.2 10.7 23.7 10.7 ...
##  $ miscontrols   : int  557 595 563 598 540 591 671 640 676 601 ...
##  $ aerialswon.pct: num  44.7 49 49.6 53.3 53 52.5 49.7 46.7 42.9 48.9 ...
##  $ overall       : chr  "europe" "safe" "safe" "safe" ...

In this cell, we are using the read.csv() function to read a CSV file named “season22.csv” into R. We are specifying that the CSV file has headers by setting header = TRUE, and we are indicating that the values in the CSV file are separated by commas using sep = “,”. Then, we are assigning the read data to a variable named data2022. Finally, we are using the str() function to display the structure of the data2022 object, which provides information about the data types and the first few rows of the dataset.

data2023<-read.csv("season23.csv", header=T,sep =",")
str(data2023)

## 'data.frame':    20 obs. of  11 variables:
##  $ teams         : chr  "Arsenal" "Aston Villa" "Bournemouth" "Brentford" ...
##  $ avg.age       : num  24.7 27 26.3 26.2 26.3 26.3 26.7 26.6 28.2 25.3 ...
##  $ possession    : num  59.3 49.3 40.4 43.8 60.2 58.7 46.3 42.8 48.8 47 ...
##  $ redcards      : int  0 1 0 1 0 3 3 2 1 3 ...
##  $ prog.carries  : int  824 637 504 392 809 829 556 500 662 555 ...
##  $ prog.passes   : int  2049 1242 1006 1131 1849 1737 1243 1052 1326 1429 ...
##  $ points        : int  84 61 39 59 62 44 45 36 52 31 ...
##  $ avgdist.gk    : num  16.1 14.2 10.5 15.7 12.3 13.7 13.1 5 14 11.6 ...
##  $ miscontrols   : int  526 565 589 520 558 568 643 605 535 602 ...
##  $ aerialswon.pct: num  46.5 48.7 49.2 51.8 53.4 52.1 46.2 48.7 48.2 49.1 ...
##  $ overall       : chr  "europe" "europe" "safe" "safe" ...

We are using the read.csv() function to read another CSV file named “season23.csv” into R, specifying that the file has headers (header = TRUE) and that the values are separated by commas (sep = “,”). The resulting dataset is assigned to a variable named data2023. Finally, we are using the str() function to display the structure of the data2023 object, which provides information about the data types and the first few rows of the dataset.

# Load required libraries
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

Histograms/Density Plots

ggplot(data2022, aes(x = points)) +
  geom_histogram(fill = "skyblue", color = "blue", bins = 20) +  # Change fill color, outline color, and number of bins
  ggtitle("Distribution of Number of Points by Teams") +
  xlab("Number of Points") +  # Add x-axis label
  ylab("Frequency") +    # Add y-axis label
  theme_minimal()        # Change theme to minimal

Histogram Visualization of Points Distribution

In this section, we used the ggplot2 package to create a histogram to visualize the distribution of the number of points obtained by teams in the Premier League dataset for the 2022 season.

Code Explanation:

ggplot(data2022, aes(x = points)): Specifies the dataset data2022 and maps the variable points to the x-axis.
geom_histogram(): Adds the histogram layer to the plot, indicating that we want to create a histogram.
fill = "skyblue", color = "blue": Specifies the fill color of the bars as “skyblue” and the outline color as “blue”.
bins = 20: Sets the number of bins (bars) in the histogram to 20.
ggtitle("Distribution of Number of Points by Teams"): Sets the title of the plot.
xlab("Number of Points"): Sets the label for the x-axis.
ylab("Frequency"): Sets the label for the y-axis.
theme_minimal(): Applies the minimal theme to the plot, which removes unnecessary background elements and gridlines.

ggplot(data2023, aes(x = possession, y = points)) +
  geom_point(color = "darkblue", size = 3, shape = 16) +  # Change point color, size, and shape
  ggtitle("Scatter Plot of Points vs. Possession") +
  xlab("Possession") +  # Add x-axis label
  ylab("Points") +      # Add y-axis label
  theme_minimal()       # Change theme to minimal

Scatter Plot of Points vs. Possession

This code created a scatter plot to visualize the relationship between possession and points using the ggplot2 package.

Code Explanation:

ggplot(data2023, aes(x = possession, y = points)): Sets up the initial ggplot object with the data2023 data frame and specifies that ‘possession’ should be plotted on the x-axis and ‘points’ on the y-axis.
geom_point(color = "darkblue", size = 3, shape = 16): Adds a layer of points to the plot, specifying the color as “darkblue”, size as 3, and shape as 16.
ggtitle("Scatter Plot of Points vs. Possession"): Adds a title to the plot.
xlab("Possession"): Adds a label to the x-axis.
ylab("Points"): Adds a label to the y-axis.
theme_minimal(): Changes the theme of the plot to a minimalistic style.

Overall, this code created a scatter plot showing the relationship between possession and points, with specified colors, labels, and a minimalistic theme.

ggplot(data2023, aes(x = possession, y = points, label = teams)) +
  geom_point(color = "darkblue", size = 3, shape = 16) +  # Change point color, size, and shape
  geom_text(hjust = -0.1, vjust = -0.5, size = 3, color = "black") +  # Add team names
  ggtitle("Scatter Plot of Points vs. Possession") +
  xlab("Possession") +  # Add x-axis label
  ylab("Points") +      # Add y-axis label
  theme_minimal()       # Change theme to minimal

Here we have observed that there exists a correlation between possession and success in the English Premier League.

Scatter Plot of Points vs. Possession with Team Labels

This code creates a scatter plot to visualize the relationship between possession and points, with team names as labels, using the ggplot2 package.

Code Explanation:

ggplot(data2023, aes(x = possession, y = points, label = teams)): Sets up the initial ggplot object with the data2023 data frame and specifies that ‘possession’ should be plotted on the x-axis, ‘points’ on the y-axis, and ‘teams’ as the labels for each point.
geom_point(color = "darkblue", size = 3, shape = 16): Adds a layer of points to the plot, specifying the color as “darkblue”, size as 3, and shape as 16.
geom_text(hjust = -0.1, vjust = -0.5, size = 3, color = "black"): Adds labels (team names) to the plot at the coordinates specified by the ‘possession’ and ‘points’ variables, adjusting the horizontal and vertical justification, font size, and color.
ggtitle("Scatter Plot of Points vs. Possession"): Adds a title to the plot.
xlab("Possession"): Adds a label to the x-axis.
ylab("Points"): Adds a label to the y-axis.
theme_minimal(): Changes the theme of the plot to a minimalistic style.

Overall, this code creates a scatter plot showing the relationship between possession and points, with team names as labels, specified colors, labels, and a minimalistic theme.

ggplot(data2022, aes(x = possession, fill = "Possession")) +
  geom_density(alpha = 0.5, color = "darkblue", fill = "lightblue") +  # Change line and fill colors
  ggtitle("Density Plot of Possession") +
  xlab("Possession") +  # Add x-axis label
  ylab("Density") +     # Add y-axis label
  theme_minimal()       # Change theme to minimal

Density Plot of Possession

This code creates a density plot to visualize the distribution of possession using the ggplot2 package.

Code Explanation:

ggplot(data2022, aes(x = possession, fill = "Possession")): Sets up the initial ggplot object with the data2022 data frame and specifies ‘possession’ to be plotted on the x-axis. The fill aesthetic is set to a constant value “Possession”, which will create a single density plot.
geom_density(alpha = 0.5, color = "darkblue", fill = "lightblue"): Adds a density layer to the plot, specifying the transparency (alpha) as 0.5, line color as “darkblue”, and fill color as “lightblue”.
ggtitle("Density Plot of Possession"): Adds a title to the plot.
xlab("Possession"): Adds a label to the x-axis.
ylab("Density"): Adds a label to the y-axis.
theme_minimal(): Changes the theme of the plot to a minimalistic style.

Overall, this code creates a density plot showing the distribution of possession, with specified colors, labels, and a minimalistic theme.

Barplots

# Reorder the levels of the "overall" variable
data2022$overall <- factor(data2022$overall, levels = c("europe", "safe", "relegation"))

# Plot the bar chart with reordered "overall" variable
ggplot(data2022, aes(x = overall)) + 
  geom_bar(fill = "skyblue", color = "blue") + 
  geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 3, position = position_stack(vjust = 0.5)) +  # Corrected aesthetics
  ggtitle("Distribution of Overall Categories") + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # Adjust x-axis labels
  labs(x = "Overall Category", y = "Frequency")  # Add axis labels

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Distribution of Overall Categories

This code creates a bar chart to visualize the distribution of the “overall” variable in the data2022 data frame.

Code Explanation:

data2022$overall <- factor(data2022$overall, levels = c("europe", "safe", "relegation")): Reorders the levels of the “overall” variable in the data2022 data frame, ensuring that “europe” appears first, followed by “safe”, and then “relegation”.
ggplot(data2022, aes(x = overall)): Sets up the initial ggplot object with the data2022 data frame and specifies ‘overall’ to be plotted on the x-axis.
geom_bar(fill = "skyblue", color = "blue"): Adds a bar layer to the plot, specifying the fill color as “skyblue” and the outline color as “blue”.
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 3, position = position_stack(vjust = 0.5)): Adds text labels to each bar, indicating the count of observations in each category. The stat = "count" argument calculates the count of observations, and position_stack(vjust = 0.5) stacks the labels on top of the bars.
ggtitle("Distribution of Overall Categories"): Adds a title to the plot.
theme_minimal(): Changes the theme of the plot to a minimalistic style.
theme(axis.text.x = element_text(angle = 45, hjust = 1)): Adjusts the angle of the x-axis labels to 45 degrees and aligns them to the right for better readability.
labs(x = "Overall Category", y = "Frequency"): Adds labels to the x-axis and y-axis.

Overall, this code creates a bar chart showing the distribution of the “overall” variable, with reordered categories, specified colors, labels, and a minimalistic theme.

Boxplots

ggplot(data2022, aes(x = overall, y = points)) + geom_boxplot() + ggtitle("Distribution of Points by Overall Category")

Distribution of Points by Overall Category

This code creates a boxplot to visualize the distribution of points across different categories of the ‘overall’ variable in the data2022 data frame.

Code Explanation:

ggplot(data2022, aes(x = overall, y = points)): Sets up the initial ggplot object with the data2022 data frame and specifies ‘overall’ to be plotted on the x-axis and ‘points’ on the y-axis.
geom_boxplot(): Adds a layer of boxplots to the plot, showing the distribution of ‘points’ within each category of ‘overall’.
ggtitle("Distribution of Points by Overall Category"): Adds a title to the plot.

Overall, this code creates a boxplot showing the distribution of points across different categories of the ‘overall’ variable, providing a visual representation of the spread, central tendency, and outliers in the data.

# Define colors for each category
category_colors <- c("europe" = "blue", "safe" = "green", "relegation" = "red")

# Plot the boxplot with colors for each category
ggplot(data2022, aes(x = overall, y = points, fill = overall)) +
  geom_boxplot() +
  scale_fill_manual(values = category_colors) +  # Apply custom colors
  ggtitle("Distribution of Points by Overall Category") +
  xlab("Overall Category") +
  ylab("Points") +
  theme_minimal()

Distribution of Points by Overall Category

This code creates a boxplot to visualize the distribution of points across different categories of the ‘overall’ variable in the data2022 data frame, with custom colors assigned to each category.

Code Explanation:

category_colors <- c("europe" = "blue", "safe" = "green", "relegation" = "red"): Defines a vector category_colors specifying custom colors for each category of the ‘overall’ variable.
ggplot(data2022, aes(x = overall, y = points, fill = overall)): Sets up the initial ggplot object with the data2022 data frame and specifies ‘overall’ to be plotted on the x-axis, ‘points’ on the y-axis, and ‘overall’ to fill the colors.
geom_boxplot(): Adds a layer of boxplots to the plot, showing the distribution of ‘points’ within each category of ‘overall’.
scale_fill_manual(values = category_colors): Applies the custom colors defined in category_colors to fill the boxes based on the ‘overall’ variable.
ggtitle("Distribution of Points by Overall Category"): Adds a title to the plot.
xlab("Overall Category") and ylab("Points"): Add labels to the x-axis and y-axis, respectively.
theme_minimal(): Changes the theme of the plot to a minimalistic style.

Overall, this code creates a boxplot showing the distribution of points across different categories of the ‘overall’ variable, with custom colors assigned to each category for better visualization.

Scatterplots

# Plot the relationship between average age and points
ggplot(data2022, aes(x = avg.age, y = points, fill = overall, label = teams)) +
  geom_point() +
  ggtitle("Relationship between Average Age and Points") +
  xlab("Average Age") +
  ylab("Points") +
  theme_minimal() +
  scale_fill_manual(values = category_colors) +  # Apply custom colors
  geom_text(aes(color = overall), check_overlap = TRUE, vjust = -0.5) +  # Add team names as labels with color based on 'overall'
  scale_color_manual(values = category_colors)  # Apply custom colors to text labels

Relationship between Average Age and Points

This code creates a scatter plot to visualize the relationship between the average age and points of teams in the data2022 data frame.

Code Explanation:

ggplot(data2022, aes(x = avg.age, y = points, fill = overall, label = teams)): Sets up the initial ggplot object with the data2022 data frame. It specifies ‘avg.age’ on the x-axis, ‘points’ on the y-axis, ‘overall’ to fill the colors, and ‘teams’ to be displayed as labels.
geom_point(): Adds a layer of points to the plot, representing the relationship between average age and points.
ggtitle("Relationship between Average Age and Points"): Adds a title to the plot.
xlab("Average Age") and ylab("Points"): Add labels to the x-axis and y-axis, respectively.
theme_minimal(): Changes the theme of the plot to a minimalistic style.
scale_fill_manual(values = category_colors): Applies custom colors defined in category_colors to fill the points based on the ‘overall’ variable.
geom_text(aes(color = overall), check_overlap = TRUE, vjust = -0.5): Adds text labels to the points, displaying the team names. The color = overall argument assigns the label color based on the ‘overall’ variable. The check_overlap = TRUE argument ensures that labels do not overlap, and vjust = -0.5 adjusts the vertical position of the labels.
scale_color_manual(values = category_colors): Applies custom colors defined in category_colors to the text labels.

Overall, this code creates a scatter plot showing the relationship between average age and points, with custom colors for points and labels based on the ‘overall’ variable. Team names are also displayed as labels.

ggplot(data2022, aes(x = prog.passes, y = prog.carries, fill = overall, label = teams)) + geom_point() + ggtitle("Relationship between Progressive Carries and  Number of Passes")+
    xlab("Passing the Ball") +
  ylab("Carrying the Ball") +
  theme_minimal() +
  scale_fill_manual(values = category_colors) +  # Apply custom colors
  geom_text(aes(color = overall), check_overlap = TRUE, vjust = -0.5) +  # Add team names as labels with color based on 'overall'
  scale_color_manual(values = category_colors)  # Apply custom colors to text labels

Relationship between Progressive Carries and Number of Passes

This code generates a scatter plot to visualize the relationship between the number of progressive passes (x-axis) and the number of progressive carries (y-axis) for each team.

Code Explanation:

ggplot(data2022, aes(x = prog.passes, y = prog.carries, fill = overall, label = teams)): Specifies the data frame (data2022) and aesthetic mappings, including the x and y variables (prog.passes and prog.carries, respectively), fill color based on the overall variable, and labels based on the teams variable.
geom_point(): Adds points to the plot.
ggtitle("Relationship between Progressive Carries and Number of Passes"): Adds a title to the plot.
xlab("Passing the Ball") + ylab("Carrying the Ball"): Labels the x and y axes.
theme_minimal(): Sets the plot theme to minimal.
scale_fill_manual(values = category_colors): Applies custom colors to the fill based on the category_colors defined earlier.
geom_text(aes(color = overall), check_overlap = TRUE, vjust = -0.5): Adds team names as labels with colors based on the overall variable. The check_overlap = TRUE argument ensures that the labels do not overlap, and vjust = -0.5 adjusts the vertical positioning of the labels.
scale_color_manual(values = category_colors): Applies custom colors to the text labels based on the category_colors.

This visualization helps explore how the number of progressive passes and carries relate to each other for different teams, with colors representing different overall categories (e.g., Europe, Safe, Relegation).

Overall, this code creates a scatter plot showing the relationship between the number of progressive carries and the number of progressive passes, with custom colors for points and labels based on the ‘overall’ variable. Team names are also displayed as labels.

Barplots/Stacked Barplots

ggplot(data2022, aes(x = overall, fill = factor(redcards))) + geom_bar(position = "stack") + ggtitle("Average Red Cards by Overall Category")

Average Red Cards by Overall Category

This code generates a stacked bar plot to visualize the average number of red cards for each category in the ‘overall’ variable.

Code Explanation:

ggplot(data2022, aes(x = overall, fill = factor(redcards))): Sets up the initial ggplot object with the data2022 data frame. It specifies ‘overall’ on the x-axis and ‘redcards’ (converted to a factor) to fill the colors of the bars.
geom_bar(position = "stack"): Adds a layer of stacked bars to the plot, with each bar representing the average number of red cards for each category in the ‘overall’ variable.
ggtitle("Average Red Cards by Overall Category"): Adds a title to the plot.

This visualization helps understand the average number of red cards across different categories in the ‘overall’ variable. The stacked bars represent the total number of red cards for each category.

Overall, this code creates a stacked bar plot showing the average number of red cards for each category in the ‘overall’ variable. The bars are stacked to represent the total number of red cards across different categories.

# Pie Chart
ggplot(data2022, aes(x = "", fill = overall)) + geom_bar(width = 1) + coord_polar("y", start = 0) + ggtitle("Proportion of Teams by Overall Category")

Proportion of Teams by Overall Category

This code generates a pie chart to illustrate the proportion of teams in each overall category.

Code Explanation:

ggplot(data2022, aes(x = "", fill = overall)): Sets up the initial ggplot object with the data2022 data frame. It creates a categorical variable for the x-axis with an empty string “” and fills the colors based on the ‘overall’ variable.
geom_bar(width = 1): Adds a layer of bars to the plot, each representing a proportion of teams in each category.
coord_polar("y", start = 0): Converts the Cartesian coordinates to polar coordinates, creating a pie chart. The “y” argument specifies that the width of the bars should be scaled according to the y-axis values. The start = 0 argument specifies the starting angle of the chart.
ggtitle("Proportion of Teams by Overall Category"): Adds a title to the plot.

This visualization helps understand the distribution of teams across different overall categories. Each category is represented by a portion of the pie, with the size of each portion indicating the proportion of teams in that category.

Overall, this code generates a pie chart illustrating the proportion of teams in each overall category.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Calculate minimum points for each category
min_points <- data2022 %>%
  group_by(overall) %>%
  summarise(min_points = min(points, na.rm = TRUE))

# Plot the proportion of teams by overall category with minimum points as labels
ggplot(data2022, aes(x = "", fill = overall)) +
  geom_bar(width = 1, position = "fill") +  # Adjust position to fill the entire pie
  geom_text(data = min_points, aes(y = min_points, label = min_points), 
            position = position_fill(vjust = 0.5), size = 6, color = "white") +  # Increase label size and set color
  coord_polar("y", start = 0) +
  scale_fill_manual(values = c("europe" = "blue", "safe" = "green", "relegation" = "red")) +  # Customize colors
  ggtitle("Proportion of Teams by Overall Category") +
  theme_minimal()  # Adjust theme for better visualization

Proportion of Teams by Overall Category with Minimum Points

This code generates a pie chart to illustrate the proportion of teams in each overall category, along with text labels indicating the minimum points for each category.

Code Explanation:

library(dplyr): Loads the dplyr package, which is used for data manipulation.
min_points <- data2022 %>% ...: Calculates the minimum points for each overall category using the group_by and summarise functions from dplyr.
ggplot(data2022, aes(x = "", fill = overall)) + ...: Sets up the initial ggplot object with the data2022 data frame. It creates a categorical variable for the x-axis with an empty string “” and fills the colors based on the ‘overall’ variable.
geom_bar(width = 1, position = "fill"): Adds a layer of bars to the plot, each representing the proportion of teams in each category. The position = “fill” argument stacks the bars to fill the entire pie.
geom_text(data = min_points, aes(y = min_points, label = min_points), ...: Adds text labels to the plot, displaying the minimum points for each category. The position_fill(vjust = 0.5) argument positions the labels in the center of each filled area.
coord_polar("y", start = 0): Converts the Cartesian coordinates to polar coordinates, creating a pie chart. The “y” argument specifies that the width of the bars should be scaled according to the y-axis values. The start = 0 argument specifies the starting angle of the chart.
scale_fill_manual(values = c("europe" = "blue", "safe" = "green", "relegation" = "red")): Customizes the fill colors of the pie chart for each overall category.
ggtitle("Proportion of Teams by Overall Category"): Adds a title to the plot.
theme_minimal(): Adjusts the theme of the plot for better visualization.

This visualization helps understand the distribution of teams across different overall categories, along with the minimum points required for each category.

Overall, this code generates a pie chart illustrating the proportion of teams in each overall category, with text labels indicating the minimum points for each category.

Correlations

#Heatmap
# Compute the correlation matrix
cor_matrix <- cor(data2022[, -c(1, 11)])

# Convert the correlation matrix to long format
cor_df <- reshape2::melt(cor_matrix)

# Plot the correlation heatmap
ggplot(data = cor_df, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile() + 
  scale_fill_gradient(low = "blue", high = "red") +  # Customize the color gradient
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Correlation Heatmap")

Correlation Heatmap

This code generates a heatmap to visualize the pairwise correlations between variables in the dataset.

Code Explanation:

cor_matrix <- cor(data2022[, -c(1, 11)]): Calculates the correlation matrix of the dataset data2022, excluding the first and eleventh columns.
cor_df <- reshape2::melt(cor_matrix): Converts the correlation matrix into a long format dataframe suitable for plotting.
ggplot(data = cor_df, aes(x = Var1, y = Var2, fill = value)) + ...: Sets up the initial ggplot object with the melted correlation dataframe cor_df. It maps the variables Var1 and Var2 to the x and y axes respectively, and the correlation values to the fill color.
geom_tile(): Adds a layer of tiles to the plot, where each tile represents a pairwise correlation value.
scale_fill_gradient(low = "blue", high = "red"): Customizes the color gradient for the fill scale, with blue indicating lower correlation values and red indicating higher correlation values.
theme(axis.text.x = element_text(angle = 45, hjust = 1)): Adjusts the angle of the x-axis labels for better readability.
ggtitle("Correlation Heatmap"): Adds a title to the plot.

This visualization helps understand the relationships between variables in the dataset by showing the pairwise correlations. Each cell in the heatmap represents the correlation between two variables, with colors indicating the strength and direction of the correlation. Blue colors represent negative correlations, red colors represent positive correlations, and lighter colors represent weaker correlations.

Scatter Plot and Scatter Plot Matrixes

# exploring relationships among features: correlation matrix
cor(data2022[c("avg.age", "possession", "prog.carries", "prog.passes","avgdist.gk",
                 "miscontrols","aerialswon.pct")])

##                    avg.age  possession prog.carries prog.passes  avgdist.gk
## avg.age         1.00000000 -0.08204764  -0.04631838 -0.04162586  0.08846055
## possession     -0.08204764  1.00000000   0.90328340  0.95951365  0.24674533
## prog.carries   -0.04631838  0.90328340   1.00000000  0.91078538  0.23966734
## prog.passes    -0.04162586  0.95951365   0.91078538  1.00000000  0.24166903
## avgdist.gk      0.08846055  0.24674533   0.23966734  0.24166903  1.00000000
## miscontrols    -0.07008462 -0.25956326  -0.30569156 -0.35372264  0.09466812
## aerialswon.pct  0.50243236  0.41629951   0.39431301  0.42961627 -0.01494381
##                miscontrols aerialswon.pct
## avg.age        -0.07008462     0.50243236
## possession     -0.25956326     0.41629951
## prog.carries   -0.30569156     0.39431301
## prog.passes    -0.35372264     0.42961627
## avgdist.gk      0.09466812    -0.01494381
## miscontrols     1.00000000    -0.42984815
## aerialswon.pct -0.42984815     1.00000000

Correlation Analysis

This code calculates the correlation matrix for a subset of features in the dataset data2022, including ‘avg.age’, ‘possession’, ‘prog.carries’, ‘prog.passes’, ‘avgdist.gk’, ‘miscontrols’, and ‘aerialswon.pct’.

Code Explanation:

cor(data2022[c("avg.age", "possession", "prog.carries", "prog.passes", "avgdist.gk", "miscontrols", "aerialswon.pct")]): This function calculates the correlation matrix for the specified subset of features in the dataset data2022.

The resulting correlation matrix will show the pairwise correlations between these features. Each cell in the matrix will contain the correlation coefficient, indicating the strength and direction of the linear relationship between two variables. The values range from -1 to 1, where:

1 indicates a perfect positive correlation,
-1 indicates a perfect negative correlation, and
0 indicates no linear correlation.

Positive values imply a positive correlation (as one variable increases, the other tends to increase), while negative values imply a negative correlation (as one variable increases, the other tends to decrease). A value close to 0 suggests no linear relationship between the variables.

#visualizing relationships among features scatterplot
pairs(data2022[c("avg.age", "possession", "prog.carries", "prog.passes","avgdist.gk",
                 "miscontrols","aerialswon.pct")])

Pairwise Scatter Plot Analysis

This code generates a matrix of scatter plots to visualize the pairwise relationships between selected features in the dataset data2022, including ‘avg.age’, ‘possession’, ‘prog.carries’, ‘prog.passes’, ‘avgdist.gk’, ‘miscontrols’, and ‘aerialswon.pct’.

Code Explanation:

pairs(data2022[c("avg.age", "possession", "prog.carries", "prog.passes", "avgdist.gk", "miscontrols", "aerialswon.pct")]): This function creates a matrix of scatter plots showing the pairwise relationships between the selected features.

Each cell in the matrix represents a scatter plot of one feature against another. This visualization allows for a quick examination of potential relationships, trends, and patterns between variables. Points on the scatter plots represent individual data points, and the position of these points relative to each other indicates the relationship between the variables.

# Customizing the scatterplot matrix
pairs(data2022[c("avg.age", "possession", "prog.carries", "prog.passes", "avgdist.gk", "miscontrols", "aerialswon.pct")],
      col = "#3399FF",  # Set point color to blue
      pch = 16,         # Set point shape to filled circle
      cex = 0.7,        # Set point size
      bg = "white",     # Set background color to white
      lower.panel = panel.smooth,  # Add smooth trend lines in the lower panels
      diag.panel = NULL,  # Remove diagonal panels
      labels = c("Avg. Age", "Possesion", "Prog. Carries", "Prog. Passes", "Avg. Dist. GK", "Miscontrols", "Aerials Won %"),  # Set axis labels
      font.labels = 2,   # Set font weight for labels
      col.axis = "#666666",  # Set axis color to gray
      cex.axis = 0.8,    # Set axis label size
      las = 1,           # Set axis label orientation to horizontal
      main = "Scatterplot Matrix of Features"  # Set main title
)

Pairwise Scatter Plot Matrix

This code generates a scatterplot matrix to visualize the pairwise relationships between features.

Code Explanation:

pairs(): This function generates a scatterplot matrix.
col = "#3399FF": Sets the point color to blue.
pch = 16: Sets the point shape to filled circles.
cex = 0.7: Sets the point size to 0.7.
bg = "white": Sets the background color to white.
lower.panel = panel.smooth: Adds smooth trend lines in the lower panels.
diag.panel = NULL: Removes the diagonal panels.
labels: Sets the axis labels for each feature.
font.labels = 2: Sets the font weight for labels to bold.
col.axis = "#666666": Sets the axis color to gray.
cex.axis = 0.8: Sets the axis label size to 0.8.
las = 1: Sets the axis label orientation to horizontal.
main = "Scatterplot Matrix of Features": Sets the main title of the plot.

# Create scatterplot matrix with ggplot2
ggplot(data2022, aes(x = avg.age, y = possession)) +
  geom_point(aes(color = overall), size = 3) +  # Color points based on 'overall' category
  geom_smooth(method = "lm", se = FALSE, color = "black") +  # Add linear regression lines
  facet_grid(. ~ overall) +  # Separate plots by 'overall' category
  labs(x = "Average Age", y = "Possession", color = "Overall") +  # Adjust axis labels and color legend title
  ggtitle("Relationships Among Features") +  # Add title
  theme_minimal()  # Adjust theme for better visualization

## `geom_smooth()` using formula = 'y ~ x'

Scatter Plot with Linear Regression Lines by Overall Category

This code generates a scatter plot with linear regression lines for each category in the “overall” variable.

Code Explanation:

ggplot(): Initializes the ggplot object with the dataset data2022 and aesthetic mappings.
geom_point(): Adds points to the plot, with color mapped to the “overall” category.
geom_smooth(): Adds a linear regression line to each plot without confidence intervals.
facet_grid(. ~ overall): Separates plots into facets based on the “overall” category.
labs(): Sets the labels for x-axis, y-axis, and color legend.
ggtitle(): Sets the title of the plot.
theme_minimal(): Applies a minimal theme to the plot for better visualization.

Regression Analysis to predict points

# Splitting the data into predictors and response variable
X_train <- data2022[, -c(1, 11)] # Exclude 'teams' and 'overall' columns
y_train <- data2022$points

X_test <- data2023[, -c(1, 11)] # Exclude 'teams' and 'overall' columns
y_test <- data2023$points

Splitting Data into Train and Test Sets

This code snippet splits the data into predictor variables (X_train, X_test) and the response variable (y_train, y_test).

X_train: Contains the predictor variables (features) for the training set. It excludes the first and last columns (likely ‘teams’ and ‘overall’) from the data2022 dataset.
y_train: Contains the response variable (target) for the training set, which is the ‘points’ column from the data2022 dataset.
X_test: Contains the predictor variables for the test set. Similar to X_train, it excludes the first and last columns from the data2023 dataset.
y_test: Contains the response variable for the test set, which is the ‘points’ column from the data2023 dataset.

# Model building
# We can use different regression models such as linear regression, decision tree regression, random forest regression, etc
# Linear Regression Model
lm_model <- lm(points ~ ., data = X_train)

Building Linear Regression Model

This code snippet builds a linear regression model using the training data X_train and y_train.

lm_model: Represents the linear regression model object fitted to the training data. It uses the lm() function, where points is the response variable, and . denotes using all other variables in X_train as predictors.

summary(lm_model)

## 
## Call:
## lm(formula = points ~ ., data = X_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9517  -2.6876   0.6733   4.1046  11.8263 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)    -22.91666   87.94433  -0.261    0.799
## avg.age         -1.81747    2.66170  -0.683    0.509
## possession      -0.05721    1.57284  -0.036    0.972
## redcards        -0.26928    3.10938  -0.087    0.933
## prog.carries     0.00691    0.03111   0.222    0.828
## prog.passes      0.03787    0.03176   1.193    0.258
## avgdist.gk      -0.86768    0.57210  -1.517    0.158
## miscontrols      0.01243    0.05724   0.217    0.832
## aerialswon.pct   1.47297    0.94482   1.559    0.147
## 
## Residual standard error: 9.078 on 11 degrees of freedom
## Multiple R-squared:  0.8725, Adjusted R-squared:  0.7798 
## F-statistic: 9.411 on 8 and 11 DF,  p-value: 0.0005902

Summary of Linear Regression Model

The summary() function provides a detailed summary of the linear regression model lm_model, including coefficients, standard errors, t-values, p-values, and goodness-of-fit statistics.

Coefficients: Estimates of the coefficients for each predictor variable.
Standard Error: The standard error associated with each coefficient estimate.
t-value: The t-statistic associated with each coefficient estimate.
Pr(>|t|): The p-value associated with each t-value, indicating the significance of each predictor.
Residual Standard Error: An estimate of the standard deviation of the residuals.
R-squared: The coefficient of determination, indicating the proportion of variance in the response variable explained by the predictors.
Adjusted R-squared: A version of R-squared adjusted for the number of predictors in the model.
F-statistic: The F-statistic for the overall significance of the model.
p-value: The p-value associated with the F-statistic.

This summary provides valuable information about the model’s performance and the significance of individual predictors in explaining the variability in the response variable.

# Decision Tree Regression Model
dt_model <- rpart(points ~ ., data = X_train)

Decision Tree Regression Model

The rpart() function fits a decision tree regression model to the training data X_train with the response variable points. This model will be used to predict the points based on the predictor variables.

The decision tree regression model recursively partitions the predictor space into distinct regions, where each region corresponds to a specific prediction. This is achieved by splitting the predictor variables based on their values, aiming to minimize the variance of the response variable within each partition.

Once the model is trained, it can be used to make predictions on new data by traversing the decision tree from the root node to the leaf nodes, where each leaf node represents a specific prediction. After fitting the model, we can use it to predict the points for the test data X_test using the predict() function.

summary(dt_model)

## Call:
## rpart(formula = points ~ ., data = X_train)
##   n= 20 
## 
##          CP nsplit rel error   xerror      xstd
## 1 0.5127455      0 1.0000000 1.078438 0.3261939
## 2 0.0100000      1 0.4872545 1.078438 0.3261939
## 
## Variable importance
##   prog.carries    prog.passes     possession aerialswon.pct    miscontrols 
##             32             23             18             14              9 
##        avg.age 
##              5 
## 
## Node number 1: 20 observations,    complexity param=0.5127455
##   mean=52.6, MSE=355.54 
##   left son=2 (13 obs) right son=3 (7 obs)
##   Primary splits:
##       prog.carries   < 725.5  to the left,  improve=0.5127455, (0 missing)
##       prog.passes    < 1402.5 to the left,  improve=0.4698490, (0 missing)
##       possession     < 51.45  to the left,  improve=0.4410750, (0 missing)
##       aerialswon.pct < 52.35  to the left,  improve=0.3459724, (0 missing)
##       miscontrols    < 593    to the right, improve=0.2278225, (0 missing)
##   Surrogate splits:
##       prog.passes    < 1557.5 to the left,  agree=0.90, adj=0.714, (0 split)
##       possession     < 52.5   to the left,  agree=0.85, adj=0.571, (0 split)
##       aerialswon.pct < 52.35  to the left,  agree=0.80, adj=0.429, (0 split)
##       miscontrols    < 593    to the right, agree=0.75, adj=0.286, (0 split)
##       avg.age        < 25.9   to the right, agree=0.70, adj=0.143, (0 split)
## 
## Node number 2: 13 observations
##   mean=42.69231, MSE=107.4438 
## 
## Node number 3: 7 observations
##   mean=71, MSE=295.4286

Summary of Decision Tree Regression Model

The summary() function does not directly provide a summary for decision tree regression models in the rpart package. Instead, we can use the printcp() function to obtain information about the complexity parameter (cp) and the corresponding cross-validated error for each node in the tree. This table helps understand the complexity of the decision tree model and guides in selecting an appropriate cp value for pruning the tree.

# Random Forest Regression Model
rf_model <- randomForest(points ~ ., data = X_train)

Building Random Forest Model

The randomForest function is used to build a random forest regression model (rf_model). This model is trained to predict the number of points (points) based on all other variables in the training dataset (X_train). The formula points ~ . specifies that points is the response variable to be predicted, and . represents all other variables in the dataset. The data argument specifies the training dataset (X_train).

rf_model

## 
## Call:
##  randomForest(formula = points ~ ., data = X_train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 149.4146
##                     % Var explained: 57.98

Random Forest Regression Model Summary

The rf_model object contains the Random Forest Regression Model that was built using the randomForest package in R. This model is capable of predicting the number of points for Premier League teams based on various predictor variables.

Model Information:

Type of Random Forest: Regression
Number of Trees: 500
No. of Variables Tried at Each Split: 2

Model Performance:

Mean of Squared Residuals: [Value]
% Variance Explained: 54.18%

This summary provides information about the model’s performance and characteristics based on the training data.

Model Evaluation

# We can evaluate the performance of each model using metrics such as RMSE, R-squared, etc.

# Function to calculate RMSE
rmse <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

# Predictions
lm_pred <- predict(lm_model, newdata = X_test)
dt_pred <- predict(dt_model, newdata = X_test)
rf_pred <- predict(rf_model, newdata = X_test)

RMSE Function

The rmse function calculates the Root Mean Squared Error (RMSE) between the actual and predicted values.

actual: Represents the actual values.
predicted: Represents the predicted values.
Calculation: Squared differences between the actual and predicted values are computed, followed by the mean of these squared differences. Finally, the square root of this mean is returned, which is the RMSE.

After defining the rmse function, predictions are made using three different models on the test dataset:

Linear Regression Model (lm_model)
Decision Tree Model (dt_model)
Random Forest Model (rf_model)

The predictions are stored in the following variables:

lm_pred: Predictions made by the Linear Regression model.
dt_pred: Predictions made by the Decision Tree model.
rf_pred: Predictions made by the Random Forest model.

# Calculate RMSE for each model
lm_rmse <- rmse(y_test, lm_pred)
dt_rmse <- rmse(y_test, dt_pred)
rf_rmse <- rmse(y_test, rf_pred)

RMSE Calculation and Comparison

This code calculates the Root Mean Squared Error (RMSE) for each model’s predictions on the test data and displays the results. Lower RMSE values indicate better model performance. You can compare the RMSE values to determine which model performs best for your regression task.

# Compare RMSE
cat("Linear Regression RMSE:", lm_rmse, "\n")

## Linear Regression RMSE: 13.2977

cat("Decision Tree Regression RMSE:", dt_rmse, "\n")

## Decision Tree Regression RMSE: 16.35151

cat("Random Forest Regression RMSE:", rf_rmse, "\n")

## Random Forest Regression RMSE: 11.53972

Model Evaluation and Selection

The RMSE indicates how the predicted number of points deviates from the actual number of points, on average. Therefore, the regression tree is the best approach to predict the number of points. In this case, our best model is Random Forest Regression; however, I would choose Linear Regression given the fact that the RMSE values are really close. Linear Regression is a technique that is simpler to understand for the project stakeholders.

# Convert all columns to numeric in the training dataset
X_train<- data.frame(lapply(X_train, as.numeric))

# Convert all columns to numeric in the testing dataset
X_test<- data.frame(lapply(X_test, as.numeric))

This code uses the lapply function to apply the as.numeric function to each column of the datasets X_train and X_test, converting them to numeric type. The result is stored in new data frames X_train and X_test, respectively, with all columns now in numeric format.

str(X_train)

## 'data.frame':    20 obs. of  9 variables:
##  $ avg.age       : num  24.4 26.1 25.4 26.2 28.5 27.4 27.1 26.7 26.1 26.5 ...
##  $ possession    : num  52.8 46.5 44.8 54.4 40.2 61.8 51.1 40 52.3 52 ...
##  $ redcards      : num  4 2 3 2 2 1 1 6 3 1 ...
##  $ prog.carries  : num  734 653 450 676 388 894 592 512 671 622 ...
##  $ prog.passes   : num  1655 1300 1189 1547 972 ...
##  $ points        : num  69 45 46 51 35 74 48 39 38 52 ...
##  $ avgdist.gk    : num  7.3 15.4 15.8 9.3 17 18.8 11.2 10.7 23.7 10.7 ...
##  $ miscontrols   : num  557 595 563 598 540 591 671 640 676 601 ...
##  $ aerialswon.pct: num  44.7 49 49.6 53.3 53 52.5 49.7 46.7 42.9 48.9 ...

str(X_test)

## 'data.frame':    20 obs. of  9 variables:
##  $ avg.age       : num  24.7 27 26.3 26.2 26.3 26.3 26.7 26.6 28.2 25.3 ...
##  $ possession    : num  59.3 49.3 40.4 43.8 60.2 58.7 46.3 42.8 48.8 47 ...
##  $ redcards      : num  0 1 0 1 0 3 3 2 1 3 ...
##  $ prog.carries  : num  824 637 504 392 809 829 556 500 662 555 ...
##  $ prog.passes   : num  2049 1242 1006 1131 1849 ...
##  $ points        : num  84 61 39 59 62 44 45 36 52 31 ...
##  $ avgdist.gk    : num  16.1 14.2 10.5 15.7 12.3 13.7 13.1 5 14 11.6 ...
##  $ miscontrols   : num  526 565 589 520 558 568 643 605 535 602 ...
##  $ aerialswon.pct: num  46.5 48.7 49.2 51.8 53.4 52.1 46.2 48.7 48.2 49.1 ...

The code run above provides information about the data types and dimensions of the data frames, as well as the first few rows of each data frame, giving us a better understanding of their structure.

Now I will create a new variable to add up features that are highly correlated.

X_train$combined_effect <- X_train$possession + X_train$prog.carries + X_train$prog.passes
X_test$combined_effect <- X_test$possession + X_test$prog.carries + X_test$prog.passes

The lines of code created above create a new variable called combined_effect in both the training and testing datasets. The combined_effect variable is calculated as the sum of three existing variables: possession, prog.carries, and prog.passes.

X_train$combined_effect <- X_train$possession + X_train$prog.carries + X_train$prog.passes: This line computes the combined_effect variable for the training dataset (X_train) by adding the values of possession, prog.carries, and prog.passes for each observation.

X_test$combined_effect <- X_test$possession + X_test$prog.carries + X_test$prog.passes: Similarly, this line calculates the combined_effect variable for the testing dataset (X_test) using the same formula.

After executing these lines, both the training and testing datasets will contain a new variable combined_effect, which represents the combined effect of possession, progressive carries, and progressive passes. This new variable can be used as a predictor in subsequent analyses or models.

str(X_train)

## 'data.frame':    20 obs. of  10 variables:
##  $ avg.age        : num  24.4 26.1 25.4 26.2 28.5 27.4 27.1 26.7 26.1 26.5 ...
##  $ possession     : num  52.8 46.5 44.8 54.4 40.2 61.8 51.1 40 52.3 52 ...
##  $ redcards       : num  4 2 3 2 2 1 1 6 3 1 ...
##  $ prog.carries   : num  734 653 450 676 388 894 592 512 671 622 ...
##  $ prog.passes    : num  1655 1300 1189 1547 972 ...
##  $ points         : num  69 45 46 51 35 74 48 39 38 52 ...
##  $ avgdist.gk     : num  7.3 15.4 15.8 9.3 17 18.8 11.2 10.7 23.7 10.7 ...
##  $ miscontrols    : num  557 595 563 598 540 591 671 640 676 601 ...
##  $ aerialswon.pct : num  44.7 49 49.6 53.3 53 52.5 49.7 46.7 42.9 48.9 ...
##  $ combined_effect: num  2442 2000 1684 2277 1400 ...

str(X_test)

## 'data.frame':    20 obs. of  10 variables:
##  $ avg.age        : num  24.7 27 26.3 26.2 26.3 26.3 26.7 26.6 28.2 25.3 ...
##  $ possession     : num  59.3 49.3 40.4 43.8 60.2 58.7 46.3 42.8 48.8 47 ...
##  $ redcards       : num  0 1 0 1 0 3 3 2 1 3 ...
##  $ prog.carries   : num  824 637 504 392 809 829 556 500 662 555 ...
##  $ prog.passes    : num  2049 1242 1006 1131 1849 ...
##  $ points         : num  84 61 39 59 62 44 45 36 52 31 ...
##  $ avgdist.gk     : num  16.1 14.2 10.5 15.7 12.3 13.7 13.1 5 14 11.6 ...
##  $ miscontrols    : num  526 565 589 520 558 568 643 605 535 602 ...
##  $ aerialswon.pct : num  46.5 48.7 49.2 51.8 53.4 52.1 46.2 48.7 48.2 49.1 ...
##  $ combined_effect: num  2932 1928 1550 1567 2718 ...

# Splitting the data into predictors and response variable
X_train <- X_train[, -c(2,4, 5)] 
y_train <- X_train$points

X_test <- X_test[, -c(2,4,5)] 
y_test <- X_test$points

In the cell above, we have just removed the three variables used to create the new calculated feature, combined_effect.

str(X_train)

## 'data.frame':    20 obs. of  7 variables:
##  $ avg.age        : num  24.4 26.1 25.4 26.2 28.5 27.4 27.1 26.7 26.1 26.5 ...
##  $ redcards       : num  4 2 3 2 2 1 1 6 3 1 ...
##  $ points         : num  69 45 46 51 35 74 48 39 38 52 ...
##  $ avgdist.gk     : num  7.3 15.4 15.8 9.3 17 18.8 11.2 10.7 23.7 10.7 ...
##  $ miscontrols    : num  557 595 563 598 540 591 671 640 676 601 ...
##  $ aerialswon.pct : num  44.7 49 49.6 53.3 53 52.5 49.7 46.7 42.9 48.9 ...
##  $ combined_effect: num  2442 2000 1684 2277 1400 ...

str(X_test)

## 'data.frame':    20 obs. of  7 variables:
##  $ avg.age        : num  24.7 27 26.3 26.2 26.3 26.3 26.7 26.6 28.2 25.3 ...
##  $ redcards       : num  0 1 0 1 0 3 3 2 1 3 ...
##  $ points         : num  84 61 39 59 62 44 45 36 52 31 ...
##  $ avgdist.gk     : num  16.1 14.2 10.5 15.7 12.3 13.7 13.1 5 14 11.6 ...
##  $ miscontrols    : num  526 565 589 520 558 568 643 605 535 602 ...
##  $ aerialswon.pct : num  46.5 48.7 49.2 51.8 53.4 52.1 46.2 48.7 48.2 49.1 ...
##  $ combined_effect: num  2932 1928 1550 1567 2718 ...

Now I will complete the same steps as above, but this time building models using the new variable, combined_effect. I intend to reduce the likelihood of heteroskedasticity in the model.

lm_model2 <- lm(points ~ ., data = X_train)
summary(lm_model2)

## 
## Call:
## lm(formula = points ~ ., data = X_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.043  -3.376   1.403   5.028  10.014 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -30.925574  65.400274  -0.473   0.6441    
## avg.age          -1.828732   2.388803  -0.766   0.4576    
## redcards          0.305240   1.898912   0.161   0.8748    
## avgdist.gk       -0.817785   0.511347  -1.599   0.1338    
## miscontrols       0.008488   0.046871   0.181   0.8591    
## aerialswon.pct    1.572217   0.873868   1.799   0.0952 .  
## combined_effect   0.027663   0.004370   6.330 2.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.523 on 13 degrees of freedom
## Multiple R-squared:  0.8672, Adjusted R-squared:  0.8059 
## F-statistic: 14.15 on 6 and 13 DF,  p-value: 4.992e-05

dt_model2 <- rpart(points ~ ., data = X_train)
summary(dt_model2)

## Call:
## rpart(formula = points ~ ., data = X_train)
##   n= 20 
## 
##          CP nsplit rel error   xerror      xstd
## 1 0.6040661      0 1.0000000 1.092622 0.3379016
## 2 0.0100000      1 0.3959339 1.092622 0.3379016
## 
## Variable importance
## combined_effect  aerialswon.pct     miscontrols      avgdist.gk        redcards 
##              44              31              12               6               6 
## 
## Node number 1: 20 observations,    complexity param=0.6040661
##   mean=52.6, MSE=355.54 
##   left son=2 (13 obs) right son=3 (7 obs)
##   Primary splits:
##       combined_effect < 2241.85 to the left,  improve=0.6040661, (0 missing)
##       aerialswon.pct  < 52.35   to the left,  improve=0.3459724, (0 missing)
##       miscontrols     < 593     to the right, improve=0.2278225, (0 missing)
##       redcards        < 1.5     to the right, improve=0.2170492, (0 missing)
##       avg.age         < 26.35   to the left,  improve=0.1011259, (0 missing)
##   Surrogate splits:
##       aerialswon.pct < 52.35   to the left,  agree=0.90, adj=0.714, (0 split)
##       miscontrols    < 593     to the right, agree=0.75, adj=0.286, (0 split)
##       redcards       < 1.5     to the right, agree=0.70, adj=0.143, (0 split)
##       avgdist.gk     < 9.9     to the right, agree=0.70, adj=0.143, (0 split)
## 
## Node number 2: 13 observations
##   mean=41.84615, MSE=101.9763 
## 
## Node number 3: 7 observations
##   mean=72.57143, MSE=212.8163

rf_model2 <- randomForest(points ~ ., data = X_train)
rf_model2

## 
## Call:
##  randomForest(formula = points ~ ., data = X_train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 204.3953
##                     % Var explained: 42.51

# Predictions
lm_pred <- predict(lm_model2, newdata = X_test)
dt_pred <- predict(dt_model2, newdata = X_test)
rf_pred <- predict(rf_model2, newdata = X_test)

# Calculate RMSE for each model
lm_rmse <- rmse(y_test, lm_pred)
dt_rmse <- rmse(y_test, dt_pred)
rf_rmse <- rmse(y_test, rf_pred)

# Compare RMSE
cat("Linear Regression RMSE:", lm_rmse, "\n")

## Linear Regression RMSE: 13.28647

cat("Decision Tree Regression RMSE:", dt_rmse, "\n")

## Decision Tree Regression RMSE: 12.23912

cat("Random Forest Regression RMSE:", rf_rmse, "\n")

## Random Forest Regression RMSE: 11.79779

Our values are similar to the ones obtained above, with the difference that the Decision Tree Regression is performing much better. I will continue analyzing our results to identify the optimal model for this project.

# Residual analysis for linear regression model
lm_resid <- resid(lm_model)
plot(lm_resid ~ lm_pred, main = "Residuals vs Fitted Values")
abline(h = 0, col = "red")  # Add a horizontal line at y = 0

This plot helps us check if the residuals have constant variance across the range of fitted values. Ideally, we want to see a random scatter of points around the horizontal line at y = 0, indicating homoscedasticity.

# Residual analysis for linear regression model
lm_resid2 <- resid(lm_model2)
plot(lm_resid2 ~ lm_pred, main = "Residuals vs Fitted Values")
abline(h = 0, col = "red")  # Add a horizontal line at y = 0

# Calculate adjusted R-squared for linear regression model
summary(lm_model)$adj.r.squared

## [1] 0.7798001

summary(lm_model2)$adj.r.squared

## [1] 0.8059005

Adjusted R-squared penalizes for overfitting by considering the number of predictors in the model. Higher adjusted R-squared values indicate better model fit. Based on our results, lm_model2 would be more reliable.

# Analysis of Variance (ANOVA) for linear regression model
anova(lm_model)

anova(lm_model2)

ANOVA assesses the overall significance of the linear regression model by comparing the variance explained by the model to the residual variance.

The anova() function in R is used to perform analysis of variance for a linear regression model. Here’s what the output represents:

For lm_model and lm_model2:

The ANOVA table provides information about the significance of each predictor variable in explaining the variation in the response variable (points). The table typically includes columns such as “Sum of Squares” (SS), “Degrees of Freedom” (df), “Mean Squares” (MS), and “F-statistic” with corresponding p-values. The p-value associated with the F-statistic tests the null hypothesis that all the regression coefficients are equal to zero (i.e., the model has no explanatory power).

The Analysis of Variance (ANOVA) tables provide information about the significance of each predictor variable in explaining the variation in the response variable (points). Here’s the interpretation for each table:

For the first ANOVA table (lm_model):

The predictor variable possession has a highly significant effect on the points obtained by teams (p < 0.001).
The avgdist.gk variable also shows a significant effect on points (p = 0.04482).
The other predictor variables (avg.age, redcards, prog.carries, prog.passes, miscontrols, aerialswon.pct) do not show significant effects on points (p > 0.05).

For the second ANOVA table (lm_model2):

The predictor variables redcards, miscontrols, aerialswon.pct, and combined_effect show significant effects on points (p < 0.05).
The other predictor variables (avg.age, avgdist.gk) do not show significant effects on points (p > 0.05).

In summary, the ANOVA tests help identify which predictor variables are statistically significant in explaining the variation in points obtained by teams.

#install.packages("lmtest") I always make package installation a comment after running my code in order to facilitate the process of saving to HTML and publishing to RPubs.

# Breusch-Pagan test for heteroskedasticity
library(lmtest)

## Warning: package 'lmtest' was built under R version 4.3.3

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(lm_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  lm_model
## BP = 7.8788, df = 8, p-value = 0.4454

bptest(lm_model2)

## 
##  studentized Breusch-Pagan test
## 
## data:  lm_model2
## BP = 6.6748, df = 6, p-value = 0.352

The code above is used for conducting the Breusch-Pagan test for heteroskedasticity on our two linear regression models, lm_model and lm_model2. This test is used to determine whether the variance of the residuals in the regression model is dependent on the values of the independent variables, which indicates heteroskedasticity.

library(lmtest): Loads the lmtest package, which contains functions for conducting hypothesis tests related to linear models.
bptest(lm_model): Conducts the Breusch-Pagan test for heteroskedasticity on lm_model. This function tests whether the variance of the residuals in lm_model is dependent on the values of the independent variables.
bptest(lm_model2): Conducts the Breusch-Pagan test for heteroskedasticity on lm_model2. Similarly, this function tests whether the variance of the residuals in lm_model2 is dependent on the values of the independent variables.

The output of these tests will indicate whether there is evidence of heteroskedasticity in the respective models. If the p-value associated with the test is below a certain significance level (e.g., 0.05), it suggests that there is heteroskedasticity present in the model.

In this case, the results of the Breusch-Pagan test for heteroskedasticity are as follows:

For lm_model:

Test statistic (BP): 7.8788
Degrees of freedom (df): 8
p-value: 0.4454

For lm_model2:

Test statistic (BP): 6.6748
Degrees of freedom (df): 6
p-value: 0.352

In both cases, the p-values are greater than 0.05, which means we fail to reject the null hypothesis of homoskedasticity. Therefore, there is no significant evidence of heteroskedasticity in the models lm_model and lm_model2.

Now, let’s proceed to save lm_model2 for future deployment and development.

# Save the trained model
saveRDS(lm_model2, "linear_regression_model.rds")

Final Project

English Premier League Data Analysis and Modeling

Introduction

Steps Overview

Conclusion

Theoretical Aspects

Linear Regression

Decision Tree Regression

Random Forest Regression

Loading the rpart Package

Loading the randomForest Package

Histogram Visualization of Points Distribution

Code Explanation:

Scatter Plot of Points vs. Possession

Code Explanation:

Scatter Plot of Points vs. Possession with Team Labels

Code Explanation:

Density Plot of Possession

Code Explanation:

Distribution of Overall Categories

Code Explanation:

Distribution of Points by Overall Category

Code Explanation:

Distribution of Points by Overall Category

Code Explanation:

Relationship between Average Age and Points

Code Explanation:

Relationship between Progressive Carries and Number of Passes

Code Explanation:

Average Red Cards by Overall Category

Code Explanation:

Proportion of Teams by Overall Category

Code Explanation:

Proportion of Teams by Overall Category with Minimum Points

Code Explanation:

Correlation Heatmap

Code Explanation:

Correlation Analysis

Code Explanation:

Pairwise Scatter Plot Analysis

Code Explanation:

Pairwise Scatter Plot Matrix

Code Explanation:

Scatter Plot with Linear Regression Lines by Overall Category

Code Explanation:

Splitting Data into Train and Test Sets

Building Linear Regression Model

Summary of Linear Regression Model

Decision Tree Regression Model

Summary of Decision Tree Regression Model

Building Random Forest Model

Random Forest Regression Model Summary

Model Information:

Model Performance:

RMSE Function

RMSE Calculation and Comparison

Model Evaluation and Selection