In this project, we aim to derive actionable insights for better performance in the English Premier League by leveraging data analytics and building predictive models. We will utilize multiple modeling techniques including Linear Regression, Decision Tree Regression, and Random Forest to predict the number of points earned by each team in the tournament. The models will be trained on data from the 2021-22 soccer season and tested on data from the subsequent season, 2022-23.
Data Collection: Gather data from the English Premier League for the 2021-22 and 2022-23 seasons, including information about the number of points earned by each team and relevant numeric features.
Data Preprocessing: Clean and preprocess the data to handle missing values, outliers, and inconsistencies. Convert categorical variables into numerical format if necessary.
Model Building:
Model Evaluation: Evaluate the performance of each model using appropriate metrics such as RMSE, R-squared, etc. Compare performance on both training and testing datasets to ensure generalization.
Insights and Recommendations: Analyze model results to identify actionable insights for performance improvement in the English Premier League. This includes understanding feature importance and developing strategies based on model predictions.
Deployment and Monitoring: Deploy the best-performing model for real-world use and continuously monitor its performance. Update the model as needed based on new data and insights.
By following this approach, we aim to utilize data-driven insights to optimize team performance and drive success in the English Premier League.
Linear regression is a statistical method used to model the relationship between a dependent variable \(Y\) and one or more independent variables \(X\). It assumes that this relationship can be approximated by a linear equation:
\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_nX_n + \varepsilon \]
where: - \(Y\) is the dependent variable (response), - \(X_1, X_2, \ldots, X_n\) are the independent variables (features), - \(\beta_0, \beta_1, \ldots, \beta_n\) are the coefficients, and - \(\varepsilon\) is the error term.
A decision tree is a tree-like structure where each internal node represents a test on a feature, each branch represents the outcome of the test, and each leaf node represents a decision or prediction. Decision tree regression works by partitioning the feature space into regions and predicting the response variable based on the average value of the training instances within each region.
Random forest is an ensemble learning method that constructs multiple decision trees during training and outputs the average prediction of the individual trees. It improves upon decision tree regression by reducing overfitting and increasing predictive accuracy.
library(rpart)
The library(rpart)
statement is used in R to load the
rpart package into the current working environment. The rpart package
provides functions for building and visualizing decision trees. By
loading the package with this command, we gain access to its
functionality, including the ability to create decision tree models
using the rpart() function and visualize them using plotting functions
provided by the package.
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
The randomForest
package is loaded using the
library()
function. This package provides functions for
building random forest models, a popular ensemble learning technique
used for classification and regression tasks.
I will be commenting on my syntax and output right below each cell.
# read the CSV with headers
data2022<-read.csv("season22.csv", header=T,sep =",")
str(data2022)
## 'data.frame': 20 obs. of 11 variables:
## $ teams : chr "Arsenal" "Aston Villa" "Brentford" "Brighton" ...
## $ avg.age : num 24.4 26.1 25.4 26.2 28.5 27.4 27.1 26.7 26.1 26.5 ...
## $ possession : num 52.8 46.5 44.8 54.4 40.2 61.8 51.1 40 52.3 52 ...
## $ redcards : int 4 2 3 2 2 1 1 6 3 1 ...
## $ prog.carries : int 734 653 450 676 388 894 592 512 671 622 ...
## $ prog.passes : int 1655 1300 1189 1547 972 1967 1229 1100 1483 1356 ...
## $ points : int 69 45 46 51 35 74 48 39 38 52 ...
## $ avgdist.gk : num 7.3 15.4 15.8 9.3 17 18.8 11.2 10.7 23.7 10.7 ...
## $ miscontrols : int 557 595 563 598 540 591 671 640 676 601 ...
## $ aerialswon.pct: num 44.7 49 49.6 53.3 53 52.5 49.7 46.7 42.9 48.9 ...
## $ overall : chr "europe" "safe" "safe" "safe" ...
In this cell, we are using the read.csv() function to read a CSV file named “season22.csv” into R. We are specifying that the CSV file has headers by setting header = TRUE, and we are indicating that the values in the CSV file are separated by commas using sep = “,”. Then, we are assigning the read data to a variable named data2022. Finally, we are using the str() function to display the structure of the data2022 object, which provides information about the data types and the first few rows of the dataset.
data2023<-read.csv("season23.csv", header=T,sep =",")
str(data2023)
## 'data.frame': 20 obs. of 11 variables:
## $ teams : chr "Arsenal" "Aston Villa" "Bournemouth" "Brentford" ...
## $ avg.age : num 24.7 27 26.3 26.2 26.3 26.3 26.7 26.6 28.2 25.3 ...
## $ possession : num 59.3 49.3 40.4 43.8 60.2 58.7 46.3 42.8 48.8 47 ...
## $ redcards : int 0 1 0 1 0 3 3 2 1 3 ...
## $ prog.carries : int 824 637 504 392 809 829 556 500 662 555 ...
## $ prog.passes : int 2049 1242 1006 1131 1849 1737 1243 1052 1326 1429 ...
## $ points : int 84 61 39 59 62 44 45 36 52 31 ...
## $ avgdist.gk : num 16.1 14.2 10.5 15.7 12.3 13.7 13.1 5 14 11.6 ...
## $ miscontrols : int 526 565 589 520 558 568 643 605 535 602 ...
## $ aerialswon.pct: num 46.5 48.7 49.2 51.8 53.4 52.1 46.2 48.7 48.2 49.1 ...
## $ overall : chr "europe" "europe" "safe" "safe" ...
We are using the read.csv() function to read another CSV file named “season23.csv” into R, specifying that the file has headers (header = TRUE) and that the values are separated by commas (sep = “,”). The resulting dataset is assigned to a variable named data2023. Finally, we are using the str() function to display the structure of the data2023 object, which provides information about the data types and the first few rows of the dataset.
# Load required libraries
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
Histograms/Density Plots
ggplot(data2022, aes(x = points)) +
geom_histogram(fill = "skyblue", color = "blue", bins = 20) + # Change fill color, outline color, and number of bins
ggtitle("Distribution of Number of Points by Teams") +
xlab("Number of Points") + # Add x-axis label
ylab("Frequency") + # Add y-axis label
theme_minimal() # Change theme to minimal
In this section, we used the ggplot2 package to create a histogram to visualize the distribution of the number of points obtained by teams in the Premier League dataset for the 2022 season.
ggplot(data2022, aes(x = points))
: Specifies the
dataset data2022
and maps the variable points
to the x-axis.geom_histogram()
: Adds the histogram layer to the plot,
indicating that we want to create a histogram.fill = "skyblue", color = "blue"
: Specifies the fill
color of the bars as “skyblue” and the outline color as “blue”.bins = 20
: Sets the number of bins (bars) in the
histogram to 20.ggtitle("Distribution of Number of Points by Teams")
:
Sets the title of the plot.xlab("Number of Points")
: Sets the label for the
x-axis.ylab("Frequency")
: Sets the label for the y-axis.theme_minimal()
: Applies the minimal theme to the plot,
which removes unnecessary background elements and gridlines.ggplot(data2023, aes(x = possession, y = points)) +
geom_point(color = "darkblue", size = 3, shape = 16) + # Change point color, size, and shape
ggtitle("Scatter Plot of Points vs. Possession") +
xlab("Possession") + # Add x-axis label
ylab("Points") + # Add y-axis label
theme_minimal() # Change theme to minimal
This code created a scatter plot to visualize the relationship between possession and points using the ggplot2 package.
ggplot(data2023, aes(x = possession, y = points))
: Sets
up the initial ggplot object with the data2023 data frame and specifies
that ‘possession’ should be plotted on the x-axis and ‘points’ on the
y-axis.geom_point(color = "darkblue", size = 3, shape = 16)
:
Adds a layer of points to the plot, specifying the color as “darkblue”,
size as 3, and shape as 16.ggtitle("Scatter Plot of Points vs. Possession")
: Adds
a title to the plot.xlab("Possession")
: Adds a label to the x-axis.ylab("Points")
: Adds a label to the y-axis.theme_minimal()
: Changes the theme of the plot to a
minimalistic style.Overall, this code created a scatter plot showing the relationship between possession and points, with specified colors, labels, and a minimalistic theme.
ggplot(data2023, aes(x = possession, y = points, label = teams)) +
geom_point(color = "darkblue", size = 3, shape = 16) + # Change point color, size, and shape
geom_text(hjust = -0.1, vjust = -0.5, size = 3, color = "black") + # Add team names
ggtitle("Scatter Plot of Points vs. Possession") +
xlab("Possession") + # Add x-axis label
ylab("Points") + # Add y-axis label
theme_minimal() # Change theme to minimal
Here we have observed that there exists a correlation between possession and success in the English Premier League.
This code creates a scatter plot to visualize the relationship between possession and points, with team names as labels, using the ggplot2 package.
ggplot(data2023, aes(x = possession, y = points, label = teams))
:
Sets up the initial ggplot object with the data2023 data frame and
specifies that ‘possession’ should be plotted on the x-axis, ‘points’ on
the y-axis, and ‘teams’ as the labels for each point.geom_point(color = "darkblue", size = 3, shape = 16)
:
Adds a layer of points to the plot, specifying the color as “darkblue”,
size as 3, and shape as 16.geom_text(hjust = -0.1, vjust = -0.5, size = 3, color = "black")
:
Adds labels (team names) to the plot at the coordinates specified by the
‘possession’ and ‘points’ variables, adjusting the horizontal and
vertical justification, font size, and color.ggtitle("Scatter Plot of Points vs. Possession")
: Adds
a title to the plot.xlab("Possession")
: Adds a label to the x-axis.ylab("Points")
: Adds a label to the y-axis.theme_minimal()
: Changes the theme of the plot to a
minimalistic style.Overall, this code creates a scatter plot showing the relationship between possession and points, with team names as labels, specified colors, labels, and a minimalistic theme.
ggplot(data2022, aes(x = possession, fill = "Possession")) +
geom_density(alpha = 0.5, color = "darkblue", fill = "lightblue") + # Change line and fill colors
ggtitle("Density Plot of Possession") +
xlab("Possession") + # Add x-axis label
ylab("Density") + # Add y-axis label
theme_minimal() # Change theme to minimal
This code creates a density plot to visualize the distribution of possession using the ggplot2 package.
ggplot(data2022, aes(x = possession, fill = "Possession"))
:
Sets up the initial ggplot object with the data2022 data frame and
specifies ‘possession’ to be plotted on the x-axis. The fill aesthetic
is set to a constant value “Possession”, which will create a single
density plot.geom_density(alpha = 0.5, color = "darkblue", fill = "lightblue")
:
Adds a density layer to the plot, specifying the transparency (alpha) as
0.5, line color as “darkblue”, and fill color as “lightblue”.ggtitle("Density Plot of Possession")
: Adds a title to
the plot.xlab("Possession")
: Adds a label to the x-axis.ylab("Density")
: Adds a label to the y-axis.theme_minimal()
: Changes the theme of the plot to a
minimalistic style.Overall, this code creates a density plot showing the distribution of possession, with specified colors, labels, and a minimalistic theme.
Barplots
# Reorder the levels of the "overall" variable
data2022$overall <- factor(data2022$overall, levels = c("europe", "safe", "relegation"))
# Plot the bar chart with reordered "overall" variable
ggplot(data2022, aes(x = overall)) +
geom_bar(fill = "skyblue", color = "blue") +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 3, position = position_stack(vjust = 0.5)) + # Corrected aesthetics
ggtitle("Distribution of Overall Categories") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # Adjust x-axis labels
labs(x = "Overall Category", y = "Frequency") # Add axis labels
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
This code creates a bar chart to visualize the distribution of the “overall” variable in the data2022 data frame.
data2022$overall <- factor(data2022$overall, levels = c("europe", "safe", "relegation"))
:
Reorders the levels of the “overall” variable in the data2022 data
frame, ensuring that “europe” appears first, followed by “safe”, and
then “relegation”.ggplot(data2022, aes(x = overall))
: Sets up the initial
ggplot object with the data2022 data frame and specifies ‘overall’ to be
plotted on the x-axis.geom_bar(fill = "skyblue", color = "blue")
: Adds a bar
layer to the plot, specifying the fill color as “skyblue” and the
outline color as “blue”.geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 3, position = position_stack(vjust = 0.5))
:
Adds text labels to each bar, indicating the count of observations in
each category. The stat = "count"
argument calculates the
count of observations, and position_stack(vjust = 0.5)
stacks the labels on top of the bars.ggtitle("Distribution of Overall Categories")
: Adds a
title to the plot.theme_minimal()
: Changes the theme of the plot to a
minimalistic style.theme(axis.text.x = element_text(angle = 45, hjust = 1))
:
Adjusts the angle of the x-axis labels to 45 degrees and aligns them to
the right for better readability.labs(x = "Overall Category", y = "Frequency")
: Adds
labels to the x-axis and y-axis.Overall, this code creates a bar chart showing the distribution of the “overall” variable, with reordered categories, specified colors, labels, and a minimalistic theme.
Boxplots
ggplot(data2022, aes(x = overall, y = points)) + geom_boxplot() + ggtitle("Distribution of Points by Overall Category")
This code creates a boxplot to visualize the distribution of points across different categories of the ‘overall’ variable in the data2022 data frame.
ggplot(data2022, aes(x = overall, y = points))
: Sets up
the initial ggplot object with the data2022 data frame and specifies
‘overall’ to be plotted on the x-axis and ‘points’ on the y-axis.geom_boxplot()
: Adds a layer of boxplots to the plot,
showing the distribution of ‘points’ within each category of
‘overall’.ggtitle("Distribution of Points by Overall Category")
:
Adds a title to the plot.Overall, this code creates a boxplot showing the distribution of points across different categories of the ‘overall’ variable, providing a visual representation of the spread, central tendency, and outliers in the data.
# Define colors for each category
category_colors <- c("europe" = "blue", "safe" = "green", "relegation" = "red")
# Plot the boxplot with colors for each category
ggplot(data2022, aes(x = overall, y = points, fill = overall)) +
geom_boxplot() +
scale_fill_manual(values = category_colors) + # Apply custom colors
ggtitle("Distribution of Points by Overall Category") +
xlab("Overall Category") +
ylab("Points") +
theme_minimal()
This code creates a boxplot to visualize the distribution of points across different categories of the ‘overall’ variable in the data2022 data frame, with custom colors assigned to each category.
category_colors <- c("europe" = "blue", "safe" = "green", "relegation" = "red")
:
Defines a vector category_colors
specifying custom colors
for each category of the ‘overall’ variable.ggplot(data2022, aes(x = overall, y = points, fill = overall))
:
Sets up the initial ggplot object with the data2022 data frame and
specifies ‘overall’ to be plotted on the x-axis, ‘points’ on the y-axis,
and ‘overall’ to fill the colors.geom_boxplot()
: Adds a layer of boxplots to the plot,
showing the distribution of ‘points’ within each category of
‘overall’.scale_fill_manual(values = category_colors)
: Applies
the custom colors defined in category_colors
to fill the
boxes based on the ‘overall’ variable.ggtitle("Distribution of Points by Overall Category")
:
Adds a title to the plot.xlab("Overall Category")
and
ylab("Points")
: Add labels to the x-axis and y-axis,
respectively.theme_minimal()
: Changes the theme of the plot to a
minimalistic style.Overall, this code creates a boxplot showing the distribution of points across different categories of the ‘overall’ variable, with custom colors assigned to each category for better visualization.
Scatterplots
# Plot the relationship between average age and points
ggplot(data2022, aes(x = avg.age, y = points, fill = overall, label = teams)) +
geom_point() +
ggtitle("Relationship between Average Age and Points") +
xlab("Average Age") +
ylab("Points") +
theme_minimal() +
scale_fill_manual(values = category_colors) + # Apply custom colors
geom_text(aes(color = overall), check_overlap = TRUE, vjust = -0.5) + # Add team names as labels with color based on 'overall'
scale_color_manual(values = category_colors) # Apply custom colors to text labels
This code creates a scatter plot to visualize the relationship between the average age and points of teams in the data2022 data frame.
ggplot(data2022, aes(x = avg.age, y = points, fill = overall, label = teams))
:
Sets up the initial ggplot object with the data2022 data frame. It
specifies ‘avg.age’ on the x-axis, ‘points’ on the y-axis, ‘overall’ to
fill the colors, and ‘teams’ to be displayed as labels.geom_point()
: Adds a layer of points to the plot,
representing the relationship between average age and points.ggtitle("Relationship between Average Age and Points")
:
Adds a title to the plot.xlab("Average Age")
and ylab("Points")
:
Add labels to the x-axis and y-axis, respectively.theme_minimal()
: Changes the theme of the plot to a
minimalistic style.scale_fill_manual(values = category_colors)
: Applies
custom colors defined in category_colors to fill the points based on the
‘overall’ variable.geom_text(aes(color = overall), check_overlap = TRUE, vjust = -0.5)
:
Adds text labels to the points, displaying the team names. The color =
overall argument assigns the label color based on the ‘overall’
variable. The check_overlap = TRUE argument ensures that labels do not
overlap, and vjust = -0.5 adjusts the vertical position of the
labels.scale_color_manual(values = category_colors)
: Applies
custom colors defined in category_colors to the text labels.Overall, this code creates a scatter plot showing the relationship between average age and points, with custom colors for points and labels based on the ‘overall’ variable. Team names are also displayed as labels.
ggplot(data2022, aes(x = prog.passes, y = prog.carries, fill = overall, label = teams)) + geom_point() + ggtitle("Relationship between Progressive Carries and Number of Passes")+
xlab("Passing the Ball") +
ylab("Carrying the Ball") +
theme_minimal() +
scale_fill_manual(values = category_colors) + # Apply custom colors
geom_text(aes(color = overall), check_overlap = TRUE, vjust = -0.5) + # Add team names as labels with color based on 'overall'
scale_color_manual(values = category_colors) # Apply custom colors to text labels
This code generates a scatter plot to visualize the relationship between the number of progressive passes (x-axis) and the number of progressive carries (y-axis) for each team.
ggplot(data2022, aes(x = prog.passes, y = prog.carries, fill = overall, label = teams))
:
Specifies the data frame (data2022
) and aesthetic mappings,
including the x and y variables (prog.passes
and
prog.carries
, respectively), fill color based on the
overall
variable, and labels based on the
teams
variable.geom_point()
: Adds points to the plot.ggtitle("Relationship between Progressive Carries and Number of Passes")
:
Adds a title to the plot.xlab("Passing the Ball") + ylab("Carrying the Ball")
:
Labels the x and y axes.theme_minimal()
: Sets the plot theme to minimal.scale_fill_manual(values = category_colors)
: Applies
custom colors to the fill based on the category_colors
defined earlier.geom_text(aes(color = overall), check_overlap = TRUE, vjust = -0.5)
:
Adds team names as labels with colors based on the overall
variable. The check_overlap = TRUE
argument ensures that
the labels do not overlap, and vjust = -0.5
adjusts the
vertical positioning of the labels.scale_color_manual(values = category_colors)
: Applies
custom colors to the text labels based on the
category_colors
.This visualization helps explore how the number of progressive passes and carries relate to each other for different teams, with colors representing different overall categories (e.g., Europe, Safe, Relegation).
Overall, this code creates a scatter plot showing the relationship between the number of progressive carries and the number of progressive passes, with custom colors for points and labels based on the ‘overall’ variable. Team names are also displayed as labels.
Barplots/Stacked Barplots
ggplot(data2022, aes(x = overall, fill = factor(redcards))) + geom_bar(position = "stack") + ggtitle("Average Red Cards by Overall Category")
This code generates a stacked bar plot to visualize the average number of red cards for each category in the ‘overall’ variable.
ggplot(data2022, aes(x = overall, fill = factor(redcards)))
:
Sets up the initial ggplot object with the data2022 data frame. It
specifies ‘overall’ on the x-axis and ‘redcards’ (converted to a factor)
to fill the colors of the bars.geom_bar(position = "stack")
: Adds a layer of stacked
bars to the plot, with each bar representing the average number of red
cards for each category in the ‘overall’ variable.ggtitle("Average Red Cards by Overall Category")
: Adds
a title to the plot.This visualization helps understand the average number of red cards across different categories in the ‘overall’ variable. The stacked bars represent the total number of red cards for each category.
Overall, this code creates a stacked bar plot showing the average number of red cards for each category in the ‘overall’ variable. The bars are stacked to represent the total number of red cards across different categories.
# Pie Chart
ggplot(data2022, aes(x = "", fill = overall)) + geom_bar(width = 1) + coord_polar("y", start = 0) + ggtitle("Proportion of Teams by Overall Category")
This code generates a pie chart to illustrate the proportion of teams in each overall category.
ggplot(data2022, aes(x = "", fill = overall))
: Sets up
the initial ggplot object with the data2022 data frame. It creates a
categorical variable for the x-axis with an empty string “” and fills
the colors based on the ‘overall’ variable.geom_bar(width = 1)
: Adds a layer of bars to the plot,
each representing a proportion of teams in each category.coord_polar("y", start = 0)
: Converts the Cartesian
coordinates to polar coordinates, creating a pie chart. The “y” argument
specifies that the width of the bars should be scaled according to the
y-axis values. The start = 0 argument specifies the starting angle of
the chart.ggtitle("Proportion of Teams by Overall Category")
:
Adds a title to the plot.This visualization helps understand the distribution of teams across different overall categories. Each category is represented by a portion of the pie, with the size of each portion indicating the proportion of teams in that category.
Overall, this code generates a pie chart illustrating the proportion of teams in each overall category.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Calculate minimum points for each category
min_points <- data2022 %>%
group_by(overall) %>%
summarise(min_points = min(points, na.rm = TRUE))
# Plot the proportion of teams by overall category with minimum points as labels
ggplot(data2022, aes(x = "", fill = overall)) +
geom_bar(width = 1, position = "fill") + # Adjust position to fill the entire pie
geom_text(data = min_points, aes(y = min_points, label = min_points),
position = position_fill(vjust = 0.5), size = 6, color = "white") + # Increase label size and set color
coord_polar("y", start = 0) +
scale_fill_manual(values = c("europe" = "blue", "safe" = "green", "relegation" = "red")) + # Customize colors
ggtitle("Proportion of Teams by Overall Category") +
theme_minimal() # Adjust theme for better visualization
This code generates a pie chart to illustrate the proportion of teams in each overall category, along with text labels indicating the minimum points for each category.
library(dplyr)
: Loads the dplyr package, which is used
for data manipulation.min_points <- data2022 %>% ...
: Calculates the
minimum points for each overall category using the group_by and
summarise functions from dplyr.ggplot(data2022, aes(x = "", fill = overall)) + ...
:
Sets up the initial ggplot object with the data2022 data frame. It
creates a categorical variable for the x-axis with an empty string “”
and fills the colors based on the ‘overall’ variable.geom_bar(width = 1, position = "fill")
: Adds a layer of
bars to the plot, each representing the proportion of teams in each
category. The position = “fill” argument stacks the bars to fill the
entire pie.geom_text(data = min_points, aes(y = min_points, label = min_points), ...
:
Adds text labels to the plot, displaying the minimum points for each
category. The position_fill(vjust = 0.5) argument positions the labels
in the center of each filled area.coord_polar("y", start = 0)
: Converts the Cartesian
coordinates to polar coordinates, creating a pie chart. The “y” argument
specifies that the width of the bars should be scaled according to the
y-axis values. The start = 0 argument specifies the starting angle of
the chart.scale_fill_manual(values = c("europe" = "blue", "safe" = "green", "relegation" = "red"))
:
Customizes the fill colors of the pie chart for each overall
category.ggtitle("Proportion of Teams by Overall Category")
:
Adds a title to the plot.theme_minimal()
: Adjusts the theme of the plot for
better visualization.This visualization helps understand the distribution of teams across different overall categories, along with the minimum points required for each category.
Overall, this code generates a pie chart illustrating the proportion of teams in each overall category, with text labels indicating the minimum points for each category.
Correlations
#Heatmap
# Compute the correlation matrix
cor_matrix <- cor(data2022[, -c(1, 11)])
# Convert the correlation matrix to long format
cor_df <- reshape2::melt(cor_matrix)
# Plot the correlation heatmap
ggplot(data = cor_df, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "red") + # Customize the color gradient
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Correlation Heatmap")
This code generates a heatmap to visualize the pairwise correlations between variables in the dataset.
cor_matrix <- cor(data2022[, -c(1, 11)])
: Calculates
the correlation matrix of the dataset data2022
, excluding
the first and eleventh columns.cor_df <- reshape2::melt(cor_matrix)
: Converts the
correlation matrix into a long format dataframe suitable for
plotting.ggplot(data = cor_df, aes(x = Var1, y = Var2, fill = value)) + ...
:
Sets up the initial ggplot object with the melted correlation dataframe
cor_df
. It maps the variables Var1
and
Var2
to the x and y axes respectively, and the correlation
values to the fill color.geom_tile()
: Adds a layer of tiles to the plot, where
each tile represents a pairwise correlation value.scale_fill_gradient(low = "blue", high = "red")
:
Customizes the color gradient for the fill scale, with blue indicating
lower correlation values and red indicating higher correlation
values.theme(axis.text.x = element_text(angle = 45, hjust = 1))
:
Adjusts the angle of the x-axis labels for better readability.ggtitle("Correlation Heatmap")
: Adds a title to the
plot.This visualization helps understand the relationships between variables in the dataset by showing the pairwise correlations. Each cell in the heatmap represents the correlation between two variables, with colors indicating the strength and direction of the correlation. Blue colors represent negative correlations, red colors represent positive correlations, and lighter colors represent weaker correlations.
Scatter Plot and Scatter Plot Matrixes
# exploring relationships among features: correlation matrix
cor(data2022[c("avg.age", "possession", "prog.carries", "prog.passes","avgdist.gk",
"miscontrols","aerialswon.pct")])
## avg.age possession prog.carries prog.passes avgdist.gk
## avg.age 1.00000000 -0.08204764 -0.04631838 -0.04162586 0.08846055
## possession -0.08204764 1.00000000 0.90328340 0.95951365 0.24674533
## prog.carries -0.04631838 0.90328340 1.00000000 0.91078538 0.23966734
## prog.passes -0.04162586 0.95951365 0.91078538 1.00000000 0.24166903
## avgdist.gk 0.08846055 0.24674533 0.23966734 0.24166903 1.00000000
## miscontrols -0.07008462 -0.25956326 -0.30569156 -0.35372264 0.09466812
## aerialswon.pct 0.50243236 0.41629951 0.39431301 0.42961627 -0.01494381
## miscontrols aerialswon.pct
## avg.age -0.07008462 0.50243236
## possession -0.25956326 0.41629951
## prog.carries -0.30569156 0.39431301
## prog.passes -0.35372264 0.42961627
## avgdist.gk 0.09466812 -0.01494381
## miscontrols 1.00000000 -0.42984815
## aerialswon.pct -0.42984815 1.00000000
This code calculates the correlation matrix for a subset of features
in the dataset data2022
, including ‘avg.age’, ‘possession’,
‘prog.carries’, ‘prog.passes’, ‘avgdist.gk’, ‘miscontrols’, and
‘aerialswon.pct’.
cor(data2022[c("avg.age", "possession", "prog.carries", "prog.passes", "avgdist.gk", "miscontrols", "aerialswon.pct")])
:
This function calculates the correlation matrix for the specified subset
of features in the dataset data2022
.The resulting correlation matrix will show the pairwise correlations between these features. Each cell in the matrix will contain the correlation coefficient, indicating the strength and direction of the linear relationship between two variables. The values range from -1 to 1, where:
Positive values imply a positive correlation (as one variable increases, the other tends to increase), while negative values imply a negative correlation (as one variable increases, the other tends to decrease). A value close to 0 suggests no linear relationship between the variables.
#visualizing relationships among features scatterplot
pairs(data2022[c("avg.age", "possession", "prog.carries", "prog.passes","avgdist.gk",
"miscontrols","aerialswon.pct")])
This code generates a matrix of scatter plots to visualize the
pairwise relationships between selected features in the dataset
data2022
, including ‘avg.age’, ‘possession’,
‘prog.carries’, ‘prog.passes’, ‘avgdist.gk’, ‘miscontrols’, and
‘aerialswon.pct’.
pairs(data2022[c("avg.age", "possession", "prog.carries", "prog.passes", "avgdist.gk", "miscontrols", "aerialswon.pct")])
:
This function creates a matrix of scatter plots showing the pairwise
relationships between the selected features.Each cell in the matrix represents a scatter plot of one feature against another. This visualization allows for a quick examination of potential relationships, trends, and patterns between variables. Points on the scatter plots represent individual data points, and the position of these points relative to each other indicates the relationship between the variables.
# Customizing the scatterplot matrix
pairs(data2022[c("avg.age", "possession", "prog.carries", "prog.passes", "avgdist.gk", "miscontrols", "aerialswon.pct")],
col = "#3399FF", # Set point color to blue
pch = 16, # Set point shape to filled circle
cex = 0.7, # Set point size
bg = "white", # Set background color to white
lower.panel = panel.smooth, # Add smooth trend lines in the lower panels
diag.panel = NULL, # Remove diagonal panels
labels = c("Avg. Age", "Possesion", "Prog. Carries", "Prog. Passes", "Avg. Dist. GK", "Miscontrols", "Aerials Won %"), # Set axis labels
font.labels = 2, # Set font weight for labels
col.axis = "#666666", # Set axis color to gray
cex.axis = 0.8, # Set axis label size
las = 1, # Set axis label orientation to horizontal
main = "Scatterplot Matrix of Features" # Set main title
)
This code generates a scatterplot matrix to visualize the pairwise relationships between features.
pairs()
: This function generates a scatterplot
matrix.col = "#3399FF"
: Sets the point color to blue.pch = 16
: Sets the point shape to filled circles.cex = 0.7
: Sets the point size to 0.7.bg = "white"
: Sets the background color to white.lower.panel = panel.smooth
: Adds smooth trend lines in
the lower panels.diag.panel = NULL
: Removes the diagonal panels.labels
: Sets the axis labels for each feature.font.labels = 2
: Sets the font weight for labels to
bold.col.axis = "#666666"
: Sets the axis color to gray.cex.axis = 0.8
: Sets the axis label size to 0.8.las = 1
: Sets the axis label orientation to
horizontal.main = "Scatterplot Matrix of Features"
: Sets the main
title of the plot.# Create scatterplot matrix with ggplot2
ggplot(data2022, aes(x = avg.age, y = possession)) +
geom_point(aes(color = overall), size = 3) + # Color points based on 'overall' category
geom_smooth(method = "lm", se = FALSE, color = "black") + # Add linear regression lines
facet_grid(. ~ overall) + # Separate plots by 'overall' category
labs(x = "Average Age", y = "Possession", color = "Overall") + # Adjust axis labels and color legend title
ggtitle("Relationships Among Features") + # Add title
theme_minimal() # Adjust theme for better visualization
## `geom_smooth()` using formula = 'y ~ x'
This code generates a scatter plot with linear regression lines for each category in the “overall” variable.
ggplot()
: Initializes the ggplot object with the
dataset data2022
and aesthetic mappings.geom_point()
: Adds points to the plot, with color
mapped to the “overall” category.geom_smooth()
: Adds a linear regression line to each
plot without confidence intervals.facet_grid(. ~ overall)
: Separates plots into facets
based on the “overall” category.labs()
: Sets the labels for x-axis, y-axis, and color
legend.ggtitle()
: Sets the title of the plot.theme_minimal()
: Applies a minimal theme to the plot
for better visualization.Regression Analysis to predict points
# Splitting the data into predictors and response variable
X_train <- data2022[, -c(1, 11)] # Exclude 'teams' and 'overall' columns
y_train <- data2022$points
X_test <- data2023[, -c(1, 11)] # Exclude 'teams' and 'overall' columns
y_test <- data2023$points
This code snippet splits the data into predictor variables (X_train, X_test) and the response variable (y_train, y_test).
X_train
: Contains the predictor variables (features)
for the training set. It excludes the first and last columns (likely
‘teams’ and ‘overall’) from the data2022
dataset.y_train
: Contains the response variable (target) for
the training set, which is the ‘points’ column from the
data2022
dataset.X_test
: Contains the predictor variables for the test
set. Similar to X_train, it excludes the first and last columns from the
data2023
dataset.y_test
: Contains the response variable for the test
set, which is the ‘points’ column from the data2023
dataset.# Model building
# We can use different regression models such as linear regression, decision tree regression, random forest regression, etc
# Linear Regression Model
lm_model <- lm(points ~ ., data = X_train)
This code snippet builds a linear regression model using the training
data X_train
and y_train
.
lm_model
: Represents the linear regression model object
fitted to the training data. It uses the lm()
function,
where points
is the response variable, and .
denotes using all other variables in X_train
as
predictors.summary(lm_model)
##
## Call:
## lm(formula = points ~ ., data = X_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9517 -2.6876 0.6733 4.1046 11.8263
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.91666 87.94433 -0.261 0.799
## avg.age -1.81747 2.66170 -0.683 0.509
## possession -0.05721 1.57284 -0.036 0.972
## redcards -0.26928 3.10938 -0.087 0.933
## prog.carries 0.00691 0.03111 0.222 0.828
## prog.passes 0.03787 0.03176 1.193 0.258
## avgdist.gk -0.86768 0.57210 -1.517 0.158
## miscontrols 0.01243 0.05724 0.217 0.832
## aerialswon.pct 1.47297 0.94482 1.559 0.147
##
## Residual standard error: 9.078 on 11 degrees of freedom
## Multiple R-squared: 0.8725, Adjusted R-squared: 0.7798
## F-statistic: 9.411 on 8 and 11 DF, p-value: 0.0005902
The summary()
function provides a detailed summary of
the linear regression model lm_model
, including
coefficients, standard errors, t-values, p-values, and goodness-of-fit
statistics.
This summary provides valuable information about the model’s performance and the significance of individual predictors in explaining the variability in the response variable.
# Decision Tree Regression Model
dt_model <- rpart(points ~ ., data = X_train)
The rpart()
function fits a decision tree regression
model to the training data X_train
with the response
variable points
. This model will be used to predict the
points based on the predictor variables.
The decision tree regression model recursively partitions the predictor space into distinct regions, where each region corresponds to a specific prediction. This is achieved by splitting the predictor variables based on their values, aiming to minimize the variance of the response variable within each partition.
Once the model is trained, it can be used to make predictions on new
data by traversing the decision tree from the root node to the leaf
nodes, where each leaf node represents a specific prediction. After
fitting the model, we can use it to predict the points for the test data
X_test
using the predict()
function.
summary(dt_model)
## Call:
## rpart(formula = points ~ ., data = X_train)
## n= 20
##
## CP nsplit rel error xerror xstd
## 1 0.5127455 0 1.0000000 1.078438 0.3261939
## 2 0.0100000 1 0.4872545 1.078438 0.3261939
##
## Variable importance
## prog.carries prog.passes possession aerialswon.pct miscontrols
## 32 23 18 14 9
## avg.age
## 5
##
## Node number 1: 20 observations, complexity param=0.5127455
## mean=52.6, MSE=355.54
## left son=2 (13 obs) right son=3 (7 obs)
## Primary splits:
## prog.carries < 725.5 to the left, improve=0.5127455, (0 missing)
## prog.passes < 1402.5 to the left, improve=0.4698490, (0 missing)
## possession < 51.45 to the left, improve=0.4410750, (0 missing)
## aerialswon.pct < 52.35 to the left, improve=0.3459724, (0 missing)
## miscontrols < 593 to the right, improve=0.2278225, (0 missing)
## Surrogate splits:
## prog.passes < 1557.5 to the left, agree=0.90, adj=0.714, (0 split)
## possession < 52.5 to the left, agree=0.85, adj=0.571, (0 split)
## aerialswon.pct < 52.35 to the left, agree=0.80, adj=0.429, (0 split)
## miscontrols < 593 to the right, agree=0.75, adj=0.286, (0 split)
## avg.age < 25.9 to the right, agree=0.70, adj=0.143, (0 split)
##
## Node number 2: 13 observations
## mean=42.69231, MSE=107.4438
##
## Node number 3: 7 observations
## mean=71, MSE=295.4286
The summary()
function does not directly provide a
summary for decision tree regression models in the rpart package.
Instead, we can use the printcp()
function to obtain
information about the complexity parameter (cp) and the corresponding
cross-validated error for each node in the tree. This table helps
understand the complexity of the decision tree model and guides in
selecting an appropriate cp value for pruning the tree.
# Random Forest Regression Model
rf_model <- randomForest(points ~ ., data = X_train)
The randomForest
function is used to build a random
forest regression model (rf_model
). This model is trained
to predict the number of points (points
) based on all other
variables in the training dataset (X_train
). The formula
points ~ .
specifies that points
is the
response variable to be predicted, and .
represents all
other variables in the dataset. The data
argument specifies
the training dataset (X_train
).
rf_model
##
## Call:
## randomForest(formula = points ~ ., data = X_train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 149.4146
## % Var explained: 57.98
The rf_model
object contains the Random Forest
Regression Model that was built using the randomForest
package in R. This model is capable of predicting the number of points
for Premier League teams based on various predictor variables.
This summary provides information about the model’s performance and characteristics based on the training data.
Model Evaluation
# We can evaluate the performance of each model using metrics such as RMSE, R-squared, etc.
# Function to calculate RMSE
rmse <- function(actual, predicted) {
sqrt(mean((actual - predicted)^2))
}
# Predictions
lm_pred <- predict(lm_model, newdata = X_test)
dt_pred <- predict(dt_model, newdata = X_test)
rf_pred <- predict(rf_model, newdata = X_test)
The rmse
function calculates the Root Mean Squared Error
(RMSE) between the actual and predicted values.
actual
: Represents the actual values.predicted
: Represents the predicted values.After defining the rmse
function, predictions are made
using three different models on the test dataset:
lm_model
)dt_model
)rf_model
)The predictions are stored in the following variables:
lm_pred
: Predictions made by the Linear Regression
model.dt_pred
: Predictions made by the Decision Tree
model.rf_pred
: Predictions made by the Random Forest
model.# Calculate RMSE for each model
lm_rmse <- rmse(y_test, lm_pred)
dt_rmse <- rmse(y_test, dt_pred)
rf_rmse <- rmse(y_test, rf_pred)
This code calculates the Root Mean Squared Error (RMSE) for each model’s predictions on the test data and displays the results. Lower RMSE values indicate better model performance. You can compare the RMSE values to determine which model performs best for your regression task.
# Compare RMSE
cat("Linear Regression RMSE:", lm_rmse, "\n")
## Linear Regression RMSE: 13.2977
cat("Decision Tree Regression RMSE:", dt_rmse, "\n")
## Decision Tree Regression RMSE: 16.35151
cat("Random Forest Regression RMSE:", rf_rmse, "\n")
## Random Forest Regression RMSE: 11.53972
The RMSE indicates how the predicted number of points deviates from the actual number of points, on average. Therefore, the regression tree is the best approach to predict the number of points. In this case, our best model is Random Forest Regression; however, I would choose Linear Regression given the fact that the RMSE values are really close. Linear Regression is a technique that is simpler to understand for the project stakeholders.
# Convert all columns to numeric in the training dataset
X_train<- data.frame(lapply(X_train, as.numeric))
# Convert all columns to numeric in the testing dataset
X_test<- data.frame(lapply(X_test, as.numeric))
This code uses the lapply
function to apply the
as.numeric
function to each column of the datasets
X_train
and X_test
, converting them to numeric
type. The result is stored in new data frames X_train
and
X_test
, respectively, with all columns now in numeric
format.
str(X_train)
## 'data.frame': 20 obs. of 9 variables:
## $ avg.age : num 24.4 26.1 25.4 26.2 28.5 27.4 27.1 26.7 26.1 26.5 ...
## $ possession : num 52.8 46.5 44.8 54.4 40.2 61.8 51.1 40 52.3 52 ...
## $ redcards : num 4 2 3 2 2 1 1 6 3 1 ...
## $ prog.carries : num 734 653 450 676 388 894 592 512 671 622 ...
## $ prog.passes : num 1655 1300 1189 1547 972 ...
## $ points : num 69 45 46 51 35 74 48 39 38 52 ...
## $ avgdist.gk : num 7.3 15.4 15.8 9.3 17 18.8 11.2 10.7 23.7 10.7 ...
## $ miscontrols : num 557 595 563 598 540 591 671 640 676 601 ...
## $ aerialswon.pct: num 44.7 49 49.6 53.3 53 52.5 49.7 46.7 42.9 48.9 ...
str(X_test)
## 'data.frame': 20 obs. of 9 variables:
## $ avg.age : num 24.7 27 26.3 26.2 26.3 26.3 26.7 26.6 28.2 25.3 ...
## $ possession : num 59.3 49.3 40.4 43.8 60.2 58.7 46.3 42.8 48.8 47 ...
## $ redcards : num 0 1 0 1 0 3 3 2 1 3 ...
## $ prog.carries : num 824 637 504 392 809 829 556 500 662 555 ...
## $ prog.passes : num 2049 1242 1006 1131 1849 ...
## $ points : num 84 61 39 59 62 44 45 36 52 31 ...
## $ avgdist.gk : num 16.1 14.2 10.5 15.7 12.3 13.7 13.1 5 14 11.6 ...
## $ miscontrols : num 526 565 589 520 558 568 643 605 535 602 ...
## $ aerialswon.pct: num 46.5 48.7 49.2 51.8 53.4 52.1 46.2 48.7 48.2 49.1 ...
The code run above provides information about the data types and dimensions of the data frames, as well as the first few rows of each data frame, giving us a better understanding of their structure.
Now I will create a new variable to add up features that are highly correlated.
X_train$combined_effect <- X_train$possession + X_train$prog.carries + X_train$prog.passes
X_test$combined_effect <- X_test$possession + X_test$prog.carries + X_test$prog.passes
The lines of code created above create a new variable called combined_effect in both the training and testing datasets. The combined_effect variable is calculated as the sum of three existing variables: possession, prog.carries, and prog.passes.
X_train\(combined_effect <- X_train\)possession + X_train\(prog.carries + X_train\)prog.passes: This line computes the combined_effect variable for the training dataset (X_train) by adding the values of possession, prog.carries, and prog.passes for each observation.
X_test\(combined_effect <- X_test\)possession + X_test\(prog.carries + X_test\)prog.passes: Similarly, this line calculates the combined_effect variable for the testing dataset (X_test) using the same formula.
After executing these lines, both the training and testing datasets will contain a new variable combined_effect, which represents the combined effect of possession, progressive carries, and progressive passes. This new variable can be used as a predictor in subsequent analyses or models.
str(X_train)
## 'data.frame': 20 obs. of 10 variables:
## $ avg.age : num 24.4 26.1 25.4 26.2 28.5 27.4 27.1 26.7 26.1 26.5 ...
## $ possession : num 52.8 46.5 44.8 54.4 40.2 61.8 51.1 40 52.3 52 ...
## $ redcards : num 4 2 3 2 2 1 1 6 3 1 ...
## $ prog.carries : num 734 653 450 676 388 894 592 512 671 622 ...
## $ prog.passes : num 1655 1300 1189 1547 972 ...
## $ points : num 69 45 46 51 35 74 48 39 38 52 ...
## $ avgdist.gk : num 7.3 15.4 15.8 9.3 17 18.8 11.2 10.7 23.7 10.7 ...
## $ miscontrols : num 557 595 563 598 540 591 671 640 676 601 ...
## $ aerialswon.pct : num 44.7 49 49.6 53.3 53 52.5 49.7 46.7 42.9 48.9 ...
## $ combined_effect: num 2442 2000 1684 2277 1400 ...
str(X_test)
## 'data.frame': 20 obs. of 10 variables:
## $ avg.age : num 24.7 27 26.3 26.2 26.3 26.3 26.7 26.6 28.2 25.3 ...
## $ possession : num 59.3 49.3 40.4 43.8 60.2 58.7 46.3 42.8 48.8 47 ...
## $ redcards : num 0 1 0 1 0 3 3 2 1 3 ...
## $ prog.carries : num 824 637 504 392 809 829 556 500 662 555 ...
## $ prog.passes : num 2049 1242 1006 1131 1849 ...
## $ points : num 84 61 39 59 62 44 45 36 52 31 ...
## $ avgdist.gk : num 16.1 14.2 10.5 15.7 12.3 13.7 13.1 5 14 11.6 ...
## $ miscontrols : num 526 565 589 520 558 568 643 605 535 602 ...
## $ aerialswon.pct : num 46.5 48.7 49.2 51.8 53.4 52.1 46.2 48.7 48.2 49.1 ...
## $ combined_effect: num 2932 1928 1550 1567 2718 ...
# Splitting the data into predictors and response variable
X_train <- X_train[, -c(2,4, 5)]
y_train <- X_train$points
X_test <- X_test[, -c(2,4,5)]
y_test <- X_test$points
In the cell above, we have just removed the three variables used to create the new calculated feature, combined_effect.
str(X_train)
## 'data.frame': 20 obs. of 7 variables:
## $ avg.age : num 24.4 26.1 25.4 26.2 28.5 27.4 27.1 26.7 26.1 26.5 ...
## $ redcards : num 4 2 3 2 2 1 1 6 3 1 ...
## $ points : num 69 45 46 51 35 74 48 39 38 52 ...
## $ avgdist.gk : num 7.3 15.4 15.8 9.3 17 18.8 11.2 10.7 23.7 10.7 ...
## $ miscontrols : num 557 595 563 598 540 591 671 640 676 601 ...
## $ aerialswon.pct : num 44.7 49 49.6 53.3 53 52.5 49.7 46.7 42.9 48.9 ...
## $ combined_effect: num 2442 2000 1684 2277 1400 ...
str(X_test)
## 'data.frame': 20 obs. of 7 variables:
## $ avg.age : num 24.7 27 26.3 26.2 26.3 26.3 26.7 26.6 28.2 25.3 ...
## $ redcards : num 0 1 0 1 0 3 3 2 1 3 ...
## $ points : num 84 61 39 59 62 44 45 36 52 31 ...
## $ avgdist.gk : num 16.1 14.2 10.5 15.7 12.3 13.7 13.1 5 14 11.6 ...
## $ miscontrols : num 526 565 589 520 558 568 643 605 535 602 ...
## $ aerialswon.pct : num 46.5 48.7 49.2 51.8 53.4 52.1 46.2 48.7 48.2 49.1 ...
## $ combined_effect: num 2932 1928 1550 1567 2718 ...
Now I will complete the same steps as above, but this time building models using the new variable, combined_effect. I intend to reduce the likelihood of heteroskedasticity in the model.
lm_model2 <- lm(points ~ ., data = X_train)
summary(lm_model2)
##
## Call:
## lm(formula = points ~ ., data = X_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.043 -3.376 1.403 5.028 10.014
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30.925574 65.400274 -0.473 0.6441
## avg.age -1.828732 2.388803 -0.766 0.4576
## redcards 0.305240 1.898912 0.161 0.8748
## avgdist.gk -0.817785 0.511347 -1.599 0.1338
## miscontrols 0.008488 0.046871 0.181 0.8591
## aerialswon.pct 1.572217 0.873868 1.799 0.0952 .
## combined_effect 0.027663 0.004370 6.330 2.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.523 on 13 degrees of freedom
## Multiple R-squared: 0.8672, Adjusted R-squared: 0.8059
## F-statistic: 14.15 on 6 and 13 DF, p-value: 4.992e-05
dt_model2 <- rpart(points ~ ., data = X_train)
summary(dt_model2)
## Call:
## rpart(formula = points ~ ., data = X_train)
## n= 20
##
## CP nsplit rel error xerror xstd
## 1 0.6040661 0 1.0000000 1.092622 0.3379016
## 2 0.0100000 1 0.3959339 1.092622 0.3379016
##
## Variable importance
## combined_effect aerialswon.pct miscontrols avgdist.gk redcards
## 44 31 12 6 6
##
## Node number 1: 20 observations, complexity param=0.6040661
## mean=52.6, MSE=355.54
## left son=2 (13 obs) right son=3 (7 obs)
## Primary splits:
## combined_effect < 2241.85 to the left, improve=0.6040661, (0 missing)
## aerialswon.pct < 52.35 to the left, improve=0.3459724, (0 missing)
## miscontrols < 593 to the right, improve=0.2278225, (0 missing)
## redcards < 1.5 to the right, improve=0.2170492, (0 missing)
## avg.age < 26.35 to the left, improve=0.1011259, (0 missing)
## Surrogate splits:
## aerialswon.pct < 52.35 to the left, agree=0.90, adj=0.714, (0 split)
## miscontrols < 593 to the right, agree=0.75, adj=0.286, (0 split)
## redcards < 1.5 to the right, agree=0.70, adj=0.143, (0 split)
## avgdist.gk < 9.9 to the right, agree=0.70, adj=0.143, (0 split)
##
## Node number 2: 13 observations
## mean=41.84615, MSE=101.9763
##
## Node number 3: 7 observations
## mean=72.57143, MSE=212.8163
rf_model2 <- randomForest(points ~ ., data = X_train)
rf_model2
##
## Call:
## randomForest(formula = points ~ ., data = X_train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 204.3953
## % Var explained: 42.51
# Predictions
lm_pred <- predict(lm_model2, newdata = X_test)
dt_pred <- predict(dt_model2, newdata = X_test)
rf_pred <- predict(rf_model2, newdata = X_test)
# Calculate RMSE for each model
lm_rmse <- rmse(y_test, lm_pred)
dt_rmse <- rmse(y_test, dt_pred)
rf_rmse <- rmse(y_test, rf_pred)
# Compare RMSE
cat("Linear Regression RMSE:", lm_rmse, "\n")
## Linear Regression RMSE: 13.28647
cat("Decision Tree Regression RMSE:", dt_rmse, "\n")
## Decision Tree Regression RMSE: 12.23912
cat("Random Forest Regression RMSE:", rf_rmse, "\n")
## Random Forest Regression RMSE: 11.79779
Our values are similar to the ones obtained above, with the difference that the Decision Tree Regression is performing much better. I will continue analyzing our results to identify the optimal model for this project.
# Residual analysis for linear regression model
lm_resid <- resid(lm_model)
plot(lm_resid ~ lm_pred, main = "Residuals vs Fitted Values")
abline(h = 0, col = "red") # Add a horizontal line at y = 0
This plot helps us check if the residuals have constant variance across the range of fitted values. Ideally, we want to see a random scatter of points around the horizontal line at y = 0, indicating homoscedasticity.
# Residual analysis for linear regression model
lm_resid2 <- resid(lm_model2)
plot(lm_resid2 ~ lm_pred, main = "Residuals vs Fitted Values")
abline(h = 0, col = "red") # Add a horizontal line at y = 0
# Calculate adjusted R-squared for linear regression model
summary(lm_model)$adj.r.squared
## [1] 0.7798001
summary(lm_model2)$adj.r.squared
## [1] 0.8059005
Adjusted R-squared penalizes for overfitting by considering the number of predictors in the model. Higher adjusted R-squared values indicate better model fit. Based on our results, lm_model2 would be more reliable.
# Analysis of Variance (ANOVA) for linear regression model
anova(lm_model)
anova(lm_model2)
ANOVA assesses the overall significance of the linear regression model by comparing the variance explained by the model to the residual variance.
The anova()
function in R is used to perform analysis of
variance for a linear regression model. Here’s what the output
represents:
For lm_model and lm_model2:
The ANOVA table provides information about the significance of each predictor variable in explaining the variation in the response variable (points). The table typically includes columns such as “Sum of Squares” (SS), “Degrees of Freedom” (df), “Mean Squares” (MS), and “F-statistic” with corresponding p-values. The p-value associated with the F-statistic tests the null hypothesis that all the regression coefficients are equal to zero (i.e., the model has no explanatory power).
The Analysis of Variance (ANOVA) tables provide information about the significance of each predictor variable in explaining the variation in the response variable (points). Here’s the interpretation for each table:
For the first ANOVA table (lm_model):
For the second ANOVA table (lm_model2):
In summary, the ANOVA tests help identify which predictor variables are statistically significant in explaining the variation in points obtained by teams.
#install.packages("lmtest") I always make package installation a comment after running my code in order to facilitate the process of saving to HTML and publishing to RPubs.
# Breusch-Pagan test for heteroskedasticity
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.3
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(lm_model)
##
## studentized Breusch-Pagan test
##
## data: lm_model
## BP = 7.8788, df = 8, p-value = 0.4454
bptest(lm_model2)
##
## studentized Breusch-Pagan test
##
## data: lm_model2
## BP = 6.6748, df = 6, p-value = 0.352
The code above is used for conducting the Breusch-Pagan test for heteroskedasticity on our two linear regression models, lm_model and lm_model2. This test is used to determine whether the variance of the residuals in the regression model is dependent on the values of the independent variables, which indicates heteroskedasticity.
library(lmtest)
: Loads the lmtest package, which
contains functions for conducting hypothesis tests related to linear
models.
bptest(lm_model)
: Conducts the Breusch-Pagan test
for heteroskedasticity on lm_model. This function tests whether the
variance of the residuals in lm_model is dependent on the values of the
independent variables.
bptest(lm_model2)
: Conducts the Breusch-Pagan test
for heteroskedasticity on lm_model2. Similarly, this function tests
whether the variance of the residuals in lm_model2 is dependent on the
values of the independent variables.
The output of these tests will indicate whether there is evidence of heteroskedasticity in the respective models. If the p-value associated with the test is below a certain significance level (e.g., 0.05), it suggests that there is heteroskedasticity present in the model.
In this case, the results of the Breusch-Pagan test for heteroskedasticity are as follows:
For lm_model:
For lm_model2:
In both cases, the p-values are greater than 0.05, which means we fail to reject the null hypothesis of homoskedasticity. Therefore, there is no significant evidence of heteroskedasticity in the models lm_model and lm_model2.
Now, let’s proceed to save lm_model2 for future deployment and development.
# Save the trained model
saveRDS(lm_model2, "linear_regression_model.rds")