NOTE: This might be a long document for some. If you are already familiar with the concepts of sports analytics and machine learning, feel free to skip to the “Getting Started”section. If you have R and RStudio already set up on your device, and understand the basics of coding, you may skip directly to the “Case Study” section for a hands-on experience.
Introduction
Sports analytics and machine learning have become increasingly intertwined in recent years, revolutionising the way teams, coaches, and analysts approach decision-making in sports. This document aims to provide a beginner-friendly introduction to machine learning in sports analytics, covering fundamental concepts, techniques, and applications.
If you’re wondering what sports analytics and machine learning are all about, let’s start with some definitions.
If you’re just interested in the practical aspects of machine learning in sports analytics, feel free to skip to the “What you need” section.
What is Sport Analytics?
Sports analytics involves the collection, analysis, and interpretation of data related to sports performance, player statistics, and game outcomes. Historically, sports analytics focused on basic statistics such as goals or points scored, and win-loss records. However, with advancements in technology and data collection methods, sports analytics has evolved to encompass a wide range of data types, including player tracking data, biometric data, and even social media sentiment analysis.
In recent years, sport analytics leverages statistical methods and machine learning algorithms to extract insights that can inform strategies, improve player performance, and enhance fan engagement.
What is Machine Learning?
Machine learning (ML) is a subset of Artificial Intelligence (AI). It focuses on developing algorithms and models that enable computers to learn from and make predictions or decisions based on data. Unlike traditional programming, where explicit instructions are provided, machine learning algorithms identify patterns in data and use these patterns to make informed decisions or predictions.
Machine learning can be broadly categorized into three main types:
Supervised Learning: In supervised learning, the algorithm is trained on a labeled dataset, where each input is associated with a corresponding output. The goal is to learn a mapping from inputs to outputs, allowing the model to make predictions on new, unseen data. Common algorithms include linear regression, decision trees, and support vector machines.
Unsupervised Learning: Unsupervised learning involves training the algorithm on an unlabeled dataset, where the goal is to identify patterns or structures within the data. Common techniques include clustering (e.g., k-means) and dimensionality reduction (e.g., principal component analysis).
Reinforcement Learning: In reinforcement learning, an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties based on its actions, and the goal is to learn a policy that maximizes cumulative rewards over time.
How does Machine Learning value add to current/existing practices?
Machine learning enhances sports analytics by providing advanced tools and techniques for data analysis, prediction, and decision-making.
Here are some ways machine learning adds value to existing practices:
Improved Predictive Accuracy: Machine learning algorithms can analyze large datasets and identify complex patterns that traditional statistical methods may miss, leading to more accurate predictions of player performance, game outcomes, and injury risks.
Real-time Analysis: Machine learning models can process data in real-time, allowing teams and coaches to make informed decisions during games based on up-to-date information.
Personalized Training and Strategy: Machine learning can help tailor training programs and game strategies to individual players based on their unique strengths, weaknesses, and performance data.
Enhanced Fan Engagement: By analyzing fan behavior and preferences, machine learning can help teams and organizations create personalized marketing campaigns and improve fan experiences.
Applications of Machine Learning in Sports Analytics
Machine learning has found numerous applications in sports analytics, including but not limited to:
Player Performance Analysis: Machine learning models can analyze player statistics and biometric data to assess performance, identify strengths and weaknesses, and predict future performance.
Injury Prediction and Prevention: By analyzing historical injury data and player workload, machine learning algorithms can identify risk factors and predict the likelihood of injuries, allowing teams to implement preventive measures.
Fan Engagement and Marketing: Machine learning can analyze fan behavior and preferences to personalize marketing efforts, improve fan experiences, and increase engagement.
With this comprehensive overview, let’s dive into the practical aspects of machine learning in sports analytics!
Getting Started
What you need
We will be using the R programming language and RStudio as our integrated development environment (IDE) for this guide. Please ensure you have the following set up on your device:
R and RStudio: Make sure you have R and RStudio installed on your computer. You can download R from CRAN and RStudio from RStudio’s website.
Datasets: For any analysis, we would need a dataset to use. R allows us to create mock datasets, which we will be using, based on real-world sports data.
How to use this guide
This guide is structured to provide a step-by-step approach to understanding and applying the concepts discussed. Each section builds upon the previous one, so it’s recommended to follow along sequentially.
You may also choose to copy the pre-written code chunks and paste them into your R script (more of this in a bit) in RStudio to run them.
The code chunks are grey-shaded boxes that look like this:
## [1] "Hello, World!"
You may also choose to click the green triangle button on the top right of each code chunk to run the code directly in this document. After clicking the button, you should see the output below the code chunk. In this case, you should see [1] “Hello, World!” as the output below the code chunk.
The hashtag # followed by text is called a comment.
Comments are not executed as part of the code but are there to provide
explanations or context about what the code does.
How to code in R
After installing R and RStudio on your device, open RStudio. You should see four general panels:
the script editor (top left),
the console (bottom left),
the environment/history (top right), and
the files/plots/packages/help/viewer (bottom right).
Generally, the script editor is where you write and save your code (i.e. top left panel), and the console is where you can run your code (i.e. bottom left panel).
TRY IT OUT:
Create a new R script by going to File > New File > R Script. This is where you can write and save your code.
After opening the R script, type the following code, “Hello, World!” (i.e. the top left panel).
Click the “Run” button or press “Ctrl + Enter” to execute the code.
You should see the output “Hello, World!” in the console (i.e. bottom left panel).
Congratulations! You’ve just written and executed your first R code. Practice makes permanent, so let’s try writing and running more code by going straight to the case study below.
Case Study
Understanding the Problem Statement
You are an aspiring analyst interested in understanding fan engagement in Singapore football. You have been given a data set containing various factors that may influence fan attendance at live football matches in the Singapore Premier League (SPL).
Problem statement: Using machine learning technique, predict the fan attendance numbers in SPL football matches and find out key determinants that drives these attendance numbers.
We will explore how predictive analytics can be applied to this problem statement using a mock data set. In predictive analytics in general, there are nine steps to follow:
- Data Preparation
- Load required packages on R
- Data Splitting
- Model Building
- Model Inspection
- Prediction
- Evaluation
- Hyperparameter Tuning
- Interpretation and Results
Step 1: Data preparation
Typically, the data set would have been created in the form of a csv file, excel file or similar.
To import the data, you would use the following code:
# import dataset
# dataset <- read.csv("path/to/your/dataset.csv") # Uncomment this line and replace with the actual path to your dataset
# view the first few rows of your dataset
# head(dataset) # Uncomment this line The dataset variable will now contain your data, and you
can use the head() function to view the first few rows of
the data set.
If the file is an excel file, you would use the following code:
# install.packages("readxl") # Uncomment this line if you haven't installed the readxl package
# library(readxl)
# dataset <- read_excel("path/to/your/dataset.xlsx") # Uncomment this line and replace with the actual path to your dataset
# head(dataset) # Uncomment this lineHowever, since we do not have access to a real-world data set for this case study, we will create a mock data set instead.
The mock data set contains different facets that may influence fan attendance at live games. Copy paste the following code chunk into your R script and run it to create the mock dataset.
# Set seed for reproducibility. This means that every time anyone runs the code, they will get the same set of random numbers. The numbers in the set.seed() function can be any integer, in this case, we used 123.
library(tidyverse)
library(data.table)##
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
set.seed(123)
# SPL season structure info
spl_seasons <- data.frame(
Season_Start = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025),
Season_End = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2025, 2026),
n_teams = c(10, 11, 12, 12, 12, 12, 12, 13, 12, 12, 10, 9, 9, 9, 9, 8, 8, 8, 9, 9, 8),
n_rounds = c(3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 1, 3, 4, 3, 4, 3),
total_games = c(135, 165, 198, 198, 198, 198, 198, 156, 162, 162, 135, 108, 108, 108, 108, 56, 84, 112, 108, 144, 84)
)
# Modern SPL teams and abbreviations (as placeholders for all years)
teams_modern <- c("Lion City Sailors", "DPMM", "Balestier Khalsa", "Tanjong Pagar",
"BG Tampines Rovers", "Hougang United", "Geylang International",
"Young Lions", "Albirex Niigata (S)")
team_abbr_modern <- c("LCS", "DPMM", "BAL", "TPG", "BGT", "HGU", "GEY", "YLS", "ALB")
names(team_abbr_modern) <- teams_modern
# Stadium details
stadiums <- c("JBS", "Bishan", "OTH", "JE")
stadium_capacity <- c("JBS" = 7100, "Bishan" = 10000, "OTH" = 5100, "JE" = 2700)
stadium_location <- c("JBS"="Central", "Bishan"="Central", "OTH"="East", "JE"="West")
stadium_identity <- c("JBS"="iconic/heritage", "Bishan"="standard/neutral",
"OTH"="standard/neutral", "JE"="standard/neutral")
stadium_capacity_type <- c("JBS"="small", "Bishan"="medium", "OTH"="small", "JE"="small")
# Attendance factor levels
team_ranking <- c("high", "medium", "low")
history_local <- c("champion", "top-half", "mid-table", "bottom-half", "relegation zone")
history_continental<- c("frequent participant", "occasional participant", "never participated")
national_capped <- c("none", "1-2", "3+")
star_marquee <- c("absent", "present")
league_form <- c("good form", "mixed form", "poor form")
rivalry <- c("no", "yes")
popularity_away <- c("low", "medium", "high")
match_time_cat <- c("afternoon", "evening", "night")
match_importance <- c("friendly", "regular season", "cup knockout", "cup final/title decider", "bottom table clash")
broadcast_avail <- c("none", "local TV", "international TV/streaming")
match_style <- c("defensive/slow", "balanced", "attacking/fast-paced")
comfort_access <- c("poor", "average", "good")
fan_facilities <- c("basic", "moderate", "extensive")
ticket_pricing <- c("low", "medium", "high")
promotions <- c("none", "group/family", "student/youth", "bundled with merchandise")
season_pass <- c("no", "yes")
merch_tiein <- c("no", "yes")
weather_cat <- c("sunny", "rainy", "hazy", "humid", "thunderstorm")
competing_event <- c("none", "minor local event", "major local event", "international event")
interest_climate <- c("low", "medium", "high")
safety_concern <- c("none", "moderate", "high")
fan_culture <- c("weak", "moderate", "strong")
comm_engage <- c("low", "medium", "high")
nat_pride <- c("not applicable", "local heroes playing", "national representation")
aud_div <- c("mostly locals", "mix of locals & expats", "mostly expats")
ad_promo <- c("weak", "moderate", "strong")
media_cover <- c("low", "moderate", "high")
ticket_info <- c("difficult", "average", "easy")
partnership <- c("none", "school groups", "corporate groups")
sm_followers_club <- c("low", "medium", "high")
sm_followers_players <- c("low", "medium", "high")
# Initialize master dataframe
all_spl_df <- data.frame()
for (season_ix in 1:nrow(spl_seasons)) {
season_row <- spl_seasons[season_ix, ]
n_teams <- season_row$n_teams
n_rounds <- season_row$n_rounds
total_games <- season_row$total_games
year_start <- season_row$Season_Start
year_end <- season_row$Season_End
# Use the first n_teams from the modern list for each season
teams_this_season <- teams_modern[1:n_teams]
team_abbr_this_season<- team_abbr_modern[teams_this_season]
# Build fixtures (all pairs home/away excluding self, repeated n_rounds)
fixtures <- expand.grid(Home_Team=teams_this_season, Away_Team=teams_this_season, stringsAsFactors=FALSE)
fixtures <- fixtures[fixtures$Home_Team != fixtures$Away_Team, ]
fixtures <- fixtures[rep(1:nrow(fixtures), n_rounds), ]
fixtures <- fixtures[1:total_games, ] # truncate extra if any
# Dates, times, and random details
sd <- as.Date(paste0(year_end, "-01-01"))
ed <- as.Date(paste0(year_end, "-12-31"))
rand_dates <- sample(seq.Date(sd, ed, by="day"), total_games, replace=FALSE)
rand_days <- weekdays(rand_dates)
rand_unix <- as.numeric(as.POSIXct(rand_dates + hours(sample(15:21, total_games, replace=TRUE))))
match_stadiums<- sample(stadiums, total_games, replace=TRUE)
stadium_caps <- stadium_capacity[match_stadiums]
stadium_locs <- stadium_location[match_stadiums]
stadium_ids <- stadium_identity[match_stadiums]
stadium_types <- stadium_capacity_type[match_stadiums]
spl_ranks <- sample(1:n_teams, total_games, replace=TRUE)
continental <- ifelse(spl_ranks %in% c(1,2), "Yes", "No")
# Create the season's dataframe
season_df <- data.frame(
Season_Start = year_start,
Season_End = year_end,
Game_Date = rand_dates,
GameID = paste0(format(rand_dates, "%d%m%Y"), "-", team_abbr_this_season[fixtures$Home_Team], team_abbr_this_season[fixtures$Away_Team]),
Match_Day = rand_days,
Match_Time = rand_unix,
Match_Temperature = sample(27:32, total_games, replace=TRUE),
Match_Humidity = sample(70:93, total_games, replace=TRUE),
Match_Weather = sample(weather_cat, total_games, replace=TRUE),
Home_Team = fixtures$Home_Team,
Away_Team = fixtures$Away_Team,
HomeTeam_GoalsFor = sample(0:5, total_games, replace=TRUE),
HomeTeam_GoalsAgainst = sample(0:5, total_games, replace=TRUE),
AwayTeam_GoalsFor = sample(0:5, total_games, replace=TRUE),
AwayTeam_GoalsAgainst = sample(0:5, total_games, replace=TRUE),
Match_Stadium = match_stadiums,
Stadium_Capacity = stadium_caps,
SPL_Rank = spl_ranks,
Continental_Competition = continental,
Geographic_Location = stadium_locs,
Stadium_Identity_History = stadium_ids,
Stadium_Capacity_Type = stadium_types,
Team_Ranking = sample(team_ranking, total_games, replace=TRUE),
Historical_Performance_Local = sample(history_local, total_games, replace=TRUE),
Historical_Performance_Continental = sample(history_continental, total_games, replace=TRUE),
National_Capped_Players = sample(national_capped, total_games, replace=TRUE),
Star_Foreign_Marquee_Players = sample(star_marquee, total_games, replace=TRUE),
Current_League_Form = sample(league_form, total_games, replace=TRUE),
Rivalry_Derby = sample(rivalry, total_games, replace=TRUE),
Popularity_Away_Team = sample(popularity_away, total_games, replace=TRUE),
Time_of_Match = sample(match_time_cat, total_games, replace=TRUE),
Day_of_Match = sample(c("weekday", "weekend", "public holiday"), total_games, replace=TRUE),
Match_Importance = sample(match_importance, total_games, replace=TRUE),
Broadcast_Availability = sample(broadcast_avail, total_games, replace=TRUE),
Match_Style_Entertainment_Factor = sample(match_style, total_games, replace=TRUE),
Comfort_Accessibility = sample(comfort_access, total_games, replace=TRUE),
Fan_Facilities = sample(fan_facilities, total_games, replace=TRUE),
Ticket_Pricing_Tier = sample(ticket_pricing, total_games, replace=TRUE),
Promotions = sample(promotions, total_games, replace=TRUE),
Season_Pass_Membership_Perks = sample(season_pass, total_games, replace=TRUE),
Merchandise_Tie_ins = sample(merch_tiein, total_games, replace=TRUE),
Competing_Events = sample(competing_event, total_games, replace=TRUE),
Public_Interest_Climate = sample(interest_climate, total_games, replace=TRUE),
Safety_Concerns = sample(safety_concern, total_games, replace=TRUE),
Fan_Culture_Presence = sample(fan_culture, total_games, replace=TRUE),
Community_Engagement = sample(comm_engage, total_games, replace=TRUE),
National_Pride_Factor = sample(nat_pride, total_games, replace=TRUE),
Audience_Diversity = sample(aud_div, total_games, replace=TRUE),
Advertising_Promotion = sample(ad_promo, total_games, replace=TRUE),
Media_Coverage = sample(media_cover, total_games, replace=TRUE),
Ticket_Info_Accessibility = sample(ticket_info, total_games, replace=TRUE),
Partnerships = sample(partnership, total_games, replace=TRUE),
Social_Media_Followers_Club = sample(sm_followers_club, total_games, replace=TRUE),
Social_Media_Followers_Players = sample(sm_followers_players, total_games, replace=TRUE),
stringsAsFactors=FALSE
)
# Add this season to the master df
all_spl_df <- rbind(all_spl_df, season_df)
}
#Remove rows where teams are NA
all_spl_df <- all_spl_df %>%
filter(!is.na(Home_Team) & !is.na(Away_Team))
#check for missing values
colSums(is.na(all_spl_df))
# all_spl_df contains all SPL seasons from 2005 to 2025/26
cat("Total games generated:", nrow(all_spl_df), "\n")
print(table(all_spl_df$Season_Start))
head(all_spl_df, 5)
# Attendance percent ranges for each type
attendance_ranges <- list(
"cup final/title decider" = c(0.80, 1.0),
"regular season" = c(0.35, 0.70),
"cup knockout" = c(0.40, 0.80),
"relegation battle" = c(0.40, 0.75),
"bottom table clash" = c(0.20, 0.45)
)
all_spl_df$Attendance <- mapply(function(importance, capacity) {
rng <- attendance_ranges[[tolower(importance)]]
if (is.null(rng)) rng <- c(0.3, 0.6)
as.integer(sample(seq(floor(rng[1]*as.numeric(capacity)), ceiling(rng[2]*as.numeric(capacity))), 1))
}, as.character(all_spl_df$Match_Importance), all_spl_df$Stadium_Capacity)
all_spl_df$Attendance <- pmin(all_spl_df$Attendance, all_spl_df$Stadium_Capacity)Step 2: Load required packages on R
What is a library?
A library, also known as a package in R, is a set of codes and functions that will help you perform specific tasks without having to write the code from scratch.
A simple analogy: Imagine you want to bake a cake. Instead of starting from scratch, you can use a cake mix (i.e. library/package) that contains all the necessary ingredients and instructions to make the cake quickly and easily.
Step 3: Data Exploration and Visualisation
Exploring the data set is one of the fundamentals of data science. It helps you understand the data better and identify any missing values, outliers, or patterns that may exist.
Some machine learning algorithms also require the data to be in a specific format (e.g. changing characters to numeric or factors), so data exploration helps you prepare the data accordingly.
summary(all_spl_df)
str(all_spl_df)
# Check for missing values
colSums(is.na(all_spl_df))
#Transform relevant variables into factors
attendance_factor_vars <- c(
"Match_Day", "Match_Weather", "Home_Team", "Away_Team", "Match_Stadium", "Geographic_Location",
"Stadium_Identity_History", "Stadium_Capacity_Type", "Team_Ranking", "Historical_Performance_Local",
"Historical_Performance_Continental", "National_Capped_Players", "Star_Foreign_Marquee_Players",
"Current_League_Form", "Rivalry_Derby", "Popularity_Away_Team", "Time_of_Match", "Day_of_Match",
"Match_Importance", "Broadcast_Availability", "Match_Style_Entertainment_Factor", "Comfort_Accessibility",
"Fan_Facilities", "Ticket_Pricing_Tier", "Promotions", "Season_Pass_Membership_Perks",
"Merchandise_Tie_ins", "Competing_Events", "Public_Interest_Climate", "Safety_Concerns",
"Fan_Culture_Presence", "Community_Engagement", "National_Pride_Factor", "Audience_Diversity",
"Advertising_Promotion", "Media_Coverage", "Ticket_Info_Accessibility", "Partnerships",
"Social_Media_Followers_Club", "Social_Media_Followers_Players", "Continental_Competition"
)
# Transform selected columns in all_spl_df to factors
all_spl_df[attendance_factor_vars] <- lapply(all_spl_df[attendance_factor_vars], as.factor)
# Now convert to numeric variables for correlation:
all_spl_df$Team_Ranking_num <- as.numeric(all_spl_df$Team_Ranking)
all_spl_df$Historical_Performance_Local_num <- as.numeric(all_spl_df$Historical_Performance_Local)
all_spl_df$Historical_Performance_Continental_num <- as.numeric(all_spl_df$Historical_Performance_Continental)
all_spl_df$National_Capped_Players_num <- as.numeric(all_spl_df$National_Capped_Players)
all_spl_df$Star_Foreign_Marquee_Players_num <- as.numeric(all_spl_df$Star_Foreign_Marquee_Players)
all_spl_df$Current_League_Form_num <- as.numeric(all_spl_df$Current_League_Form)
all_spl_df$Popularity_Away_Team_num <- as.numeric(all_spl_df$Popularity_Away_Team)
all_spl_df$Match_Importance_num <- as.numeric(all_spl_df$Match_Importance)
all_spl_df$Broadcast_Availability_num <- as.numeric(all_spl_df$Broadcast_Availability)
all_spl_df$Comfort_Accessibility_num <- as.numeric(all_spl_df$Comfort_Accessibility)
all_spl_df$Fan_Facilities_num <- as.numeric(all_spl_df$Fan_Facilities)
all_spl_df$Ticket_Pricing_Tier_num <- as.numeric(all_spl_df$Ticket_Pricing_Tier)
all_spl_df$Promotions_num <- as.numeric(all_spl_df$Promotions)
all_spl_df$Season_Pass_Membership_Perks_num <- as.numeric(all_spl_df$Season_Pass_Membership_Perks)
all_spl_df$Merchandise_Tie_ins_num <- as.numeric(all_spl_df$Merchandise_Tie_ins)
all_spl_df$Public_Interest_Climate_num <- as.numeric(all_spl_df$Public_Interest_Climate)
all_spl_df$Safety_Concerns_num <- as.numeric(all_spl_df$Safety_Concerns)
all_spl_df$Fan_Culture_Presence_num <- as.numeric(all_spl_df$Fan_Culture_Presence)
all_spl_df$Community_Engagement_num <- as.numeric(all_spl_df$Community_Engagement)
all_spl_df$National_Pride_Factor_num <- as.numeric(all_spl_df$National_Pride_Factor)
all_spl_df$Advertising_Promotion_num <- as.numeric(all_spl_df$Advertising_Promotion)
all_spl_df$Media_Coverage_num <- as.numeric(all_spl_df$Media_Coverage)
all_spl_df$Ticket_Info_Accessibility_num <- as.numeric(all_spl_df$Ticket_Info_Accessibility)
all_spl_df$Social_Media_Followers_Club_num <- as.numeric(all_spl_df$Social_Media_Followers_Club)
all_spl_df$Social_Media_Followers_Players_num <- as.numeric(all_spl_df$Social_Media_Followers_Players)
# Check the structure again
str(all_spl_df)
#Check for correlation between variables
correlation_vars <- c(
"Attendance", "Stadium_Capacity", "Match_Temperature", "Match_Humidity",
"Team_Ranking_num", "Historical_Performance_Local_num", "Historical_Performance_Continental_num",
"National_Capped_Players_num", "Star_Foreign_Marquee_Players_num", "Current_League_Form_num",
"Popularity_Away_Team_num", "Match_Importance_num", "Broadcast_Availability_num",
"Comfort_Accessibility_num", "Fan_Facilities_num", "Ticket_Pricing_Tier_num",
"Promotions_num", "Season_Pass_Membership_Perks_num", "Merchandise_Tie_ins_num",
"Public_Interest_Climate_num", "Safety_Concerns_num", "Fan_Culture_Presence_num",
"Community_Engagement_num", "National_Pride_Factor_num", "Advertising_Promotion_num",
"Media_Coverage_num", "Ticket_Info_Accessibility_num", "Social_Media_Followers_Club_num",
"Social_Media_Followers_Players_num"
)
correlation_matrix <- cor(all_spl_df[correlation_vars], use="complete.obs")
#Visualise corrplot with values
corrplot(correlation_matrix, method="color", type="upper", tl.col="black", tl.srt=45, addCoef.col = "black", number.cex=0.7)#Identify the moderate -strong correlations with Attendance
correlation_with_attendance <- correlation_matrix[,"Attendance"]
correlation_with_attendance <- sort(correlation_with_attendance, decreasing=TRUE)
correlation_with_attendance <- correlation_with_attendance[abs(correlation_with_attendance) > 0.3 & names(correlation_with_attendance) != "Attendance"]
correlation_with_attendance
#Visualise the correlations with Attendance
corrplot(as.matrix(correlation_with_attendance), method="color", tl.col="black", tl.srt=45, addCoef.col = "black", number.cex=0.7)Step 3.5: Data Splitting
Before building the model, we need to split the data into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate the model’s performance on unseen data.
set.seed(123) # For reproducibility
train_index <- createDataPartition(all_spl_df$Attendance, p=0.8, list=FALSE)
train_data <- all_spl_df[train_index, ]
test_data <- all_spl_df[-train_index, ]Setting the seed means that every time anyone runs the code, they will get the same set of random numbers, provided they use the same numbers in the bracket. The numbers in the set.seed() function can be any integer, in this case, we used 123.
The createDataPartition() function from the
caret package is used to create a stratified random sample
of the data, ensuring that the distribution of the target variable
(Attendance) is similar in both the training and testing sets.
Step 4: Model Building
In this step, we will build a machine learning model to predict fan attendance at live games based on the various factors in our data set.
We will use a linear regression model for this purpose, as it is a simple yet effective algorithm for predicting continuous outcomes. We will build three models: a null model, a full model, and a stepwise model.
A null model is a model that only includes the intercept term, while a full model includes all the predictors in the data set.
A stepwise model is built by adding or removing predictors based on their statistical significance, using a criterion such as the Akaike Information Criterion (AIC).
lm_null <- lm(Attendance ~ 1, data=all_spl_df)
lm_full <- lm(Attendance ~ .-GameID, data=all_spl_df)
#Stepwise feature selection based on AIC
lm_step <- step(lm_null, scope=list(lower=lm_null, upper=lm_full), direction="both", trace=0)
#Compare summaries of all models
summary(lm_null)
summary(lm_full)
summary(lm_step)The summaries of the three models will provide information on the coefficients, R-squared values, and p-values for each predictor. The stepwise model will help us identify the most important predictors that significantly influence fan attendance at live games.
Step 5: Model Inspection
It is now time to inspect the model we have built. This step is crucial to ensure that the model is valid and reliable. To do so, we will check the model diagnostics for the stepwise model. Model diagnostics help us assess the assumptions of the linear regression model, such as linearity, normality of residuals, homoscedasticity, and independence of errors.
What am I looking at?
These are your model diagnostic plots. Think of them as a health check-up for the linear regression model you just built.
They help you check if your model meets the key “assumptions” of linear regression. If these plots look good, you can be more confident in your model’s results (like your p-values and R2).
Here’s a breakdown of each plot:
1. Residuals vs Fitted
What it is: This plot shows your model’s prediction errors (the residuals) on the y-axis against its predicted attendance numbers (the fitted values) on the x-axis.
What we want to see: A random “shotgun blast” of points. The red line should be mostly flat and centered on zero.
What it tells us: If the red line is flat at zero, it means our model’s errors are random, which is good. If the line has a clear curve (like the slight “U” shape we see here), it suggests our model might be missing something. For example, the relationship between stadium size and attendance might not be perfectly linear, and a more complex model could be slightly better.
2. Q-Q Residuals
What it is: This is the Normal Q-Q plot. It checks if your model’s errors (residuals) are “normally distributed” (i.e., follow a classic bell curve). This is a key assumption for linear regression.
What we want to see: All the black dots should fall perfectly along the dashed straight line.
What it tells us: Your plot looks excellent. The dots stick to the line almost perfectly. This means the “normality” assumption is met, and you can trust the p-values and confidence intervals your model is giving you.
3. Scale-Location
What it is: This plot is similar to the first one, but it checks if the “spread” (or variance) of your errors is consistent across all predictions. This is the “homoscedasticity” assumption.
What we want to see: A random scatter of points with a flat red line. We don’t want to see a “funnel” or “megaphone” shape (where the points get more spread out from left to right).
What it tells us: Your plot looks good. The red line is relatively flat, and the spread of the points seems consistent. This assumption is also met.
4. Residuals vs Leverage
What it is: This is the “outlier detector.” It helps you find individual data points that might be having a large and potentially negative influence on your model.
What we want to see: We want all our points to be clustered together and (most importantly) inside the dashed red lines. Those dashed lines represent “Cook’s distance,” which is a measure of influence.
What it tells us: Your plot is perfect. All the points are well inside the Cook’s distance lines. This means you don’t have any single “super-outlier” games that are skewing your entire model.
Step 6: Prediction
After checking the model diagnostics, we can now use the model to make predictions on the test data.This is where we see how well our model performs on unseen data i.e. data that was not used to train the model (remember the test data we created in Step 3.5)?
Step 7: Evaluation
Now that we have made predictions on the test data, we need to evaluate the model’s performance. We will use two common metrics for regression models: Root Mean Squared Error (RMSE) and R-squared (R²).
#Calculate RMSE and R-squared for stepwise model
rmse <- sqrt(mean((test_data$Attendance - predictions)^2))
ss_total <- sum((test_data$Attendance - mean(test_data$Attendance))^2)
ss_residual <- sum((test_data$Attendance - predictions)^2)
r_squared <- 1 - (ss_residual / ss_total)
cat("RMSE:", rmse, "\n")## RMSE: 838.0364
## R-squared: 0.8481487
The RMSE value indicates the average difference between the predicted and actual attendance values. A lower RMSE value indicates better model performance. In this case, an RMSE value of 838.036 suggests that, on average, the model’s predictions are off by approximately 838.0364 attendees.
The R-squared value indicates the proportion of variance in the dependent variable (Attendance) that can be explained by the independent variables in the model. An R-squared value closer to 1 indicates a better fit.In this case, an R-squared value of 0.848 suggests that the model explains approximately 84.8% of the variance in fan attendance at live games in the Singapore Premier League (SPL).
Step 8: Hyperparameter Tuning
Step 9: Interpretation and Results
So we have built, inspected, predicted, evaluated, and tuned our model. Now it’s time to interpret the results and understand the key determinants that drive fan attendance at live games in the Singapore Premier League (SPL).
#Display coefficients of the stepwise model
coefficients <- summary(lm_step)$coefficients
coefficients_table <- as.data.frame(coefficients)
# Display coefficients of the stepwise model
coefficients <- summary(lm_step)$coefficients
coefficients_table <- as.data.frame(coefficients)
colnames(coefficients_table) <- c("Estimate", "Std. Error", "t value", "p-value")
kable(coefficients_table, caption="Stepwise Model Coefficients") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Estimate | Std. Error | t value | p-value | |
|---|---|---|---|---|
| (Intercept) | -1509.1134833 | 105.4550238 | -14.3104940 | 0.0000000 |
| Stadium_Capacity | 0.5557288 | 0.0062897 | 88.3551263 | 0.0000000 |
| Match_Importancecup final/title decider | 3586.1696075 | 51.9989676 | 68.9661694 | 0.0000000 |
| Match_Importancecup knockout | 1709.4633820 | 51.6719327 | 33.0830161 | 0.0000000 |
| Match_Importancefriendly | 696.3456140 | 52.5027323 | 13.2630357 | 0.0000000 |
| Match_Importanceregular season | 1285.6777768 | 52.5525724 | 24.4646022 | 0.0000000 |
| Ticket_Info_Accessibility_num | -50.6530504 | 20.2065235 | -2.5067672 | 0.0122558 |
| Popularity_Away_Team_num | 45.3350522 | 20.1640451 | 2.2483114 | 0.0246551 |
| Social_Media_Followers_Playerslow | 100.8514031 | 40.8256671 | 2.4702941 | 0.0135756 |
| Social_Media_Followers_Playersmedium | 84.1251212 | 40.1557788 | 2.0949692 | 0.0362880 |
| Broadcast_Availability_num | 37.2356311 | 20.1076250 | 1.8518165 | 0.0641863 |
| Season_Pass_Membership_Perksyes | -61.0983065 | 33.0687326 | -1.8476156 | 0.0647923 |
| Home_TeamBalestier Khalsa | 20.8554330 | 72.6598943 | 0.2870281 | 0.7741179 |
| Home_TeamBG Tampines Rovers | 31.7570772 | 72.5941261 | 0.4374607 | 0.6618203 |
| Home_TeamDPMM | -32.2078506 | 72.6137062 | -0.4435506 | 0.6574112 |
| Home_TeamGeylang International | -89.8306717 | 72.0230084 | -1.2472496 | 0.2124389 |
| Home_TeamHougang United | 144.1162549 | 72.3490190 | 1.9919587 | 0.0464992 |
| Home_TeamLion City Sailors | -15.7999305 | 72.7865440 | -0.2170721 | 0.8281723 |
| Home_TeamTanjong Pagar | 27.3243349 | 72.6511094 | 0.3761035 | 0.7068763 |
| Home_TeamYoung Lions | -94.0505247 | 72.0385154 | -1.3055589 | 0.1918393 |
This table is the answer to the second part of our problem statement: “find out key determinants that drives these attendance numbers.” Here is a breakdown of what each column means.
How to Read This Table
Let’s focus on the two most important columns for our analysis:
Estimate and p-value.
1. Estimate (The “Effect”)
This is the most important number. It tells you the size and
direction of each variable’s effect on Attendance.
For a numeric variable (like
Stadium_Capacity):The
Estimateis 0.5557.Interpretation: For every 1 additional seat of capacity in a stadium, the model predicts attendance will increase by 0.56 people (holding all other factors equal).
For a categorical variable (like
Match_Importance):This is a bit different. R automatically picks one level to be the “baseline” or “default” (in this case, it was “bottom table clash,” which is why it’s not in the list).
The
EstimateforMatch_Importancecup final/title decideris 3586.17.Interpretation: The model predicts a “cup final/title decider” will have 3,586 more fans than a “bottom table clash” (holding all other factors equal).
Similarly, a “regular season” game is predicted to have 1,286 more fans than a “bottom table clash.”
2. p-value (The “Significance”)
This column tells you if the variable is a statistically significant predictor. It answers the question: “Is this variable’s effect real, or did it just show up due to random chance?”
The Rule of Thumb: A p-value less than 0.05 is considered “significant.”
Interpretation: When you see a p-value with a lot of zeros (like
0.0000000forStadium_Capacity), it means the model is extremely confident that this variable has a real, measurable impact on attendance.
Conclusion
What are our Key Determinants?
Based on the table above, our model has found several significant drivers of attendance.
The “Heavy Hitters” (p-value < 0.05):
Stadium_CapacityMatch_Importance(all levels)Ticket_Info_Accessibility_num(This is negative, suggesting “easier” access (a higher number) decreases attendance, which is interesting!)Popularity_Away_Team_numSocial_Media_Followers_PlayersHome_TeamHougang United(This is just barely significant, suggesting games at Hougang’s stadium have a small positive effect compared to the baseline team).
Not as Important (p-value > 0.05):
Broadcast_Availability_numSeason_Pass_Membership_PerksyesMost of the other
Home_Teamvariables (likeBalestier Khalsa,DPMM, etc.). This means that once you account for stadium size and match importance, the specific home team (other than Hougang) doesn’t have a statistically significant effect.
So there you have it! In this guide, we’ve successfully walked through an entire predictive modeling project, from creating a data set to interpreting our final model. You now have a foundational workflow for tackling linear regression problems in R. Look out for Part 2, where we dive into more complex models that can possibly capture more non-linear interactions!