NOTE: This might be a long document for some. If you are already familiar with the concepts of sports analytics and machine learning, feel free to skip to the “Getting Started”section. If you have R and RStudio already set up on your device, and understand the basics of coding, you may skip directly to the “Case Study” section for a hands-on experience.

Introduction

Sports analytics and machine learning have become increasingly intertwined in recent years, revolutionising the way teams, coaches, and analysts approach decision-making in sports. This document aims to provide a beginner-friendly introduction to machine learning in sports analytics, covering fundamental concepts, techniques, and applications.

If you’re wondering what sports analytics and machine learning are all about, let’s start with some definitions.

If you’re just interested in the practical aspects of machine learning in sports analytics, feel free to skip to the “What you need” section.

What is Sport Analytics?

Sports analytics involves the collection, analysis, and interpretation of data related to sports performance, player statistics, and game outcomes. Historically, sports analytics focused on basic statistics such as goals or points scored, and win-loss records. However, with advancements in technology and data collection methods, sports analytics has evolved to encompass a wide range of data types, including player tracking data, biometric data, and even social media sentiment analysis.

In recent years, sport analytics leverages statistical methods and machine learning algorithms to extract insights that can inform strategies, improve player performance, and enhance fan engagement.

What is Machine Learning?

Machine learning (ML) is a subset of Artificial Intelligence (AI). It focuses on developing algorithms and models that enable computers to learn from and make predictions or decisions based on data. Unlike traditional programming, where explicit instructions are provided, machine learning algorithms identify patterns in data and use these patterns to make informed decisions or predictions.

Machine learning can be broadly categorized into three main types:

  1. Supervised Learning: In supervised learning, the algorithm is trained on a labeled dataset, where each input is associated with a corresponding output. The goal is to learn a mapping from inputs to outputs, allowing the model to make predictions on new, unseen data. Common algorithms include linear regression, decision trees, and support vector machines.

  2. Unsupervised Learning: Unsupervised learning involves training the algorithm on an unlabeled dataset, where the goal is to identify patterns or structures within the data. Common techniques include clustering (e.g., k-means) and dimensionality reduction (e.g., principal component analysis).

  3. Reinforcement Learning: In reinforcement learning, an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties based on its actions, and the goal is to learn a policy that maximizes cumulative rewards over time.

How does Machine Learning value add to current/existing practices?

Machine learning enhances sports analytics by providing advanced tools and techniques for data analysis, prediction, and decision-making.

Here are some ways machine learning adds value to existing practices:

  1. Improved Predictive Accuracy: Machine learning algorithms can analyze large datasets and identify complex patterns that traditional statistical methods may miss, leading to more accurate predictions of player performance, game outcomes, and injury risks.

  2. Real-time Analysis: Machine learning models can process data in real-time, allowing teams and coaches to make informed decisions during games based on up-to-date information.

  3. Personalized Training and Strategy: Machine learning can help tailor training programs and game strategies to individual players based on their unique strengths, weaknesses, and performance data.

  4. Enhanced Fan Engagement: By analyzing fan behavior and preferences, machine learning can help teams and organizations create personalized marketing campaigns and improve fan experiences.

Applications of Machine Learning in Sports Analytics

Machine learning has found numerous applications in sports analytics, including but not limited to:

  1. Player Performance Analysis: Machine learning models can analyze player statistics and biometric data to assess performance, identify strengths and weaknesses, and predict future performance.

  2. Injury Prediction and Prevention: By analyzing historical injury data and player workload, machine learning algorithms can identify risk factors and predict the likelihood of injuries, allowing teams to implement preventive measures.

  3. Fan Engagement and Marketing: Machine learning can analyze fan behavior and preferences to personalize marketing efforts, improve fan experiences, and increase engagement.

With this comprehensive overview, let’s dive into the practical aspects of machine learning in sports analytics!

Getting Started

What you need

We will be using the R programming language and RStudio as our integrated development environment (IDE) for this guide. Please ensure you have the following set up on your device:

  1. R and RStudio: Make sure you have R and RStudio installed on your computer. You can download R from CRAN and RStudio from RStudio’s website.

  2. Datasets: For any analysis, we would need a dataset to use. R allows us to create mock datasets, which we will be using, based on real-world sports data.

How to use this guide

This guide is structured to provide a step-by-step approach to understanding and applying the concepts discussed. Each section builds upon the previous one, so it’s recommended to follow along sequentially.

You may also choose to copy the pre-written code chunks and paste them into your R script (more of this in a bit) in RStudio to run them.

The code chunks are grey-shaded boxes that look like this:

# This is an example code chunk
print("Hello, World!")
## [1] "Hello, World!"

You may also choose to click the green triangle button on the top right of each code chunk to run the code directly in this document. After clicking the button, you should see the output below the code chunk. In this case, you should see [1] “Hello, World!” as the output below the code chunk.

The hashtag # followed by text is called a comment. Comments are not executed as part of the code but are there to provide explanations or context about what the code does.

How to code in R

After installing R and RStudio on your device, open RStudio. You should see four general panels:

  1. the script editor (top left),

  2. the console (bottom left),

  3. the environment/history (top right), and

  4. the files/plots/packages/help/viewer (bottom right).

Generally, the script editor is where you write and save your code (i.e. top left panel), and the console is where you can run your code (i.e. bottom left panel).

TRY IT OUT:

Create a new R script by going to File > New File > R Script. This is where you can write and save your code.

After opening the R script, type the following code, “Hello, World!” (i.e. the top left panel).

Click the “Run” button or press “Ctrl + Enter” to execute the code.

You should see the output “Hello, World!” in the console (i.e. bottom left panel).

Congratulations! You’ve just written and executed your first R code. Practice makes permanent, so let’s try writing and running more code by going straight to the case study below.

Case Study

Understanding the Problem Statement

You are an aspiring analyst interested in understanding fan engagement in Singapore football. You have been given a data set containing various factors that may influence fan attendance at live football matches in the Singapore Premier League (SPL).

Problem statement: Using machine learning technique, predict the fan attendance numbers in SPL football matches and find out key determinants that drives these attendance numbers.

We will explore how predictive analytics can be applied to this problem statement using a mock data set. In predictive analytics in general, there are nine steps to follow:

  1. Data Preparation
  2. Load required packages on R
  3. Data Splitting
  4. Model Building
  5. Model Inspection
  6. Prediction
  7. Evaluation
  8. Hyperparameter Tuning
  9. Interpretation and Results

Step 1: Data preparation

Typically, the data set would have been created in the form of a csv file, excel file or similar.

To import the data, you would use the following code:

# import dataset
# dataset <- read.csv("path/to/your/dataset.csv") # Uncomment this line and replace with the actual path to your dataset
# view the first few rows of your dataset
# head(dataset) # Uncomment this line 

The dataset variable will now contain your data, and you can use the head() function to view the first few rows of the data set.

If the file is an excel file, you would use the following code:

# install.packages("readxl") # Uncomment this line if you haven't installed the readxl package
# library(readxl)
# dataset <- read_excel("path/to/your/dataset.xlsx") # Uncomment this line and replace with the actual path to your dataset
# head(dataset) # Uncomment this line

However, since we do not have access to a real-world data set for this case study, we will create a mock data set instead.

The mock data set contains different facets that may influence fan attendance at live games. Copy paste the following code chunk into your R script and run it to create the mock dataset.

# Set seed for reproducibility. This means that every time anyone runs the code, they will get the same set of random numbers. The numbers in the set.seed() function can be any integer, in this case, we used 123.

library(tidyverse)
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
set.seed(123)

# SPL season structure info
spl_seasons <- data.frame(
  Season_Start = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025),
  Season_End   = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2025, 2026),
  n_teams      = c(10, 11, 12, 12, 12, 12, 12, 13, 12, 12, 10, 9, 9, 9, 9, 8, 8, 8, 9, 9, 8),
  n_rounds     = c(3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 1, 3, 4, 3, 4, 3),
  total_games  = c(135, 165, 198, 198, 198, 198, 198, 156, 162, 162, 135, 108, 108, 108, 108, 56, 84, 112, 108, 144, 84)
)

# Modern SPL teams and abbreviations (as placeholders for all years)
teams_modern <- c("Lion City Sailors", "DPMM", "Balestier Khalsa", "Tanjong Pagar",
                  "BG Tampines Rovers", "Hougang United", "Geylang International",
                  "Young Lions", "Albirex Niigata (S)")
team_abbr_modern <- c("LCS", "DPMM", "BAL", "TPG", "BGT", "HGU", "GEY", "YLS", "ALB")
names(team_abbr_modern) <- teams_modern

# Stadium details
stadiums <- c("JBS", "Bishan", "OTH", "JE")
stadium_capacity <- c("JBS" = 7100, "Bishan" = 10000, "OTH" = 5100, "JE" = 2700)
stadium_location  <- c("JBS"="Central", "Bishan"="Central", "OTH"="East", "JE"="West")
stadium_identity  <- c("JBS"="iconic/heritage", "Bishan"="standard/neutral",
                       "OTH"="standard/neutral", "JE"="standard/neutral")
stadium_capacity_type <- c("JBS"="small", "Bishan"="medium", "OTH"="small", "JE"="small")

# Attendance factor levels
team_ranking       <- c("high", "medium", "low")
history_local      <- c("champion", "top-half", "mid-table", "bottom-half", "relegation zone")
history_continental<- c("frequent participant", "occasional participant", "never participated")
national_capped    <- c("none", "1-2", "3+")
star_marquee       <- c("absent", "present")
league_form        <- c("good form", "mixed form", "poor form")
rivalry            <- c("no", "yes")
popularity_away    <- c("low", "medium", "high")
match_time_cat     <- c("afternoon", "evening", "night")
match_importance   <- c("friendly", "regular season", "cup knockout", "cup final/title decider", "bottom table clash")
broadcast_avail    <- c("none", "local TV", "international TV/streaming")
match_style        <- c("defensive/slow", "balanced", "attacking/fast-paced")
comfort_access     <- c("poor", "average", "good")
fan_facilities     <- c("basic", "moderate", "extensive")
ticket_pricing     <- c("low", "medium", "high")
promotions         <- c("none", "group/family", "student/youth", "bundled with merchandise")
season_pass        <- c("no", "yes")
merch_tiein        <- c("no", "yes")
weather_cat        <- c("sunny", "rainy", "hazy", "humid", "thunderstorm")
competing_event    <- c("none", "minor local event", "major local event", "international event")
interest_climate   <- c("low", "medium", "high")
safety_concern     <- c("none", "moderate", "high")
fan_culture        <- c("weak", "moderate", "strong")
comm_engage        <- c("low", "medium", "high")
nat_pride          <- c("not applicable", "local heroes playing", "national representation")
aud_div            <- c("mostly locals", "mix of locals & expats", "mostly expats")
ad_promo           <- c("weak", "moderate", "strong")
media_cover        <- c("low", "moderate", "high")
ticket_info        <- c("difficult", "average", "easy")
partnership        <- c("none", "school groups", "corporate groups")
sm_followers_club  <- c("low", "medium", "high")
sm_followers_players <- c("low", "medium", "high")

# Initialize master dataframe
all_spl_df <- data.frame()

for (season_ix in 1:nrow(spl_seasons)) {
  season_row    <- spl_seasons[season_ix, ]
  n_teams       <- season_row$n_teams
  n_rounds      <- season_row$n_rounds
  total_games   <- season_row$total_games
  year_start    <- season_row$Season_Start
  year_end      <- season_row$Season_End

  # Use the first n_teams from the modern list for each season
  teams_this_season    <- teams_modern[1:n_teams]
  team_abbr_this_season<- team_abbr_modern[teams_this_season]

  # Build fixtures (all pairs home/away excluding self, repeated n_rounds)
  fixtures <- expand.grid(Home_Team=teams_this_season, Away_Team=teams_this_season, stringsAsFactors=FALSE)
  fixtures <- fixtures[fixtures$Home_Team != fixtures$Away_Team, ]
  fixtures <- fixtures[rep(1:nrow(fixtures), n_rounds), ]
  fixtures <- fixtures[1:total_games, ] # truncate extra if any

  # Dates, times, and random details
  sd <- as.Date(paste0(year_end, "-01-01"))
  ed <- as.Date(paste0(year_end, "-12-31"))
  rand_dates    <- sample(seq.Date(sd, ed, by="day"), total_games, replace=FALSE)
  rand_days     <- weekdays(rand_dates)
  rand_unix     <- as.numeric(as.POSIXct(rand_dates + hours(sample(15:21, total_games, replace=TRUE))))
  match_stadiums<- sample(stadiums, total_games, replace=TRUE)
  stadium_caps  <- stadium_capacity[match_stadiums]
  stadium_locs  <- stadium_location[match_stadiums]
  stadium_ids   <- stadium_identity[match_stadiums]
  stadium_types <- stadium_capacity_type[match_stadiums]
  spl_ranks     <- sample(1:n_teams, total_games, replace=TRUE)
  continental   <- ifelse(spl_ranks %in% c(1,2), "Yes", "No")

  # Create the season's dataframe
  season_df <- data.frame(
    Season_Start                  = year_start,
    Season_End                    = year_end,
    Game_Date                     = rand_dates,
    GameID                        = paste0(format(rand_dates, "%d%m%Y"), "-", team_abbr_this_season[fixtures$Home_Team], team_abbr_this_season[fixtures$Away_Team]),
    Match_Day                     = rand_days,
    Match_Time                    = rand_unix,
    Match_Temperature             = sample(27:32, total_games, replace=TRUE),
    Match_Humidity                = sample(70:93, total_games, replace=TRUE),
    Match_Weather                 = sample(weather_cat, total_games, replace=TRUE),
    Home_Team                     = fixtures$Home_Team,
    Away_Team                     = fixtures$Away_Team,
    HomeTeam_GoalsFor             = sample(0:5, total_games, replace=TRUE),
    HomeTeam_GoalsAgainst         = sample(0:5, total_games, replace=TRUE),
    AwayTeam_GoalsFor             = sample(0:5, total_games, replace=TRUE),
    AwayTeam_GoalsAgainst         = sample(0:5, total_games, replace=TRUE),
    Match_Stadium                 = match_stadiums,
    Stadium_Capacity              = stadium_caps,
    SPL_Rank                      = spl_ranks,
    Continental_Competition       = continental,
    Geographic_Location           = stadium_locs,
    Stadium_Identity_History      = stadium_ids,
    Stadium_Capacity_Type         = stadium_types,
    Team_Ranking                  = sample(team_ranking, total_games, replace=TRUE),
    Historical_Performance_Local  = sample(history_local, total_games, replace=TRUE),
    Historical_Performance_Continental = sample(history_continental, total_games, replace=TRUE),
    National_Capped_Players       = sample(national_capped, total_games, replace=TRUE),
    Star_Foreign_Marquee_Players = sample(star_marquee, total_games, replace=TRUE),
    Current_League_Form           = sample(league_form, total_games, replace=TRUE),
    Rivalry_Derby                 = sample(rivalry, total_games, replace=TRUE),
    Popularity_Away_Team          = sample(popularity_away, total_games, replace=TRUE),
    Time_of_Match                 = sample(match_time_cat, total_games, replace=TRUE),
    Day_of_Match                  = sample(c("weekday", "weekend", "public holiday"), total_games, replace=TRUE),
    Match_Importance              = sample(match_importance, total_games, replace=TRUE),
    Broadcast_Availability        = sample(broadcast_avail, total_games, replace=TRUE),
    Match_Style_Entertainment_Factor = sample(match_style, total_games, replace=TRUE),
    Comfort_Accessibility         = sample(comfort_access, total_games, replace=TRUE),
    Fan_Facilities                = sample(fan_facilities, total_games, replace=TRUE),
    Ticket_Pricing_Tier           = sample(ticket_pricing, total_games, replace=TRUE),
    Promotions                    = sample(promotions, total_games, replace=TRUE),
    Season_Pass_Membership_Perks  = sample(season_pass, total_games, replace=TRUE),
    Merchandise_Tie_ins           = sample(merch_tiein, total_games, replace=TRUE),
    Competing_Events              = sample(competing_event, total_games, replace=TRUE),
    Public_Interest_Climate       = sample(interest_climate, total_games, replace=TRUE),
    Safety_Concerns               = sample(safety_concern, total_games, replace=TRUE),
    Fan_Culture_Presence          = sample(fan_culture, total_games, replace=TRUE),
    Community_Engagement          = sample(comm_engage, total_games, replace=TRUE),
    National_Pride_Factor         = sample(nat_pride, total_games, replace=TRUE),
    Audience_Diversity            = sample(aud_div, total_games, replace=TRUE),
    Advertising_Promotion         = sample(ad_promo, total_games, replace=TRUE),
    Media_Coverage                = sample(media_cover, total_games, replace=TRUE),
    Ticket_Info_Accessibility     = sample(ticket_info, total_games, replace=TRUE),
    Partnerships                  = sample(partnership, total_games, replace=TRUE),
    Social_Media_Followers_Club   = sample(sm_followers_club, total_games, replace=TRUE),
    Social_Media_Followers_Players = sample(sm_followers_players, total_games, replace=TRUE),
    stringsAsFactors=FALSE
  )

  # Add this season to the master df
  all_spl_df <- rbind(all_spl_df, season_df)
}

#Remove rows where teams are NA
all_spl_df <- all_spl_df %>%
  filter(!is.na(Home_Team) & !is.na(Away_Team))

#check for missing values
colSums(is.na(all_spl_df))

# all_spl_df contains all SPL seasons from 2005 to 2025/26
cat("Total games generated:", nrow(all_spl_df), "\n")
print(table(all_spl_df$Season_Start))
head(all_spl_df, 5)

# Attendance percent ranges for each type
attendance_ranges <- list(
  "cup final/title decider" = c(0.80, 1.0),
  "regular season"          = c(0.35, 0.70),
  "cup knockout"            = c(0.40, 0.80),
  "relegation battle"       = c(0.40, 0.75),
  "bottom table clash"                = c(0.20, 0.45)
)

all_spl_df$Attendance <- mapply(function(importance, capacity) {
  rng <- attendance_ranges[[tolower(importance)]]
  if (is.null(rng)) rng <- c(0.3, 0.6)
  as.integer(sample(seq(floor(rng[1]*as.numeric(capacity)), ceiling(rng[2]*as.numeric(capacity))), 1))
}, as.character(all_spl_df$Match_Importance), all_spl_df$Stadium_Capacity)

all_spl_df$Attendance <- pmin(all_spl_df$Attendance, all_spl_df$Stadium_Capacity)

Step 2: Load required packages on R

What is a library?

A library, also known as a package in R, is a set of codes and functions that will help you perform specific tasks without having to write the code from scratch.

A simple analogy: Imagine you want to bake a cake. Instead of starting from scratch, you can use a cake mix (i.e. library/package) that contains all the necessary ingredients and instructions to make the cake quickly and easily.

library(tidyverse)
library(caret)
library(ggpubr)
library(GGally)
library(corrplot)
library(gridExtra)
library(knitr)
library(kableExtra)
library(flextable)

Step 3: Data Exploration and Visualisation

Exploring the data set is one of the fundamentals of data science. It helps you understand the data better and identify any missing values, outliers, or patterns that may exist.

Some machine learning algorithms also require the data to be in a specific format (e.g. changing characters to numeric or factors), so data exploration helps you prepare the data accordingly.

summary(all_spl_df)
str(all_spl_df)
# Check for missing values
colSums(is.na(all_spl_df))

#Transform relevant variables into factors
attendance_factor_vars <- c(
  "Match_Day", "Match_Weather", "Home_Team", "Away_Team", "Match_Stadium", "Geographic_Location",
  "Stadium_Identity_History", "Stadium_Capacity_Type", "Team_Ranking", "Historical_Performance_Local",
  "Historical_Performance_Continental", "National_Capped_Players", "Star_Foreign_Marquee_Players",
  "Current_League_Form", "Rivalry_Derby", "Popularity_Away_Team", "Time_of_Match", "Day_of_Match",
  "Match_Importance", "Broadcast_Availability", "Match_Style_Entertainment_Factor", "Comfort_Accessibility",
  "Fan_Facilities", "Ticket_Pricing_Tier", "Promotions", "Season_Pass_Membership_Perks",
  "Merchandise_Tie_ins", "Competing_Events", "Public_Interest_Climate", "Safety_Concerns",
  "Fan_Culture_Presence", "Community_Engagement", "National_Pride_Factor", "Audience_Diversity",
  "Advertising_Promotion", "Media_Coverage", "Ticket_Info_Accessibility", "Partnerships",
  "Social_Media_Followers_Club", "Social_Media_Followers_Players", "Continental_Competition"
)

# Transform selected columns in all_spl_df to factors
all_spl_df[attendance_factor_vars] <- lapply(all_spl_df[attendance_factor_vars], as.factor)

# Now convert to numeric variables for correlation:
all_spl_df$Team_Ranking_num <- as.numeric(all_spl_df$Team_Ranking)
all_spl_df$Historical_Performance_Local_num <- as.numeric(all_spl_df$Historical_Performance_Local)
all_spl_df$Historical_Performance_Continental_num <- as.numeric(all_spl_df$Historical_Performance_Continental)
all_spl_df$National_Capped_Players_num <- as.numeric(all_spl_df$National_Capped_Players)
all_spl_df$Star_Foreign_Marquee_Players_num <- as.numeric(all_spl_df$Star_Foreign_Marquee_Players)
all_spl_df$Current_League_Form_num <- as.numeric(all_spl_df$Current_League_Form)
all_spl_df$Popularity_Away_Team_num <- as.numeric(all_spl_df$Popularity_Away_Team)
all_spl_df$Match_Importance_num <- as.numeric(all_spl_df$Match_Importance)
all_spl_df$Broadcast_Availability_num <- as.numeric(all_spl_df$Broadcast_Availability)
all_spl_df$Comfort_Accessibility_num <- as.numeric(all_spl_df$Comfort_Accessibility)
all_spl_df$Fan_Facilities_num <- as.numeric(all_spl_df$Fan_Facilities)
all_spl_df$Ticket_Pricing_Tier_num <- as.numeric(all_spl_df$Ticket_Pricing_Tier)
all_spl_df$Promotions_num <- as.numeric(all_spl_df$Promotions)
all_spl_df$Season_Pass_Membership_Perks_num <- as.numeric(all_spl_df$Season_Pass_Membership_Perks)
all_spl_df$Merchandise_Tie_ins_num <- as.numeric(all_spl_df$Merchandise_Tie_ins)
all_spl_df$Public_Interest_Climate_num <- as.numeric(all_spl_df$Public_Interest_Climate)
all_spl_df$Safety_Concerns_num <- as.numeric(all_spl_df$Safety_Concerns)
all_spl_df$Fan_Culture_Presence_num <- as.numeric(all_spl_df$Fan_Culture_Presence)
all_spl_df$Community_Engagement_num <- as.numeric(all_spl_df$Community_Engagement)
all_spl_df$National_Pride_Factor_num <- as.numeric(all_spl_df$National_Pride_Factor)
all_spl_df$Advertising_Promotion_num <- as.numeric(all_spl_df$Advertising_Promotion)
all_spl_df$Media_Coverage_num <- as.numeric(all_spl_df$Media_Coverage)
all_spl_df$Ticket_Info_Accessibility_num <- as.numeric(all_spl_df$Ticket_Info_Accessibility)
all_spl_df$Social_Media_Followers_Club_num <- as.numeric(all_spl_df$Social_Media_Followers_Club)
all_spl_df$Social_Media_Followers_Players_num <- as.numeric(all_spl_df$Social_Media_Followers_Players)

# Check the structure again
str(all_spl_df)
#Check for correlation between variables 
correlation_vars <- c(
  "Attendance", "Stadium_Capacity", "Match_Temperature", "Match_Humidity",
  "Team_Ranking_num", "Historical_Performance_Local_num", "Historical_Performance_Continental_num",
  "National_Capped_Players_num", "Star_Foreign_Marquee_Players_num", "Current_League_Form_num",
  "Popularity_Away_Team_num", "Match_Importance_num", "Broadcast_Availability_num",
  "Comfort_Accessibility_num", "Fan_Facilities_num", "Ticket_Pricing_Tier_num",
  "Promotions_num", "Season_Pass_Membership_Perks_num", "Merchandise_Tie_ins_num",
  "Public_Interest_Climate_num", "Safety_Concerns_num", "Fan_Culture_Presence_num",
  "Community_Engagement_num", "National_Pride_Factor_num", "Advertising_Promotion_num",
  "Media_Coverage_num", "Ticket_Info_Accessibility_num", "Social_Media_Followers_Club_num",
  "Social_Media_Followers_Players_num"
)
correlation_matrix <- cor(all_spl_df[correlation_vars], use="complete.obs")
#Visualise corrplot with values
corrplot(correlation_matrix, method="color", type="upper", tl.col="black", tl.srt=45, addCoef.col = "black", number.cex=0.7)

#Identify the moderate -strong correlations with Attendance
correlation_with_attendance <- correlation_matrix[,"Attendance"]
correlation_with_attendance <- sort(correlation_with_attendance, decreasing=TRUE)
correlation_with_attendance <- correlation_with_attendance[abs(correlation_with_attendance) > 0.3 & names(correlation_with_attendance) != "Attendance"]
correlation_with_attendance

#Visualise the correlations with Attendance
corrplot(as.matrix(correlation_with_attendance), method="color", tl.col="black", tl.srt=45, addCoef.col = "black", number.cex=0.7)

Step 3.5: Data Splitting

Before building the model, we need to split the data into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate the model’s performance on unseen data.

set.seed(123) # For reproducibility
train_index <- createDataPartition(all_spl_df$Attendance, p=0.8, list=FALSE)
train_data <- all_spl_df[train_index, ]
test_data  <- all_spl_df[-train_index, ]

Setting the seed means that every time anyone runs the code, they will get the same set of random numbers, provided they use the same numbers in the bracket. The numbers in the set.seed() function can be any integer, in this case, we used 123.

The createDataPartition() function from the caret package is used to create a stratified random sample of the data, ensuring that the distribution of the target variable (Attendance) is similar in both the training and testing sets.

Step 4: Model Building

In this step, we will build a machine learning model to predict fan attendance at live games based on the various factors in our data set.

We will use a linear regression model for this purpose, as it is a simple yet effective algorithm for predicting continuous outcomes. We will build three models: a null model, a full model, and a stepwise model.

A null model is a model that only includes the intercept term, while a full model includes all the predictors in the data set.

A stepwise model is built by adding or removing predictors based on their statistical significance, using a criterion such as the Akaike Information Criterion (AIC).

lm_null <- lm(Attendance ~ 1, data=all_spl_df)

lm_full <- lm(Attendance ~ .-GameID, data=all_spl_df)

#Stepwise feature selection based on AIC
lm_step <- step(lm_null, scope=list(lower=lm_null, upper=lm_full), direction="both", trace=0)


#Compare summaries of all models
summary(lm_null)
summary(lm_full)
summary(lm_step)

The summaries of the three models will provide information on the coefficients, R-squared values, and p-values for each predictor. The stepwise model will help us identify the most important predictors that significantly influence fan attendance at live games.

Step 5: Model Inspection

It is now time to inspect the model we have built. This step is crucial to ensure that the model is valid and reliable. To do so, we will check the model diagnostics for the stepwise model. Model diagnostics help us assess the assumptions of the linear regression model, such as linearity, normality of residuals, homoscedasticity, and independence of errors.

#Check model diagnostics for stepwise model
par(mfrow=c(2,2))
plot(lm_step)

What am I looking at?

These are your model diagnostic plots. Think of them as a health check-up for the linear regression model you just built.

They help you check if your model meets the key “assumptions” of linear regression. If these plots look good, you can be more confident in your model’s results (like your p-values and R2).

Here’s a breakdown of each plot:

1. Residuals vs Fitted

  • What it is: This plot shows your model’s prediction errors (the residuals) on the y-axis against its predicted attendance numbers (the fitted values) on the x-axis.

  • What we want to see: A random “shotgun blast” of points. The red line should be mostly flat and centered on zero.

  • What it tells us: If the red line is flat at zero, it means our model’s errors are random, which is good. If the line has a clear curve (like the slight “U” shape we see here), it suggests our model might be missing something. For example, the relationship between stadium size and attendance might not be perfectly linear, and a more complex model could be slightly better.

2. Q-Q Residuals

  • What it is: This is the Normal Q-Q plot. It checks if your model’s errors (residuals) are “normally distributed” (i.e., follow a classic bell curve). This is a key assumption for linear regression.

  • What we want to see: All the black dots should fall perfectly along the dashed straight line.

  • What it tells us: Your plot looks excellent. The dots stick to the line almost perfectly. This means the “normality” assumption is met, and you can trust the p-values and confidence intervals your model is giving you.

3. Scale-Location

  • What it is: This plot is similar to the first one, but it checks if the “spread” (or variance) of your errors is consistent across all predictions. This is the “homoscedasticity” assumption.

  • What we want to see: A random scatter of points with a flat red line. We don’t want to see a “funnel” or “megaphone” shape (where the points get more spread out from left to right).

  • What it tells us: Your plot looks good. The red line is relatively flat, and the spread of the points seems consistent. This assumption is also met.

4. Residuals vs Leverage

  • What it is: This is the “outlier detector.” It helps you find individual data points that might be having a large and potentially negative influence on your model.

  • What we want to see: We want all our points to be clustered together and (most importantly) inside the dashed red lines. Those dashed lines represent “Cook’s distance,” which is a measure of influence.

  • What it tells us: Your plot is perfect. All the points are well inside the Cook’s distance lines. This means you don’t have any single “super-outlier” games that are skewing your entire model.

Step 6: Prediction

After checking the model diagnostics, we can now use the model to make predictions on the test data.This is where we see how well our model performs on unseen data i.e. data that was not used to train the model (remember the test data we created in Step 3.5)?

#Predict on test data using stepwise model
predictions <- predict(lm_step, newdata=test_data)
head(predictions)

Step 7: Evaluation

Now that we have made predictions on the test data, we need to evaluate the model’s performance. We will use two common metrics for regression models: Root Mean Squared Error (RMSE) and R-squared (R²).

#Calculate RMSE and R-squared for stepwise model
rmse <- sqrt(mean((test_data$Attendance - predictions)^2))
ss_total <- sum((test_data$Attendance - mean(test_data$Attendance))^2)
ss_residual <- sum((test_data$Attendance - predictions)^2)
r_squared <- 1 - (ss_residual / ss_total)
cat("RMSE:", rmse, "\n")
## RMSE: 838.0364
cat("R-squared:", r_squared, "\n")
## R-squared: 0.8481487

The RMSE value indicates the average difference between the predicted and actual attendance values. A lower RMSE value indicates better model performance. In this case, an RMSE value of 838.036 suggests that, on average, the model’s predictions are off by approximately 838.0364 attendees.

The R-squared value indicates the proportion of variance in the dependent variable (Attendance) that can be explained by the independent variables in the model. An R-squared value closer to 1 indicates a better fit.In this case, an R-squared value of 0.848 suggests that the model explains approximately 84.8% of the variance in fan attendance at live games in the Singapore Premier League (SPL).

Step 8: Hyperparameter Tuning

#For linear regression, there are no hyperparameters to tune. However, if using other models
#like decision trees or random forests, hyperparameter tuning can be performed using 
#the caret package's train() function with tuneGrid argument.

Step 9: Interpretation and Results

So we have built, inspected, predicted, evaluated, and tuned our model. Now it’s time to interpret the results and understand the key determinants that drive fan attendance at live games in the Singapore Premier League (SPL).

#Display coefficients of the stepwise model
coefficients <- summary(lm_step)$coefficients
coefficients_table <- as.data.frame(coefficients)
# Display coefficients of the stepwise model
coefficients <- summary(lm_step)$coefficients
coefficients_table <- as.data.frame(coefficients)
colnames(coefficients_table) <- c("Estimate", "Std. Error", "t value", "p-value")
kable(coefficients_table, caption="Stepwise Model Coefficients") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Stepwise Model Coefficients
Estimate Std. Error t value p-value
(Intercept) -1509.1134833 105.4550238 -14.3104940 0.0000000
Stadium_Capacity 0.5557288 0.0062897 88.3551263 0.0000000
Match_Importancecup final/title decider 3586.1696075 51.9989676 68.9661694 0.0000000
Match_Importancecup knockout 1709.4633820 51.6719327 33.0830161 0.0000000
Match_Importancefriendly 696.3456140 52.5027323 13.2630357 0.0000000
Match_Importanceregular season 1285.6777768 52.5525724 24.4646022 0.0000000
Ticket_Info_Accessibility_num -50.6530504 20.2065235 -2.5067672 0.0122558
Popularity_Away_Team_num 45.3350522 20.1640451 2.2483114 0.0246551
Social_Media_Followers_Playerslow 100.8514031 40.8256671 2.4702941 0.0135756
Social_Media_Followers_Playersmedium 84.1251212 40.1557788 2.0949692 0.0362880
Broadcast_Availability_num 37.2356311 20.1076250 1.8518165 0.0641863
Season_Pass_Membership_Perksyes -61.0983065 33.0687326 -1.8476156 0.0647923
Home_TeamBalestier Khalsa 20.8554330 72.6598943 0.2870281 0.7741179
Home_TeamBG Tampines Rovers 31.7570772 72.5941261 0.4374607 0.6618203
Home_TeamDPMM -32.2078506 72.6137062 -0.4435506 0.6574112
Home_TeamGeylang International -89.8306717 72.0230084 -1.2472496 0.2124389
Home_TeamHougang United 144.1162549 72.3490190 1.9919587 0.0464992
Home_TeamLion City Sailors -15.7999305 72.7865440 -0.2170721 0.8281723
Home_TeamTanjong Pagar 27.3243349 72.6511094 0.3761035 0.7068763
Home_TeamYoung Lions -94.0505247 72.0385154 -1.3055589 0.1918393

This table is the answer to the second part of our problem statement: “find out key determinants that drives these attendance numbers.” Here is a breakdown of what each column means.

How to Read This Table

Let’s focus on the two most important columns for our analysis: Estimate and p-value.

1. Estimate (The “Effect”)

This is the most important number. It tells you the size and direction of each variable’s effect on Attendance.

  • For a numeric variable (like Stadium_Capacity):

    • The Estimate is 0.5557.

    • Interpretation: For every 1 additional seat of capacity in a stadium, the model predicts attendance will increase by 0.56 people (holding all other factors equal).

  • For a categorical variable (like Match_Importance):

    • This is a bit different. R automatically picks one level to be the “baseline” or “default” (in this case, it was “bottom table clash,” which is why it’s not in the list).

    • The Estimate for Match_Importancecup final/title decider is 3586.17.

    • Interpretation: The model predicts a “cup final/title decider” will have 3,586 more fans than a “bottom table clash” (holding all other factors equal).

    • Similarly, a “regular season” game is predicted to have 1,286 more fans than a “bottom table clash.”

2. p-value (The “Significance”)

This column tells you if the variable is a statistically significant predictor. It answers the question: “Is this variable’s effect real, or did it just show up due to random chance?”

  • The Rule of Thumb: A p-value less than 0.05 is considered “significant.”

  • Interpretation: When you see a p-value with a lot of zeros (like 0.0000000 for Stadium_Capacity), it means the model is extremely confident that this variable has a real, measurable impact on attendance.

Conclusion

What are our Key Determinants?

Based on the table above, our model has found several significant drivers of attendance.

  • The “Heavy Hitters” (p-value < 0.05):

    • Stadium_Capacity

    • Match_Importance (all levels)

    • Ticket_Info_Accessibility_num (This is negative, suggesting “easier” access (a higher number) decreases attendance, which is interesting!)

    • Popularity_Away_Team_num

    • Social_Media_Followers_Players

    • Home_TeamHougang United (This is just barely significant, suggesting games at Hougang’s stadium have a small positive effect compared to the baseline team).

  • Not as Important (p-value > 0.05):

    • Broadcast_Availability_num

    • Season_Pass_Membership_Perksyes

    • Most of the other Home_Team variables (like Balestier Khalsa, DPMM, etc.). This means that once you account for stadium size and match importance, the specific home team (other than Hougang) doesn’t have a statistically significant effect.


So there you have it! In this guide, we’ve successfully walked through an entire predictive modeling project, from creating a data set to interpreting our final model. You now have a foundational workflow for tackling linear regression problems in R. Look out for Part 2, where we dive into more complex models that can possibly capture more non-linear interactions!