Ever since I’ve embarked on the statistician’s path, I find myself continuously dumbfounded at the power that statistics bestows upon those who dare brave the confusion it produces. Statistical models have been used to predict, categorize, classify, reduce, and learn anything, and everything, quantifiable in our modern world. Statisticians believe themselves to be capable of foresight, peering into the future using their mathematical wizardry in the hopes of extract hidden knowledge and bountiful profit. This particular Statistical ability, predicting the future, has fascinated me the most when applied to professional sports.
I have always seen professional sports as games of chance and unpredictability. Surely there are too many variables to take into account in order to reliably and consistently predict the outcome of a particular game or match….right? Well, that may be so but it hasn’t stopped statisticians and entrepreneurs alike from trying, and it won’t stop me either.
What follows is my attempt at producing, and training, a linear regression model to predict the outcomes of horse races in Hong Kong using data from the 2014 to 2017 seasons. The datasets used in this project have been acquired from user Lantana Camara off his/her “Hong Kong Horse Racing Results 2014-17 Seasons” datasets page hosted on Kaggle.com which can be found here. The origin of the dataset is The Hong Kong Jockey Club. But, before we dive into the data, lets take a quick look at the Hong Kong horse racing scene.
It is no secret that the popularity of horse racing has steadily declined in the past decades. Although, while this may be true in the United States, horse racing is booming in Hong Kong. The Hong Kong Jockey Club is a horse racing operator and not-for-profit organization in Hong Kong, China founded in 1884 while under British colonial rule. According to Bloomberg.com, the club attracts HK$138.8 million (US$17.86 million) per race, more than any other track in the world, and it has been projected that HK$108 billion (US$13 billion) was bet during the 2014 and 2015 racing season. The reason for such enthusiastic participation by Hong Kong residents could be that The Hong Kong Jockey Club holds a government-granted monopoly on horse racing betting, fixed odds betting on overseas soccer events, and the Mark Six Lottery. In exchange, the club is the government’s largest taxpayer providing 72.5 cents for every dollar of its winnings in taxes. Not only that, unlike the United States, a restricted number of horse trainers and jockeys, about 24 each, participate in the races. As a result, the sport produces well recognized celebrities and crowd favorites and who wouldn’t come back to see their favorite win again? For these, and many other, reasons, horse racing remains both a lucrative and loved sport in the the heart of Hong Kong.
# Load necessary packages
library(caTools)
library(dummies)
library(Hmisc)
library(dplyr)
library(usdm)
library(fmsb)
# Reading in the datasets.
horse.data <- read.csv("race-result-horse.csv")
race.data <- read.csv("race-result-race.csv")
# Upon examination of the current datasets; they both contain a "race_id" column. Lets combine the dataframes along this column to produce a central, holistic view of the data.
horserace.data <- merge(horse.data, race.data, by = "race_id")
# Lets get a look at this new dataframe.
str(horserace.data)
## 'data.frame': 30189 obs. of 30 variables:
## $ race_id : Factor w/ 2367 levels "2014-001","2014-002",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ finishing_position : Factor w/ 36 levels "1","1 DH","10",..: 1 11 13 15 17 19 21 23 25 3 ...
## $ horse_number : int 1 2 10 3 7 9 13 4 6 11 ...
## $ horse_name : Factor w/ 2162 levels "A BEAUTIFUL",..: 445 1474 781 1888 1935 2098 256 348 1900 1860 ...
## $ horse_id : Factor w/ 2162 levels "A001","A002",..: 294 991 672 782 291 582 855 970 679 337 ...
## $ jockey : Factor w/ 106 levels "---","A Atzeni",..: 9 30 104 46 106 6 22 42 55 92 ...
## $ trainer : Factor w/ 95 levels "A Bull","A de Watrigant",..: 18 19 94 14 52 8 50 54 87 15 ...
## $ actual_weight : Factor w/ 32 levels "-","103","104",..: 32 32 20 31 24 22 14 28 26 18 ...
## $ declared_horse_weight: Factor w/ 399 levels "-","1000","1001",..: 34 77 67 224 138 102 55 205 75 139 ...
## $ draw : Factor w/ 16 levels "---","1","10",..: 2 6 10 9 16 4 5 15 13 14 ...
## $ length_behind_winner : Factor w/ 217 levels "-","---","+1/2",..: 1 51 51 51 138 168 168 170 183 184 ...
## $ running_position_1 : int 1 8 2 6 9 12 4 5 7 11 ...
## $ running_position_2 : int 2 9 1 4 10 13 3 6 7 11 ...
## $ running_position_3 : int 2 9 1 5 10 13 3 6 7 12 ...
## $ running_position_4 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ finish_time : Factor w/ 4175 levels "---","0.55.16",..: 1202 1234 1235 1235 1271 1289 1291 1294 1302 1310 ...
## $ win_odds : Factor w/ 181 levels "---","1","1.1",..: 50 142 89 103 103 36 181 33 12 39 ...
## $ running_position_5 : int NA NA NA NA NA NA NA NA NA NA ...
## $ running_position_6 : int NA NA NA NA NA NA NA NA NA NA ...
## $ src : Factor w/ 2367 levels "20140914-1.html",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ race_date : Factor w/ 254 levels "2014-09-14","2014-09-17",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ race_course : Factor w/ 2 levels "Happy Valley",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ race_number : int 1 1 1 1 1 1 1 1 1 1 ...
## $ race_class : Factor w/ 16 levels "Class 1","Class 2",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ race_distance : int 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 ...
## $ track_condition : Factor w/ 9 levels "FAST","GOOD",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ race_name : Factor w/ 1084 levels "A FORCE FOR GOOD HANDICAP",..: 983 983 983 983 983 983 983 983 983 983 ...
## $ track : Factor w/ 7 levels "ALL WEATHER TRACK",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ sectional_time : Factor w/ 2367 levels "12.39 21.25 23.53",..: 526 526 526 526 526 526 526 526 526 526 ...
## $ incident_report : Factor w/ 2367 levels "\n A FAST ONE was slow to begin and throughout the race was detached from the field. A veterina"| __truncated__,..: 2178 2178 2178 2178 2178 2178 2178 2178 2178 2178 ...
Lets reduce the data frame down to only include those variables we will use for identification and those variables that we will include in our regression model. The following columns will be removed, the reason for which follows:
“finishing_position”: We are concerned with predicting ending race times not directly predicting the finishing position of a horse.
“horse_number”: Redundant identification of horse.
“horse_id”: Redundant identification of horse.
“length_behind_winner”: Only care about end times of horses.
“running_position_X”: We are concerned about each horse’s ending position only, which is already reflected, relatively, through the “finish_time” and explicitly through the “finishing_position”.
“src”: Provides no additional, useful information.
“race_date”: Assumed to have little to no connection to horse performance; I have considered the possibility that a thoroughbred horse may go through monthly, cyclical behavioral changes especially when it is still going through puberty; but, this would be difficult to confirm and considering most race horses are between 3-6 years of age, its likely that all the horses will be psychologically equivalent at this stage in life.
“race_number”: Multiple identification varaibles for each race are already present; redundant.
“race_name”: I am not concerned with a single race; races will not be analysed individually.
“sectional_time”: Provides insight into how fast the horse ran and when but, again, we are only concerned with the end result.
“incident_report”: Will not be included in model BUT will be crucial in determining outliers and explaining missing data.
keep <- c("race_id", "horse_name", "jockey", "trainer", "actual_weight", "declared_horse_weight", "draw", "finish_time", "win_odds", "race_course", "race_class", "race_distance", "track_condition", "track")
newhorserace.data <- horserace.data[keep]
Here is a brief description of what each of the remaining variables define:
“race_id”: Identification code assigned to each individual race.
“horse_name”: The name of the race horse included in the race.
“jockey”: The name of the jockey involved in a race, riding a particular horse.
“trainer”: The trainer of each horse involved in a race.
“actual_weight”: The weight of the corresponding jockey in pounds (lbs).
“declared_horse_weight”: The weight of the corresponding horse, before the race, in pounds (lbs).
“draw”: The post position/lane in which the corresponding horse ran in; assigned randomly before the race.
“finish_time”: The amount of time in which a horse completed a race; horse times are reported in the format of Min.Sec.MilliSec
“win_odds”: The odds of a horse winning the race; provided before each race.
“race_course”: The name of the race course where the race had taken place.
“race_class”: The class of the race.
“race_distance”: Total distance of the race in meters (m).
“track_condition”: Details the condition of the track during the race.
“track”: Specifies the track on which the race took place.
A majority of the columns are of the class “Factor” when in fact we require some of them to be integers, numbers, or dates. In addition, there are some columns that are missing crucial data which are populated with “—” or “-”; lets replace all such instances with “NA” to better identify which rows are missing data.
newhorserace.data[newhorserace.data == "---" ] <- NA
newhorserace.data[newhorserace.data == "-" ] <- NA
# Lets obtain some information explaining the absence of "finish_time" in some of the data.
missing <- which(is.na(newhorserace.data$finish_time))
missing.data <- horserace.data[missing, c("race_id", "horse_name", "finish_time", "incident_report")]
# What proportion of our total data does this missing data represent?
nrow(missing.data)/nrow(newhorserace.data)
## [1] 0.02216039
Upon closer inspection, it seems that there are a total of 669 instances in which a horse does not possess a finish time. According to the accompanying “incident_report”, it seems that the horses in question were mostly, if not all, withdrawn from the race in accordance with the opinion of the veterinarian officer who was present at the time. The main reason being that the horses turned fractious as the race neared closer. According to my calculations, all of these entries only represent about 2% of the total data.
Because this is relatively insignificant, considering the size of our dataset, I decided to remove each entry from the data set completely. If the data was omitted due to a conversion error, I would have made an effort to retrieve the original data. The removal of data must be done always be done cautiously and with care. Otherwise, the analyst risks injecting a bias into the dataset.
newhorserace.data <- newhorserace.data[which(!is.na(newhorserace.data$finish_time)),]
# Are there any missing values remaining in the dataset?
sum(is.na(newhorserace.data))
## [1] 0
# ...Nope
What was, initially, a huge dataset froth with missing values is now cleaned of non-vital variables and filled with usable data. Our linear model is going to be built on our dependent variable, “finish_time”, and our independent variables. Currently, a majority of our variables are of the “Factor” data type. For successful analysis, we will need to coerce certain variables into integers/numbers while the remaining factor variables will need to be turned into dummy variables in order to monitor the impact of their levels on “finish_time”. Speaking of “finish_time”, we need to transform this information into a form that can be easily analyzed. Seeing as every millisecond counts in a race, we will transform each finish time into decimal units of seconds.
# Function to turn "finish_time" to seconds.
ToSeconds <- function(x){
y = gsub("[.]", "" , x, perl=TRUE)
a = as.numeric(substr(y, 1, 1))
b = as.numeric(substr(y, 2, 3))
c = as.numeric(substr(y, 4, 5))
result = (a*60) + b + (c/100)
return(result)
}
fintimes.list <- lapply(newhorserace.data$finish_time, ToSeconds)
fin.times <- matrix(unlist(fintimes.list), ncol = 1, byrow = TRUE)
# Create a new, official dataframe and set original "finish_time" to new list with each time transformed into a numerical value.
final.data <- newhorserace.data
final.data[,"finish_time"] <- c(fin.times)
# Coercing each varaible to its appropriate form.
final.data$actual_weight <- as.numeric(as.character(final.data$actual_weight))
final.data$declared_horse_weight <- as.numeric(as.character(final.data$declared_horse_weight))
final.data$win_odds <- as.numeric(as.character(final.data$win_odds))
final.data[,"race_distance"] <- factor(final.data[,"race_distance"])
Before we create our dummy variables, I believe that now would be the best time to deal with the problem of multicollinearity. Multicollinearity is the issue of our independent variables being correlated with one another which leads to incorrect results and false conclusions due to the redundancy introduced by including variables that are correlated with anything other than the dependent variable. I’m choosing to do this now while we still have relatively few variables as opposed to the greater number of variables we will have once we introduce all of our dummy variables.
In order to test for multicollinearity, we are going to use 2 different tests to determine the correlation between each of our different variable classes. The break down is as follows:
Numeric - Numeric: Pearson Correlation
Numeric - Factor: ANOVA (Analysis of Variance)
Factor - Factor: Variance Inflation Factor (VIF)
By doing this, we will confirm if our independent variables have any affect on our dependent variable and whether any multicollinearity is present between our variables.
# First, Pearson to test for the correlation between numeric varaibles.
num <- select_if(final.data, is.numeric)
rcorr(as.matrix(num), type = "pearson")
## actual_weight declared_horse_weight finish_time
## actual_weight 1.00 0.03 0.02
## declared_horse_weight 0.03 1.00 -0.13
## finish_time 0.02 -0.13 1.00
## win_odds -0.22 -0.09 -0.06
## win_odds
## actual_weight -0.22
## declared_horse_weight -0.09
## finish_time -0.06
## win_odds 1.00
##
## n= 29520
##
##
## P
## actual_weight declared_horse_weight finish_time
## actual_weight 0.0000 0.0078
## declared_horse_weight 0.0000 0.0000
## finish_time 0.0078 0.0000
## win_odds 0.0000 0.0000 0.0000
## win_odds
## actual_weight 0.0000
## declared_horse_weight 0.0000
## finish_time 0.0000
## win_odds
According to our output, all of our correlations are negligible or, for all intents and purposes, non-existent. As confirmation, we can see that the p-values of each correlation is below the 0.05 significance level meaning that none of them are statistically significant. This is great news as it rules out multicollinearity between the numerical variables, but the output also includes each independent variable’s correlation with the dependent variable. Often, the fact that our independent variables lack correlation with our dependent variable would be a red flag hinting towards the probability that our resulting model will fail to produce accurate predictions. Although, seeing as we’re dealing with the outcome of horse races and, more specifically, predicting the end time of a horse down to the final second, enormous variance needs to be expected and taken into consideration.
# ANOVA for the "correlation" between numeric and factor varaibles.
summary(aov(finish_time ~ horse_name + jockey + trainer + draw + race_course + race_class + race_distance + track_condition + track, data=final.data))
## Df Sum Sq Mean Sq F value Pr(>F)
## horse_name 2154 6559664 3045 2812.2 <2e-16 ***
## jockey 83 56093 676 624.1 <2e-16 ***
## trainer 23 16455 715 660.7 <2e-16 ***
## draw 14 1578 113 104.1 <2e-16 ***
## race_course 1 65456 65456 60444.1 <2e-16 ***
## race_class 15 106498 7100 6556.2 <2e-16 ***
## race_distance 8 3433293 429162 396300.5 <2e-16 ***
## track_condition 8 2461 308 284.1 <2e-16 ***
## track 6 715 119 110.0 <2e-16 ***
## Residuals 27207 29463 1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The output just produced, above, is the result of a one-way ANOVA test between “finish_time” and each independent factor variable. “race_id” was excluded because it is just for identification purposes only. The null hypothesis purposed by the ANOVA test, in this case, is that the mean “finish_time” of each level of a factor variable are equivalent to each other. A significant p-value signals that the null hypothesis should be rejected and that the means are, in fact, different. By consulting the p-values above, we can see that each result is statistically significant meaning that each variable has a non-negligible affect on “finish_time”. Now, we can pat ourselves on the back but this is no time for celebration. This tells us very little about the effect that each independent factor variable has on the other. This is the purpose of our next calculation; teh Variance Inflation Factor.
# First, lets create a dataframe called "factor" which only includes "finish_time" and each of our factor varaibles.
factor <- (select_if(final.data, is.factor))[,2:10]
factor["finish_time"] <- final.data["finish_time"]
# Lets include a function, created by User fawda123 which can be found on the GitHub repository (https://gist.github.com/fawda123/4717702#file-vif_fun-r)
vif_func<-function(in_frame,thresh=10,trace=T,...){
if(any(!'data.frame' %in% class(in_frame))) in_frame<-data.frame(in_frame)
# Get initial vif value for all comparisons of variables
vif_init<-NULL
var_names <- names(in_frame)
for(val in var_names){
regressors <- var_names[-which(var_names == val)]
form <- paste(regressors, collapse = '+')
form_in <- formula(paste(val, '~', form))
vif_init<-rbind(vif_init, c(val, VIF(lm(form_in, data = in_frame, ...))))
}
vif_max<-max(as.numeric(vif_init[,2]), na.rm = TRUE)
if(vif_max < thresh){
if(trace==T){ # Print output of each iteration
prmatrix(vif_init,collab=c('var','vif'),rowlab=rep('',nrow(vif_init)),quote=F)
cat('\n')
cat(paste('All variables have VIF < ', thresh,', max VIF ',round(vif_max,2), sep=''),'\n\n')
}
return(var_names)
}
else{
in_dat<-in_frame
# Backwards selection of explanatory variables, stops when all VIF values are below 'thresh'
while(vif_max >= thresh){
vif_vals<-NULL
var_names <- names(in_dat)
for(val in var_names){
regressors <- var_names[-which(var_names == val)]
form <- paste(regressors, collapse = '+')
form_in <- formula(paste(val, '~', form))
vif_add<-VIF(lm(form_in, data = in_dat, ...))
vif_vals<-rbind(vif_vals,c(val,vif_add))
}
max_row<-which(vif_vals[,2] == max(as.numeric(vif_vals[,2]), na.rm = TRUE))[1]
vif_max<-as.numeric(vif_vals[max_row,2])
if(vif_max<thresh) break
if(trace==T){ # Print output of each iteration
prmatrix(vif_vals,collab=c('var','vif'),rowlab=rep('',nrow(vif_vals)),quote=F)
cat('\n')
cat('removed: ',vif_vals[max_row,1],vif_max,'\n\n')
flush.console()
}
in_dat<-in_dat[,!names(in_dat) %in% vif_vals[max_row,1]]
}
return(names(in_dat))
}
}
# Now, lets correct for collinearity:
vif_func(in_frame=factor[,1:9],thresh=5,trace=T)
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in Ops.factor(r, 2): '^' not meaningful for factors
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in Ops.factor(r, 2): '^' not meaningful for factors
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in Ops.factor(r, 2): '^' not meaningful for factors
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in Ops.factor(r, 2): '^' not meaningful for factors
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in Ops.factor(r, 2): '^' not meaningful for factors
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in Ops.factor(r, 2): '^' not meaningful for factors
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in Ops.factor(r, 2): '^' not meaningful for factors
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in Ops.factor(r, 2): '^' not meaningful for factors
## Warning in model.response(mf, "numeric"): using type = "numeric" with a
## factor response will be ignored
## Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
## Warning in Ops.factor(r, 2): '^' not meaningful for factors
## Warning in max(as.numeric(vif_init[, 2]), na.rm = TRUE): no non-missing
## arguments to max; returning -Inf
## var vif
## horse_name <NA>
## jockey <NA>
## trainer <NA>
## draw <NA>
## race_course <NA>
## race_class <NA>
## race_distance <NA>
## track_condition <NA>
## track <NA>
##
## All variables have VIF < 5, max VIF -Inf
## [1] "horse_name" "jockey" "trainer" "draw"
## [5] "race_course" "race_class" "race_distance" "track_condition"
## [9] "track"
Here, I wanted to determine if the independent variables, which are all factors, were independent of each other. In a word, I’m trying to detect any “collinearity” between them. This is best achieved using the Chi-Squared test or by calculating the Variance Inflation Factors (VIF). In R, the function needed to perform the Chi-Squared test is ‘chisq.test()’ which requires a contingency table to pass through it. In addition, the contingency table can be made up of no more than 2 variables which means that if I were to perform a Chi-Squared test on all of my data, I would have to create 9C2 or 36 individual contingency tables, one for each possible combination of 2 factor variables, and then pass each of them through ‘chisq.test()’. Because this is very time consuming and labor intensive, I have decided to use the VIF instead.
A VIF for a single explanatory variable is obtained using the r-squared value of the regression of that variable against all other explanatory variables. It is known that multicollinearity inflates the variance of the coefficients of a linear regression. What the VIF formula does, is calculate the factor by which the coefficients are being inflated. Once we have this information, we decide, arbitrarily, on a suitable “cut-off” measurement. Any variable with a VIF above the cut-off signals a collinear effect and is removed from the model.
The function “vif_func” is a clever way of dealing with the variables which cross the aforementioned cut-off. The function calculates the VIF for each variable and removes the one with the highest VIF. It continues to do this until all of the variables possess a VIF below the threshold. The result is a list of the variables which were not removed during this process and which, as a result, are free of any collinear effects. This function has little documentation, all of which can be found on this blog post. Now, according to multiple comments, the function seems to work just fine. As you can see, when I ran the data through the function, the output is all of the variables that were original passed through the function. What precedes it is a segment of rather cryptic output. It is unclear whether this is a warning or simply a comment outputted at regular intervals as the function is run. I believe it to be the ladder seeing as the data frame I inputted was free of numeric data; all of the variables were of the type “Factor”. Additionally, the output failed to provide VIF values defaulting the max VIF to be negative infinity which, frankly, is not a good sign. Additional research will need to be done to determine what went wrong or, alternatively, to confirm that everything went right. From here on I will assume that the VIFs for each variable are so low as to be undetectable and the variables are, therefore, non-collinear.
At this point, I am satisfied with the variables I have on hand and will, warily, press onward by producing dummy variables or each level of each factor variable. When it comes to fitting variables into linear regression models, we must ensure that each variable included is in a quantitative form. By creating a linear regression model, we aim to squeeze all the information we have into a single linear equation; we can only do so with numbers. So, we must encode each level of a factor variable with a number.
It is at this point that I realize the, possible, failure of my VIF analysis was due to the fact that I had not yet encoded the factor variables into dummy variables. I hesitated to do this in the beginning because I wanted to refrain from complicating the situation by adding additional variables prematurely. The VIF formula is dependent on r-square which is obtained from the linear regression of one variable modeled by all the others. I do not see how the vif_func() could have done this without numerical dummy variables.
# Lets create some dummy variables.
dummy.df <- dummy.data.frame(data = final.data[2:14], names = c("horse_name", "jockey", "trainer", "draw", "race_course", "race_class", "race_distance", "track_condition", "track"), sep = "_")
Now that we have created our dummy variables, we will need to split our data into a training set and a test set. In this way, we set aside a portion of the data in order to have something to feed into our model, once it is created, to test how well it predicts the finish time depending on the variables it is provided.
# Splitting the data into a train set and test set.
set.seed(64)
ind <- sample(2, nrow(dummy.df), replace = TRUE, prob = c(0.7,0.3))
train <- dummy.df[ind == 1, ]
test <- dummy.df[ind == 2, ]
Above, I have split the data into a training and test set using a 70%-30% split. This is an arbitrary decision. I have seen other recommendations such as 66%-33% as well as 80%-20%. And so, I decided 70%-30% would be a fair compromise. Now, it’s time to create the model.
# Creating the linear regression model
fit <- lm(finish_time ~ ., data = train)
(summary(fit))$r.squared # R-Squared of the model
## [1] 0.997476
(summary(fit))$adj.r.squared # Adjusted R-squared of the model
## [1] 0.9971671
(summary(fit))$fstatistic # F-statistic of the model
## value numdf dendf
## 3228.731 2231.000 18227.000
# Extracting the p-value of the F-statistic
lmp <- function (modelobject) {
if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
f <- summary(modelobject)$fstatistic
p <- pf(f[1],f[2],f[3],lower.tail=F)
attributes(p) <- NULL
return(p)
}
lmp(fit) # uses the above function to capture the F-test p-value.
## [1] 0
Looking at the summary of our resulting linear model, one can easily become overwhelmed by the sheer number of independent variables. But, the headliner of our show, as always, is the p-value. Scanning down the significance value, p-value, of our variables, of the entire summary output, one may spot a few that are above our standard threshold of 0.05 but they are far and few in between. In output above, I have extracted the adjusted R-squared and p-value of the F-statistic.
Our adjusted R-squared is 0.9972 which means that the inclusion of our independent variables have accounted for 99% of the variance seen in “finish_time”. At first this seems like great news, but we must take this value with a grain of salt. Real life is full of immense variability and the fact that we have almost completely accounted for all of the potential variability in a horse race seems just too good to be true. As a matter of fact, I speculate that the model is in real trouble of being overfitted; This means that we have over complicated the model resulting in a model that does not generalize well for data which we have not seen yet. This may be, somewhat, corrected by removing any and all non statistically significant independent variables from the model which does not seem feasible given the size of our dataset.
As a side note, it came across my mind that I could utilize the step() function to remove non-statistically significant variables from the model, as it is built, in a step-wise manner. Although, due to the density of this dataset, it proved impractical taking an excessive amount of time to complete.
Nevertheless, the recorded significance of our F-statistic is less than 2.2x10^-16. This is incredibly small; so small, in fact, that it comes up as 0 when extracted from the summary. The F-statistic is an F-test, kind of like a specific type of t-test. The F-statistic compares our model to one with no predictors/independent variables at all which is called an “intercept-only model”. It works off of the null hypothesis that the fit of the intercept-only model and your model are equal. If the statistic is statistically significant, then we can reject the null hypothesis and conclude that the intercept-only model produces a fit that is inferior to that of our own model. That is precisely the case here. Our F-statistic is statistically significant giving credibility to our model.
Lets poke and prod our model by analyzing the variance of our model with two plots in particular. The Residual Plot and The Normal Q-Q Plot.
# Residual plot.
res <- resid(fit)
plot(fit)
## Warning: not plotting observations with leverage one:
## 8, 93, 291, 310, 321, 758, 792, 862, 981, 986, 1010, 1339, 1354, 1415, 1551, 1614, 1643, 1760, 1867, 1999, 2003, 2015, 2134, 2222, 2245, 2504, 2529, 2868, 3841, 3844, 5004, 5321, 5703, 8060, 8517, 8525, 8849, 8858, 8874, 8878, 8881, 8885, 8886, 9255, 10711, 10713, 11817, 12247, 12977, 13480, 13617, 14764, 15718, 15719, 15720, 15723, 15729, 15743, 15744, 15755, 15772, 16577, 16603, 17588, 17805, 18025, 18325, 18981, 19167, 19398, 19451, 19471, 19492, 19584, 19595, 19684, 19832, 19978, 19985, 20011, 20128, 20133, 20136, 20183, 20214, 20215, 20258, 20291, 20326, 20384, 20442, 20459
## Warning: not plotting observations with leverage one:
## 8, 93, 291, 310, 321, 758, 792, 862, 981, 986, 1010, 1339, 1354, 1415, 1551, 1614, 1643, 1760, 1867, 1999, 2003, 2015, 2134, 2222, 2245, 2504, 2529, 2868, 3841, 3844, 5004, 5321, 5703, 8060, 8517, 8525, 8849, 8858, 8874, 8878, 8881, 8885, 8886, 9255, 10711, 10713, 11817, 12247, 12977, 13480, 13617, 14764, 15718, 15719, 15720, 15723, 15729, 15743, 15744, 15755, 15772, 16577, 16603, 17588, 17805, 18025, 18325, 18981, 19167, 19398, 19451, 19471, 19492, 19584, 19595, 19684, 19832, 19978, 19985, 20011, 20128, 20133, 20136, 20183, 20214, 20215, 20258, 20291, 20326, 20384, 20442, 20459
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
This is called a ‘Residual Plot’. The x-axis marks the predicted values of ‘finish_time’. The y-axis marks the residuals or errors between the predicted values on the regression line and the corresponding observed values at that same x position. Between each observed and predicted value, there is some error and that error is the residual.
This plot is meant to confirm one of several assumptions that we make, whether or not we know it, when we construct a linear regression model. The assumption here is that of linearity. We assume that the data we have will fit in a linear model. When we consult the residual plot, we want to see a dispersion of points that is seemingly random around the red horizontal line. In addition, we want to see that the red line is in fact horizontal with little to no sloping. Both of these things we see in this plot. The points look to be slightly more concentrated above the line but they don’t follow any geometric pattern and the red horizontal line is completely flat if not a little wavy. This means that a linear regression model is appropriate for the data. Our assumption of linearity has been met.
Lets address the fact that the points look to be clustered into 8 groups. I would not consider this to be a pattern which disproves linearity. I believe the culprit to be the fact that within any official horse race, you have a specific race of a specific class with a specific breed of horses racing on a specific type of track of a specific distance. And due to the niche of each race, you are bound to have horses that will finish the race at relativity similar times which could explain why the points cluster around the finish times that they do in the residual plot above.
The y-axis of this plot marks the ordered, observed, standardized residuals while the x-axis marks the ordered, theoretical residuals. This plot serves to check the assumption that the data we acquired comes from a normally distributed population. Ideally, the points would align along a straight, diagonal path from the origin with a slope of +1. This would confirm that the data is, more or less, from a normally distributed population.
Our plot shows a practically horizontal line with subtle curves on the ends. This plot is described to have “heavy tails”. The points fall along a line in the middle of the graph but curve off in the extremities. A plot that exhibits this behavior usually means that the data has more extreme values than if they truly came from a normal distribution.
Despite the fact that our data does not seem to come from a normal distribution, this is of little consequence to us. Least square regression models provide the best linear, unbiased estimator no matter the distribution it comes from. This plot is used more so to show that the data is from a normal distribution but the assumption of normality is not a requirement for usage of a linear regression model.
# Prediction.
pred <- predict(fit, test)
## Warning in predict.lm(fit, test): prediction from a rank-deficient fit may
## be misleading
head(pred)
## 2 4 6 7 13 18
## 83.75984 81.97338 84.82159 84.88150 84.35503 70.17863
head(test$finish_time)
## [1] 82.65 82.66 83.20 83.22 84.10 70.31
# RMSE
RMSE = function(m, o){
sqrt(mean((m - o)^2))
}
RMSE(pred,test$finish_time)
## [1] 1.209265
Now, using out linear model, I have used the rest of our data, the test data, to predict what the finish time of a horse would be depending on a wide array of independent variables. In order to gauge how well the model does in the prediction of finish times, I have also calculated the Root Mean Squared Error (RMSE). The RMSE is the square root of the variance of the residuals. It indicates how close the observed data points are to the model’s predicted values, on average. The RMSE can be interpreted as the standard deviation of the unexplained variance and has the convenient property of possessing the same units as the dependent variable which in our case is in seconds. Lower values of RMSE indicate better fit.
Now, RMSE is a good measure of how accurately the model predicts the response but doesn’t really mean anything concrete on its own. There are no agreed upon standards across the field of statistics as to what is a “good” RMSE value. The interpretation of the RMSE, and whether it is good or not, is dependent on the the analyst and his knowledge of the domain he is analyzing. In our case, we have a RMSE of 1.2 which can be translated as meaning that, on average, our predictions are off by 1.2 seconds. Now, considering that horse races often come down to the millisecond, this result is less than ideal. A second is often the difference between first and second place or even first and last place.
Despite our best efforts, it seems like our linear model came up short in the end. Although I won’t be betting big on my model, I’m confident that I have learned a lot and that there is still much for me to learn in linear modeling as well as horse racing. This won’t be the last you’ll here from me on the topic of linear modeling and horse racing . . . you can bet on it.