For my horse system, I utilize data.frames in R to store data about horses that I use to train the model. This works because I have a relatively small dataset of around 60,000 horses and around 50 features. As my dataset grows, it will become advantageous to utilize a database instead of the in-memory storage native to R.
The following is an example of how I could load a table with the same data that I currently store in a data.frame, and how I could then access that data again through SQL queries. What I will do is process text files in R, then send them to the database. The key difference between the method I will present here, and the method I have done before, is that this method only ever stores 1 day’s worth of information at a time in memory, whereas my other method loaded years’ worth of information all at once.
Before I begin the data transfer, I must make some preparations. I will have R list all of the files in my directory that will need to be loaded. The files that I am interested in have the extensions: DRF; DR2; DR3; DR4. I also need to establish a connection to the database I am working with. Of course, I will also need the related libraries.
library(plyr)
library(dplyr)
library(stringr)
library(RPostgreSQL)
library(knitr)
DRF_files = list.files(pattern = '.DRF$')
DR2_files = list.files(pattern = '.DR2$')
DR3_files = list.files(pattern = '.DR3$')
DR4_files = list.files(pattern = '.DR4$')
#Connect to db
#The database and user have already been created, I am using R to access the database by sending the credentials
pw = 'myPass'
drv = dbDriver("PostgreSQL")
con = dbConnect(drv, dbname = "postgres",
host = "localhost", port = 5432,
user = "openpg", password = pw)
Now that we have a list of files that we’d like to transfer, and a connection to a databse, we are ready to proceed. Note that I created the database table called ‘horse_level_combined’ ahead of time, so I am going to write to that table. I am not creating a new table. I will write a for loop that processes each file, creates new calculated fields, combines all 4 files into 1 data.frame, then writes that data to the database.
#Remove the table, if it exists, to ensure a clean transfer
dbGetQuery(con, 'delete from horse_level_combined')
## NULL
#Create the table
for(a in 1:max(length(DRF_files))) {
#process each file
drf = read.csv(DRF_files[a], header = F, stringsAsFactors = F)
RaceData = drf[,c(1,2,3,4,5,7, 23)]
names(RaceData) = c("Track", "RDate", "RNum", "Dist","Surface","RType", "PostTime")
RaceData$Surface = substr(as.character(RaceData$Surface), 1,1) #Import error where "T" = TRUE
dr2 = read.csv(DR2_files[a], header = F)
dr2_ = read.csv(DR2_files[a], header = F, quote = "")
HorseData = dr2[,c(1,2,3,4,19,21,24,25,41,43,64,65,69,71,75,189,201,202)]
names(HorseData) = c("Track", "RDate", "RNum", "PP", "TrainRecordCurYear", "JockRecordCurYear", "HName", "YOB", "Medication",
"EquipChange", "Year", "StartsYear", "EarnYear", "StartsPrev", "EarnPrev", "BRISRunstyle", "ProgramNum", "MLOdds")
x = str_replace(dr2_[,201], "[:punct:]", "")
HorseData$ProgramNum = str_replace(x, "[:punct:]", "")
dr3 = read.csv(DR3_files[a], header = F)
PastPerf = dr3[,c(1,2,3,4,5,14,35,41,56,57,61,64,84)]
names(PastPerf) = c("Track", "RDate", "RNum", "PP", "PrevRDate", "NumHor", "Purse", "FinishPos", "BRIS2fPace", "BRIS4fPace",
"BRISLatePace", "BRISSpeed", "ClaimedFlag")
dr4 = read.csv(DR4_files[a], header = F)
BRISStats = dr4[,c(1,2,3,4,5,10)]
names(BRISStats) = c("Track", "RDate", "RNum", "PP", "PowerRating", "BestSpeedLife")
#build features and combine the data
RaceData$PostTime = gsub("[\\(, \\), \\/]", "", substr(RaceData$PostTime, 1, 7))
#Trainer and jockey win percentages
HorseData$TrainStartCurYear = as.numeric(substr(HorseData$TrainRecordCurYear, 1, 4))
HorseData$TrainWinCurYear = as.numeric(substr(HorseData$TrainRecordCurYear, 5, 8))
HorseData$TrainWinPct = HorseData$TrainWinCurYear/(HorseData$TrainStartCurYear + 1) # +1 eliminates undefined
HorseData$JocStartCurYear = as.numeric(substr(HorseData$JockRecordCurYear, 1, 4))
HorseData$JocWinCurYear = as.numeric(substr(HorseData$JockRecordCurYear, 5, 8))
HorseData$JockWinPct = HorseData$JocWinCurYear/(HorseData$JocStartCurYear + 1)
# Equipment changes
HorseData$addBlinks = ifelse(HorseData$EquipChange == 1, 1, 0)
HorseData$remBlinks = ifelse(HorseData$EquipChange == 2, 1, 0)
# Horse earnings
HorseData$earnPer = (HorseData$EarnYear + HorseData$EarnPrev)/(HorseData$StartsYear+HorseData$StartsPrev)
HorseData$logEarn = log(HorseData$earnPer + 1)
# Age
HorseData$birthYear = 2000+HorseData$YOB
HorseData$age = as.numeric(HorseData$Year) - HorseData$birthYear
# Find days since previous race
PastPerf$RDateD = as.Date(as.character(PastPerf$RDate), "%Y%m%d")
PastPerf$PrevDateD = as.Date(as.character(PastPerf$PrevRDate), "%Y%m%d")
PastPerf$DaysOld = PastPerf$RDateD - PastPerf$PrevDateD
# Finish position is messy, including records for horses that didn't finish.
# We will say in these cases that they simply finished last.
PastPerf$FinishPos = ifelse(is.na(as.numeric(as.character(PastPerf$FinishPos))),
PastPerf$NumHor,
as.numeric(as.character(PastPerf$FinishPos)))
PastPerf$FinishPos = ifelse(PastPerf$FinishPos %in% c(89,99), PastPerf$NumHor, PastPerf$FinishPos)
# subset just races in the past 120 days for recency
pastPerf120 = filter(PastPerf, DaysOld <= 120)
pastPerf120 = group_by(pastPerf120, Track, RDate, RNum, PP)
pastPerf120 = summarize(pastPerf120, countRaces120 = length(Track))
# Subset just races in past 365 days
pastPerf365 = filter(PastPerf, DaysOld <= 365, !is.na(BRISSpeed))
pastPerf365 = group_by(pastPerf365, Track, RDate, RNum, PP)
pastPerf365 = summarize(pastPerf365, avgSpeed120 = mean(BRISSpeed), avg2FSpeed120 = mean(BRIS2fPace),
avg4FSpeed120 = mean(BRIS4fPace), avgLateSpeed120 = mean(BRISLatePace))
# subset just races in past 500 days
pastPerf500 = filter(PastPerf, DaysOld <= 500)
pastPerf500 = group_by(pastPerf500, Track, RDate, RNum, PP)
pastPerf500 = summarize(pastPerf500, logPurse = log(mean(Purse)))
#Clean up, take only what we want
HorseData = select(HorseData, Track, RDate, RNum, PP, HName, BRISRunstyle, TrainWinPct, JockWinPct, addBlinks, remBlinks, logEarn, age, MLOdds,
ProgramNum)
PastPerf = left_join(pastPerf500, pastPerf365)
PastPerf = left_join(PastPerf, pastPerf120)
PastPerf[is.na(PastPerf)] = 0
PastPerf$countsq = PastPerf$countRaces120*PastPerf$countRaces120
#Combine the sets
h = inner_join(HorseData, BRISStats)
h = inner_join(h, RaceData)
h = left_join(h, PastPerf)
h = unique(h)
h[is.na(h)] = 0
h$Track = str_trim(h$Track)
#Build features that compare each horse to others in the same race
i = group_by(h, Track, RDate, RNum)
j = summarise(i, meanEarn = mean(logEarn), meanPurse = mean(logPurse),
meanSpeed120 = mean(avgSpeed120), meanES = mean(avg2FSpeed120), mean4f = mean(avg4FSpeed120),
meanLS = mean(avgLateSpeed120), numHor = length(PP), meanPower = mean(PowerRating))
h = inner_join(i, j)
h$compEarn = ifelse(h$logEarn == 0, 0, h$logEarn - h$meanEarn) #earnings compared to competition
h$compPurse = ifelse(h$logPurse == 0, 0, h$logPurse- h$meanPurse) #average purse compared to competition
h$compSpeed120 = ifelse(h$avgSpeed120 == 0, 0, h$avgSpeed120 - h$meanSpeed120) #speed compared to competition
h$compES = ifelse(h$avg2FSpeed120 == 0, 0, h$avg2FSpeed120 - h$meanES) #early speed compared to competition
h$comp4f = ifelse(h$avg4FSpeed120 == 0, 0, h$avg4FSpeed120 - h$mean4f) #middle speed compared to competition
h$compLS = ifelse(h$avgLateSpeed120 == 0, 0, h$avgLateSpeed120 - h$meanLS) #late speed compared to competition
h$compPower = ifelse(h$PowerRating == 0, 0, h$PowerRating - h$meanPower) #BRIS Power rating compared to competition
h$numHorsq = h$numHor*h$numHor
h$agesq = h$age * h$age
h$raceID = factor(paste(h$Track, h$RDate, h$RNum, sep = ""))
h = as.data.frame(h)
dbWriteTable(con, 'horse_level_combined', value = h, append = T, row.names = F)
}
Great, so now I’ve processed all of my files and I have a database with all of my data. I’ll need to write queries to access my data.
Let’s say I want to do some exploratory analysis and see how many races were run at each track in my data. I will write a query to find that information.
races_per_track = dbGetQuery(con, 'select "Track" , Count(distinct "raceID") as Count from "horse_level_combined" group by "Track" order by "Track" asc;')
kable(races_per_track)
| Track | count |
|---|---|
| AJX | 18 |
| BEL | 56 |
| CON | 54 |
| CT | 32 |
| DED | 55 |
| EVD | 45 |
| FL | 53 |
| GG | 33 |
| GP | 12 |
| GPW | 58 |
| GRM | 9 |
| HAW | 41 |
| HST | 18 |
| IND | 56 |
| KEE | 58 |
| LA | 26 |
| LRL | 23 |
| LS | 18 |
| MED | 12 |
| NP | 28 |
| PEN | 40 |
| PM | 30 |
| PRX | 44 |
| RP | 36 |
| SA | 44 |
| TDN | 34 |
| TUP | 33 |
| WO | 44 |
| WRD | 36 |
| ZIA | 55 |
Maybe I’m interested in a specific race, say the 5th race at Belmont Park on October 24th 2015, and I want to see horses ordered by their BRIS Power Ranking. Note that I have a raceID field which I could use for this query, but for the sake of being simple and easy to understand, I will use a combination of variables.
Belmont5 = dbGetQuery(con, 'select "Track", "RNum", "RDate", "HName", "PowerRating" from "horse_level_combined" where "Track" = \'BEL\' and "RNum" = 5 and "RDate" = \'20151024\' order by "PowerRating" desc;')
kable(Belmont5)
| Track | RNum | RDate | HName | PowerRating |
|---|---|---|---|---|
| BEL | 5 | 20151024 | WARRIOROFTHEROSES | 134.41 |
| BEL | 5 | 20151024 | ROYAL POSSE | 131.92 |
| BEL | 5 | 20151024 | BEYOND EMPIRE | 131.72 |
| BEL | 5 | 20151024 | SIOUX | 131.52 |
| BEL | 5 | 20151024 | SARATOGA SNACKS | 131.51 |
| BEL | 5 | 20151024 | GOOD LUCK GUS | 130.84 |
| BEL | 5 | 20151024 | EMPIRE DREAMS | 129.64 |
| BEL | 5 | 20151024 | JOHANNESBURG SMILE | 124.34 |
| BEL | 5 | 20151024 | FULL OF MINE | 117.66 |
I populated my database with data from text files, then queried some of the information to demonstrate how you might use the RPostreSQL library. This was just a simple excercise, but if my data were 1,000,000 observations instead of a few hundred, then I would run the risk of R crashing every time it does a computation, due to the burden on my computer’s memory. It is important to be comfortable using databases with the analysis software of your choice if you’re doing any projects at an industrial scale.