DAT 301 Midterm Slides

Background

Tim Donaghy is a ex-NBA ref who gambled on his own games, mostly betting on the “spread”, or how many points a team will win a game by. Most of his gambling took place from 2003-2007, when he was caught and immediately fired by the league. This data analysis will focus on if teams benefited, or had a higher change in win percentage, when the final spread in games Tim refereed was a larger difference than the team’s average final spread. To summarize, did Tim’s betting actually benefit or harm the teams he refereed? This data analysis will be done with an expansive NBA data set, downloaded from Kaggle.

Data Source

To analyze the data, downloaded from Kaggle as a .sqlite file, the library RSQLite was used to convert two of the tables from the database into R data frames. The two tables needed were the catalog of games, and the catalog of the referees who called each game.

Data Setup

Once the data was converted into a format compatible with R, a basic function called findOverlap was used to find how many games Tim Donaghy called for each of the 10 teams analyzed. The function is a basic for loop that returns a vector with the row number of overlapping games, which is then used to create a data frame for each team containing the games that Tim reffed.

findOverlap = function(df) {
    x = c()
    for (row in 1:nrow(df)) {
        game = df[row, "GAME_ID"]
        for (row2 in 1:nrow(tim_df)) {
            refGame = tim_df[row2, "refGAME_ID"]
            if (game == refGame) {
                x = c(x, row)
            }
        }
    }
    return(x)
}

Data Setup

Here is an example of the Suns Data frame, an overlap of games that Tim Donaghy refereed the Phoenix Suns.

##       GAME_ID SEASON_ID PLUS_MINUS_HOME WL_HOME TEAM_ABBREVIATION_HOME
## 1  0020300426     22003               8       W                    PHX
## 2  0020300642     22003              -4       L                    ATL
## 3  0020400871     22004              13       W                    PHX
## 4  0020401207     22004              14       W                    PHX
## 5  0020500267     22005              -8       L                    GSW
## 6  0020500832     22005             -17       L                    HOU
## 7  0020500925     22005               8       W                    PHX
## 8  0020600301     22006             -14       L                    ORL
## 9  0020600488     22006              28       W                    PHX
## 10 0020600575     22006              -9       L                    HOU

Data Setup

These data frames were then used to find the overall win percentage with and without Tim refereeing, along with the spread, for all 10 teams. These values were combined into two vectors that form the base of the analysis, shown below.

spreadDiscrepancy = abs(timSpreads - spreads)
winDiscrepancy = (100 *abs(timWinPercents - winPercents))

Plot of Win % and Spread Discrepancies

Number of Games Refereed for each Team

This is a simple pie chart, coded with ggplot, to show the teams chosen to analyze were all refereed a similar amount of times by Tim.

Interactive Plot with 3 Variables

Linear Regression of Data

While the RSME and MAE values are somewhat low, the R^2 value is relatively high for linear regression. The trend may be stronger if more data is analyzed, but it does not seem that there is a strong linear correlation that teams benefit from Tim’s betting.

set.seed(123)
trainvals <- dfc3$spreadDiscrepancy %>% 
  createDataPartition(p = 0.6, list = FALSE)
traindf  <- dfc3[trainvals, ]
testdf <- dfc3[-trainvals, ]

model <- lm(spreadDiscrepancy ~ winDiscrepancy, data = traindf)
predictions <- model %>% predict(testdf)
data.frame( R2 = R2(predictions, testdf$spreadDiscrepancy),
            RMSE = RMSE(predictions, testdf$spreadDiscrepancy),
            MAE = MAE(predictions, testdf$spreadDiscrepancy))

##   R2     RMSE       MAE
## 1  1 1.159162 0.8506676

Graphical Reprensation of Linear Regression

Conclusion

As the graph shows, there is a linear trend to the data analyzed, but the R^2 number shows that it is not the strongest. Again, the train data only has two data points, and the model would benefit from an extended analysis. In conclusion, Tim Donaghy’s selfishness to bet on games he called seemed to result in teams being affected, with the larger difference in spread leading to a larger difference in a team’s win percent. Further analysis could include an expanded set of teams, from 10 to the full 30, along with finding if Tim lead to teams losing out on wins or gaining wins, not just the absolute change in win percentage.