1 Introduction

In this project we will be looking at binary predictive modeling using the logistic regression model. We will look to see if we need to transform any variables and do cross validation before finding our final model. We want to look at wins as our response variable and want to look at what betting odds factor helps with that.

1.1 Data and Variable Descriptions

This data set is on the NBA game betting odds and outcomes of the 2014-2015 Season. There is 1230 observations and 17 variables. The variables in this data set are

  • Datenum (categorical)- This is the amount of days since January 1, 1960
  • Team (categorical)- Where the home team is from
  • OppTeam (categorical)- Where the away team is from
  • TeamPts (numerical)- Home team points scored
  • OppPts (numerical)- Away team points scored
  • Wins (binary response)- If the home team won (1 means they won, 0 means they lost)
  • TeamCov (binary response) - If the home team covered the spread (1 means they covered, 0 means a “push”, and -1 means they didn’t cover)
  • TeamSprd (numerical)- The Vegas point spread for the home team
  • OvrUndr (numerical)- The over/under Vegas line for the total points in the game
  • TeamDiff (numerical)- Home Points minus Away Points
  • TotalPts (numerical)- Home Points plus Away Points

We only used a certain number of these variables like TeamPts, OppPts, TeamWin, TeamSprd, and OvrUndr due to my past experiences. The other variables were either not important for this project or they were a combination of others.

1.2 Research Question

The objective of this case study is to build a logistic regression model to predict wins using various risk factors associated with the game.

2 Exploratory Analysis

We first make the following pairwise scatter plots to inspect the potential issues with predictor variables.

Looking at the scatter plots we can see that none look skewed and are all unimodal besides our binary response variable which is team wins. This means that we do not need to transform any of our predictor variables.

2.1 Standizing Numerical Predictor Variables

Since this is a predictive model, we don’t worry about the interpretation of the coefficients. The objective is to identify a model that has the best predictive performance.

2.2 Data Split - Training and Testing Data

We randomly split the data into two subsets. 80% of the data will be used as training data. We will use the training data to search the candidate models, validate them and identify the final model using the cross-validation method. The 20% of the hold-up sample will be used for assessing the performance of the final model.

2.3 Best Model Identification

In the past modules, we introduced full and reduced models to set up the scope for searching for the final model. In this case study, we use the full, reduced, and final models obtained based on the step-wise variable selection as the three candidate models.

2.3.1 Cross-Validation for Model Identification

Since our training data is relatively small, I will use 5-fold cross-validation to ensure the validation data set has enough diabetes cases.

Average of prediction errors of candidate models
PE1 PE2 PE3
1 1 1

The average predictive errors of both model 1 and model 2 are the same. Since model 2 is simpler than model 1, we choose model 2 as the final predictive model. This selection of the final model is based on the cut-off probability 0.5.

2.3.2 Final Model Reporting

The previous cross-validation procedure identified the best model with pre-selected cut-off 0.5. The actual accuracy of the final model is given by

The actual accuracy of the final model
x
1

Looking at these results, it tells us that this model has a 100% accuracy of predicting the outcome in the final model. This seems correct because how many points you score affects the outcome of the game greatly and how mnay points you can hold your opponent too.

3 Conclusion

In this project we were trying to make a model that predicts wins. The final result for the accuracy of our final model was 100%. An accuracy of 100% shows that the model is identifying if an observation falls into win (1) or loss (0) for 100% of the time.

