Introduction
In this project we will be looking at binary predictive modeling
using the logistic regression model. We will look to see if we need to
transform any variables and do cross validation before finding our final
model. We want to look at wins as our response variable and want to look
at what betting odds factor helps with that.
Data and Variable
Descriptions
This data set is on the NBA game betting odds and outcomes of the
2014-2015 Season. There is 1230 observations and 17 variables. The
variables in this data set are
- Datenum (categorical)- This is the amount of days since January 1,
1960
- Team (categorical)- Where the home team is from
- OppTeam (categorical)- Where the away team is from
- TeamPts (numerical)- Home team points scored
- OppPts (numerical)- Away team points scored
- Wins (binary response)- If the home team won (1 means they won, 0
means they lost)
- TeamCov (binary response) - If the home team covered the spread (1
means they covered, 0 means a “push”, and -1 means they didn’t
cover)
- TeamSprd (numerical)- The Vegas point spread for the home team
- OvrUndr (numerical)- The over/under Vegas line for the total points
in the game
- TeamDiff (numerical)- Home Points minus Away Points
- TotalPts (numerical)- Home Points plus Away Points
We only used a certain number of these variables like TeamPts,
OppPts, TeamWin, TeamSprd, and OvrUndr due to my past experiences. The
other variables were either not important for this project or they were
a combination of others.
Research
Question
The objective of this case study is to build a logistic regression
model to predict wins using various risk factors associated with the
game.
Exploratory
Analysis
We first make the following pairwise scatter plots to inspect the
potential issues with predictor variables.

Looking at the scatter plots we can see that none look skewed and are
all unimodal besides our binary response variable which is team wins.
This means that we do not need to transform any of our predictor
variables.
Standizing Numerical
Predictor Variables
Since this is a predictive model, we don’t worry about the
interpretation of the coefficients. The objective is to identify a model
that has the best predictive performance.
Data Split - Training
and Testing Data
We randomly split the data into two subsets. 80% of the data will be
used as training data. We will use the training data to search the
candidate models, validate them and identify the final model using the
cross-validation method. The 20% of the hold-up sample will be used for
assessing the performance of the final model.
Best Model
Identification
In the past modules, we introduced full and reduced models to set up
the scope for searching for the final model. In this case study, we use
the full, reduced, and final models obtained based on the step-wise
variable selection as the three candidate models.
Cross-Validation
for Model Identification
Since our training data is relatively small, I will use 5-fold
cross-validation to ensure the validation data set has enough diabetes
cases.
Average of prediction errors of candidate models
| 1 |
1 |
1 |
The average predictive errors of both model 1 and model 2 are the
same. Since model 2 is simpler than model 1, we choose model 2 as the
final predictive model. This selection of the final model is based on
the cut-off probability 0.5.
Final Model
Reporting
The previous cross-validation procedure identified the best model
with pre-selected cut-off 0.5. The actual accuracy of the final model is
given by
The actual accuracy of the final model
| 1 |
Looking at these results, it tells us that this model has a 100%
accuracy of predicting the outcome in the final model. This seems
correct because how many points you score affects the outcome of the
game greatly and how mnay points you can hold your opponent too.
Conclusion
In this project we were trying to make a model that predicts wins.
The final result for the accuracy of our final model was 100%. An
accuracy of 100% shows that the model is identifying if an observation
falls into win (1) or loss (0) for 100% of the time.
