This assignment has two parts.
First, I analyze NCAA football coaching changes using logit and
probit models. The dependent variable is cochange, where
1 means a coaching change occurred and 0 means
no coaching change occurred.
Second, I use machine learning to predict Chicago Cubs game outcomes with a Random Forest model. This satisfies the machine learning portion of the assignment using my own sports dataset.
The first model estimates whether a coaching change occurred using recent winning percentage, games coached at the current school, winning percentage at the current school, and total career games coached.
| Observations | Coaching Changes | No Coaching Change | Coaching Change Rate |
|---|---|---|---|
| 4053 | 690 | 3363 | 17.0% |
| Variable | Odds Ratio | Lower 95% CI | Upper 95% CI | p-value | Significance |
|---|---|---|---|---|---|
| Intercept | 0.4349 | 0.3485 | 0.5412 | 0.0000 | *** |
| Recent Season Win % | 0.0666 | 0.0348 | 0.1268 | 0.0000 | *** |
| Games at Current School | 1.0061 | 1.0033 | 1.0091 | 0.0000 | *** |
| Win % at Current School | 2.4066 | 1.1081 | 5.2335 | 0.0265 |
|
| Career Games Coached | 0.9972 | 0.9949 | 0.9995 | 0.0178 |
|
This logit model evaluates whether coaching turnover is associated with short-term performance, school-specific success, tenure at the current school, and overall career experience.
Odds ratios above 1 mean that higher values are associated with higher odds of a coaching change. Odds ratios below 1 mean that higher values are associated with lower odds of a coaching change.
Recent winning percentage captures immediate performance pressure. Winning percentage at the current school captures the coach’s broader record at that institution. Games coached at the current school measures tenure, while career games coached measures overall experience.
| Scenario | Recent Win % | School Win % | Predicted Probability |
|---|---|---|---|
| Poor Recent Performance | 0.25 | 0.3 | 23.7% |
| Average Performance | 0.50 | 0.5 | 15.9% |
| Strong Performance | 0.75 | 0.7 | 10.2% |
The predicted probability table translates the logit model into a more intuitive form. Instead of only reading coefficients or odds ratios, the table shows how the estimated probability of a coaching change changes under poor, average, and strong performance scenarios.
| Variable | Odds Ratio | Lower 95% CI | Upper 95% CI | p-value | Significance |
|---|---|---|---|---|---|
| Intercept | 0.2088 | 0.1235 | 0.3519 | 0.0000 | *** |
| Recent Points For | 0.9971 | 0.9958 | 0.9983 | 0.0000 | *** |
| Recent Points Against | 1.0011 | 0.9997 | 1.0025 | 0.1128 | |
| Games at Current School | 1.0186 | 1.0055 | 1.0319 | 0.0053 | ** |
| Points For at Current School | 0.9991 | 0.9988 | 0.9995 | 0.0000 | *** |
| Points Against at Current School | 1.0007 | 1.0003 | 1.0010 | 0.0002 | *** |
| Career Games Coached | 0.9971 | 0.9948 | 0.9994 | 0.0149 |
|
The second logit model replaces winning percentage with points for and points against. This gives a more detailed performance model because it separates offensive production from defensive performance.
If PF or pfsc has an odds ratio below 1,
stronger offensive production is associated with lower coaching turnover
probability. If PA or pasc has an odds ratio
above 1, allowing more points is associated with higher turnover
probability.
One limitation is that pfsc and pasc are
cumulative school totals. Coaches who remain at a school longer
naturally accumulate more points for and against, so those variables
partially overlap with gamesc.
| Model | AIC | Residual Deviance | Observations |
|---|---|---|---|
| Winning Percentage Logit | 3572.873 | 3562.873 | 4053 |
| Scoring-Based Logit | 3525.792 | 3511.792 | 4053 |
The scoring-based model is preferred because it separates offense and defense instead of using only winning percentage. AIC is also useful for comparing the two specifications. A lower AIC indicates better relative model fit after accounting for model complexity.
| Model | Variable | Coefficient | Std. Error | z-statistic | p-value | Significance |
|---|---|---|---|---|---|---|
| Logit Model | ||||||
| Logit | Intercept | -1.56655 | 0.26703 | -5.86648 | 0.00000 | *** |
| Logit | Recent Points For | -0.00294 | 0.00063 | -4.70098 | 0.00000 | *** |
| Logit | Recent Points Against | 0.00111 | 0.00070 | 1.58579 | 0.11279 | |
| Logit | Games at Current School | 0.01841 | 0.00660 | 2.78771 | 0.00531 | ** |
| Logit | Points For at Current School | -0.00086 | 0.00017 | -5.02210 | 0.00000 | *** |
| Logit | Points Against at Current School | 0.00069 | 0.00018 | 3.78241 | 0.00016 | *** |
| Logit | Career Games Coached | -0.00286 | 0.00117 | -2.43412 | 0.01493 |
|
| Probit Model | ||||||
| Probit | Intercept | -0.95107 | 0.14754 | -6.44619 | 0.00000 | *** |
| Probit | Recent Points For | -0.00154 | 0.00034 | -4.53451 | 0.00001 | *** |
| Probit | Recent Points Against | 0.00060 | 0.00039 | 1.52538 | 0.12716 | |
| Probit | Games at Current School | 0.00976 | 0.00369 | 2.64666 | 0.00813 | ** |
| Probit | Points For at Current School | -0.00045 | 0.00009 | -4.76121 | 0.00000 | *** |
| Probit | Points Against at Current School | 0.00037 | 0.00010 | 3.66577 | 0.00025 | *** |
| Probit | Career Games Coached | -0.00164 | 0.00064 | -2.57415 | 0.01005 |
|
| Model | AIC | Residual Deviance | Observations |
|---|---|---|---|
| Logit | 3525.792 | 3511.792 | 4053 |
| Probit | 3532.655 | 3518.655 | 4053 |
| Scenario | PF | PA | pfsc | pasc | Logit Predicted Probability | Probit Predicted Probability |
|---|---|---|---|---|---|---|
| Weak Scoring Profile | 205 | 306 | 409 | 1445 | 33.8% | 32.3% |
| Average Scoring Profile | 261 | 252 | 845 | 853 | 15.7% | 16.0% |
| Strong Scoring Profile | 333 | 202 | 1614 | 445 | 5.2% | 5.1% |
The logit and probit models are both appropriate for binary dependent variables. The main difference is the assumed distribution behind the model: logit uses the logistic distribution, while probit uses the standard normal distribution.
The raw coefficients should not be compared directly because they are on different scales. Instead, the important comparison is whether the signs, significance levels, model fit, and predicted probabilities tell the same story.
If the logit and probit predicted probabilities are close across the scoring scenarios, then the results are robust to the choice of binary response model.
For the machine learning section, I use a Random Forest classification model to predict whether the Chicago Cubs win a game.
The model uses only pregame information based on recent team form, including 10-game rolling win percentage, recent scoring, recent runs allowed, recent run differential, home/away status, and month.
This avoids data leakage because the model does not use final score information from the game it is trying to predict.
| Total Games | Training Games: 2014-2022 | Testing Games: 2023-2025 | Overall Win Rate |
|---|---|---|---|
| 1009 | 677 | 332 | 48.3% |
| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | binary | 0.5 |
| Predicted | Loss | Win |
|---|---|---|
| Loss | 101 | 105 |
| Win | 61 | 65 |
| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | binary | 0.5000 |
| precision | binary | 0.5159 |
| recall | binary | 0.3824 |
| roc_auc | binary | 0.5118 |
The Random Forest model predicts Cubs wins using only information available before each game. This makes the model more realistic than using final runs scored or final runs allowed from the same game.
The accuracy table shows the overall share of correct classifications. The confusion matrix shows how often the model correctly identifies wins and losses. Precision, recall, and ROC AUC provide additional detail about classification quality.
The feature importance plot shows which pregame variables the Random Forest relied on most heavily. If recent run differential, recent win percentage, or recent run prevention rank highly, that suggests short-term team quality is important for predicting future Cubs outcomes.
The NCAA coaching analysis shows that coaching changes can be studied systematically using binary response models. The winning percentage model captures overall performance, while the scoring-based model separates offensive and defensive performance.
The logit and probit comparison shows whether the preferred model is robust across two common binary response specifications.
The machine learning section extends the assignment by using a Random Forest model to predict Chicago Cubs wins from pregame information. Together, the assignment demonstrates both traditional econometric modeling and machine learning classification in sports analytics.