Introduction

This assignment has two parts.

First, I analyze NCAA football coaching changes using logit and probit models. The dependent variable is cochange, where 1 means a coaching change occurred and 0 means no coaching change occurred.

Second, I use machine learning to predict Chicago Cubs game outcomes with a Random Forest model. This satisfies the machine learning portion of the assignment using my own sports dataset.

Question 1: NCAA Football Coaching Changes

The first model estimates whether a coaching change occurred using recent winning percentage, games coached at the current school, winning percentage at the current school, and total career games coached.

Coaching Change Summary
Observations Coaching Changes No Coaching Change Coaching Change Rate
4053 690 3363 17.0%
Logit Model Results Using Winning Percentage
Variable Odds Ratio Lower 95% CI Upper 95% CI p-value Significance
Intercept 0.4349 0.3485 0.5412 0.0000 ***
Recent Season Win % 0.0666 0.0348 0.1268 0.0000 ***
Games at Current School 1.0061 1.0033 1.0091 0.0000 ***
Win % at Current School 2.4066 1.1081 5.2335 0.0265
Career Games Coached 0.9972 0.9949 0.9995 0.0178

This logit model evaluates whether coaching turnover is associated with short-term performance, school-specific success, tenure at the current school, and overall career experience.

Odds ratios above 1 mean that higher values are associated with higher odds of a coaching change. Odds ratios below 1 mean that higher values are associated with lower odds of a coaching change.

Recent winning percentage captures immediate performance pressure. Winning percentage at the current school captures the coach’s broader record at that institution. Games coached at the current school measures tenure, while career games coached measures overall experience.

Predicted Probability of Coaching Change
Scenario Recent Win % School Win % Predicted Probability
Poor Recent Performance 0.25 0.3 23.7%
Average Performance 0.50 0.5 15.9%
Strong Performance 0.75 0.7 10.2%

The predicted probability table translates the logit model into a more intuitive form. Instead of only reading coefficients or odds ratios, the table shows how the estimated probability of a coaching change changes under poor, average, and strong performance scenarios.

Scoring-Based Logit Model Results
Variable Odds Ratio Lower 95% CI Upper 95% CI p-value Significance
Intercept 0.2088 0.1235 0.3519 0.0000 ***
Recent Points For 0.9971 0.9958 0.9983 0.0000 ***
Recent Points Against 1.0011 0.9997 1.0025 0.1128
Games at Current School 1.0186 1.0055 1.0319 0.0053 **
Points For at Current School 0.9991 0.9988 0.9995 0.0000 ***
Points Against at Current School 1.0007 1.0003 1.0010 0.0002 ***
Career Games Coached 0.9971 0.9948 0.9994 0.0149

The second logit model replaces winning percentage with points for and points against. This gives a more detailed performance model because it separates offensive production from defensive performance.

If PF or pfsc has an odds ratio below 1, stronger offensive production is associated with lower coaching turnover probability. If PA or pasc has an odds ratio above 1, allowing more points is associated with higher turnover probability.

One limitation is that pfsc and pasc are cumulative school totals. Coaches who remain at a school longer naturally accumulate more points for and against, so those variables partially overlap with gamesc.

Comparison of Logit Model Specifications
Model AIC Residual Deviance Observations
Winning Percentage Logit 3572.873 3562.873 4053
Scoring-Based Logit 3525.792 3511.792 4053

The scoring-based model is preferred because it separates offense and defense instead of using only winning percentage. AIC is also useful for comparing the two specifications. A lower AIC indicates better relative model fit after accounting for model complexity.

Logit and Probit Coefficient Comparison
Model Variable Coefficient Std. Error z-statistic p-value Significance
Logit Model
Logit Intercept -1.56655 0.26703 -5.86648 0.00000 ***
Logit Recent Points For -0.00294 0.00063 -4.70098 0.00000 ***
Logit Recent Points Against 0.00111 0.00070 1.58579 0.11279
Logit Games at Current School 0.01841 0.00660 2.78771 0.00531 **
Logit Points For at Current School -0.00086 0.00017 -5.02210 0.00000 ***
Logit Points Against at Current School 0.00069 0.00018 3.78241 0.00016 ***
Logit Career Games Coached -0.00286 0.00117 -2.43412 0.01493
Probit Model
Probit Intercept -0.95107 0.14754 -6.44619 0.00000 ***
Probit Recent Points For -0.00154 0.00034 -4.53451 0.00001 ***
Probit Recent Points Against 0.00060 0.00039 1.52538 0.12716
Probit Games at Current School 0.00976 0.00369 2.64666 0.00813 **
Probit Points For at Current School -0.00045 0.00009 -4.76121 0.00000 ***
Probit Points Against at Current School 0.00037 0.00010 3.66577 0.00025 ***
Probit Career Games Coached -0.00164 0.00064 -2.57415 0.01005
Logit and Probit Model Fit Comparison
Model AIC Residual Deviance Observations
Logit 3525.792 3511.792 4053
Probit 3532.655 3518.655 4053
Predicted Probability Comparison
Scenario PF PA pfsc pasc Logit Predicted Probability Probit Predicted Probability
Weak Scoring Profile 205 306 409 1445 33.8% 32.3%
Average Scoring Profile 261 252 845 853 15.7% 16.0%
Strong Scoring Profile 333 202 1614 445 5.2% 5.1%

The logit and probit models are both appropriate for binary dependent variables. The main difference is the assumed distribution behind the model: logit uses the logistic distribution, while probit uses the standard normal distribution.

The raw coefficients should not be compared directly because they are on different scales. Instead, the important comparison is whether the signs, significance levels, model fit, and predicted probabilities tell the same story.

If the logit and probit predicted probabilities are close across the scoring scenarios, then the results are robust to the choice of binary response model.

Question 2: Chicago Cubs Random Forest Model

For the machine learning section, I use a Random Forest classification model to predict whether the Chicago Cubs win a game.

The model uses only pregame information based on recent team form, including 10-game rolling win percentage, recent scoring, recent runs allowed, recent run differential, home/away status, and month.

This avoids data leakage because the model does not use final score information from the game it is trying to predict.

Chicago Cubs Machine Learning Dataset Summary
Total Games Training Games: 2014-2022 Testing Games: 2023-2025 Overall Win Rate
1009 677 332 48.3%
Random Forest Test Set Accuracy
.metric .estimator .estimate
accuracy binary 0.5
Random Forest Confusion Matrix
Predicted Loss Win
Loss 101 105
Win 61 65
Random Forest Classification Metrics
.metric .estimator .estimate
accuracy binary 0.5000
precision binary 0.5159
recall binary 0.3824
roc_auc binary 0.5118

The Random Forest model predicts Cubs wins using only information available before each game. This makes the model more realistic than using final runs scored or final runs allowed from the same game.

The accuracy table shows the overall share of correct classifications. The confusion matrix shows how often the model correctly identifies wins and losses. Precision, recall, and ROC AUC provide additional detail about classification quality.

The feature importance plot shows which pregame variables the Random Forest relied on most heavily. If recent run differential, recent win percentage, or recent run prevention rank highly, that suggests short-term team quality is important for predicting future Cubs outcomes.

Final Conclusion

The NCAA coaching analysis shows that coaching changes can be studied systematically using binary response models. The winning percentage model captures overall performance, while the scoring-based model separates offensive and defensive performance.

The logit and probit comparison shows whether the preferred model is robust across two common binary response specifications.

The machine learning section extends the assignment by using a Random Forest model to predict Chicago Cubs wins from pregame information. Together, the assignment demonstrates both traditional econometric modeling and machine learning classification in sports analytics.