Introduction

This assignment has two parts.

First, I analyze NCAA football coaching changes using logit and probit models. The dependent variable is cochange, where 1 means a coaching change occurred and 0 means no coaching change occurred.

Second, I use machine learning to predict Chicago Cubs game outcomes with a Random Forest model. This satisfies the machine learning portion of the assignment using my own sports dataset.

Question 1: NCAA Football Coaching Changes

The first model estimates whether a coaching change occurred using recent winning percentage, games coached at the current school, winning percentage at the current school, and total career games coached.

Coaching Change Summary
Observations	Coaching Changes	No Coaching Change	Coaching Change Rate
4053	690	3363	17.0%

Logit Model Results Using Winning Percentage
Variable	Odds Ratio	Lower 95% CI	Upper 95% CI	p-value	Significance
Intercept	0.4349	0.3485	0.5412	0.0000	***
Recent Season Win %	0.0666	0.0348	0.1268	0.0000	***
Games at Current School	1.0061	1.0033	1.0091	0.0000	***
Win % at Current School	2.4066	1.1081	5.2335	0.0265
Career Games Coached	0.9972	0.9949	0.9995	0.0178

This logit model evaluates whether coaching turnover is associated with short-term performance, school-specific success, tenure at the current school, and overall career experience.

Odds ratios above 1 mean that higher values are associated with higher odds of a coaching change. Odds ratios below 1 mean that higher values are associated with lower odds of a coaching change.

Recent winning percentage captures immediate performance pressure. Winning percentage at the current school captures the coach’s broader record at that institution. Games coached at the current school measures tenure, while career games coached measures overall experience.

Predicted Probability of Coaching Change
Scenario	Recent Win %	School Win %	Predicted Probability
Poor Recent Performance	0.25	0.3	23.7%
Average Performance	0.50	0.5	15.9%
Strong Performance	0.75	0.7	10.2%

The predicted probability table translates the logit model into a more intuitive form. Instead of only reading coefficients or odds ratios, the table shows how the estimated probability of a coaching change changes under poor, average, and strong performance scenarios.

Scoring-Based Logit Model Results
Variable	Odds Ratio	Lower 95% CI	Upper 95% CI	p-value	Significance
Intercept	0.2088	0.1235	0.3519	0.0000	***
Recent Points For	0.9971	0.9958	0.9983	0.0000	***
Recent Points Against	1.0011	0.9997	1.0025	0.1128
Games at Current School	1.0186	1.0055	1.0319	0.0053	**
Points For at Current School	0.9991	0.9988	0.9995	0.0000	***
Points Against at Current School	1.0007	1.0003	1.0010	0.0002	***
Career Games Coached	0.9971	0.9948	0.9994	0.0149

The second logit model replaces winning percentage with points for and points against. This gives a more detailed performance model because it separates offensive production from defensive performance.

If PF or pfsc has an odds ratio below 1, stronger offensive production is associated with lower coaching turnover probability. If PA or pasc has an odds ratio above 1, allowing more points is associated with higher turnover probability.

One limitation is that pfsc and pasc are cumulative school totals. Coaches who remain at a school longer naturally accumulate more points for and against, so those variables partially overlap with gamesc.

Comparison of Logit Model Specifications
Model	AIC	Residual Deviance	Observations
Winning Percentage Logit	3572.873	3562.873	4053
Scoring-Based Logit	3525.792	3511.792	4053

The scoring-based model is preferred because it separates offense and defense instead of using only winning percentage. AIC is also useful for comparing the two specifications. A lower AIC indicates better relative model fit after accounting for model complexity.

Logit and Probit Coefficient Comparison
Model	Variable	Coefficient	Std. Error	z-statistic	p-value	Significance
Logit Model
Logit	Intercept	-1.56655	0.26703	-5.86648	0.00000	***
Logit	Recent Points For	-0.00294	0.00063	-4.70098	0.00000	***
Logit	Recent Points Against	0.00111	0.00070	1.58579	0.11279
Logit	Games at Current School	0.01841	0.00660	2.78771	0.00531	**
Logit	Points For at Current School	-0.00086	0.00017	-5.02210	0.00000	***
Logit	Points Against at Current School	0.00069	0.00018	3.78241	0.00016	***
Logit	Career Games Coached	-0.00286	0.00117	-2.43412	0.01493
Probit Model
Probit	Intercept	-0.95107	0.14754	-6.44619	0.00000	***
Probit	Recent Points For	-0.00154	0.00034	-4.53451	0.00001	***
Probit	Recent Points Against	0.00060	0.00039	1.52538	0.12716
Probit	Games at Current School	0.00976	0.00369	2.64666	0.00813	**
Probit	Points For at Current School	-0.00045	0.00009	-4.76121	0.00000	***
Probit	Points Against at Current School	0.00037	0.00010	3.66577	0.00025	***
Probit	Career Games Coached	-0.00164	0.00064	-2.57415	0.01005

Logit and Probit Model Fit Comparison
Model	AIC	Residual Deviance	Observations
Logit	3525.792	3511.792	4053
Probit	3532.655	3518.655	4053

Predicted Probability Comparison
Scenario	PF	PA	pfsc	pasc	Logit Predicted Probability	Probit Predicted Probability
Weak Scoring Profile	205	306	409	1445	33.8%	32.3%
Average Scoring Profile	261	252	845	853	15.7%	16.0%
Strong Scoring Profile	333	202	1614	445	5.2%	5.1%

The logit and probit models are both appropriate for binary dependent variables. The main difference is the assumed distribution behind the model: logit uses the logistic distribution, while probit uses the standard normal distribution.

The raw coefficients should not be compared directly because they are on different scales. Instead, the important comparison is whether the signs, significance levels, model fit, and predicted probabilities tell the same story.

If the logit and probit predicted probabilities are close across the scoring scenarios, then the results are robust to the choice of binary response model.

Question 2: Chicago Cubs Random Forest Model

For the machine learning section, I use a Random Forest classification model to predict whether the Chicago Cubs win a game.

The model uses only pregame information based on recent team form, including 10-game rolling win percentage, recent scoring, recent runs allowed, recent run differential, home/away status, and month.

This avoids data leakage because the model does not use final score information from the game it is trying to predict.

Chicago Cubs Machine Learning Dataset Summary
Total Games	Training Games: 2014-2022	Testing Games: 2023-2025	Overall Win Rate
1009	677	332	48.3%

Random Forest Test Set Accuracy
.metric	.estimator	.estimate
accuracy	binary	0.5

Random Forest Confusion Matrix
Predicted	Loss	Win
Loss	101	105
Win	61	65

Random Forest Classification Metrics
.metric	.estimator	.estimate
accuracy	binary	0.5000
precision	binary	0.5159
recall	binary	0.3824
roc_auc	binary	0.5118

The Random Forest model predicts Cubs wins using only information available before each game. This makes the model more realistic than using final runs scored or final runs allowed from the same game.

The accuracy table shows the overall share of correct classifications. The confusion matrix shows how often the model correctly identifies wins and losses. Precision, recall, and ROC AUC provide additional detail about classification quality.

The feature importance plot shows which pregame variables the Random Forest relied on most heavily. If recent run differential, recent win percentage, or recent run prevention rank highly, that suggests short-term team quality is important for predicting future Cubs outcomes.

Final Conclusion

The NCAA coaching analysis shows that coaching changes can be studied systematically using binary response models. The winning percentage model captures overall performance, while the scoring-based model separates offensive and defensive performance.

The logit and probit comparison shows whether the preferred model is robust across two common binary response specifications.

The machine learning section extends the assignment by using a Random Forest model to predict Chicago Cubs wins from pregame information. Together, the assignment demonstrates both traditional econometric modeling and machine learning classification in sports analytics.

Econometrics Assignment #8

Coaching Turnover and Machine Learning Analysis

DJ Barry

Introduction

Question 1: NCAA Football Coaching Changes

Question 2: Chicago Cubs Random Forest Model

Final Conclusion