Quantitative and Statistical Methods Summary

Types of regressions

Method	Process	Manual_Calculation	Assumptions	Advantages	Disadvantages	Parameter_Testing
OLS	Minimizes the sum of squared residuals to estimate parameters.	Specify the regression equation Y = Xβ + ε. Compute β = (X’X)^(-1)X’Y. Multiply X’X, invert it, and multiply by X’Y. Solve for β.	Linearity, no multicollinearity, homoscedasticity, uncorrelated errors, normally distributed errors (for inference).	Simple, easy to interpret; BLUE under Gauss-Markov assumptions.	Sensitive to outliers and multicollinearity; inefficient under heteroscedasticity or autocorrelation.	Use `regress`. Coefficients interpreted as unit change effects. Test with t-tests and assess significance with p-values.
GLS	Adjusts for heteroscedasticity/autocorrelation by transforming the data or using weights.	Estimate the error variance structure (Ω). Transform the model using Ω^(-1/2): Y* = Ω^(-1/2)Y, X* = Ω^(-1/2)X. Apply OLS to the transformed model.	Correctly specified structure of heteroscedasticity or autocorrelation.	Efficient under heteroscedasticity or autocorrelation.	Requires knowledge/estimation of error structure; bias if misspecified.	Use `prais` or `xtgls`. Test similar to OLS but interpret results within the transformed context.
2SLS	Stage 1: Regress endogenous variables on instruments. Stage 2: Use fitted values in the main regression.	Stage 1: Regress endogenous variables (Z) on instruments (W): Z = Wπ + ν. Obtain predicted values (Z-hat). Replace Z in the main equation: Y = Z-hatβ + ε. Apply OLS.	Instruments must be relevant (correlated with endogenous variables) and exogenous (uncorrelated with errors).	Corrects endogeneity bias; consistent under valid instruments.	Sensitive to weak instruments; relies on instrument validity.	Use `ivregress 2sls`. Test instrument relevance with first-stage F-statistics and use Hansen J-test for overidentification.
GMM	Minimizes weighted sum of squared moment conditions derived from data and model.	Define moment conditions E[g(Z, θ)] = 0. Choose weighting matrix W. Minimize J(θ) = g(Z, θ)’Wg(Z, θ). Solve for θ (parameters).	Valid moment conditions; correct specification of weighting matrix.	Flexible under heteroscedasticity or autocorrelation; handles overidentified models.	Computationally intensive; sensitive to weight matrix choice.	Use `gmm` or `xtabond2`. Test overidentifying restrictions with Hansen J-test. Interpret coefficients based on moment conditions.
GME	Maximizes entropy subject to constraints (data and prior information).	Define entropy H = -∑p_i ln(p_i). Specify constraints (e.g., Xβ = Y). Maximize H subject to constraints. Solve for β and p_i (entropy weights).	Requires carefully chosen constraints; small or ill-posed data.	Handles multicollinearity and small samples; incorporates prior information.	Relatively uncommon; computationally demanding; interpretation depends on entropy weights.	Limited support in Stata. Often requires external packages or manual programming. Coefficients depend on entropy constraints and weights.
MLE	Maximizes the likelihood function of the data given the model.	Define the likelihood L(θ) = Πf(y_i \| X_i, θ). Take log: ln(L(θ)). Differentiate ln(L(θ)) w.r.t. θ, set to zero. Solve for θ (MLE estimates).	Correct specification of likelihood function; errors are i.i.d.	Asymptotically efficient and consistent; flexible for non-linear models.	Sensitive to misspecified likelihood; computationally intensive.	Use `ml`, `logit`, `probit`. Interpret likelihood values and use LR tests for model comparison. Interpret coefficients based on likelihood estimation.

Common Problems in Regression Analysis

Problem	Meaning	Consequences	Solution
Endogeneity	Occurs when an explanatory variable is correlated with the error term, often due to reverse causality, omitted variables, or measurement error.	Biased and inconsistent coefficient estimates; incorrect inference and policy recommendations.	Use instrumental variables (IV) or two-stage least squares (2SLS); include omitted variables; improve data quality.
Multicollinearity	Occurs when two or more independent variables are highly correlated, making it hard to estimate their individual effects.	Inflated standard errors, leading to low statistical significance and difficulty in determining the effect of each variable.	Center variables, drop one variable, or use ridge regression or principal component analysis (PCA).
Omitted Variable Bias	Happens when a relevant variable is excluded from the model, causing biased and inconsistent estimates.	Biased coefficient estimates; results cannot reliably reflect the true relationship between variables.	Include the omitted variable if data is available; use proxy variables; apply sensitivity analysis.
Heteroskedasticity	Occurs when the variance of the error term is not constant across observations.	Inefficient estimates, invalid hypothesis tests, and incorrect standard errors.	Use robust standard errors (e.g., White’s robust estimator) or generalized least squares (GLS).
Autocorrelation	Happens when error terms are correlated across observations, often in time-series data.	Biased standard errors, leading to invalid hypothesis tests and inefficient estimates.	Use Newey-West standard errors; model the autocorrelation structure (e.g., ARMA or Prais-Winsten regression).
Measurement Error	Occurs when the observed variables contain measurement errors, leading to biased and inconsistent parameter estimates.	Bias and inconsistency in parameter estimates; loss of reliability in results.	Use methods like instrumental variables (IV) to address measurement error; improve data collection methods.
Non-Linearity	Occurs when the relationship between the dependent and independent variables is not linear, violating the linearity assumption.	Incorrect model specification leads to biased estimates and poor predictive accuracy.	Apply non-linear models such as polynomial regression, log-transformation, or generalized additive models (GAM).

Testing the presence of regression problems

Problem	Test	Intuition_Process	Stata_Command	Statistic_and_Interpretation
Endogeneity	Durbin-Wu-Hausman Test	Compares the consistency of OLS and IV estimates. If IV estimates differ significantly from OLS, endogeneity is likely present.	`ivregress` with Hausman test: `hausman`	The test returns a chi-square statistic: - Null: OLS is consistent. - Rejecting the null suggests endogeneity. Look at p-values for significance.
Multicollinearity	Variance Inflation Factor (VIF)	Checks if independent variables are highly correlated. A high VIF indicates multicollinearity.	`estat vif` after regression	VIF > 10 indicates high multicollinearity. Analyze the `VIF` values for each independent variable.
Omitted Variable Bias	No direct test, but look for model misfit and theoretical relevance.	Omitted variable bias cannot be directly tested but can be suspected when model fit is poor, residuals are large, or theoretical relationships are overlooked.	No specific command; examine model fit and theoretical relevance.	No direct statistic; look for patterns in residual plots, model misfit, or theoretical gaps.
Heteroskedasticity	Breusch-Pagan Test, White Test	Detects non-constant variance in the residuals. Breusch-Pagan tests variance as a function of independent variables; White’s test checks for heteroskedasticity without specifying a form.	`estat hettest` for Breusch-Pagan; `estat imtest, white` for White test	Breusch-Pagan: High chi-square values suggest heteroskedasticity. White: Similar chi-square interpretation, robust to forms of heteroskedasticity.
Autocorrelation	Durbin-Watson Test, Breusch-Godfrey LM Test	Tests whether error terms are serially correlated. Durbin-Watson focuses on adjacent residuals; Breusch-Godfrey handles higher-order autocorrelation.	`estat dwatson`; `estat bgodfrey`	Durbin-Watson statistic near 2 suggests no autocorrelation: - <2 suggests positive autocorrelation. - >2 suggests negative autocorrelation. Breusch-Godfrey returns a chi-square statistic; p-values indicate significance.
Measurement Error	No direct test; look for inconsistent results or discrepancies in estimates.	Measurement error tests are often qualitative; look for issues in data collection or unexpected inconsistencies in results.	No specific command; address by improving data quality or using IV methods.	No formal statistic. Look for bias and inconsistencies in coefficients across models.
Non-Linearity	Ramsey RESET Test	Checks whether higher-order terms improve the fit of the model. Ramsey RESET uses powers of fitted values to test for specification errors.	`estat ovtest` for Ramsey RESET test	RESET: High F-statistic suggests non-linearity or omitted variable issues. Check the p-value.

Maximum Likelihood Estimation (MLE)

Summary of Process

Likelihood Function:
- Define \(L(\mu) = \prod_{i=1}^n f(y_i; \mu)\).
- For the exponential distribution: \(f(y_i; \mu) = \mu e^{-\mu y_i}\).
Log-Likelihood:
- Take the natural log: \(\ln L(\mu) = n \ln \mu - \mu \sum_{i=1}^n y_i\).
First-Order Condition:
- Differentiate \(\ln L(\mu)\) with respect to \(\mu\) and set it to zero: \[ \frac{\partial \ln L(\mu)}{\partial \mu} = \frac{n}{\mu} - \sum_{i=1}^n y_i = 0 \]
- Solve for \(\hat{\mu}\): \(\hat{\mu} = \frac{n}{\sum_{i=1}^n y_i} = \frac{1}{\bar{y}}\).
Fisher Information Matrix:
- Compute: \(I(\mu) = -E\left[\frac{\partial^2 \ln L(\mu)}{\partial \mu^2}\right] = \frac{n}{\mu^2}\).
Variance of the Estimator:
- The variance of \(\hat{\mu}\): \(\text{Var}(\hat{\mu}) = \frac{1}{I(\mu)} = \frac{\mu^2}{n}\).
Asymptotic Distribution:
- The MLE estimator follows: \[ \sqrt{n}(\hat{\mu} - \mu) \sim N(0, \mu^2) \]

Step-by-Step Summary

Write the likelihood function \(L(\mu)\).
Take the log-likelihood \(\ln L(\mu)\).
Differentiate \(\ln L(\mu)\) and solve for \(\hat{\mu}\).
Compute the Fisher Information \(I(\mu)\).
The variance of \(\hat{\mu}\) is \(I(\mu)^{-1}\).
\(\hat{\mu} \sim N(\mu, \mu^2 / n)\).

Method of Moments Estimation (MME)

Summary of Process

Moment Condition:
- Use \(E[y_i] = \frac{1}{\mu}\).
Set the Moment Condition:
- Solve for \(\mu\): \(\hat{\mu}_{MME} = \frac{1}{\bar{y}}\).
Variance of the Estimator:
- Compute: \[ \text{Var}(\hat{\mu}) = \mu^2 \cdot \frac{1}{n} \]
Asymptotic Distribution:
- The MME estimator follows: \[ \hat{\mu}_{MME} \sim N(\mu, \frac{\mu^2}{n}) \]

Step-by-Step Summary

Take the moment condition \(E[y_i] = \frac{1}{\mu}\).
Solve for \(\hat{\mu}_{MME} = \frac{1}{\bar{y}}\).
Compute the variance:
- \(m_\mu = \frac{\partial m(\mu)}{\partial \mu}\).
- \(\text{Var}(m(y_i, \mu))\).
Use \(\text{Var}(\hat{\mu}) = \mu^2 / n\).
\(\hat{\mu}_{MME} \sim N(\mu, \mu^2 / n)\).

Generalized Method of Moments (GMM)

Summary of Process

Moment Conditions:
- Define: \(E[m(y_i, \mu)] = E[(y_i - \mu)] = 0\).
Minimization Problem:
- Solve: \[ \min_\mu Q(\mu) = m_n(\mu)' W_n^{-1} m_n(\mu) \]
- \(m_n(\mu) = \frac{1}{n} \sum_{i=1}^n m(y_i, \mu)\), and \(W_n\) is the covariance matrix.
Optimal Weights:
- Set \(W_n = V_n^{-1}\), where \(V_n = \text{Cov}[m(y_i, \mu)]\).
Variance of the Estimator:
- Compute: \[ \text{Var}(\hat{\mu}_{GMM}) = (m_\mu' W_n^{-1} m_\mu)^{-1} \]
Asymptotic Distribution:
- The GMM estimator follows: \[ \hat{\mu}_{GMM} \sim N(\mu, \text{Var}(\hat{\mu}_{GMM})) \]

Step-by-Step Summary

Define the moment condition matrix \(m(y_i, \mu)\).
Set the minimization problem \(\min_\mu Q(\mu)\).
Find \(A = W_n^{-1}\).
Compute \(\text{Var}(\hat{\mu}) = (m_\mu' W_n^{-1} m_\mu)^{-1}\).
Use optimal weights \(W_n = V_n^{-1}\).
\(\hat{\mu}_{GMM} \sim N(\mu, \text{Var}(\hat{\mu}))\).

Key Comparisons

Method	Estimator	Variance	Key_Assumptions
MLE	\(\hat{\mu}_{MLE} = \frac{1}{\bar{y}}\)	\(\frac{\mu^2}{n}\)	Correctly specified likelihood function
MME	\(\hat{\mu}_{MME} = \frac{1}{\bar{y}}\)	\(\frac{\mu^2}{n}\)	Validity of the moment condition
GMM	\(\hat{\mu}_{GMM} = \arg \min_\mu Q(\mu)\)	\((m_\mu' W_n^{-1} m_\mu)^{-1}\)	Correctly specified moment conditions