- SSD calculated the same way, as are SS_R, SS_T, SS_M.
- Assessing the Regression model:
- R^2 has the same interpretation — how much variation accounted for by model — except that now
R produces the Multiple R^2.
- However, R^2 will always increase with addition of more variables.
- Akaike Information Criterion (AIC) is a measure of fit which penalises models for having more variables — adjusted R^2.
- It is a measure of parsimony adjusted model fit.
- Another is Bayesian Information Criterion (BIC).
- Assessing individual predictors — Feature Selection:
- Add predictors based on past research
- Add predictors based on theoretical/conceptual/logical significance in order of significance.
- Order may matter if predictors are correlated.
- Forced entry: significant predictors added based on no order
- Step-wise regression: forward, backward, both, all subsets
- Step-wise methods assess fit of variable based on other already selected variables already selected.
- All-subsets method assesses every combination using Mallow’s C_p
- Step-wise methods best for exploratory analysis. For other types of analyses, important to cross-validate.
- Final assessment of Regression model
- Check if model represents all data well or just some influential data (such may not be found by looking at residuals since the regression line may have been affected by the datum to lie close to it, resulting in a low residual. However, Cook’s distance may show if the datum lies away from other data)
- Check if model generalises well (balance between bias and variance)
- Bias:
- Variance:
- Residuals: unstandardised are in the same unit as the outcome variable — difficult to compare across different models. Hence, standardised residuals.
- Outliers: May bias our model as affect values for regression coefficients. These are points that lie away from the general trend.
- Influential cases: May bias our model.
- Cook’s distance to check influence of a case on the model. > 1 = Bad!
- Leverage, which measures the influence of an observed value of the outcome variable over the predicted values. (k+1)/N
- DFBeta
- DFFit
- Covariance Ratio
- To check the model that it generalises well: cross-validate.
- Cross-validation:
- Refers to assessing model accuracy across different samples from same population. Used to check if model generalises properly.
- Method #1: Adjusted R^2
- Method #2: Data Splitting
Confidence Intervals: Given that we know the estimate, standard error of the estimate, and the degrees of freedom, we can calculate the confidence intervals.
- Models can be compared using the F-ratio. This calculates the significance of R^2, allowing comparison of R^2 for different models.
- For comparing like so, model 1 must contain x predictor, model 2 x and y predictors, model 3 x, y, and z predictors.
- Subtract the Multiple R^2 to see improvement. Use
anova() to compare the models, and interpret the F-statisic.
- Multicollinearity: if observed, implies that impossible to obtain unique regression coefficients for correlated variables.
- It creates difficulties in assessing feature importance
- It increases standard errors of regression coefficients
- Limits the size of R, which is the correlation coefficient
- To detect pairwise multicollinearity: correlation matrix. Otherwise, Variance Inflation Factor (vif).
If all assumptions for regression met, and such an analysis conducted successfully: the regression model is on average more probable to be the same as the population model. Thus, model is generalisable to population
If assumptions not met, use robust regression: bootstrapping. This allows relaxing our assumptions.