Regression models
\(\beta\) estimates
\[X \cdot \overrightarrow{\beta} = \overrightarrow{y}\] \[X^TX \cdot \overrightarrow{\beta} = X^T\overrightarrow{y}\] \[(X^TX)^{-1}X^TX \cdot \overrightarrow{\beta} = (X^TX)^{-1}X^T\overrightarrow{y}\] \[\overrightarrow{\beta} = (X^TX)^{-1}X^T\overrightarrow{y}\]
A unique inverse of the matrix \(X^TX\) exists when
- no predictor can be determined from a combination of one or more of the other predictors, and
- the number of samples is greater than the number of predictors (possible remedies include removing pairwise correlated predictors, using VIF to diagnose multicollinearity, and using PCA to reduce the dimension of the predictor space).
For univariate regressions, the coefficients can be calculated using sum of squares as follows: \[\hat{\beta_1} = \frac{SSXY}{SSX} = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n(x_i - \bar{x})^2}\] \[\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}\]
The standard errors of the \(\beta\) estimates are: \[SE(\beta_0)^2 = \sigma^2\Big[\frac{1}{n} + \frac{\bar{x}^2}{SSX}\Big] = \sigma^2\Big[\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n(x_i - \bar{x})^2}\Big]\] \[SE(\beta_1)^2 = \frac{\sigma^2}{SSX} = \frac{\sigma^2}{\sum_{i=1}^n(x_i - \bar{x})^2}\] where \[\sigma^2 = \frac{SSE}{n - 2}\]
Standard errors can be used to compute confidence intervals. A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter.
F test
\[H_0: \beta_1 = \beta_2 = ... = \beta_p = 0\] \[H_1: \text{at least one }\beta\text{ is significantly different from 0}.\] \[F = \frac{\frac{SSR}{\text{model df}}}{\frac{SSE}{\text{error df}}} = \frac{\frac{SSY - SSE}{p}}{\frac{SSE}{n - p - 1}}\] The F-statistic relies on the normality assumption, but even if the errors are not normally-distributed, the F-statistic approximately follows an F-distribution provided that the sample size \(n\) is large.
To test whether a particular variable, or a particular group of variables, are significant, use F test to test whether excluding them causes a significant increase in deviance: \[F = \frac{\frac{SSE_{restricted} - SSE_{unrestricted}}{\Delta_{df}}}{\frac{SSE}{n - p - 1}}\] When only one variable is tested, the square of its t-statistic is the corresponding F-statistic.
Problems with regression models
- Non-linearity of the data
- Residual plots are a useful graphical tool for identifying non-linearity.
- Correlation of error terms (autocorrelation)
- If there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors.
- Methods to detect autocorrelation:
- Durbin-watson test \[d = {\sum_{t=2}^T (e_t - e_{t-1})^2 \over {\sum_{t=1}^T e_t^2}} \approx 2(1 - \rho)\] where \(\rho\) is the sample autocorrelation of the residuals.
- Plot residuals, use
acf
suppressMessages(library(dplyr))
suppressMessages(library(quantmod))
suppressMessages(library(lmtest))
options('getSymbols.warning4.0' = F)
getSymbols('AAPL')
## [1] "AAPL"
aapl = data.frame(AAPL)
aapl$date = row.names(aapl)
aapl_return = diff(aapl$AAPL.Adjusted) / aapl$AAPL.Adjusted[-nrow(aapl)]
aapl$aapl_ret = c(NA, aapl_return)
aapl = select(aapl, date, aapl_ret)
getSymbols('SPX')
## [1] "SPX"
spx = data.frame(SPX)
spx$date = row.names(spx)
spx_return = diff(spx$SPX.Adjusted) / spx$SPX.Adjusted[-nrow(spx)]
spx$spx_ret = c(NA, spx_return)
spx = select(spx, date, spx_ret)
returns = inner_join(x = aapl, y = spx, by = 'date')
returns = filter(returns, date >= as.Date('2012-01-01'))
model = lm(aapl_ret ~ spx_ret, data = returns)
resid = residuals(model)
plot.ts(resid)

acf(resid)

dwtest(model)
##
## Durbin-Watson test
##
## data: model
## DW = 1.9363, p-value = 0.1958
## alternative hypothesis: true autocorrelation is greater than 0
par(mfrow = c(2, 2))
plot(model)

- Non-constant variance of error terms (heteroskedasticity)
- Methods to detect heterokedasticity:
- Breusch-Pagan test, White test
- Residual plot
- Methods to correct heteroskedasticity:
- Transform the response \(Y\) using a concave function such as \(\text{log}Y\) or \(\sqrt{Y}\). Such a transformation results in a greater amount of shrinkage of the larger responses, leading to a reduction in heteroscedasticity.
- If the form of the heteroskedasticity is known, we can use WLS to assign less weight to observations with a higher error variance.
- For example, the \(i\)th response could be an average of \(n_i\) raw observations. If each of these raw observations is uncorrelated with variance \(\sigma^2\), then their average has variance \(\sigma_i^2 = \frac{\sigma^2}{n_i}\). In this case a simple remedy is to fit our model using weights proportional to the inverse variances, i.e., \(w_i = n_i\) in this case.
- Outliers (in the y-direction)
- A point for which \(y_i\) is far from the value predicted by the mode.
- Residual plots can be used to identify outliers, but instead of using residuals themselves, we can plot the studentized residuals, computed by dividing each residual by its estimated standard error. Observations whose studentized residuals are greater than 3 in absolute value are possible outliers.
- Linear regression is prone to chasing observations that are away from the overall trend of the majority of the data. To cure this problem, alternative metrics to SSE can be used:
- minimizing the sum of the absolute errors
- Huber function (using the squared residuals when they are small and the absolute ones when they are above a threshold)
- High leverage points (in the x-direction)
- Methods to detect high leverage points:
- Leverage statistic
- The \(i\)th diagonal element of the hat matrix \(\mathbf{X}\left(\mathbf{X}^T\mathbf{X}\right)^{-1}\mathbf{X}^T\): \[h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum(x_i - \bar{x})^2}\]
- The leverage statistic \(h_i\) is always between \(\frac{1}{n}\) and 1, and the average leverage for all the observations is always equal to \(\frac{p + 1}{n}\). Hence, if a given observation has a leverage statistic that greatly exceeds \(\frac{p + 1}{n}\), then we may suspect that the corresponding point has high leverage.
- Cook’s distance
- Combines both leverage and outliers: \[C_i = \frac{1}{p}r_{studentized}^2\frac{h_i}{1 - h_i}\]
- Collinearity
- Collinearity reduces the accuracy of the estimates of the regression coefficients, and, hence, causes the standard error for \(\beta_j\) to grow.
- Method to detect collinearity: VIF
- Method to correct collinearity: orthoganization, PCA, heuristic approach
Paritial least squares
- PCR: pre-processing predictors via PCA prior to performing regression (unsupervised)
- PLS: while the PCA linear combinations are chosen to maximally summarize predictor space variability, the PLS linear combinations of predictors are chosen to maximally summarize covariance with the response (supervised). This means that PLS finds components that maximally summarize the variation of the predictors while simultaneously requiring these components to have max correlation with the response.
Penalized models to prevent overfitting
- Ridge regression: \[SSE_{L2} = \sum_{i=1}^{n}(y_i - \hat{y_i})^2 + \lambda\sum_{j=1}^{P}\beta^2\]
- Lasso regression: \[SSE_{L1} = \sum_{i=1}^{n}(y_i - \hat{y_i})^2 + \lambda\sum_{j=1}^{P}|\beta|\]
- Ridge regression is known to shrink the coefficients of correlated predictors towards each other, allowing them to borrow strength from each other. in the extreme case of k identical predictors, they each get identical coefficients with \(\frac{1}{k}\)th the size that any single one would get if fit alone.
- Lasso, on the other hand, is somewhat indifferent to very correlated predictors, and will tend to pick one and ignore the rest (i.e., feature selection).
- Elastic net: \[SSE_{Enet} = \sum_{i=1}^{n}(y_i - \hat{y_i})^2 + \lambda_1\sum_{j=1}^{P}\beta^2 + \lambda_2\sum_{j=1}^{P}|\beta|\]
- Predictor data need to be centered and scaled first before applying regularization.
Support vector machines
Hyperplane
- In a \(p\)-dimensional space, a hyperplane is a flat affine subspace of dimension \(p - 1\). The word affine indicates that the subspace need not pass through the origin. \[\beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p = 0\]
- The maximal margin hyperplane is the separating hyperplane for which the margin is largest—that is, it is the hyperplane that has the farthest minimum distance to the training observations.
- The maximal margin hyperplane depends directly on the support vectors, but not on the other observations.
SVM for classificatoin
- Optimization: \[max_{\beta_0, \beta_1, \dots, \beta_p}M\] \[\text{subject to }\sum_{j=1}^p\beta_j^2 = 1\] \[y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip}) \ge M \text{ } \forall i = 1, 2, \dots, n.\]
- The first constraint ensure that the perpendicular distance from the \(i\)th observation to the hyperplane is given by \[y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip})\]
- The 2 constraints ensure that each observation is on the correct side of the hyperplane and at least a distance M from the hyperplane.
- Sometimes a a separating hyperplane doesn’t exist; even if it does, it may lead to sensitivity to individual observations. Hence, an alternative classifier that allows misclassification solves the following optimization problem: \[max_{\beta_0, \beta_1, \dots, \beta_p, \epsilon_1, \epsilon_2, \dots, \epsilon_n}M\] \[\text{subject to }\sum_{j=1}^p\beta_j^2 = 1\] \[y_i(\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \dots + \beta_px_{ip}) \ge M(1 - \epsilon_i)\] \[\epsilon_i \ge 0, \sum_{i=1}^n\epsilon_i \le C\]
- \(\epsilon_1, \epsilon_2, \dots, \epsilon_n\) are slack variables that allow individual observations to be on the wrong side of the margin or the hyperplane.
- If \(\epsilon_i > 0\) then the \(i\)th observation is on the wrong side of the margin.
- If \(\epsilon_i > 1\) then it is on the wrong side of the hyperplane.
- \(C\) is the budget for the amount that the margin can be violated by the \(n\) observations. As the budget \(C\) increases, we become more tolerant of violations to the margin, and so the margin will widen.
- Observations that lie directly on the margin, or on the wrong side of the margin for their class, are support vectors and affect the support vector classifier. Hence, SVM is quite robust to the behavior of observations that are far away from the hyperplane.
- The classification function for a new sample can be written as \[f(\mathbf{u}) = \beta_0 + \mathbf{\beta'u}
= \beta_0 + \sum_{j=1}^P\beta_ju_j
= \beta_0 + \sum_{i=1}^ny_i\alpha_ix_i'\mathbf{u}\] where \(\alpha_i\) is nonzero only for the support vectors.
- The dot product, \(x_i'\mathbf{u}\), can be written as a product of the distance of \(x_i\) from the origin, the distance of \(\mathbf{u}\) from the origin, and the cosine of the angle between \(x_i\) and \(\mathbf{u}\). In other words, it measures the similarity between \(x_i\) and \(\mathbf{u}\).
- More generally, given the dot product of the new samples and the training samples, the equation can be written as: \[f(\mathbf{u}) = \beta_0 + \sum_{i=1}^ny_i\alpha_iK(\mathbf{x_i}, \mathbf{u})\] where \(K(\cdot)\) is called the kernal function, which quantifies the similarity of two observations. Other types of kernel functions can be used to encompass nonlinear functions of the predictors: \[\text{polynomial} = (\phi(\mathbf{x'u}) + 1)^{degree}\] \[\text{radial basis function} = \exp(-\sigma||\mathbf{x} - \mathbf{u}||^2)\] \[\text{hyperbolic tangent} = \tanh(\phi(\mathbf{x'u}) + 1)\] where \(\phi\) and \(\sigma\) are scaling parameters.
Relationship with logistic regression
- It can be shown that optimization criterion for SVM can be rewritten to the form of “Loss + Penalty” as follows: \[minimize_{\beta_0, \beta_1, \dots, \beta_p}{\sum_{i=1}^nmax[0, 1 - y_if(x_i)] + \lambda\sum_{j=1}^p\beta_j^2}\]
- The hinge loss function is closely related to the loss function used in logistic regression as shown below:
library(reshape2)
library(ggplot2)
yx = seq(-5, 4, .01)
l1 = log(1 + exp(-yx))
l2 = ifelse(1 - yx < 0, 0, 1 - yx)
df = data.frame(cbind(yx, l1, l2))
df = melt(df, id = 'yx')
ggplot(df, aes(x = yx, y = value, colour = variable)) + geom_line() +
scale_colour_discrete(name = 'Model', labels = c('Logistic Regression Loss', 'SVM Loss')) + xlab('yf(x)') + ylab('Loss')

- Due to the similarities between their loss functions, logistic regression and the support vector classifier often give very similar results. When the classes are well separated, SVMs tend to behave better than logistic regression; in more overlapping regimes, logistic regression is often preferred.
SVM for regression
- \(\epsilon\)-insensitive loss function:
- data points with residuals within the threshold (i.e., samples that the model fits well) do not contribute to the regression fit.
- data points with an absolute difference greater than the threshold contribute a linear-scale amount.
- As a result, the poorly-predicted points (the “outliers” in the extreme case) are the only points that define the regression line.
- The SVM regression coefficients minimize \[cost\sum_{i=1}^{n}L_\epsilon(y_i - \hat{y_i}) + \sum_{j=1}^{P}\beta_j^2\] where \(L_\epsilon(\cdot)\) is the \(\epsilon\)-insensitive function and \(cost\) is the cost penalty as the reverse of the \(\lambda\) used in regularized regressions or neural networks.
- The linear SVM predictor function for a new sample \(u\) is written as: \[\hat{y} = \beta_0 + \beta_1u_1 + \dots + \beta_Pu_P
= \beta_0 + \sum_{j=1}^{P}\beta_ju_j
= \beta_0 + \sum_{j=1}^{P}\sum_{i=1}^{n}\alpha_ix_{ij}u_j
= \beta_0 + \sum_{i=1}^{n}\alpha_i\Big(\sum_{j=1}^{P}x_{ij}u_j\Big)\]
- Each data points are used in the prediction, however,
- For training set samples that are within \(\pm\epsilon\) of the regression line, the \(\alpha\) parameters are 0, indicating they have no impact on prediction.
- The rest of the data points with non-zero \(\alpha\) are called the support vectors.
Linear classificatoin models
Logistic regression
- Binomial likelihood: \[L(p) = C_n^r p^r (1 - p)^{n-r}\] where \(n\) is total number of trials and \(r\) is the number of successes.
- Log odds of the event as a linear function: \[log\Big(\frac{p}{1 - p}\Big) = \beta_0 + \beta_1x_1 + \dots + \beta_Px_P\]
- Event probability: \[p = \frac{1}{1 + exp[-(\beta_0 + \beta_1x_1 + \dots + \beta_Px_P)]}\]
- Using maximum likelihood to fit a logistic regression model (given the data and given our choice of model, what value of the parameters of that model make the observed data most likely?): \[L(\beta) = \prod_{i=1}^nPr(Y = y_i|X = x_i) = \prod_{i=1}^np_i^{y_i}(1 - p_i)^{1-y_i}\] \[l(\beta) = \sum_{i=1}^n[y_i\text{log}p(x_i) + (1 - y_i)\text{log}(1 - p(x_i))]\]
Linear discriminate analysis
Given Bayes’ rule, \[Pr[Y = C_l|X] = \frac{Pr[Y = C_l] \cdot Pr[X|Y = C_l]}{Pr[X]}
= \frac{Pr[Y = C_l] \cdot Pr[X|Y = C_l]}{\sum_{l=1}^{C}Pr[Y = C_l] \cdot Pr[X|Y = C_l]}\] \(X\) is classified into group \(C_l\) if \(Pr[Y = C_l] \cdot Pr[X|Y = C_l]\) has the largest value of all C classes.
Assuming the distribution of the predictors is multivariate normal with a multidimensional mean vector \(\mathbf{\mu}_l\) and covariance matrix \(\Sigma_l\), \[\text{log}(Pr[X|Y = C_l]) = \text{log} f_l(x)
= \text{log} \Big(\frac{1}{(2\pi)^\frac{p}{2}|\Sigma_l|^\frac{1}{2}}exp(-\frac{1}{2}(x - \mu_l)^T\Sigma_l^{-1}(x - \mu_l)) \Big)\]
Further assuming the covariance matrices are identical across groups, the above equation is equivalent to \[x^T\Sigma^{-1}\mu_l - \frac{1}{2}\mu_l^T\Sigma^{-1}\mu_l\] Adding the prior probability, we have \[x^T\Sigma^{-1}\mu_l - \frac{1}{2}\mu_l^T\Sigma^{-1}\mu_l + \text{log}(Pr[Y = C_l])\] Without the assumption of constant covariance, the equation is no longer linear in \(X\) and is hence QDA.
LDA advantages over logistic regression:
- When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
- If \(n\) is small and the distribution of the predictors \(X\) is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
- Linear discriminant analysis is popular when we have more than two response classes.
LDA limitations:
- The LDA solution depends on inverting a covariance matrix - a unique solution exists only when the data contain more samples than predictors and the predictors are independent (just like in regression).
- Assumption of normal distribution and equal covariance.
Extensions to LDA:
- QDA
- RDA: \[\tilde{\Sigma_l}(\lambda) = \lambda\Sigma_l + (1 - \lambda)\Sigma\] where \(\Sigma_l\) is the covariance matrix of the \(l\)th class and \(\Sigma\) is the pooled covariance matrix across all classes.
- MDA: allows each class to be represented by multiple multivariate normal distribution. These distributions can have different means but, like LDA, the covariance structures are assumed to be the same.
Partial least squares discriminant analysis (PLSDA)
- PLSDA seeks to find optimal group separation while being guided by between-groups covariance matrix whereas PCA seeks to reduce dimension using the total variation as directed by the overall covariance matrix of the predictors.
- However, if dimension reduction is not necessary and classification is the goal, LDA will always provide a lower misclassification rate than PLS.
Penalized models
glmnet uses ridge and lasso penalties simultaneously, like the elastic net, but structures the penalty slightly differently: \[\text{log}L(p) - \lambda\Big[(1 - \alpha)\frac{1}{2}\sum_{j=1}^{P}\beta_j^2 + \alpha\sum_{j=1}^{P}|\beta_j|\Big]\] where \(\text{log}L(p)\) is the binomial or multinomial log likelihood.